python – Filter out ambiguous bases from a DNA sequence

Perhaps there aren’t enough test cases here, because I don’t see why you can’t just use:

pattern = re.compile(r"(^ACGT)")
print(pattern.sub("", toy_sequence))

Which says replace all characters that are not in ACGT with blanks. Thus, you do not need to know which characters are illegal, only those which are legal. This should be very fast, doesn’t require a function definition because compiling pattern really serves as your function, which you now call with pattern.sub, and is a lot easier to read.

If this is a satisfactory answer, I’d like to say that

def check_and_clean_sequence(sequence, alphabet):

is not a valuable function even if my solution did not work. cleaning_ambiguous_bases will achieve the same result and it will not be any slower. Checking first will at best perform the same as just calling cleaning_ambiguous_bases because regardless, you need to check every character. However, if you check first, you will iterate through the sequence potentially twice: once to check, and then once to replace. It’s faster to just walk through once.