performance – Optimise python functions for speed


Here are some suggestions for general code cleanup. I can’t guarantee that any of these changes will improve performance.


if len(xxx):

Almost certainly you can replace these with if xxx:. It’s not guaranteed to be the same, but almost all types that support len will test false if their length is 0. This includes strings, lists, dicts, and other standard Python types.


I would turn __rstrip and __lstrip into top-level functions, or at least make them @staticmethods since they don’t use self. __generate_results could be a static method as well.


__rstrip and __lstrip seem to reimplement the functionality of the built-in str.rstrip and str.lstrip. Possibly, you could replace them with

display_text = t(0).rstrip(',.;!?:"')
display_text = display_text.lstrip(',.;!?:"'')

If it’s really important that at most 5 characters be stripped, you can do that like this:

def _rstrip(token):
    return token(:-5) + token(-5:).rstrip(',.;!?:"')

def _lstrip(token):
    return token(:5).lstrip(',.;!?:"'') + token(5:)

token(-1) in (',', '.', ';', '!', '?', ':', '"')

Either token(-1) in {',', '.', ';', '!', '?', ':', '"'} or token(-1) in ',.;!?:"' would be faster. The latter is riskier since it’s not exactly the same: it will do a substring test if token(-1) isn’t a single character. But if token is a string then token(-1) is guaranteed to be a single character.


words = ()
for t in normalised:
    if len(t(0)):
        words.append(t(0))

You can replace this with words = (t(0) for t in normalised if t(0)).


 words = ()
 for t2 in normalised:
     if idx == t2(1):
         words.append(t2(0))

This could be a significant source of inefficiency. You could try making a lookup table outside the loop:

normalized_lookup = collections.defaultdict(list)
for t in normalized:
    normalized_lookup(t(1)).append(t(0))

and then replace the quoted code with words = normalized_lookup(idx). This could end up being slower, though. Also, it has the side effect that lists for the same value of idx will be shared, which could cause subtle, hard-to-catch aliasing bugs down the line. If that’s an issue, write words = list(normalized_lookup(idx)) (or tuple).