Introduction to the problem
I have started working as an intern, and my first project is to come up with an implementation of company names matching. Customers send a form with the manufacturers that they want to get their products from, and as the Manufacturers are many, we need a quite efficient algorithm to match we the database registered ones.
For an example, see the table below: +-----------+----------------+-------+ | MANUINPUT | MANUFACTMATCH | SCORE | +-----------+----------------+-------+ | COMPANY A | COMPANY A Inc. | 100 | +-----------+----------------+-------+
I have created a python program that uses the FuzzyMatching. In order to become more specific, I get the scores using token_sort_ratio.
ratio = fuzz.token_sort_ratio(manu,manuInput) if ratio < 80 and len(manu) < 6 and len(manuInput) < 6: ratio = ratio - 10 for country in countries: if country in manu and country in manuInput and ratio > 82 and (len(country) > len(manuInput)/2 or len(country) > len(manu)): ratio = ratio - 5 if manu != '' and manuInput != '' and len(manu) - len(manuInput) < abs(3): if manu(0) != manuInput(0): ratio = ratio - 10 if len(manu) > 1 and len(manuInput) > 1: if manu(1) != manuInput(1): ratio = ratio - 6
My first step, which is not shown above, is that I use cleanco and an stopwords library, to remove words that are not “needed” in a company name. I also remove words like ‘proffessional’, ‘products’ etc., remove the whole punctuation, and make everything to lower.
We have also developed a database table that we define synonyms like Company A = Company A America which I also search into for a perfect exact match before calculating ratio.
The results I get, are somewhat good, but sometimes they fall off and they come out ‘random’.
Problems I have found
- As I remove many stopwords, company names might become really small. (Company A)->(A), so if an input comes with a word that is not in stopwords, (Company) A won’t match with (Company) A America, as (Company) B might exist, and it will return as a better result.
- Company names including only stopwords. For example ‘International Products Company Inc.’, becomes an empty string for a match, so it gets no match. I think this is a sacrifice to be made though. Tell me If you have an idea how to overcome that.
- There exist some cases where Company A is not Company A Inc. (although not that many), so I get some mismatches from that.
Solutions I came up with
- If we have to match an input company name, if input length is < 4, and if the starting letters don’t match with the current company we compare with, then substract 10 from the score.
- As we use token sort ratio, counts on tokenizing company names and calculating subratios, if we get ‘NAMEA’ and try to match with ‘NAME A’ (space incl.) then results get worse ratio than they should. So when I get a score less than 70, I use the partial ratio calculator also, and I keep the best result.
- Whenever there is not a perfect match, return best 4 matches. This satisfies, that the second result might be the one we need.
- I am not sure if I should try to come up with a clustering algorithm, categorizing companies based on largest partial shared string. I don’t have performance issues for the time being
I would like to come up with many more heuristics in order to improve my results more. I would appreciate your help in sharing heuristics, based on length, based on letters, based on score, and even provide an alternative to the problem (for example, a training model implementation etc.), or maybe a weighted calculation model using more string similarity functions rather than Levenshtein distance.
Thank you again for reading the whole question. Waiting for your answers :).