Conventional wisdom and textbooks say are especially suited for spelling correction and fuzzy string search. But does this really hold true?
Also in the comments to my the BK-tree has been mentioned as a superior data structure for fuzzy search.
So I decided to compare and benchmark the BK-tree to other options.
I. Approximate string search algorithms
Approximate string search allows to lookup a string in a list of strings and return those strings which are close according to a specific .
There are are many are different string metrics like , , , and .
I will compare four different algorithms to lookup a string in a list of strings within a maximum according to the string metric.
For spelling correction an additional word frequency associated with every term can be used to sort and filter the results (suggestions) further.
All algorithms strive for the same goals to achieve short lookup times: reducing the number of lookups and comparisons (between words and/or artificially generated candidates), possibly reducing further the number of full edit distance calculations and finally reducing the computational complexity of the edit distance calculation itself, while not compromising its accuracy.
The four different algorithms I want to compare and benchmark are:
(Symmetric Delete spelling correction algorithm)
(Linear search spelling correction algorithm)
II. Levenshtein edit distance variations
All four algorithms are using derivatives of the Levenshtein edit distance.
There are three different levenshtein distances:
Norvig’s algorithm is using the true Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance.
SymSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.
LinSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.
Verbosity of search results
In our tests we distinguish three levels of verbosity of search results, which will result in different lookup times:
Level 0: Return only the result with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return the result with the highest word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.
Level 1: Return all results with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return them all sorted by word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.
Level 2: Return all results within the maximum edit distance, sorted by word frequency. This allows for no early termination of the search.
Norvig’s Spelling Corrector
The idea is if we artificially generate all terms within maximum edit distance from the misspelled term, then the correct term must be among them. We have to look all of them up in the dictionary until we have a match. So all possible combinations of the 4 spelling error types (insert, delete, replace and adjacent switch) are generated. This is quite expensive with e.g. 114,324 candidate term generated for a word of length=9 and edit distance=2.
BK-tree
During indexing the Levenshtein(root node,child node) are precalculated.
There are several interesting posts discussing the BK-tree in detail:
I compared three C# implementations of the BK-Tree
SymSpell Algorithm
SymsSpell is an algorithm to find all strings within an maximum edit distance from a huge list of strings in very short time. It can be used for spelling correction and fuzzy string search.
SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing.
Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!
The speed comes from inexpensive delete-only edit candidate generation and pre-calculation. An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but SymSpell needs to generate only 25 deletes to cover them all, both at pre-calculation and at lookup time. Magic!
The idea behind prefix indexing is that the discriminatory power of additional chars is decreasing with word length. Thus by restricting the delete candidate generation to the prefix, we can save space, without sacrificing filter efficiency too much. In the benchmark three different prefix lengths lp=5, lp=6 and lp=7 were used. They reflect different compromises between search speed and index size. Longer prefix length means higher search speed at the cost of higher index size.
LinSpell
This is basically a linear scan through the word list and calculating the edit distance for every single word (with a few tweaks). It was intended as the baseline measure for the benchmark. Surprisingly enough and despite its O(n) characteristics it turned out to excel both BK-tree and Norvig’s algorithm.
There are several reasons for that:
do not underestimate the constants in the Big O notation. Visiting only 20% of the nodes in a BK-tree is more expensive than a linear search where the atomic cost is only 10%.
As Damerau-Levenshtein calculation is very expensive it is not the number of processed words which matters, but the number of words where we need the full Damerau-Levenshtein calculation. We can speedup the search if we can prefilter words from the calculation or terminate the calculation once a certain edit distance is reached.
If we restrict the search to best match we can utilize options for early termination of the search.
words with no spelling errors are a frequent case. Then the lookup can be done with a hash table or trie in O(1) ! If we restrict the search to best match, we can instantaneously terminate the search.
We do not need to calculate the edit distance if Abs(word.Length-input.Length) > Maximum edit distance (or best edit distance found so far if we restrict the search to best match)
If we restrict the search to best match, we can terminate the edit distance calculation once the best edit distance found so far is reached.
If we restrict the search to best match, and we have found already one match with edit distance=1, then we do not need to calculate the edit distance if the count of the term in question is smaller than the count of the match already found.