SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking
Conventional wisdom and textbooks say BK-trees are especially suited for spelling correction and fuzzy string search. But does this really hold true?
Conventional wisdom and textbooks say BK-trees are especially suited for spelling correction and fuzzy string search. But does this really hold true?
Also in the comments to my blog post on spelling correction the BK-tree has been mentioned as a superior data structure for fuzzy search.
So I decided to compare and benchmark the BK-tree to other options.
I. Approximate string search algorithms
Approximate string search allows to lookup a string in a list of strings and return those strings which are close according to a specific string metric.
There are are many are different string metrics like Levenshtein, Damerau-Levenshtein, Hamming distance, Jaro-Winkler and Strike a match.
I will compare four different algorithms to lookup a string in a list of strings within a maximum edit distance according to the Damerau-Levenshtein string metric.
For spelling correction an additional word frequency associated with every term can be used to sort and filter the results (suggestions) further.
All algorithms strive for the same goals to achieve short lookup times: reducing the number of lookups and comparisons (between words and/or artificially generated candidates), possibly reducing further the number of full edit distance calculations and finally reducing the computational complexity of the edit distance calculation itself, while not compromising its accuracy.
The four different algorithms I want to compare and benchmark are:
SymSpell (Symmetric Delete spelling correction algorithm)
LinSpell (Linear search spelling correction algorithm)
II. Levenshtein edit distance variations
All four algorithms are using derivatives of the Levenshtein edit distance. There are three different levenshtein distances:
Levenshtein distance: adjacent transposition (AC->CA) counted as 2 edits. The triangle inequality does hold.
Restricted Damerau-Levenshtein distance (Optimal string alignment algorithm): adjacent transposition counted as 1 edit, but substrings can’t be edited more than once: ed(“CA” , “ABC”) =3. The triangle inequality does not hold.
True Damerau-Levenshtein distance: adjacent transposition counted as 1 edit, substrings can be edited more than once: ed(“CA” , “ABC”) =2. The triangle inequality does hold.
Norvig’s algorithm is using the true Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance.
The BK-Tree implementation of Xenopax is using the Levenshtein edit distance. It could be modified to use the True Damerau-Levenshtein edit distance, but not the Restricted Damerau-Levenshtein edit distance where the triangle inequality required for a BK tree does not hold.
SymSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.
LinSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.
Verbosity of search results
In our tests we distinguish three levels of verbosity of search results, which will result in different lookup times:
Level 0: Return only the result with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return the result with the highest word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.
Level 1: Return all results with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return them all sorted by word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.
Level 2: Return all results within the maximum edit distance, sorted by word frequency. This allows for no early termination of the search.
Norvig’s Spelling Corrector
The idea is if we artificially generate all terms within maximum edit distance from the misspelled term, then the correct term must be among them. We have to look all of them up in the dictionary until we have a match. So all possible combinations of the 4 spelling error types (insert, delete, replace and adjacent switch) are generated. This is quite expensive with e.g. 114,324 candidate term generated for a word of length=9 and edit distance=2.
Originally it was written in Python. For the benchmark I used the faithful C# port from Lorenzo Stoakes of Peter Norvig’s algorithm, which has been extended to support edit distance 3.
BK-tree
The BK-tree utilizes the triangle inequality, a property of the Levenshtein edit distance: Levenstein(A,B)+Levenstein(A,C)≥Levenstein(B,C) and Levenstein(A,B)−Levenstein(A,C)≤Levenstein(B,C).
During indexing the Levenshtein(root node,child node) are precalculated.
During lookup we calculate Levenshtein(input ,root node). The triangle inequality is used as a filter, to only recursively follow those child nodes where the precalculated Levenstein(root node,child node) is in the range [Levenstein(input ,root node)-dmax, Levenstein(input ,root node)+dmax].
There are several interesting posts discussing the BK-tree in detail:
I compared three C# implementations of the BK-Tree
BK-Tree1 (C# implementation of tgriffith)
BK-Tree2 (C# implementation of TarasRoshko)
BK-Tree3 (C# implementation of Xenopax)
and decided to use the fastest one from Xenopax (which is also linked from Wikipedia) for this benchmark.
SymSpell Algorithm
SymsSpell is an algorithm to find all strings within an maximum edit distance from a huge list of strings in very short time. It can be used for spelling correction and fuzzy string search.
SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing.
The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!
The speed comes from inexpensive delete-only edit candidate generation and pre-calculation. An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but SymSpell needs to generate only 25 deletes to cover them all, both at pre-calculation and at lookup time. Magic!
The idea behind prefix indexing is that the discriminatory power of additional chars is decreasing with word length. Thus by restricting the delete candidate generation to the prefix, we can save space, without sacrificing filter efficiency too much. In the benchmark three different prefix lengths lp=5, lp=6 and lp=7 were used. They reflect different compromises between search speed and index size. Longer prefix length means higher search speed at the cost of higher index size.
The C# source code of SymSpell is released as Open Source on GitHub.
LinSpell
This is basically a linear scan through the word list and calculating the edit distance for every single word (with a few tweaks). It was intended as the baseline measure for the benchmark. Surprisingly enough and despite its O(n) characteristics it turned out to excel both BK-tree and Norvig’s algorithm.
There are several reasons for that:
do not underestimate the constants in the Big O notation. Visiting only 20% of the nodes in a BK-tree is more expensive than a linear search where the atomic cost is only 10%.
As Damerau-Levenshtein calculation is very expensive it is not the number of processed words which matters, but the number of words where we need the full Damerau-Levenshtein calculation. We can speedup the search if we can prefilter words from the calculation or terminate the calculation once a certain edit distance is reached.
If we restrict the search to best match we can utilize options for early termination of the search.
words with no spelling errors are a frequent case. Then the lookup can be done with a hash table or trie in O(1) ! If we restrict the search to best match, we can instantaneously terminate the search.
We do not need to calculate the edit distance if Abs(word.Length-input.Length) > Maximum edit distance (or best edit distance found so far if we restrict the search to best match)
If we restrict the search to best match, we can terminate the edit distance calculation once the best edit distance found so far is reached.
If we restrict the search to best match, and we have found already one match with edit distance=1, then we do not need to calculate the edit distance if the count of the term in question is smaller than the count of the match already found.
The C# source code of LinSpell is released as Open Source on GitHub.
Last updated
Was this helpful?