SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking

Conventional wisdom and textbooks say BK-trees are especially suited for spelling correction and fuzzy string search. But does this really hold true?

Conventional wisdom and textbooks say BK-treesarrow-up-right are especially suited for spelling correction and fuzzy string search. But does this really hold true?

Also in the comments to my blog post on spelling correctionarrow-up-right the BK-tree has been mentioned as a superior data structure for fuzzy search.

So I decided to compare and benchmark the BK-tree to other options.

I. Approximate string search algorithms

Approximate string search allows to lookup a string in a list of strings and return those strings which are close according to a specific string metricarrow-up-right.

There are are many are different string metrics like Levenshteinarrow-up-right, Damerau-Levenshteinarrow-up-right, Hamming distancearrow-up-right, Jaro-Winklerarrow-up-right and Strike a matcharrow-up-right.

I will compare four different algorithms to lookup a string in a list of strings within a maximum edit distancearrow-up-right according to the Damerau-Levenshteinarrow-up-right string metric.

For spelling correction an additional word frequency associated with every term can be used to sort and filter the results (suggestions) further.

All algorithms strive for the same goals to achieve short lookup times: reducing the number of lookups and comparisons (between words and/or artificially generated candidates), possibly reducing further the number of full edit distance calculations and finally reducing the computational complexity of the edit distance calculation itself, while not compromising its accuracy.

The four different algorithms I want to compare and benchmark are:

II. Levenshtein edit distance variations

All four algorithms are using derivatives of the Levenshtein edit distance. There are three different levenshtein distances:

Norvig’s algorithm is using the true Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance.

The BK-Tree implementation of Xenopaxarrow-up-right is using the Levenshtein edit distance. It could be modified to use the True Damerau-Levenshtein edit distance, but not the Restricted Damerau-Levenshtein edit distance where the triangle inequality required for a BK tree does not holdarrow-up-right.

SymSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.

LinSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.

Verbosity of search results

In our tests we distinguish three levels of verbosity of search results, which will result in different lookup times:

Level 0: Return only the result with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return the result with the highest word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.

Level 1: Return all results with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return them all sorted by word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.

Level 2: Return all results within the maximum edit distance, sorted by word frequency. This allows for no early termination of the search.

Norvig’s Spelling Corrector

The idea is if we artificially generate all terms within maximum edit distance from the misspelled term, then the correct term must be among them. We have to look all of them up in the dictionary until we have a match. So all possible combinations of the 4 spelling error types (insert, delete, replace and adjacent switch) are generated. This is quite expensive with e.g. 114,324 candidate term generated for a word of length=9 and edit distance=2.

Originally it was written in Pythonarrow-up-right. For the benchmark I used the faithful C# port from Lorenzo Stoakesarrow-up-right of Peter Norvig’s algorithm, which has been extended to support edit distance 3.

BK-tree

The BK-treearrow-up-right utilizes the triangle inequalityarrow-up-right, a property of the Levenshtein edit distance: Levenstein(A,B)+Levenstein(A,C)≥Levenstein(B,C) and Levenstein(A,B)−Levenstein(A,C)≤Levenstein(B,C).

During indexing the Levenshtein(root node,child node) are precalculated.

During lookup we calculate Levenshtein(input ,root node). The triangle inequalityarrow-up-right is used as a filter, to only recursively follow those child nodes where the precalculated Levenstein(root node,child node) is in the range [Levenstein(input ,root node)-dmax, Levenstein(input ,root node)+dmax].

There are several interesting posts discussing the BK-tree in detail:

I compared three C# implementations of the BK-Tree

and decided to use the fastest one from Xenopaxarrow-up-right (which is also linked from Wikipedia) for this benchmark.

SymSpell Algorithm

SymsSpell is an algorithm to find all strings within an maximum edit distance from a huge list of strings in very short time. It can be used for spelling correction and fuzzy string search.

SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing.

The Symmetric Delete spelling correction algorithmarrow-up-right reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!

The speed comes from inexpensive delete-only edit candidate generation and pre-calculation. An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but SymSpell needs to generate only 25 deletes to cover them all, both at pre-calculation and at lookup time. Magic!

The idea behind prefix indexing is that the discriminatory power of additional chars is decreasing with word length. Thus by restricting the delete candidate generation to the prefix, we can save space, without sacrificing filter efficiency too much. In the benchmark three different prefix lengths lp=5, lp=6 and lp=7 were used. They reflect different compromises between search speed and index size. Longer prefix length means higher search speed at the cost of higher index size.

The C# source code of SymSpell is released as Open Source on GitHubarrow-up-right.

LinSpell

This is basically a linear scan through the word list and calculating the edit distance for every single word (with a few tweaks). It was intended as the baseline measure for the benchmark. Surprisingly enough and despite its O(n) characteristics it turned out to excel both BK-tree and Norvig’s algorithm.

There are several reasons for that:

  • do not underestimate the constants in the Big O notation. Visiting only 20% of the nodes in a BK-tree is more expensive than a linear search where the atomic cost is only 10%.

  • As Damerau-Levenshtein calculation is very expensive it is not the number of processed words which matters, but the number of words where we need the full Damerau-Levenshtein calculation. We can speedup the search if we can prefilter words from the calculation or terminate the calculation once a certain edit distance is reached.

  • If we restrict the search to best match we can utilize options for early termination of the search.

  • words with no spelling errors are a frequent case. Then the lookup can be done with a hash table or trie in O(1) ! If we restrict the search to best match, we can instantaneously terminate the search.

  • We do not need to calculate the edit distance if Abs(word.Length-input.Length) > Maximum edit distance (or best edit distance found so far if we restrict the search to best match)

  • If we restrict the search to best match, we can terminate the edit distance calculation once the best edit distance found so far is reached.

  • If we restrict the search to best match, and we have found already one match with edit distance=1, then we do not need to calculate the edit distance if the count of the term in question is smaller than the count of the match already found.

The C# source code of LinSpell is released as Open Source on GitHubarrow-up-right.

Last updated