📌
Kiến thức tôi biết
  • Giới thiệu
  • Hyperledger Fabric
    • Các công cụ và môi trường
      • Công cụ hỗ trợ cho Ubuntu
      • Docker và hơn thế nữa
      • Vagrant
      • CouchDB
      • PostgresSQL
  • HYPERLEDGER INDY
  • Laravel
    • Cài đặt
    • Tổng hợp câu lệnh
    • Blade Template Engine
  • MySQL
    • Untitled
  • Design pattern
    • Untitled
  • Angular
    • Untitled
  • React
    • Untitled
  • Ethereum
    • Untitled
  • IBM Waston iot platform
    • Untitled
  • dijango
    • Untitled
  • nodejs
    • Untitled
  • java
    • Untitled
  • C# - Winform
    • Untitled
  • C# - ASP.Net
    • Untitled
  • golang
    • Untitled
  • OOP
    • Untitled
  • load balancer
    • Untitled
  • HTML
    • Untitled
  • CSS-SCSS-LESS-SASS
    • Untitled
  • JS-JSX-TS
    • Untitled
  • Vuejs
    • Untitled
  • Ngoài lề
    • Hướng dẫn cài đặt iSpring 9.1 Việt Hóa
  • Xử lý ảnh
    • Miền không gian
    • Miền tần số
    • Ảnh màu
  • Python
    • Tổng quan
  • OpenVPN
  • Symspellpy
  • SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking
Powered by GitBook
On this page
  • I. Approximate string search algorithms
  • II. Levenshtein edit distance variations
  • Verbosity of search results
  • Norvig’s Spelling Corrector
  • BK-tree
  • SymSpell Algorithm
  • LinSpell

Was this helpful?

SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking

Conventional wisdom and textbooks say BK-trees are especially suited for spelling correction and fuzzy string search. But does this really hold true?

PreviousSymspellpy

Last updated 5 years ago

Was this helpful?

Conventional wisdom and textbooks say are especially suited for spelling correction and fuzzy string search. But does this really hold true?

Also in the comments to my the BK-tree has been mentioned as a superior data structure for fuzzy search.

So I decided to compare and benchmark the BK-tree to other options.

I. Approximate string search algorithms

Approximate string search allows to lookup a string in a list of strings and return those strings which are close according to a specific .

There are are many are different string metrics like , , , and .

I will compare four different algorithms to lookup a string in a list of strings within a maximum according to the string metric.

For spelling correction an additional word frequency associated with every term can be used to sort and filter the results (suggestions) further.

All algorithms strive for the same goals to achieve short lookup times: reducing the number of lookups and comparisons (between words and/or artificially generated candidates), possibly reducing further the number of full edit distance calculations and finally reducing the computational complexity of the edit distance calculation itself, while not compromising its accuracy.

The four different algorithms I want to compare and benchmark are:

  • (Symmetric Delete spelling correction algorithm)

  • (Linear search spelling correction algorithm)

II. Levenshtein edit distance variations

All four algorithms are using derivatives of the Levenshtein edit distance. There are three different levenshtein distances:

Norvig’s algorithm is using the true Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance.

SymSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.

LinSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.

Verbosity of search results

In our tests we distinguish three levels of verbosity of search results, which will result in different lookup times:

Level 0: Return only the result with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return the result with the highest word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.

Level 1: Return all results with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return them all sorted by word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.

Level 2: Return all results within the maximum edit distance, sorted by word frequency. This allows for no early termination of the search.

Norvig’s Spelling Corrector

The idea is if we artificially generate all terms within maximum edit distance from the misspelled term, then the correct term must be among them. We have to look all of them up in the dictionary until we have a match. So all possible combinations of the 4 spelling error types (insert, delete, replace and adjacent switch) are generated. This is quite expensive with e.g. 114,324 candidate term generated for a word of length=9 and edit distance=2.

BK-tree

During indexing the Levenshtein(root node,child node) are precalculated.

There are several interesting posts discussing the BK-tree in detail:

I compared three C# implementations of the BK-Tree

SymSpell Algorithm

SymsSpell is an algorithm to find all strings within an maximum edit distance from a huge list of strings in very short time. It can be used for spelling correction and fuzzy string search.

SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing.

Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!

The speed comes from inexpensive delete-only edit candidate generation and pre-calculation. An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but SymSpell needs to generate only 25 deletes to cover them all, both at pre-calculation and at lookup time. Magic!

The idea behind prefix indexing is that the discriminatory power of additional chars is decreasing with word length. Thus by restricting the delete candidate generation to the prefix, we can save space, without sacrificing filter efficiency too much. In the benchmark three different prefix lengths lp=5, lp=6 and lp=7 were used. They reflect different compromises between search speed and index size. Longer prefix length means higher search speed at the cost of higher index size.

LinSpell

This is basically a linear scan through the word list and calculating the edit distance for every single word (with a few tweaks). It was intended as the baseline measure for the benchmark. Surprisingly enough and despite its O(n) characteristics it turned out to excel both BK-tree and Norvig’s algorithm.

There are several reasons for that:

  • do not underestimate the constants in the Big O notation. Visiting only 20% of the nodes in a BK-tree is more expensive than a linear search where the atomic cost is only 10%.

  • As Damerau-Levenshtein calculation is very expensive it is not the number of processed words which matters, but the number of words where we need the full Damerau-Levenshtein calculation. We can speedup the search if we can prefilter words from the calculation or terminate the calculation once a certain edit distance is reached.

  • If we restrict the search to best match we can utilize options for early termination of the search.

  • words with no spelling errors are a frequent case. Then the lookup can be done with a hash table or trie in O(1) ! If we restrict the search to best match, we can instantaneously terminate the search.

  • We do not need to calculate the edit distance if Abs(word.Length-input.Length) > Maximum edit distance (or best edit distance found so far if we restrict the search to best match)

  • If we restrict the search to best match, we can terminate the edit distance calculation once the best edit distance found so far is reached.

  • If we restrict the search to best match, and we have found already one match with edit distance=1, then we do not need to calculate the edit distance if the count of the term in question is smaller than the count of the match already found.

: adjacent transposition (AC->CA) counted as 2 edits. The triangle inequality does hold.

(Optimal string alignment algorithm): adjacent transposition counted as 1 edit, but substrings can’t be edited more than once: ed(“CA” , “ABC”) =3. The triangle inequality does not hold.

: adjacent transposition counted as 1 edit, substrings can be edited more than once: ed(“CA” , “ABC”) =2. The triangle inequality does hold.

The BK-Tree is using the Levenshtein edit distance. It could be modified to use the True Damerau-Levenshtein edit distance, but not the Restricted Damerau-Levenshtein edit distance where the .

Originally it was . For the benchmark I used the faithful of Peter Norvig’s algorithm, which has been extended to support edit distance 3.

The utilizes the , a property of the Levenshtein edit distance: Levenstein(A,B)+Levenstein(A,C)≥Levenstein(B,C) and Levenstein(A,B)−Levenstein(A,C)≤Levenstein(B,C).

During lookup we calculate Levenshtein(input ,root node). The is used as a filter, to only recursively follow those child nodes where the precalculated Levenstein(root node,child node) is in the range [Levenstein(input ,root node)-dmax, Levenstein(input ,root node)+dmax].

BK-Tree1 ()

BK-Tree2 ()

BK-Tree3 ()

and decided to use the (which is also linked from Wikipedia) for this benchmark.

The reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

The .

The .

BK-trees
blog post on spelling correction
string metric
Levenshtein
Damerau-Levenshtein
Hamming distance
Jaro-Winkler
Strike a match
edit distance
Damerau-Levenshtein
Norvig's Spelling Corrector
BK-tree (Burkhard-Keller-tree)
SymSpell
LinSpell
Levenshtein distance
Restricted Damerau-Levenshtein distance
True Damerau-Levenshtein distance
implementation of Xenopax
triangle inequality required for a BK tree does not hold
written in Python
C# port from Lorenzo Stoakes
BK-tree
triangle inequality
triangle inequality
The BK-Tree — A Data Structure for Spell Checking
Interesting data structures: the BK-tree
Interesting data structures: the BK-tree (HN discussion)
Damn Cool Algorithms, Part 1: BK-Trees
BK-Tree | Introduction & Implementation
Implementing BK-tree in Clojure
BK-tree
C# implementation of tgriffith
C# implementation of TarasRoshko
C# implementation of Xenopax
fastest one from Xenopax
Symmetric Delete spelling correction algorithm
C# source code of SymSpell is released as Open Source on GitHub
C# source code of LinSpell is released as Open Source on GitHub