Fuzzy search / fuzzy string matching
When it comes to searching, developers usually apply straightforward approaches such as exact match or searching using wildcards. In terms of SQL, queries become similar to the following:
SELECT * FROM movies WHERE name = "Forrest Gump"
Or using LIKE operator and wildcards:
SELECT * FROM movies WHERE name LIKE "%Forrest%"
This is fine, as long as you do not consider the human factor. So what can go wrong when people search for a term? There are many possibilities, including mistyping, misspelling, national spelling differences, synonyms, and other human errors.
In an ideal case, it would be ideal to handle such behaviour, such that, for example, if a user mistyped a word, they are still able to get relevant search results. To accomplish this, however, is a complicated task that will take time and effort to solve.
Theory and algorithms
Multiple algorithms have been developed to address this type of problem; the most common are: Damerau-Levenshtein Algorithm, Trigrams, Soundex and Double Metaphone. Let's quickly discuss each of these.
This algorithm calculates the Damerau-Levenshtein Distance between two strings. Damerau-Levenshtein Distance (sometimes it is called "edit distance") is the number of steps needed to make the first string look like the second one. Each step is one of the following: update (change one letter), insert (insert a letter), delete (delete a letter) and transposition of letters that are right next to each other.
For example, if we have strings "center" and "centre", conducting 1 step (flip last letters) is enough to make the words look identical. Thus, the Damerau-Levenshtein Distance between these two strings is 1.
The less distance is, the more similar strings are.
Trigram is 3 sequential letters of a string. The idea is to create a full set of all possible trigrams for each string, and then find the number of matching trigrams in those sets.
Let's get trigrams for "Forrest Gump"
'For', 'orr', 'rre', 'res', 'est', 'st', 't G', ' Gu', 'Gum','ump'
And do the same for “Forest Gump” (one missing letter)
'For', 'ore', 'res', 'est', 'st ', 't G', ' Gu', 'Gum','ump'
There are 9 matching trigrams in the sets, which is enough to treat these strings as similar and return the correct one, even if the original search request was misspelled.
Soundex and Double Metaphone
Double Metaphone expands the Soundex algorithm, which are both phonetic search algorithms. The idea is to turn a string into some sort of its phonetic representation and then compare these phonetic representations. Soundex and metaphone produce phonetics based on English pronunciation.
Soundex produces phonetic keys of the same length (it always returns 4 characters), while Metaphone allows varying lengths of keys and uses a wider range of pronunciation rules.
Double Metaphone is even more powerful by returning primary and secondary code for each name.
The idea behind phonetic search is to convert search terms and target names into their phonetic equivalents and then compare phonetic equivalents.
Various combinations of above algorithms
Of course these algorithms can be combined in multiple ways. For example, we can calculate double metaphone and then apply trigrams to the results.
Fortunately, this is not only a theoretical consideration. Plenty of modern software either supports it out of the box or has modules that support fuzzy search and matching. There are modules pg_trgm and fuzzystrmatch that provide quite a lot of flexibility with fuzzy search for those developers who use Postgresql.
SIMILARITY function lets you search strings by trigrams similarity index (from 0 to 1, where 1 means full match)
SELECT * FROM authors WHERE SIMILARITY(name,"Isac Asimov") > 0.4 ;
Pg_trigrams extension also has other useful functions, such as word_similarity (text, text) that returns the greatest similarity between one string and any substring of another string.
Levenshtein function returns Levenshtein distance between its arguments:
SELECT name FROM user WHERE levenshtein("Cameron", name) = 1 ;
This query returns all records where the number of steps required to turn the name column to the string "Cameron" is less than or equal to 1.
Metaphone is a phonetic algorithm:
SELECT * FROM animals WHERE METAPHONE(name,5) = METAPHONE("elefant",6);
The second parameter of the function is the maximum length of return value; the greater parameter is, the more similarity of the string is required.
This query, for example, will return elephant, since metaphone of "elefant" is "ELFNT", the same as metaphone of "elephant"
dmetaphone and dmetaphone_alt functions add some flexibility to the matching mechanism. Dmetaphone_alt matching requires minimal similarity of the strings
Sometimes combinations of the functions are very helpful.
SELECT * FROM movies WHERE SIMILARITY(METAPHONE(name,10), METAPHONE("Si Tomlee",10)) > 0.4
MySql has built in function SOUNDEX and nGrams functionality, which is used for full text search. There is no built-in function for levenshtein, but you can Google it, as there are pretty good implementations available that you can add.
In Oracle, there are a wide range of possibilities for fuzzy search: SOUNDEX function, UTL_MATCH package that contains ‘Edit distance’ (Levenshtein) - related functions. There are open-source PL/SQL implementations of double metaphone.