The “Levenshtein Distance” is a measure of the difference between two strings, which is calculated as the minimum number of single-character edits required to change one string into the other. These edits can be insertions, deletions, or substitutions of a single character.
Named after the Soviet mathematician Vladimir Levenshtein, who first considered this distance in 1965, it’s particularly useful in applications where you want to determine how similar two strings of text are, such as spell checking, DNA sequence alignment, and Natural Language Processing.
For example, let’s take the words “kitten” and “sitting”. The Levenshtein distance between these two words would be 3, because:
- Replace the “k” in “kitten” with an “s” to get “sitten” (substitution)
- Replace the “e” in “sitten” with an “i” to get “sittin” (substitution)
- Append a “g” to “sittin” to get “sitting” (insertion)
Therefore, it took three single-character edits to change “kitten” into “sitting”, so the Levenshtein distance is 3.
The Levenshtein distance is always at least the difference in lengths between the two strings. It is at most the length of the longer string, and it is zero if and only if the strings are equal. When used to compare two strings a and b, the Levenshtein distance is also a metric, as it fulfills the following three conditions:
- It is zero if a and b are the same string.
- It is always at least zero (non-negative).
- It is symmetric (swapping a and b doesn’t change the distance).
- It respects triangle inequality (the distance from a to b to c is always at least the distance from a to c).