To implement a function for approximate string matching, we need a reasonable metric to know how similar two strings are. I decided to use Damerau-Levenshtein distance as the metric to achieve the goal.
The Damerau-Levenshtein distance is defined as the minimum number of primitive edit operations needed to transform one string into the other and these operations are substitution, deletion, insertion and the transposition of two adjacent characters.
ABC → AXC Substitution
ABC → AC Deletion
ABC → ABXC Insertion
ABC → ACB Transposition
Here is the Delphi implementation of Damerau-Levenshtein distance algorithm:
function DamerauLevenshteinDistance(const Str1, Str2: String): Integer; function Min(const A, B, C: Integer): Integer; begin Result := A; if B < Result then Result := B; if C < Result then Result := C; end; var LenStr1, LenStr2: Integer; I, J, T, Cost, PrevCost: Integer; pStr1, pStr2, S1, S2: PChar; D: PIntegerArray; begin LenStr1 := Length(Str1); LenStr2 := Length(Str2); // save a bit memory by making the second index points to the shorter string if LenStr1 < LenStr2 then begin T := LenStr1; LenStr1 := LenStr2; LenStr2 := T; pStr1 := PChar(Str2); pStr2 := PChar(Str1); end else begin pStr1 := PChar(Str1); pStr2 := PChar(Str2); end; // bypass leading identical characters while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin Inc(pStr1); Inc(pStr2); Dec(LenStr1); Dec(LenStr2); end; // bypass trailing identical characters while (LenStr2 <> 0) and ((pStr1 + LenStr1 - 1)^ = (pStr2 + LenStr2 - 1)^) do begin Dec(LenStr1); Dec(LenStr2); end; // is the shorter string empty? so, the edit distance is length of the longer one if LenStr2 = 0 then begin Result := LenStr1; Exit; end; // calculate the edit distance GetMem(D, (LenStr2 + 1) * SizeOf(Integer)); for I := 0 to LenStr2 do D[I] := I; S1 := pStr1; for I := 1 to LenStr1 do begin PrevCost := I - 1; Cost := I; S2 := pStr2; for J := 1 to LenStr2 do begin if (S1^ = S2^) or ((I > 1) and (J > 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^)) then Cost := PrevCost else Cost := 1 + Min(Cost, PrevCost, D[J]); PrevCost := D[J]; D[J] := Cost; Inc(S2); end; Inc(S1); end; Result := D[LenStr2]; FreeMem(D); end;
DamerauLevenshteinDistance function is normally called inside a loop, I had to optimize my implementation. Otherwise, the original algorithm is smaller and more readable.
Now that we have a function to find out how similar two strings are, we can use it to calculate the similarity ratio of the two strings to approximately match them.
function StringSimilarityRatio(const Str1, Str2: String; IgnoreCase: Boolean): Double; var MaxLen: Integer; Distance: Integer; begin Result := 1.0; if Length(Str1) > Length(Str2) then MaxLen := Length(Str1) else MaxLen := Length(Str2); if MaxLen <> 0 then begin if IgnoreCase then Distance := DamerauLevenshteinDistance(LowerCase(Str1), LowerCase(Str2)) else Distance := DamerauLevenshteinDistance(Str1, Str2); Result := Result - (Distance / MaxLen); end; end;
The return value of
StringSimilarityRatio function is a floating-point number between 0 (zero) and 1 (one), where 0 means not similar at all and 1 means equal strings.
UPDATE: The algorithm is updated to the one that is optimized by data man.