To implement a function for approximate string matching, we need a reasonable metric to know how similar two strings are. I decided to use Damerau-Levenshtein distance as the metric to achieve the goal.

Damerau-Levenshtein Distance

The Damerau-Levenshtein distance is defined as the minimum number of primitive edit operations needed to transform one string into the other and these operations are substitution, deletion, insertion and the transposition of two adjacent characters.

ABC → AXC       Substitution
ABC → AC        Deletion
ABC → ABXC      Insertion
ABC → ACB       Transposition

Here is the Delphi implementation of Damerau-Levenshtein distance algorithm:

function DamerauLevenshteinDistance(const Str1, Str2: String): Integer;
 
  function Min(const A, B, C: Integer): Integer; 
  begin
    Result := A;
    if B < Result then
      Result := B;
    if C < Result then
      Result := C;
  end;
 
var
  LenStr1, LenStr2: Integer;
  I, J, T, Cost, PrevCost: Integer;
  pStr1, pStr2, S1, S2: PChar;
  D: PIntegerArray;
begin
  LenStr1 := Length(Str1);
  LenStr2 := Length(Str2);
 
  // save a bit memory by making the second index points to the shorter string
  if LenStr1 < LenStr2 then
  begin
    T := LenStr1;
    LenStr1 := LenStr2;
    LenStr2 := T;
    pStr1 := PChar(Str2);
    pStr2 := PChar(Str1);
  end
  else
  begin
    pStr1 := PChar(Str1);
    pStr2 := PChar(Str2);
  end;
 
  // bypass leading identical characters
  while (LenStr2 <> 0) and (pStr1^ = pStr2^) do
  begin
    Inc(pStr1);
    Inc(pStr2);
    Dec(LenStr1);
    Dec(LenStr2);
  end;
 
  // bypass trailing identical characters
  while (LenStr2 <> 0) and ((pStr1 + LenStr1 - 1)^ = (pStr2 + LenStr2 - 1)^) do
  begin
    Dec(LenStr1);
    Dec(LenStr2);
  end;
 
  // is the shorter string empty? so, the edit distance is length of the longer one
  if LenStr2 = 0 then
  begin
    Result := LenStr1;
    Exit;
  end;
 
  // calculate the edit distance
  GetMem(D, (LenStr2 + 1) * SizeOf(Integer));
 
  for I := 0 to LenStr2 do
    D[I] := I;
 
  S1 := pStr1;
  for I := 1 to LenStr1 do
  begin
    PrevCost := I - 1;
    Cost := I;
    S2 := pStr2;
    for J := 1 to LenStr2 do
    begin
      if (S1^ = S2^) or ((I > 1) and (J > 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^)) then
        Cost := PrevCost
      else
        Cost := 1 + Min(Cost, PrevCost, D[J]);
      PrevCost := D[J];
      D[J] := Cost;
      Inc(S2);
    end;
    Inc(S1);
  end;
  Result := D[LenStr2];
  FreeMem(D);
end;

Because DamerauLevenshteinDistance function is normally called inside a loop, I had to optimize my implementation. Otherwise, the original algorithm is smaller and more readable.

Similarity Ratio

Now that we have a function to find out how similar two strings are, we can use it to calculate the similarity ratio of the two strings to approximately match them.

function StringSimilarityRatio(const Str1, Str2: String; IgnoreCase: Boolean): Double;
var
  MaxLen: Integer;
  Distance: Integer;
begin
  Result := 1.0;
  if Length(Str1) > Length(Str2) then
    MaxLen := Length(Str1)
  else
    MaxLen := Length(Str2);
  if MaxLen <> 0 then
  begin
    if IgnoreCase then
      Distance := DamerauLevenshteinDistance(LowerCase(Str1), LowerCase(Str2))
    else
      Distance := DamerauLevenshteinDistance(Str1, Str2);
    Result := Result - (Distance / MaxLen);
  end;
end;

The return value of StringSimilarityRatio function is a floating-point number between 0 (zero) and 1 (one), where 0 means not similar at all and 1 means equal strings.

UPDATE: The algorithm is updated to the one that is optimized by data man.

Reader's Comments »

  1. 1. By data man on November 26, 2010 at 02:47

    Thank for post.

    I added

      while (LenStr2 <> 0) and ((pStr1+LenStr1-1)^ = (pStr2+LenStr2-1)^) do
      begin
        Dec(LenStr1);
        Dec(LenStr2);
      end;

    after

      while (LenStr2 <> 0) and (pStr1^ = pStr2^) do
      begin
        Inc(pStr1);
        Inc(pStr2);
        Dec(LenStr1);
        Dec(LenStr2);
      end;

    and the speed has increased slightly, if the line endings match.

  2. 2. By Francisco DueƱas on January 28, 2011 at 21:47

    Did you add the lines or replace the old ones?
    ———————————————-
    I added

      while (LenStr2 <> 0) and ((pStr1+LenStr1-1)^ = (pStr2+LenStr2-1)^) do
      begin
        Dec(LenStr1);
        Dec(LenStr2);
      end;

    after

      while (LenStr2 <> 0) and (pStr1^ = pStr2^) do
      begin
        Inc(pStr1);
        Inc(pStr2);
        Dec(LenStr1);
        Dec(LenStr2);
      end;

    and the speed has increased slightly, if the line endings match.
    —————————————————————————-

  3. 3. By Michel Deslierres on May 21, 2013 at 16:43

    Thank you for the algorithm. There seems to be a little bug.

    When initializing D, I should loop from 0 to T-1 (which will be equal to LenStr2) :

    for I := 0 to T-1 do
    D[I] := I;

    Regards

  4. 4. By Kambiz on May 21, 2013 at 18:43

    @Michel: You’re right. Thanks for your attention!