To implement a function for approximate string matching, we need a reasonable metric to know how similar two strings are. I decided to use Damerau-Levenshtein distance as the metric to achieve the goal.

Damerau-Levenshtein Distance

The Damerau-Levenshtein distance is defined as the minimum number of primitive edit operations needed to transform one string into the other, and these operations are substitution, deletion, insertion, and the transposition of two adjacent characters.

ABC → AXC       Substitution
ABC → AC        Deletion
ABC → ABXC      Insertion
ABC → ACB       Transposition

Here is the Delphi implementation of Damerau-Levenshtein distance algorithm:

function DamerauLevenshteinDistance(const Str1, Str2: String): Integer;
var
  LenStr1, LenStr2: Integer;
  I, J, T, Cost, Minimum: Integer;
  pStr1, pStr2, S1, S2: PChar;
  D, RowPrv2, RowPrv1, RowCur, Temp: PIntegerArray;
begin
  LenStr1 := Length(Str1);
  LenStr2 := Length(Str2);
  // to save some space, make sure the second index points to the shorter string
  if LenStr1 < LenStr2 then
  begin
    T := LenStr1;
    LenStr1 := LenStr2;
    LenStr2 := T;
    pStr1 := PChar(Str2);
    pStr2 := PChar(Str1);
  end
  else
  begin
    pStr1 := PChar(Str1);
    pStr2 := PChar(Str2);
  end;
  // to save some time and space, look for exact match 
  while (LenStr2 <> 0) and (pStr1^ = pStr2^) do
  begin
    Inc(pStr1);
    Inc(pStr2);
    Dec(LenStr1);
    Dec(LenStr2);
  end;
  // when one string is empty, length of the other is the distance
  if LenStr2 = 0 then
  begin
    Result := LenStr1;
    Exit;
  end;
  // calculate the edit distance
  T := LenStr2 + 1;
  GetMem(D, 3 * T * SizeOf(Integer));
  FillChar(D^, 2 * T * SizeOf(Integer), 0);
  RowCur := D;
  RowPrv1 := @D[T];
  RowPrv2 := @D[2 * T];
  S1 := pStr1;
  for I := 1 to LenStr1 do
  begin
    Temp := RowPrv2;
    RowPrv2 := RowPrv1;
    RowPrv1 := RowCur;
    RowCur := Temp;
    RowCur[0] := I;
    S2 := pStr2;
    for J := 1 to LenStr2 do
    begin
      Cost := Ord(S1^ <> S2^);
      Minimum := RowPrv1[J - 1] + Cost;      // substitution
      T := RowCur[J - 1] + 1;                // insertion
      if T < Minimum then
        Minimum := T;
      T := RowPrv1[J] + 1;                   // deletion
      if T < Minimum then
        Minimum := T;
      if (I <> 1) and (J <> 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^) then
      begin
        T := RowPrv2[J - 2] + Cost;          // transposition
        if T < Minimum then
          Minimum := T;
      end;
      RowCur[J] := Minimum;
      Inc(S2);
    end;
    Inc(S1);
  end;
  Result := RowCur[LenStr2];
  FreeMem(D);
end;

Because DamerauLevenshteinDistance function is normally called inside a loop, I had to optimize my implementation. Otherwise, the original algorithm is smaller and more readable.

Similarity Ratio

Now that we have a function to find out how similar two strings are, we can use it to calculate the similarity ratio of the two strings to approximately match them.

function StringSimilarityRatio(const Str1, Str2: String; IgnoreCase: Boolean): Double;
var
  MaxLen: Integer;
  Distance: Integer;
begin
  Result := 1.0;
  if Length(Str1) > Length(Str2) then
    MaxLen := Length(Str1)
  else
    MaxLen := Length(Str2);
  if MaxLen <> 0 then
  begin
    if IgnoreCase then
      Distance := DamerauLevenshteinDistance(LowerCase(Str1), LowerCase(Str2))
    else
      Distance := DamerauLevenshteinDistance(Str1, Str2);
    Result := Result - (Distance / MaxLen);
  end;
end;

The return value of StringSimilarityRatio function is a floating-point number between 0 (zero) and 1 (one), where 0 means not similar at all and 1 means equal strings.

Reader's Comments »

  1. 1. By data man on November 26, 2010 at 02:47

    Thank for post.

    I added

      while (LenStr2 <> 0) and ((pStr1+LenStr1-1)^ = (pStr2+LenStr2-1)^) do
      begin
        Dec(LenStr1);
        Dec(LenStr2);
      end;

    after

      while (LenStr2 <> 0) and (pStr1^ = pStr2^) do
      begin
        Inc(pStr1);
        Inc(pStr2);
        Dec(LenStr1);
        Dec(LenStr2);
      end;

    and the speed has increased slightly, if the line endings match.

  2. 2. By Francisco DueƱas on January 28, 2011 at 21:47

    Did you add the lines or replace the old ones?
    ———————————————-
    I added

      while (LenStr2 <> 0) and ((pStr1+LenStr1-1)^ = (pStr2+LenStr2-1)^) do
      begin
        Dec(LenStr1);
        Dec(LenStr2);
      end;

    after

      while (LenStr2 <> 0) and (pStr1^ = pStr2^) do
      begin
        Inc(pStr1);
        Inc(pStr2);
        Dec(LenStr1);
        Dec(LenStr2);
      end;

    and the speed has increased slightly, if the line endings match.
    —————————————————————————-

  3. 3. By Kambiz on February 2, 2011 at 11:29

    No, I didn’t update the original post.
    You can find a more optimized code on the forum.

Leave a Reply

Required fields are marked with asterisk (*)