Dec 282010
 

Levenshtein algorithm is one of possible fuzzy strings matching algorithm. Levenshtein algorithm calculates Levenshtein distance which is a metric for measuring a difference between two strings. The Levenshtein distance is also called an edit distance and it defines minimum single character edits (insert/updates/deletes) needed to transform one string to another. Details on the algorithm itself can be found on Wikipedia.
When you need to use it in queries, functions or stored procedures you have two possibilities – T-SQL implementation and CLR implementation.
I will show both solutions here and also compare the speed of both solutions.

T-SQL implementation of Levenshtein algorithm

For T-SQL I will took the one I have found on the SqlTeam.com forums a which was originally developed by Joseph Gama as mentioned in the SqlTeam post.

CLR implementation of Levenshtein algorithm

You need to compile the code using e.g. C# Express into a an assembly and create the assembly in DB. In my Case the assembly is named [SQLCLR].

Testing of the functions

Once we have properly created both functions we can start testing it. Here is a script for test of both functions on several strings.

And here are the results of the tests:

As we can see, the functions return the same results except the second select. As we have selected not to ignore the case in the CLR version. For T-SQL strings are identical as the function uses a default collation which I have Case Insensitive. If I had a default collation Case Sensitive, then the results will be the same. It could be also possible to modify the T-SQL function to accept parameter for Case Sensitive/Insensitive comparison and then use different collations for that, but it’s not what we want to do here.

Speed comparison

As we saw in previous paragraphs here, both T-SQL and CLR version of the algorithm woks correctly. Now take a look on the calculation speed of the Levenshtein distance by both version.

For the test we can use a simple script, which will calculate the Levenshtein distance in cycle 10 000 times

Here are the results

As we can see the 10 000 times calculation using the T-SQL version took 22993 milliseconds which is in average circa 2.3 millisecond for calculating the distance for the strings in our test query.

On the other side the 10 000 times calculation using CLR took only 763 milliseconds which is in average circa 0.08 milliseconds for calculating the distance for the same strings as in T-SQL version.

Conclusion

From the results we can see that the CLR is about 30 times faster on the same machine than the T-SQL version of the same algorithm. Even the T-SQL version took only 2.3 milliseconds per calculation of sample texts and it’s quite good to use in normal usage, the use or CLR can enormously decrease the processing time when processing higher volume of records.

Also it is another example where CLR beats the T-SQL solution and where it has sense to use the CLR instead of pure T-SQL.

  3 Responses to “Fuzzy strings matching using Levenshtein algorithm on SQL Server (T-SQL vs CLR)”

  1. Thanks for the code. It works very well. Yes the CLR function is about 30-33 times faster it appears.

    We were looking to match addresses, is there a way to standardize addresses also, other than this type of matching?

  2. Thanks heaps,

    It shows that how we could get a good result in server side.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

(required)

(required)