Improving Search with Levenshtein Distance

We've been creating a document management system for a local accountancy firm. They've asked that this new system should search document titles, and descriptions, whilst being able to cope with spelling mistakes.

Coping with spelling mistakes

Computers are like your physics teacher at school - great at maths, not very creative, but ultimately very precise. So how can a computer cope with spelling mistakes? If I search for "Mony Lawndaring", how can I get it to bring back results for "Money Laundering"?

One way to cope with spelling mistakes is to use the Levenshtein distance metric devised by Levenshtein, a Russian scientist - not to be confused with the American Pie character Jim Levenstein.

Using this algorithm, you can calculate the number of changes you need to make to one word, in order to change it into another word.

A practical example

The Levenshtein distance between "kitten" and "sitting" is three, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of 's' for 'k')
  2. sitten → sittin (substitution of 'i' for 'e')
  3. sittin → sitting (insertion of 'g' at the end)

We can then convert the difference into a percentage using the following formula:

p = (1 - l/m) × 100

Where l is the levenshtein distance and m is the length of the longest of the two words:

(1 - 3/7) × 100 = 57.14...

Applying the algorithm to our search

Now that we've got a common way of calculating the mathematical difference between words, we can apply the process to our search facility.

By calculating the percentage difference between each of the search terms entered by the user, compared to each of the words in the system's document titles/descriptions, we can work out which documents the search should retrieve.

In my calculation, I've decided to:

  1. Ignore results where the percentage match is less than 50%.
  2. Treat the percentages as ordinary numbers, and sum them to create a "total match" between the search terms and document.

For example, imagine we have two documents on our system: "Money Laundering" and "Money Facts", and the user searches for "Mony Lawndaring".

Mony Lawndaring vs Money Laundering

Word 1 Word 2 Match
Mony Money 80%
Mony Laundering 10%
Lawndaring Money 10%
Lawndaring Laundering 80%

Total match for "Mony Lawndaring" against "Money Laundering" is 160.

Mony Lawndaring vs Money Facts

Word 1 Word 2 Match
Mony Money 80%
Mony Facts 0%
Lawndaring Money 10%
Lawndaring Facts 10%

Total match for "Mony Lawndaring" against "Money Facts" is 80.

Results

Then by simply ordering the search results by the highest total match, we can see the "Money Laundering" document appear before "Money Facts" in the search results list!

The Switchplane Team

Hey! We are Switchplane and we help businesses save time and money by building them custom software. This can be anything from a job management portal to a complex ecommerce solution. Want to find out more?

About Switchplane

Contact Us

Can we help with your project?

Get in touch

Join our Newsletter

Sign up to our monthly newsletter to keep up to date on the latest Switchplane and sector news.

You can unsubscribe at any time. View our Privacy Policy