« Return to the blog

Tom - Director

Posted by Tom on 10th May 2011

Improving Search with Levenshtein Distance

We've been creating a document management system for local accountancy firm Honey Barrett.

One of their complaints about the existing system is that the search facility isn't that great. They have specifically requested that this new system should search document titles, and descriptions, whilst being able to cope with spelling mistakes.

Coping with spelling mistakes

Computers are like your physics teacher at school - great at maths, not very creative, but ultimately very precise. So how can a computer cope with spelling mistakes? If I search for "Mony Lawndaring", how can I get it to bring back results for "Money Laundering"?

Jim Levenstein, Left - Vladimir Levenshtein, Right

Jim Levenstein, Left - Vladimir Levenshtein, Right

One way to cope with spelling mistakes is to use the Levenshtein distance metric devised by Levenshtein, a Russian scientist - not to be confused with the American Pie character Jim Levenstein.

Using this algorithm, you can calculate the number of changes you need to make to one word, in order to change it into another word.

A practical example

The Levenshtein distance between "kitten" and "sitting" is three, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of 's' for 'k')
  2. sitten → sittin (substitution of 'i' for 'e')
  3. sittin → sitting (insertion of 'g' at the end).

We can then convert the difference into a percentage using the following formula:

p = (1 - l/m) × 100

Where l is the levenshtein distance and m is the length of the longest of the two words:

(1 - 3/7) × 100 = 57.14...

Applying the algorithm to our search

Now that we've got a common way of calculating the mathematical difference between words, we can apply the process to our search facility.

By calculating the percentage difference between each of the search terms entered by the user, compared to each of the words in the system's document titles/descriptions, we can work out which documents the search should retrieve.

In my calculation, I've decided to:

  1. Ignore results where the percentage match is less than 50%.
  2. Treat the percentages as ordinary numbers, and sum them to create a "total match" between the search terms and document.

For example, imagine we have two documents on our system: "Money Laundering" and "Money Facts", and the user searches for "Mony Lawndaring".

Mony Lawndaring vs Money Laundering

Word 1Word 2Match
MonyMoney80%
MonyLaundering10%
LawndaringMoney10%
LawndaringLaundering80%

Total match for "Mony Lawndaring" against "Money Laundering" is 160.

Mony Lawndaring vs Money Facts

Word 1Word 2Match
MonyMoney80%
MonyFacts0%
LawndaringMoney10%
LawndaringFacts10%

Total match for "Mony Lawndaring" against "Money Facts" is 80.

Result

Then by simply ordering the search results by the highest total match, we can see the "Money Laundering" document appear before "Money Facts" in the search results list!

Add A Comment

:
:
:
: Captcha

Recent Blogs

Michael-Developer

Twitter Tools

Do you use Twitter as part of your business marketing strategy. Here are some useful tips and tools to help you succeed. Read More »

Evie-Designer

Gradient Meshes in Adobe Illustrator

There's more to Adobe Illustrator than basic shapes and blocks of colour, and there's more to gradients than linear or radial effects. Read More »

Evie-Designer

Web Design Trends 2011

Web designers are shying away from creating gimmicky tricks, but rather clean, accessible, bug-free coding - that works. Read More »

Chris-Project Manager

Let's Do Business - Eastbourne 2011

Find out what Switchplane did at Let's Do Business Eastbourne 2011. Read More »

Joel-Director

Integrating with Kashflow using PHP

Find out how to use PHP's SoapClient class to integrate with Kashflow. Includes sample code and best practice tips. Read More »

Joel-Director

The New Switchplane Website

Our new website is ready - read about what we've updated and why, and get ready for Project Awesome! Read More »

Michael-Developer

Website Accessibility

Have you ever considered the importance of accessibility on your website? Read More »

Chris-Project Manager

Networking Tips

Business networking tips and trips featuring the adventures of Norm! Read More »

Michael-Developer

How Amazing Databases Are

A blog post about the usefulness of databases, using a person management system as an example. Read More »

Christian-Designer

Reasons to use Vector Graphics over Rasters

When should you use vectors or rasters, and does it even matter? Read More »

Michael-Developer

The Importance of Variables

Variables are useful - make sure you use them properly! Read More »

Chris-Project Manager

Some thoughts on writing for your business website...

Read some tips on preparing copy for your website. Read More »

Chris-Project Manager

Let's Do Business - Hastings 2010

Switchplane attended LDB Hastings in 2010 - read about our day. Read More »

Chris-Project Manager

Lets Do Business Eastbourne 2010

Switchplane's experience at Let's Do Business 2010 in Eastbourne. Read More »

Evie-Designer

Web Design Trends 2010 Part 1

A guide to web design trends 2010, including hand-drawn and painted layouts, typefaces, modern vectors and large headers and footers. Read More »

Evie-Designer

Common Printing Problems and how to avoid them

Even if it looks good on your screen, it isn't guaranteed to come out like that! I've selected a few of the most common print problems to watch out for. Read More »

Evie-Designer

Web design trends 2009 Part 2: Old and Torn Paper

Make your website a real piece of personal artwork by collecting and scanning-in torn and crumpled paper and using it for the design of your website (or just take a visit to istock photo!). Read More »

Evie-Designer

Web design trends 2009 Part 1: Badges

Subtle elements and new layout ideas for 2009. My guide on how to make your website look fashionable and unique, packed full of information and further resources. Read More »

Tom-Director

The Joys of JQuery

JQuery is a light-weight, cross-browser and feature packed JavaScript library. To explain why JQuery is so good, first you need to know why not using it is so bad! Read More »

Vanessa-Account Manager

Lets Do Business Eastbourne 2009

Last Thursday (25th June) Switchplane attended their first Lets Do Business event at the Winter Gardens in Eastbourne. Read More »

Vanessa-Account Manager

Top 10 Sales Tips

Times are tough. Sales are down. Morale is low. Sound familiar? Well it needn't be. Below are a few tips which are sure to help you re evaluate and succeed in pushing your sales margins. Read More »

Evie-Designer

Create printable artwork in Indesign and Photoshop

A few useful steps to help you successfully prepare your artwork for print, using Adobe Photoshop and Adobe Indesign. Learn about colour modes, resolution, size and other helpful tips! Read More »

Cron-Administrator

Search engine ranking more important than ever

Search engine optimisation is now an essential part of a successful website. And because of the way search engines now work, the focus of SEO is now about producing quality content and getting your name out there so you get incoming links. Read More »

Cron-Administrator

Four Months at Switchplane

I've been at Switchplane for four months now - to celebrate the launch of our new website I'm sharing some thoughts on what it's like working here. Read More »

Joel-Director

How to run a cron job on the first weekday of the month

Often it's useful to generate automatic reports or perform some other task at the start of the month. Find out how to schedule a cron job to run on the first weekday of each month. Read More »