Wednesday, March 3, 2010

Fuzzy string matching in PostgreSQL

A recent project required me to use fuzzy string matching, or sound alike matching, in an application that searched a list of names. It turns out there is a contrib module for the PostgreSQL database called fuzzystrmatch that provides several different matching algorithms.

The task at hand involved rewriting a legacy application, originally in PICK, in Ruby on Rails. The PICK application used a soundex search to find names of people that sounded like the search string.

Three algorithms are available as PostgreSQL functions (after installation of the fuzzystrmatch module). They are soundex(), levenshtein(), and metaphone().

Both soundex and metaphone convert a string into character codes. Soundex uses 4 characters and metaphone uses a configurable number of characters. Levenshtein directly compares two strings and returns an integer indicating how well the two strings match.

After some trial and error, I found that metaphone produced better results than soundex. I didn't test the Levenshtein function.

To improve the results, I added a classic substring search using ILIKE. The combination of ILIKE and metaphone gave me a broad, but reasonably accurate fuzzy string search.