Home Data Services IT Contractor The Monkey Monkey Help & Support About Contact i
 
Data Services:

Hire my services

Database design
Application design
Data processing
Migration
Excel databases
Formatting
Fuzzy matching

Impact analysis

 
 
 
Fuzzy Matching:


In a nutshell...

Fuzzy matching is a mathematical process used to determine how similar one piece of data is to another.

As an example: If you had 'John Arun Smith' in one system and wanted to find similar names from another system, the fuzzy matching process may return a list like this:

J A Smith
John A Smith
JA Smith
Jon Smith
Jonathan A Smith
Jon A Smithe

For each entry examined, the fuzzy matching process can give a probability score as to the accuracy of the match. For example, 'John A Smith' might receive a 95% score of similarity whereas 'JA Smith' would only get a 72% score.


Tell me more...

NB: I take no credit for the actual algorithms I use in my code. I've included a summary of the algorithms and authors below if you're interested.

So, is it any good? It depends on what you are trying to match and what you are trying to match it for.

Fuzzy names:
When it comes to names it's easy see the relationship between the source and a potential match. In the example above, you could present these records to a User and ask them to select which they thought was the correct match. (You could of course automate this, and use the match with the highest degree of accuracy). This may be enough to allow you to match two records from two different systems - which is where fuzzy matching is often used.

Fuzzy numbers:
Numbers on the other hand are a lot more problematic. For example, if my source is 123, then is 234 a good match? After all, I have two of the three digits! What about 124? Once again, only one digit wrong (out?).

Anything else I should know?
Anywhere you have data that cannot be matched by conventional methods you can always look at the fuzzy matching option. However, one thing to keep in mind is the time needed to do the matching.

To give you an idea how quickly permutations can increase the number of records to process, consider this - At the begining of a game of chess, White has 20 possible moves and Black has 20 possible replies. In just these two moves, there are 400 different possible game variants and that number of game variants quickly grows as follows:

In 3 moves = 8,902
In 4 moves = 197,281
In 5 moves = 4,865,609
In 6 moves = 119,060,324

It's important to factor in the amount of time it takes to fuzzy process the data against the amount of time available to complete the work you are doing. It's impossible to give an accurate time required because of the many variables involved but think at least days and possibly weeks of continuous processing and you won't be disappointed!




Back to the top:



The Algorithms used in the Fuzzy Matching Process:

The Dice Coefficient - Similarity metric:
The method used to feed the string to this algorithm is the linear q-gram (or n-gram) model:

Q or N would be the length of the gram.
Q-gram: monograms, bigrams, trigrams, etc.

Example bigram for “Nelson”: %N Ne el ls so on n#
The Dice Coefficient is used as the threshold algorithm:

D is Dice Coefficient
SB is Shared Bigrams
TBg1 is total number of bigrams in Qgram1
TBg2 is total number of bigrams in Qgram2

D = (2SB)/(TBg1+TBg2)

Double the number of shared bigrams and divide by total number of bigrams in each string. A good Dice Coefficient value would be a value greater than 0.2. If the Dice value is less than 0.2, the comparisons are rejected immediately and not analyzed by the other algorithms.

Levenshtein Edit Distance:
The Levenshtein Distance algorithm is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. Levenshtein Edit Distance is the number of insertions, deletions, or replacements of single characters that are required to convert one string to the other.

Character transposition is detected.
Transposition is given a cost of 1.

Transposition detection is taken from:
Berghel, Hal ; Roach, David : "An Extension of Ukkonen's Enhanced Dynamic Programming ASM Algorithm"

A Levenshtein Distance of 2, 3, or more characters is meaningless without evaluating the lengths of the two comparison strings as well.

Longest Common Subsequence:
The LCS is calculated by the length of the longest common, not necessarily contiguous, sub-sequence of characters divided by the average character lengths of both strings. A good value for LCS is a value greater than or equal to 0.8.

Double Metaphone:
The Double Metaphone algorithm, developed by Lawrence Phillips and published in the June 2000 issue of C/C++ Users Journal, is part of a class of algorithms known as "phonetic matching" or "phonetic encoding" algorithms.
He wrote it as a replacement for SOUNDEX.

These algorithms attempt to detect phonetic ("sounds-like") relationships between words. For example, a phonetic matching algorithm should detect a strong phonetic relationship between "Nelson" and "Nilsen", and no phonetic relationship between "Adam" and "Nelson."

Double Metaphone is designed primarily to encode American English names (though it also encodes most English words well) while taking into account the fact that such words can have more than one acceptable pronunciation. Double Metaphone can compute a primary and a secondary encoding for a given word or name to indicate both the most likely pronunciation as well as an optional alternative pronunciation (hence the "double" in the name).

Back to the top:

Home Data Services IT Contractor The Monkey Monkey Help & Support About Contact i
Copyright © 2003 KnockYourSocksOff.com - All rights reserved - [site map]