Automatic Spelling Correction

This page describes the final result of some experiments I made in an attempt to duplicate google's automatic spelling correction.

By results, I don't mean metrics, but rather a description of the final algorithm. This algorithm was able to correctly "correct" 48 of the top 50 misspelled words. In the last two cases it guessed a different word that was very close to the "correct" word as defined by the "top 50 misspelled words" web site.

What seemed to work

Attempts to create an automatic spelling correction system using only cosine similarity did not result in a usable system. However, combining this technique with "phonetic spelling" as well as using the "longest common subsequence" did. Actually some other simple word comparison rules were also used -- as described below.

In addition to the simple act of getting the spell correction system to work, it is also necessary to get it to be fast. There are approximately 300,000 words in common English use and 300,000 comparisons is just too much to do in an on-line or other user-interactive environment.

Here's what was done to overcome the speed problem:

  1. store all the words in the language sorted by a cosine-similarity index of each word to a baseline word form. That is, compare each valid English word to a baseline word and use the comparison number as a key into a map for the word then:
  2. To search for the proper spelling of a candidate word, compute the candidate spelling's cosine similarity to the baseline word. The use the std::upper_bound() method to search the map for the approximate location in the valid words map.
    Note, before calling upper_bound given the cosine similarity of the candidate word to the baseline word, increase the similarity value by 1.2 to given it the chance of dealing with words whose spellings have too many letters, not just missing letters.
  3. When upper_bound returns an iterator into the map of valid word spellings, process that item and all elements closer to the beginning of the map -- or until the current word's cosine similarity is less that 80% of the original similarity. Basically, just iterate backwards from the upper bound and break out of the loop when the current word's similarity is too low.

    In each iteration, compute the cosine similarity of that iteration's candidate spelling to the user spelling -- not the baseline word -- at the current location in the valid words table.

    Keep track of the closest match you find. If the current word is less than 50% of the closest match found so far, ignore it. Insert each current word you process into a secondary map. This secondary map uses a much more complicated key value than the valid words table. Its key is a structure including the longest matching sub-sequence, the actual similarity, a string that holds a second phonetic spelling of the current word.

  4. After visiting all the interesting "valid words" and producing the secondary table sorting possibilities on this bigger more complicated search key, produce the same key structure for the candidate word. Then use the upper bound algorithm on the secondary table to find the best match. Iterate backwards from the return value a few locations -- printing each as a possible corrected spelling. If you find the exact match between the candidate string and a word, quit early.

The "baseline word" used in the above algorithm is spelled like this: "abcdefghijklmnopqrstuvwxyzabcdefghijklmnoprstuvwxy".

The first phonetic algorithm, mentioned above, is the metaphone algorithm by Lawrence Phillips, the second is the NYSIIS algorithm. The principle difference in the two is that the metaphone algorithm deals only with consonants and the NYSIIS replaces vowell sequences with a single 'A'.

An Additional secondary sort key is computed for the word endings: s, ing, y, d, r. That is, each word ending is represented as a different numeric value so as to move the word up or down in the secondary ranking.

The "primary storage" map is keyed on the primary key -- not the secondary storage map. The "secondary storage" map (the one created during the process of searching for a specific word) has the actual similarity and the count of matching characters (longest subsequence) -- but, the count of matching characters is decremented if the last characters of a word are one of the "decoration" characters.

That is, the decoration doesn't count as a matching character.

And, if the first character of the comparand word equals the first character of a word in the table, add one to the matching characters.

Finally, if the word in the table is more than 2 different from the comparand word, decrement the length by 1.