K-Gram Spelling Correction

Todd Sullivan and Ashutosh Kulkarni
Project 1 of 2
Stanford's Information Retrieval and Web Mining Course
Stanford Department of Computer Science

In this project we created several spelling correctors using the Jaccard score of k-gram overlap. We augmented our correctors with Levenshtein edit distance, and various ranking and tie breaking methods. We found that a layered ranking algorithm that first creates a list of possible corrections based on Jaccard k-gram overlap and then re-ranks the list based on Levenshtein edit distance (with word frequency-based tie breaking) performed best. We also studied Lucene's functionality and built-in spell checking mechanisms, and incorporated our spelling corrector into Lucene.

Technical Report

Member Contributions

The following list details all group contributions.

  • I implemented the spelling correctors, performed all spelling corrector experiments and analyses, and wrote the spelling corrector portion of the report (Section 2).
  • Ashutosh incorporated our spelling correctors into Lucene, performed all analyses of Lucene, and wrote the Lucene portion of the report (Section 3).
  • I applied all formatting and presentation features to the report.

Source Code

Since this assignment will be used in future versions of the information retrieval and web mining course, I am not releasing the code at this time.