Language Modeling

Todd Sullivan and Pavani Vantimitta
Stanford's Natural Language Processing Course
Project 1 of 3
Stanford Department of Computer Science

In this project we created several language models (unigrams, bigrams, and trigrams). We used various smoothing and backoff techniques (Katz, linear interpolation, and Simple Good Turing). Our models scaled well and could easily be applied to all 1,000,000 sentences in the provided corpus.

Technical Report

Member Contributions

The following list details all group contributions. These contributions were not the original tasks assigned to each group member, but were the end result due to each member's abilities and other issues.

  • I implemented the Simple Good Turing smoothing, linear interpolation weights optimization, and all optimizations involving the SuperHelper.
  • Pavani and I pair programmed the rest of the assignment.
  • I performed all experiments, collected all data, and organized all results.
  • Pavani created the graphs used in the report.
  • I wrote the entire report and performed all of the results analysis.
  • I applied all formatting and presentation features to the report.

Source Code

Since this assignment will be used in future versions of the natural language processing course, I am not releasing the code at this time.