Maximum Entropy Sequence Classifiers and Treebank Parsing

Todd Sullivan and Pavani Vantimitta
Stanford's Natural Language Processing Course
Project 3 of 3
Stanford Department of Computer Science

In this project we created a maximum entropy sequence classifier to label words as protein, DNA, RNA, cell line, cell type, or other. We also implemented a CKY treebank parser and investigated the effect of chunking on parse quality. Our CKY parser was blazing fast. It was able to parse 20 word sentences in an average of 125 milliseconds and 69 word sentences (the largest in the GENIA corpus) in an average of 4,849 milliseconds. The cited speed for 20 word sentences "achievable with some optimization" was 5 seconds, so I would say we did fairly well!

Technical Report

Member Contributions

The following list details all group contributions. These contributions were not the original tasks assigned to each group member, but were the end result due to each member's abilities and other issues.

  • I implemented the hill climbing feature selection for the maximum entropy classifier, conceived and implemented all optimizations, and developed most of the maximum entropy classifier features.
  • Pavani and I pair programmed the rest of the assignment.
  • I performed all experiments, collected all data, and organized all results.
  • Pavani individually selected the performance analysis examples
  • Pavani and I jointly discussed the examples/performance analysis and created the tree visualizations.
  • Pavani selected the examples and created the graphs.
  • I wrote all sections of the report except the sections with examples.
  • I applied all formatting and presentation features to the report.

Source Code

Since this assignment will be used in future versions of the natural language processing course, I am not releasing the code at this time.