Extending Naive Bayes Classifiers for Text Classification

Todd Sullivan and Ashutosh Kulkarni
Project 2 of 2
Stanford's Information Retrieval and Web Mining Course
Stanford Department of Computer Science

In this project we created the basic Multivariate Bernoulli Naive Bayes and Multinomial Naive Bayes classifiers. We extended the multinomial classifier to handle skewed data bias and to take advantage of the fact that we are using the classifier for text classification. These extensions result in the Complement Multinomial Naive Bayes (CNB), Weight-Normalized Complement Multinomial Naive Bayes (WCNB), and Transformed Weight-Normalized Complement Multinomial Naive Bayes (TWCNB) classifiers. As their names suggest, each classifier builds upon the previous one (but does not necessarily make a better classifier for all applications). We also implemented the Chi-Square, KL divergence, and dKL divergence feature selection methods.

We evaluated our classifiers and feature selection methods on the 20 Newsgroups collection, which is a set of email messages from 20 different Usenet newsgroups with 1,000 emails from each newsgroup. Aside from comparing our classifiers and feature selection methods with each other, we also compared our classifiers to SVMs and decisions trees from the WEKA machine learning library and experimented with various preprocessing methods and features such as stemming, n-grams, stop lists, and several domain-specific features.

Technical Report

Member Contributions

The following list details all group contributions.

  • I implemented Chi-Square feature selection, k-fold cross validation and all preprocessing methods and extra features.
  • Ashutosh implemented the Multivariate Bernoulli Naive Bayes and Multinomial Naive Bayes classifiers
  • I optimized the multivariate and multinomial classifiers to reduce their total train/test time by two orders of magnitude to around 4 seconds.
  • Ashutosh implemented the Complement Multinomial Naive Bayes classifier using my optimized multinomial code as a base.
  • Ashutosh and I both independently developed WCNB and TWCNB classifiers. Both of our implementations produced the same results at the same speed.
  • Ashutosh and I pair-programmed the KL Divergence and dKL Divergence feature selection methods.
  • I wrote the Chi-Square feature selection (Section 3), k-fold cross validation (Section 5), preprocessing techniques/domain-specific features (Section 7), and WEKA comparison (Section 9) portions of the report.
  • Ashutosh wrote the multivariate and multinomial naive bayes (Section 2), CNB/WCNB/TWCNB (Section 6), and Chi-Square/KL Divergence/dKL Divergence feature selection comparison (Section 8) portions of the report.
  • The introduction (Section 1), conclusion (Section 10), and built for speed (Section 4) sections were co-written after the rest of the report was put together.
  • We each individually performed the experiments and analyses in our respective sections.
  • I applied all formatting and presentation features to the report.

Source Code

Since this assignment will be used in future versions of the information retrieval and web mining course, I am not releasing the code at this time.