Speaker Identification and Geographical Region Prediction in Audio Reviews

Todd Sullivan, Ashutosh Kulkarni, and Richa Bhayani
Stanford's Speech Recognition and Synthesis Course
Stanford Department of Computer Science

In this project we tackled two tasks: speaker identification and geographical region prediction using audio signals. We crawled ExpoTV.com to create a dataset of audio product reviews. We implemented diagonal covariance and full covariance Gaussian mixture models and used the 39 MFCC features to predict the speaker of a given audio sample and the region of the U.S. (West, Midwest, Northeast, or South) that the speaker currently lives in. We show that acoustic features are indeed helpful in the region prediction task and that the use of hundreds of small context-dependent Gaussian mixture models performs significantly better than large context-independent models.

Technical Report

Member Contributions

The following list details all group contributions.

  • I crawled ExpoTV.com, downloaded all of the videos, ripped the audio out of all of the videos, and parsed the ~200,000 crawled pages to extract our dataset.
  • I implemented the diagonal covariance and full covariance Gaussian mixture models, context-dependent models, and aggregate classifier.
  • Ashutosh and Richa used HTK to extract the MFCC features from each review's audio.
  • Ashutosh and Richa used HTK and Sphinx to train a speech recognizer. They also transcribed words in our dataset's transcripts that were not in the phoneme dictionary into phonemes, applied the trained recognizer to our dataset, and evaluated the result.
  • Ashutosh and Richa experimented with textual features for the overall classifier.
  • I applied all formatting and presentation features to the report.