View on GitHub

Tweet Classification

Disambiguating your tweets since 2012.

Download this project as a .zip file Download this project as a tar.gz file

CS 472/572: Principles of Artificial Intelligence

Term Project, Fall 2012, Iowa State University

This page details our approach of using machine learning for classification of tweets about companies.


Project Abstract

Twitter is a popular source for data mining due to its massive scale and inclusivity of current trends. It is also a veritable goldmine of useful data for businesses seeking to discover public opinion about themselves or their products. Using machine learning techniques, we analyze the accuracy of classifying whether tweets directly mention a particular company or merely contain keywords related to the company. We aim to survey a variety of text processing steps combined with a NaiveBayes classifier, focusing on the effects of different tokenizations. Our dataset consists of approximately 2,000 tweets of which 64% directly mention the company Apple, Inc or one of its products. The other 36% do not, but still contain keywords related to the company. Preliminary results have shown significant increases in accuracy by using domain-specific tokenizers, while surprisingly showing decreases in classification accuracy for other standard preprocessing methods such as n-grams.


Building the Corpus

Preprocessing: the Pipeline



We have presented a pipeline for processing tweets that increases accuracy and precision of the resulting models. We introduce a layered approach to tokenization that sees improvements over other rule-based methods. We have built several reusable Java components that extend the Mallet API for processing of data from Twitter. We created a TweetJsonIterator that accepts a file of Twitter's JSON-formatted tweets and creates training Instances containing their values for easy processing during model training. We have implemented several Pipes that can be reused or easily modified to suit a similar Twitter processing purpose. These include Link2Title, Stemmer, SpellCheck, and Tokenize, the latter of which serves as a more extensible tokenizer than Mallet's default. We release our testing utility as a new way of comparing multiple similar pipelines. All code is available through our project website. We also provide a corpus of 100,000 tweets selected by Apple-related keyword and our labeled data set of 2000 tweets. The bash scripts used to filter the data for near-duplicates and spam are also available for download.


View our paper here and slides here.

To check out the code with git:

$ git clone