CS 472/572: Principles of Artificial Intelligence
Term Project, Fall 2012, Iowa State University
This page details our approach of using machine learning for classification of tweets about companies.
- Daniel Stiner, Senior, Computer Engineering (@danstiner)
- Brandon Maxwell, Junior, Computer Science (@bmaxwell921)
- Curtis Ullerich, Senior, Computer Engineering (@curtisullerich)
Twitter is a popular source for data mining due to its massive scale and inclusivity of current trends. It is also a veritable goldmine of useful data for businesses seeking to discover public opinion about themselves or their products. Using machine learning techniques, we analyze the accuracy of classifying whether tweets directly mention a particular company or merely contain keywords related to the company. We aim to survey a variety of text processing steps combined with a NaiveBayes classifier, focusing on the effects of different tokenizations. Our dataset consists of approximately 2,000 tweets of which 64% directly mention the company Apple, Inc or one of its products. The other 36% do not, but still contain keywords related to the company. Preliminary results have shown significant increases in accuracy by using domain-specific tokenizers, while surprisingly showing decreases in classification accuracy for other standard preprocessing methods such as n-grams.
Building the Corpus
- By accessing the Twitter ‘Firehose’ stream through their public API, we harvested 100,000 tweets using broad keyword filters with the goal of collecting a superset of relevant tweets. This provides us a smaller set of features in the overall corpus, with a much higher prior confidence that each individual tweet is of interest.
- We define a tweet to be ‘about’ Apple if the tweet mentions the company (via direct reference or stock symbol), one of its products, or a service it offers.
- We hand labeled 2000 tweets for use in training and testing, including only English-language tweets in our training set.
Preprocessing: the Pipeline
- We do all machine learning with the Mallet library developed by the University of Massachusetts McCallum (2002). Mallet includes abstractions for a data processing pipeline, during which we use both existing and custom processing ‘Pipes’ to transform the data.
- Examples of transforms include: removing stop words, creating bigrams, correcting html features, replacing URLs with relevant data, stemming, and custom tokenization.
- Twitter-aware tokenizer outperformed whitespace tokenization.
- Applying stemming and stop word removal to both pipelines increased the accuracy for almost every case.
We have presented a pipeline for processing tweets that increases accuracy and precision of the resulting models. We introduce a layered approach to tokenization that sees improvements over other rule-based methods. We have built several reusable Java components that extend the Mallet API for processing of data from Twitter. We created a TweetJsonIterator that accepts a file of Twitter's JSON-formatted tweets and creates training Instances containing their values for easy processing during model training. We have implemented several Pipes that can be reused or easily modified to suit a similar Twitter processing purpose. These include Link2Title, Stemmer, SpellCheck, and Tokenize, the latter of which serves as a more extensible tokenizer than Mallet's default. We release our testing utility as a new way of comparing multiple similar pipelines. All code is available through our project website. We also provide a corpus of 100,000 tweets selected by Apple-related keyword and our labeled data set of 2000 tweets. The bash scripts used to filter the data for near-duplicates and spam are also available for download.
To check out the code with git:
$ git clone email@example.com:curtisullerich/twittertrader.git