In the last post I spoke about the difficulties and some of the solutions you can employ to improve your sentiment analysis implementation. In this post I’ll talk about what’s involved when you need to train your classifier.
Why do I need to train my classifier and how is this done?
In order for your classifier to accurately predict the probability of your text belonging to a specific category (in our use case positive / negative / neutral) you need to have datasets that are relevant to these categories.
Check out the following dataset and you’ll see what I mean:
|positive|,|@PrincessSuperC Hey Cici sweetheart! Just wanted to let u know I luv u! OH! and will the mixtape drop soon? FANTASY RIDE MAY 5TH!!!!| |positive|,|@Msdebramaye I heard about that contest! Congrats girl!!| |positive|,|UNC!!! NCAA Champs!! Franklin St.: I WAS THERE!! WILD AND CRAZY!!!!!! Nothing like it...EVER http://tinyurl.com/49955t3| |neutral|,|Do you Share More #jokes #quotes #music #photos or #news #articles on #Facebook or #Twitter?| |neutral|,|Good night #Twitter and #TheLegionoftheFallen. 5:45am cimes awfully early!| |neutral|,|I just finished a 2.66 mi run with a pace of 11'14"/mi with Nike+ GPS. #nikeplus #makeitcount| |negative|,|Disappointing day. Attended a car boot sale to raise some funds for the sanctuary, made a total of 88p after the entry fee - sigh| |negative|,|no more taking Irish car bombs with strange Australian women who can drink like rockstars...my head hurts.| |negative|,|Just had some bloodwork done. My arm hurts|
Once you have your sample dataset you can load this into your classifier. When you’ve configured and loaded your classifier you can then supply text to it for classification. (If you want to see how this works “under the hood” I recommend checking out this article here as it breaks down numbers and talks about bayesian theorem).The selected Optin Cat form doesn't exist.
I’ve trained my classifier, now what will it do?
As you pass text to your trained classifier it will attempt to identify which category it belongs to (positive, negative or neutral) and will return the PROBABILITY of the match.
So if we pass in the text “I love this new phone, it’s fantastic!” Our classifier would perform it’s magic and arrive at something like the following for each of our categories:
- Negative: 0.1
- Neutral: 0.25
- Positive: 0.98
So we’ve covered the basics of training the classifier and it’s output. You might be wondering where you can get sample training datasets so I’ve included some links to get you started:
How big should my training datasets be?
I remember whilst building my first sentiment analysis tool as part of my masters degree research and for some reason I thought “the bigger the better”. I loaded as much as I could into the classifier (hundreds of thousands of rows) – this didn’t make much difference.
Training Dataset 2 has around 5000+ hand classified tweets and is a decent enough dataset to get you started.
Other important factors that can make a difference are:
- Having a similar amount of sample data in each of the categories
- Sufficiently cleansing of pre-processing the data
- Training data should be related to the domain you’re trying to classify, for example don’t train your classifier with hotel review if you’re trying to classify academic papers.
I’ve been able to get my classifiers to run at an accuracy of 91% and have experimented with different approaches and data cleansing techniques.
In this post we’ve covered the following:
- Why and how do we to train our classifiers
- Training datasets and how to use them
- Briefly touched on accuracy
In my next post I’ll share the results of some of me experiments when trying to perform sentiment analysis. In the meantime if you have any comments, suggestions or thoughts then please drop me a message.