In my last post I introduced the Bayes rule and it’s relationship with sentiment analysis. In this post I’ll talk about some of the difficulties of applying sentiment analysis and what we can do to try and improve the accuracy.
Sentiment analysis can be applied to many areas but arriving at whether a statement is positive or negative can be difficult. The categorisation is mainly split into two type’s facts and opinion.
Facts are expressed about entities, whereas events are about their properties.
Lui discusses that opinions are completely subjective and describe people’s sentiments, appraisals or general feeling towards entities and their properties. (Lui, 2010).
The human language can be complex for machine based learning systems to interpret and opinions can be expressed with sarcasm or irony. The order of words for can add even more confusion.
Take the following example (Frank):
“I currently use the Nikon D90 and love it, but not as much as the Canon 40D/50D. I chose the D90 for the video feature. My mistake.”
In this example, the author is conveying sarcasm; this can be hard for classifiers to process (Frank)
“After a whole 5 hours away from work, I get to go back again, I’m so lucky!”
For a classifier to process data and provide more accurate results it must be trained. This can be achieved by collecting training data. Various sources can be used, one popular means is to use a corpus of move reviews labelled as positive or negative. The algorithm is then applied – the best accuracy for this approach is approximately 82.9% (Read)
Improve accuracy by processing neutral sentiment
A paper published by Koppel discusses how the majority of research into sentiment analysis ignores “neutral” sentiment. The paper goes onto discuss why it is crucial to identify neutral polarity for a number of reasons, learning from just positive and negative examples alone will not result in accurate classification.
Koppel also reinforces the point that “In almost all actual polarity problems, including sentiment analysis, there are, however, at least three categories that must be distinguished: positive, negative and neutral.”
An article written by Kitsuregaw also reinforces the importance of classifying and training with neutral phrases, it discusses that during error analysis, the majority of errors were actually related to neutral phrases. It explains that out of 48 incorrectly classified phrases, 37 of them were neutral and attributes the error rate to not using a “neutral corpus” when training the classifier.
It doesn’t end there and there is more that can be done, often you will find there are strings of text that add no value to the classification process which brings us onto “Stop Words”.
Stop words in computing terms, are words which are filtered out prior to or after processing of natural language data and text. (Wikipedia). Unfortunately there is not one definitive list of stop words which can be used and any group of words can be chosen as stop words in terms of sentiment analysis. They are sometimes known as “noise words”.
Search engines for example do not record common stop words in order to save disk space or to speed up searches. (Sullivan) – i.e. search engines “stop” looking at them.
A common list of stop words could contain something like the follow:
Notice they don’t express much in the way of emotion ?
If you’re trying to perform sentiment analysis you’ll want to remove words like this from your classification process.
Collecting data is only one part of the challenge, once data is obtained and stop words have been removed, it must be further cleansed. In an effort to produce more accurate results data is generally “pre-processed”. The approach to pre-processing data will vary depending on your problem domain.
If you were trying to pre-process Tweets for example on Twitter, you may want to replace all hyperlinks with a stop word such as URL.
Well most of the URLS on Twitter are shortened so by instructing your classifier to treat them as a stop word, you save yourself some processing time. You may also adopt the same approach with usernames to further cleans each tweet.
If you’ve identified suitable stop words and cleansed the data sufficiently you’re one step closer from being able to perform sentiment analysis against your data!
The final thing you have to do is to TRAIN your classifier. I’ll cover that in my next post though and discuss what options you have.
As always if you have any comments, suggestions or have other insights then please drop me a message!