People have always had an interest on what specific individuals or groups of individuals think or what their opinion is. Since the inception of the internet, increasing numbers of people are using on-line websites and services to express their opinion. With social media sites such as Facebook, LinkedIn and Twitter, it is becoming feasible to automate and gauge what public opinion is for a given topic, news story, product or brand.
Opinions that are mined from such services can be valuable and datasets that are gathered can be analysed and presented in such a way that it becomes easy to identify if the on-line mood is positive or negative. This allows individuals or business to be proactive as opposed to reactive when a negative conversational thread is emerging. Alternatively positive sentiment can be identified thereby allowing the identification of product advocates or to see which parts of a business strategy are working.
The amount of opinion data available on-line is vast when compared to traditional opinion analysis such as paper based questionnaires and surveys for example. Just look at the following statistics (taken from here)
- Facebook users share nearly 2.5 million pieces of content.
- Twitter users tweet nearly 300,000 times.
- Instagram users post nearly 220,000 new photos.
- YouTube users upload 72 hours of new video content.
- Apple users download nearly 50,000 apps.
- Email users send over 200 million messages.
- Amazon generates over $80,000 in online sales.
Making “sense of the noise” can be somewhat difficult however computational methods can be applied to automatically extract data, analyse and classify this opinion data. This technique is known as Sentiment Analysis which is a branch of Machine Learning.
Challenges of Sentiment Analysis
Sentiment analysis is not without it’s challenges. On-line opinion data for example is often published using natural language which is unstructured in its format meaning that it can be hard to categorise. This is actually the problem most often encountered with sentiment analysis. Interpreting the mood of a subject may vary from one person to another; a problem made even harder by the format the subject may be analysed in. These challenges coupled with certain nuances of the English language can make certain texts hard to process.
A paper written by Bing Liu discusses how performing sentiment analysis is a “multi-faceted problem” and goes onto detail some of the current challenges.. You can find more information about such things on-line by googling around but enough of the challenges and onto some of the techniques that can be implemented to implement sentiment analysis.
Sentiment Analysis Techniques
There are many ways to implement Sentiment Analysis, ultimately, it is a text classification problem and can be broken down into two main areas: (Carstens, 2011)
- Supervised Learning
- Unsupervised Learning
This technique involves the construction of a “Classifier” and the problem has been studied intensively. The Classifier is responsible for categorizing texts into either a positive, negative or neutral polarity.
The three main classification techniques are:
- Naïve Bayes
- Maximum Entropy
- Support Vector Machines (SVM)
From the above, SVM provides the best accuracy. (Bing Liu / Pang et al, 2012).
Unsupervised Learning has three steps, the first is to implement POS tagging (Part of Speech), then, two consecutive words are extracted to identify if their tags conform to given patterns. The second step is to estimate the sentiment orientation (SO) of the extracted phrases. Finally, the third step is to compute the average SO of all phrases that were extracted in terms of positive or negative.
Naïve Bayes is the technique that I’m going to focus on for the purpose of this series of blog posts. It is is used as a means for arriving at predictions in light of relevant evidence. It is also known as conditional probability or inverse probability. The theorem was discovered by an English Presbyterian and mathematician called Thomas Bayes and published posthumously in 1763 (Routledge). It’s easy enough to get your head around and there are quite a few implementations around on the web. I suggest checking them out.
I think that’s enough for just now, in my next post I’ll detail the underlying theory and implement an example.
As always, if you have any questions, comments or suggestions then drop me a message.
Are you using sentiment analysis or machine learning in any of your projects?