In my last post I introduced sentiment analysis, the Naïve Bayes classification technique and why you or your business might be interested in this.
In this post I’ll delve into it in more detail and and walk through an example and how it’s connected to sentiment analysis.
The rule itself is written like this: (Boone)
p(A|B) = p(B|A) p(A) / p(B)
Now lets break this down and explain each component:
p(A|B): ‘The probability of A given B’. This basically means the probability of finding observation A, given that some part of evidence B is there. This is what we want to find out. (Boone)
p(B|A): This is the probability of the evidence turning up, given that the outcome obtains.
p(A): This is the probability of the outcome occurring, without the knowledge of the new evidence.
p(B): This is the probability of the evidence arising, without regard to the outcome.
The sample data set as discussed by (Amiune) illustrates how the theorem can be applied when trying to arrive at whether or not an email is spam if it has the word “buy” in the mail body.
P(spam |words) = P(words/spam)P(spam) / P(words)
We have a database of 100 emails.
|
What is the probability that an email is spam if it has the word “buy” in the content?
The answer to the above is as follows:
- There are 48 emails that are spam and have the word “buy”.
- And there are 52 emails that have the word “buy”: 48 that are spam plus 4 that aren’t spam.
So the probability that an email is spam if it has the word “buy” is 48/52 = 0.92. So we should probably put this email in the spam folder.
Redefining the problem to use probabilities
As mentioned previously, the rule and notation is based on probabilities, so we can redefine the problem to use probabilities rather than quantities. Using the same database of emails.
- 60% of those emails are spam
- 80% of those emails that are spam have the word “buy”
- 20% of those emails that are spam don’t have the word “buy”
- 40% of those emails aren’t spam
- 10% of those emails that aren’t spam have the word “buy”
- 90% of those emails that aren’t spam don’t have the word “buy”
What is the probability that an email is spam if it has the word “buy”?
The notation to arrive at the answer looks like this:
- P(spam) = probability that an email is spam
- P(not spam) = probability that an email isn’t spam
- P(“buy”|spam) = probability that an email that it is spam has the word “buy”
- P(“buy”|not spam) = probability that an email that it isn’t spam has the word buy”
- P(spam|”buy”) = probability that an email that has the word “buy” is spam
- So P(spam|”buy”) is the answer we are looking for
- P(“buy”|spam) * P(spam) counts all the emails that are spam and have the word “buy”
- P(“buy”|not spam) * P(not spam) counts all the emails that aren’t spam and have the word “buy”
Summing the previous two P(“buy”|spam) * P(spam) + P(“buy”|not spam) * P(not spam) – we count all the emails that have the word “buy”
Meaning the resulting equation looks like this:
P(spam|”buy”) = P(“buy”|spam) * P(spam) / (P(“buy”|spam) * P(spam) + P(“buy”|not spam) * P(not spam))
This is Bayesian Theorem.
Or , to inject the numbers: 0.8 * 0.6 / (0.8*0.6 + 0.1*0.4) = 0.48 / 0.52
The result of this simulation was: 0.9222485960747988
Or in plain English, based on our existing datasets, there is a 92% chance that emails that contain the word “buy” are spam type emails.
So how do use this theorem to apply sentiment analysis? Read on!
Sentiment Analysis using Bayesian Theorem
Performing sentiment analysis using Bayesian Theorem involves writing a Naïve Bayesian Classifier which is based on the Bayes Rule that we’ve just discussed. This rule is a way of looking at the conditional probabilities of an event using a given set of mathematical probabilities. As we’ve just saw, the rule if often used in email systems when trying to detect if email is actually valid based on the presence of a certain set of keywords.
You can find a sample classifier on Github, have a play around with it and see how you get on. In my next post I’ll talk a little bit more about the difficulties of sentiment analysis and how some of these can be alleviated.
In the meantime, feel free to reach out if you have any questions or comments.
1 Pingback