During my MSc a few years ago whilst specialising in machine learning, sentiment analysis and Bayesian theorem, I encountered a technique that I could use to improve the computers understanding of human language called POS Tagging.
What is POS Tagging?
POS tagging is the process of assigning a ‘tag/category’ (in the form of an abbreviated code) to each word (token) in a given sentence.
In the English language for example, common POS categories are:
- nouns
- verbs
- adjectives
- adverbs
- pronouns
- prepositions
- conjunctions
- interjections
Other categories can be derived from different forms of the above, for example a verb can be in its base form or in past tense.
Penn Treebank POS Tags
For the purposes of this blog post I have focussed on the Penn Treebank POS Tag Set. You can see the entire list of these POS Tags in the below table:
Tag | Description | Tag | Description |
CC | Coordinating conjunction | PRP$ | Possessive pronoun |
CD | Cardinal number | RB | Adverb |
DT | Determiner | RBR | Adverb, comparative |
EX | Existential there | RBS | Adverb, superlative |
FW | Foreign word | RP | Particle |
IN | Preposition or subordinating conjunction | SYM | Symbol |
JJ | Adjective | TO | to |
JJR | Adjective, comparative | UH | Interjection |
JJS | Adjective, superlative | VB | Verb, base form |
LS | List item marker | VBD | Verb, past tense |
MD | Modal | VBG | Verb, gerund or present participle |
NN | Noun, singular or mass | VBN | Verb, past participle |
NNS | Noun, plural | VBP | Verb, non-3rd person singular present |
NNP | Proper noun, singular | VBZ | Verb, 3rd person singular present |
NNPS | Proper noun, plural | WDT | Wh-determiner |
PDT | Predeterminer | WP | Wh-pronoun |
POS | Possessive ending | WP$ | Possessive wh-pronoun |
PRP | Personal pronoun | WRB | Wh-adverb |
So how do I use this?
If we take the following statement: “I think I’m going to get a new computer” then run it through a POS Tagger such as the Stanford Tagger, it will take each token (word) in the sentence and return the following:
“I/FW think/NN I’m/NN going/VBG to/TO get/VB a/DT new/JJ computer/NN”
Having the information in this format makes it easier for the machine to identify patterns and linguistic constructs that can help you process the information more easily. The machine can also begin to understand the context of each word in the sentence based on words that are consecutive to each other.
What kind of patterns would be valuable?
In some of my earlier posts I covered sentiment analysis and opinion mining. Taking POS tagging into account we can improve the accuracy of sentiment analysis techniques further by looking for specific patterns.
If we consider the following POS tagged sentence:
“phone/NN is/VB great/JJ”.
(The word ‘is’ would have been removed as a Stop Word during pre-processing)
We know the sentence contains a noun (phone) and an adjective (great) has been used to describe it.
We can therefore infer that NN JJ is a very specific pattern that we should be looking for. There are others which you can see in the table below which are also valuable:
Armed with this information, we can pass this specific text into a sentiment analysis API such as the Social Opinion REST API which would return the following:
As we’ve passed in very specific tokens it’s easier and quicker for a sentiment analysis engine to process this information.
I hope this post has given you some insight into how POS tagging works and how it can be applied when performing text classification and manipulating datasets. There’s lots more than can be done but the purpose of the post was to introduce it.
Some links that might be helpful are:
- Social Opinion REST API (beta) – performs sentiment analysis against text
- PartsOfSpeech.info – only POS Tagger by Stanford University
Have you ever had to use this in a project?
Have you found better patterns?
Let me know your thoughts and drop me a message.
1 Pingback