Software Architect / Microsoft MVP (AI) and Technical Author

Analytics and Big Data, Machine Learning, Prototyping, Sentiment Analysis

What is POS (Part of Speech) Tagging? How can I use it?

During my MSc a few years ago whilst specialising in machine learning, sentiment analysis and Bayesian theorem, I encountered a technique that I could use to improve the computers understanding of human language called POS Tagging.

What is POS Tagging?

POS tagging is the process of assigning a ‘tag/category’ (in the form of an abbreviated code) to each word (token) in a given sentence.

In the English language for example, common POS categories are:

  • nouns
  • verbs
  • adjectives
  • adverbs
  • pronouns
  • prepositions
  • conjunctions
  • interjections

Other categories can be derived from different forms of the above, for example a verb can be in its base form or in past tense.

Penn Treebank POS Tags

For the purposes of this blog post I have focussed on the Penn Treebank POS Tag Set.  You can see the entire list of these POS Tags in the below table:

Tag Description Tag Description
CC Coordinating conjunction PRP$ Possessive pronoun
CD Cardinal number RB Adverb
DT Determiner RBR Adverb, comparative
EX Existential there RBS Adverb, superlative
FW Foreign word RP Particle
IN Preposition or subordinating conjunction SYM Symbol
JJ Adjective TO to
JJR Adjective, comparative UH Interjection
JJS Adjective, superlative VB Verb, base form
LS List item marker VBD Verb, past tense
MD Modal VBG Verb, gerund or present participle
NN Noun, singular or mass VBN Verb, past participle
NNS Noun, plural VBP Verb, non-3rd person singular present
NNP Proper noun, singular VBZ Verb, 3rd person singular present
NNPS Proper noun, plural WDT Wh-determiner
PDT Predeterminer WP Wh-pronoun
POS Possessive ending WP$ Possessive wh-pronoun
PRP Personal pronoun WRB Wh-adverb

 

So how do I use this?

If we take the following statement: I think I’m going to get a new computer” then run it through a POS Tagger such as the Stanford Tagger, it will take each token (word) in the sentence and return the following:

I/FW  think/NN I’m/NN going/VBG to/TO get/VB a/DT new/JJ computer/NN

Having the information in this format makes it easier for the machine to identify patterns and linguistic constructs that can help you process the information more easily.  The machine can also begin to understand the context of each word in the sentence based on words that are consecutive to each other.

What kind of patterns would be valuable?

In some of my earlier posts I covered sentiment analysis and opinion mining.  Taking POS tagging into account we can improve the accuracy of sentiment analysis techniques further by looking for specific patterns.

If we consider the following POS tagged sentence:

“phone/NN is/VB great/JJ”.

(The word ‘is’ would have been removed as a Stop Word during pre-processing)

We know the sentence contains a noun (phone) and an adjective (great) has been used to describe it.

We can therefore infer that NN JJ is a very specific pattern that we should be looking for.  There are others which you can see in the table below which are also valuable:

posPatterns

Armed with this information, we can pass this specific text into a sentiment analysis API such as the Social Opinion REST API which would return the following:

restApi

As we’ve passed in very specific tokens it’s easier and quicker for a sentiment analysis engine to process this information.

I hope this post has given you some insight into how POS tagging works and how it can be applied when performing text classification and manipulating datasets. There’s lots more than can be done but the purpose of the post was to introduce it.

Some  links that might be helpful are:

Have you ever had to use this in a project?

Have you found better patterns?

Let me know your thoughts and drop me a message.

JOIN MY EXCLUSIVE EMAIL LIST
Get the latest content and code from the blog posts!
I respect your privacy. No spam. Ever.

Leave a Reply