Text Preprocessing: Part-of-Speech (POS) Tagging using NLTK

1. Introduction

In Natural Language Processing (NLP), Part-of-Speech (POS) Tagging is the process of assigning grammatical labels (like noun, verb, adjective, etc.) to each word in a sentence.

For example, in the sentence:
“Arjun is learning NLP.”
the words can be tagged as:

Arjun → Noun (Proper Noun)
is → Verb (auxiliary verb)
learning → Verb (present participle)
NLP → Noun (proper noun)

POS tagging provides important syntactic information that helps NLP systems understand how words relate grammatically, which is essential for:

Syntactic parsing
Named Entity Recognition (NER)
Sentiment analysis
Information extraction

2. What is POS Tagging?

Definition:
Part-of-speech tagging is the process of labeling each word in a sentence with its corresponding part of speech based on both its definition and its context.

Example

Word	POS Tag	Description
Arjun	NNP	Proper Noun, Singular
is	VBZ	Verb, 3rd person singular present
learning	VBG	Verb, Gerund/Participle
NLP	NNP	Proper Noun, Singular

The tag NNP, VBZ, and VBG come from the Penn Treebank POS Tagset, which is widely used in NLP.

3. Installing and Importing NLTK

To perform POS tagging, we use NLTK (Natural Language Toolkit), one of the most popular Python libraries for text processing.

Step 1: Install NLTK

pip install nltk

Step 2: Import Required Modules and Download Resources

import nltk
from nltk import word_tokenize, pos_tag

# Download necessary datasets
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

4. Performing POS Tagging in NLTK

Let’s start with a simple example.

text = "Arjun is learning Natural Language Processing using Python."

# Tokenize the sentence
tokens = word_tokenize(text)

# Apply POS tagging
pos_tags = pos_tag(tokens)

print(pos_tags)

Expected Output

[('Arjun', 'NNP'), ('is', 'VBZ'), ('learning', 'VBG'),
 ('Natural', 'JJ'), ('Language', 'NN'), ('Processing', 'NN'),
 ('using', 'VBG'), ('Python', 'NNP'), ('.', '.')]

Explanation:

NNP → Proper Noun, Singular
VBZ → Verb, 3rd person singular present
VBG → Verb, Gerund/Participle
JJ → Adjective
NN → Common Noun, Singular

5. POS Tagging for Multiple Sentences

NLTK can handle multiple sentences together. You can tokenize text at both sentence and word levels.

from nltk import sent_tokenize

text = """Arjun is learning Natural Language Processing.
He enjoys solving problems related to text analytics."""

# Sentence tokenization
sentences = sent_tokenize(text)

# Apply word tokenization and POS tagging for each sentence
for sent in sentences:
    tokens = word_tokenize(sent)
    tags = pos_tag(tokens)
    print(tags)

Expected Output

[('Arjun', 'NNP'), ('is', 'VBZ'), ('learning', 'VBG'),
 ('Natural', 'JJ'), ('Language', 'NN'), ('Processing', 'NN'), ('.', '.')]

[('He', 'PRP'), ('enjoys', 'VBZ'), ('solving', 'VBG'),
 ('problems', 'NNS'), ('related', 'VBN'), ('to', 'TO'),
 ('text', 'NN'), ('analytics', 'NNS'), ('.', '.')]

Explanation:

PRP → Pronoun
NNS → Noun plural
VBN → Verb, past participle
TO → Preposition “to”

6. Understanding POS Tags (Penn Treebank Tagset)

The Penn Treebank Tagset defines standard POS labels used by NLTK.
You can view all of them using:

nltk.download('tagsets')
nltk.help.upenn_tagset()

A few common ones are listed below:

Tag	Meaning	Example
NN	Noun, singular	car, tree
NNS	Noun, plural	cars, trees
NNP	Proper noun, singular	Arjun, India
VB	Verb, base form	go, eat
VBD	Verb, past tense	went, ate
VBG	Verb, gerund	eating, running
JJ	Adjective	big, fast
RB	Adverb	quickly, silently
PRP	Personal pronoun	he, she, they
IN	Preposition	in, on, at
DT	Determiner	the, an, a
CC	Coordinating conjunction	and, or, but

7. Example: Extracting Specific POS Tags

You can extract words belonging to a particular grammatical category.
For instance, extracting only nouns or verbs.

Example – Extracting Nouns

text = "Arjun loves playing football and watching science documentaries."

tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

nouns = [word for word, tag in pos_tags if tag.startswith('NN')]
print("Nouns:", nouns)

Expected Output

Nouns: ['Arjun', 'football', 'science', 'documentaries']

Example – Extracting Verbs

verbs = [word for word, tag in pos_tags if tag.startswith('VB')]
print("Verbs:", verbs)

Expected Output

Verbs: ['loves', 'playing', 'watching']

8. POS Tagging in Different Languages

NLTK’s default POS tagger is trained on English text.
For other languages, you can use specific trained models such as spaCy or Stanza.

However, for English datasets, NLTK’s averaged_perceptron_tagger provides highly accurate results for general-purpose use.

9. Combining POS Tagging with Other NLP Tasks

POS tagging is often combined with:

Named Entity Recognition (NER) – identifying proper nouns such as people, places, or organizations.
Chunking – grouping related words (like noun phrases).
Dependency Parsing – determining grammatical relationships between words.

Example combining POS tagging with NER:

nltk.download('maxent_ne_chunker')
nltk.download('words')

text = "Arjun works at Google in Bangalore."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

# Named Entity Recognition (NER)
ner_tree = nltk.ne_chunk(pos_tags)
print(ner_tree)

Expected Output

(S
  (PERSON Arjun/NNP)
  works/VBZ
  at/IN
  (ORGANIZATION Google/NNP)
  in/IN
  (GPE Bangalore/NNP)
  ./.)

Explanation:
The NER chunker uses POS tags to identify named entities like persons, organizations, and locations.

10. Visualizing POS Tags

You can visualize POS-tagged sentences using Treebank format for better understanding.

from nltk import RegexpParser

text = "Arjun is studying advanced topics in Machine Learning."

tokens = word_tokenize(text)
tags = pos_tag(tokens)

# Define a simple grammar
grammar = "NP: {<DT>?<JJ>*<NN.*>+}"

# Create a parser
parser = RegexpParser(grammar)

# Parse and draw tree
tree = parser.parse(tags)
tree.draw()

Explanation:

The grammar NP defines a noun phrase as an optional determiner (DT), followed by adjectives (JJ), and ending with one or more nouns (NN).
The draw() function opens an interactive parse tree visualization window.

11. Advanced Example: Filtering Sentences by POS Pattern

You can even filter sentences that follow a specific pattern — for example, “Adjective + Noun” combinations.

text = "Beautiful flowers bloom gracefully in the garden."

tokens = word_tokenize(text)
tags = pos_tag(tokens)

adj_noun_pairs = [(tags[i][0], tags[i+1][0]) 
                  for i in range(len(tags)-1) 
                  if tags[i][1].startswith('JJ') and tags[i+1][1].startswith('NN')]

print("Adjective-Noun Pairs:", adj_noun_pairs)

Expected Output

Adjective-Noun Pairs: [('Beautiful', 'flowers')]

Explanation:
This technique helps identify descriptive patterns in text, which can be useful in linguistic analysis or sentiment-based feature extraction.

12. Common Pitfalls in POS Tagging

Ambiguity of Words:
Many words can serve multiple roles. Example:
- “Book a ticket” (verb) vs “Read a book” (noun)
Context Sensitivity:
Rule-based or statistical taggers may misclassify words without deep context understanding.
Domain Differences:
A tagger trained on news text may perform poorly on medical or technical data.

13. Summary

Feature	Description
Purpose	Assign grammatical roles (noun, verb, adjective, etc.) to words
Library	NLTK (`pos_tag`)
Dataset	`averaged_perceptron_tagger`
Tagset	Penn Treebank
Common Uses	Syntax parsing, NER, sentiment analysis
Limitations	Struggles with context-dependent meanings

14. Complete Verified Example

import nltk
from nltk import word_tokenize, pos_tag

# Download required resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

text = "Arjun is learning Natural Language Processing using Python."

# Tokenize and tag
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

# Print results
print("Tokens:", tokens)
print("POS Tags:", pos_tags)

Verified Output

Tokens: ['Arjun', 'is', 'learning', 'Natural', 'Language', 'Processing', 'using', 'Python', '.']
POS Tags: [('Arjun', 'NNP'), ('is', 'VBZ'), ('learning', 'VBG'),
           ('Natural', 'JJ'), ('Language', 'NN'),
           ('Processing', 'NN'), ('using', 'VBG'),
           ('Python', 'NNP'), ('.', '.')]

15. Conclusion

Part-of-Speech (POS) Tagging is a fundamental step in NLP pipelines that helps machines understand the grammatical structure of sentences.
Using NLTK, you can:

Tokenize text efficiently.
Apply accurate POS tagging.
Extract specific grammatical structures or patterns.
Combine it with higher-level NLP tasks like chunking, parsing, and NER.

By integrating POS tagging into your preprocessing pipeline, you enhance the linguistic depth and syntactic understanding of your NLP models.