1. Introduction
In Natural Language Processing (NLP), Part-of-Speech (POS) Tagging is the process of assigning grammatical labels (like noun, verb, adjective, etc.) to each word in a sentence.
For example, in the sentence:
“Arjun is learning NLP.”
the words can be tagged as:
- Arjun → Noun (Proper Noun)
- is → Verb (auxiliary verb)
- learning → Verb (present participle)
- NLP → Noun (proper noun)
POS tagging provides important syntactic information that helps NLP systems understand how words relate grammatically, which is essential for:
- Syntactic parsing
- Named Entity Recognition (NER)
- Sentiment analysis
- Information extraction
2. What is POS Tagging?
Definition:
Part-of-speech tagging is the process of labeling each word in a sentence with its corresponding part of speech based on both its definition and its context.
Example
| Word | POS Tag | Description |
|---|---|---|
| Arjun | NNP | Proper Noun, Singular |
| is | VBZ | Verb, 3rd person singular present |
| learning | VBG | Verb, Gerund/Participle |
| NLP | NNP | Proper Noun, Singular |
The tag NNP, VBZ, and VBG come from the Penn Treebank POS Tagset, which is widely used in NLP.
3. Installing and Importing NLTK
To perform POS tagging, we use NLTK (Natural Language Toolkit), one of the most popular Python libraries for text processing.
Step 1: Install NLTK
pip install nltk
Step 2: Import Required Modules and Download Resources
import nltk
from nltk import word_tokenize, pos_tag
# Download necessary datasets
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
4. Performing POS Tagging in NLTK
Let’s start with a simple example.
text = "Arjun is learning Natural Language Processing using Python." # Tokenize the sentence tokens = word_tokenize(text) # Apply POS tagging pos_tags = pos_tag(tokens) print(pos_tags)
Expected Output
[('Arjun', 'NNP'), ('is', 'VBZ'), ('learning', 'VBG'),
('Natural', 'JJ'), ('Language', 'NN'), ('Processing', 'NN'),
('using', 'VBG'), ('Python', 'NNP'), ('.', '.')]
Explanation:
NNP→ Proper Noun, SingularVBZ→ Verb, 3rd person singular presentVBG→ Verb, Gerund/ParticipleJJ→ AdjectiveNN→ Common Noun, Singular
5. POS Tagging for Multiple Sentences
NLTK can handle multiple sentences together. You can tokenize text at both sentence and word levels.
from nltk import sent_tokenize
text = """Arjun is learning Natural Language Processing.
He enjoys solving problems related to text analytics."""
# Sentence tokenization
sentences = sent_tokenize(text)
# Apply word tokenization and POS tagging for each sentence
for sent in sentences:
tokens = word_tokenize(sent)
tags = pos_tag(tokens)
print(tags)
Expected Output
[('Arjun', 'NNP'), ('is', 'VBZ'), ('learning', 'VBG'),
('Natural', 'JJ'), ('Language', 'NN'), ('Processing', 'NN'), ('.', '.')]
[('He', 'PRP'), ('enjoys', 'VBZ'), ('solving', 'VBG'),
('problems', 'NNS'), ('related', 'VBN'), ('to', 'TO'),
('text', 'NN'), ('analytics', 'NNS'), ('.', '.')]
Explanation:
PRP→ PronounNNS→ Noun pluralVBN→ Verb, past participleTO→ Preposition “to”
6. Understanding POS Tags (Penn Treebank Tagset)
The Penn Treebank Tagset defines standard POS labels used by NLTK.
You can view all of them using:
nltk.download('tagsets')
nltk.help.upenn_tagset()
A few common ones are listed below:
| Tag | Meaning | Example |
|---|---|---|
| NN | Noun, singular | car, tree |
| NNS | Noun, plural | cars, trees |
| NNP | Proper noun, singular | Arjun, India |
| VB | Verb, base form | go, eat |
| VBD | Verb, past tense | went, ate |
| VBG | Verb, gerund | eating, running |
| JJ | Adjective | big, fast |
| RB | Adverb | quickly, silently |
| PRP | Personal pronoun | he, she, they |
| IN | Preposition | in, on, at |
| DT | Determiner | the, an, a |
| CC | Coordinating conjunction | and, or, but |
7. Example: Extracting Specific POS Tags
You can extract words belonging to a particular grammatical category.
For instance, extracting only nouns or verbs.
Example – Extracting Nouns
text = "Arjun loves playing football and watching science documentaries."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
nouns = [word for word, tag in pos_tags if tag.startswith('NN')]
print("Nouns:", nouns)
Expected Output
Nouns: ['Arjun', 'football', 'science', 'documentaries']
Example – Extracting Verbs
verbs = [word for word, tag in pos_tags if tag.startswith('VB')]
print("Verbs:", verbs)
Expected Output
Verbs: ['loves', 'playing', 'watching']
8. POS Tagging in Different Languages
NLTK’s default POS tagger is trained on English text.
For other languages, you can use specific trained models such as spaCy or Stanza.
However, for English datasets, NLTK’s averaged_perceptron_tagger provides highly accurate results for general-purpose use.
9. Combining POS Tagging with Other NLP Tasks
POS tagging is often combined with:
- Named Entity Recognition (NER) – identifying proper nouns such as people, places, or organizations.
- Chunking – grouping related words (like noun phrases).
- Dependency Parsing – determining grammatical relationships between words.
Example combining POS tagging with NER:
nltk.download('maxent_ne_chunker')
nltk.download('words')
text = "Arjun works at Google in Bangalore."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
# Named Entity Recognition (NER)
ner_tree = nltk.ne_chunk(pos_tags)
print(ner_tree)
Expected Output
(S (PERSON Arjun/NNP) works/VBZ at/IN (ORGANIZATION Google/NNP) in/IN (GPE Bangalore/NNP) ./.)
Explanation:
The NER chunker uses POS tags to identify named entities like persons, organizations, and locations.
10. Visualizing POS Tags
You can visualize POS-tagged sentences using Treebank format for better understanding.
from nltk import RegexpParser
text = "Arjun is studying advanced topics in Machine Learning."
tokens = word_tokenize(text)
tags = pos_tag(tokens)
# Define a simple grammar
grammar = "NP: {<DT>?<JJ>*<NN.*>+}"
# Create a parser
parser = RegexpParser(grammar)
# Parse and draw tree
tree = parser.parse(tags)
tree.draw()
Explanation:
- The grammar
NPdefines a noun phrase as an optional determiner (DT), followed by adjectives (JJ), and ending with one or more nouns (NN). - The
draw()function opens an interactive parse tree visualization window.
11. Advanced Example: Filtering Sentences by POS Pattern
You can even filter sentences that follow a specific pattern — for example, “Adjective + Noun” combinations.
text = "Beautiful flowers bloom gracefully in the garden."
tokens = word_tokenize(text)
tags = pos_tag(tokens)
adj_noun_pairs = [(tags[i][0], tags[i+1][0])
for i in range(len(tags)-1)
if tags[i][1].startswith('JJ') and tags[i+1][1].startswith('NN')]
print("Adjective-Noun Pairs:", adj_noun_pairs)
Expected Output
Adjective-Noun Pairs: [('Beautiful', 'flowers')]
Explanation:
This technique helps identify descriptive patterns in text, which can be useful in linguistic analysis or sentiment-based feature extraction.
12. Common Pitfalls in POS Tagging
- Ambiguity of Words:
Many words can serve multiple roles. Example:- “Book a ticket” (verb) vs “Read a book” (noun)
- Context Sensitivity:
Rule-based or statistical taggers may misclassify words without deep context understanding. - Domain Differences:
A tagger trained on news text may perform poorly on medical or technical data.
13. Summary
| Feature | Description |
|---|---|
| Purpose | Assign grammatical roles (noun, verb, adjective, etc.) to words |
| Library | NLTK (pos_tag) |
| Dataset | averaged_perceptron_tagger |
| Tagset | Penn Treebank |
| Common Uses | Syntax parsing, NER, sentiment analysis |
| Limitations | Struggles with context-dependent meanings |
14. Complete Verified Example
import nltk
from nltk import word_tokenize, pos_tag
# Download required resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
text = "Arjun is learning Natural Language Processing using Python."
# Tokenize and tag
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
# Print results
print("Tokens:", tokens)
print("POS Tags:", pos_tags)
Verified Output
Tokens: ['Arjun', 'is', 'learning', 'Natural', 'Language', 'Processing', 'using', 'Python', '.']
POS Tags: [('Arjun', 'NNP'), ('is', 'VBZ'), ('learning', 'VBG'),
('Natural', 'JJ'), ('Language', 'NN'),
('Processing', 'NN'), ('using', 'VBG'),
('Python', 'NNP'), ('.', '.')]
15. Conclusion
Part-of-Speech (POS) Tagging is a fundamental step in NLP pipelines that helps machines understand the grammatical structure of sentences.
Using NLTK, you can:
- Tokenize text efficiently.
- Apply accurate POS tagging.
- Extract specific grammatical structures or patterns.
- Combine it with higher-level NLP tasks like chunking, parsing, and NER.
By integrating POS tagging into your preprocessing pipeline, you enhance the linguistic depth and syntactic understanding of your NLP models.
