1. Introduction
When working with textual data in Natural Language Processing (NLP), it’s crucial to reduce words to their base or root form. This process ensures that words with similar meanings are treated alike by algorithms.
While stemming reduces words by chopping off prefixes or suffixes using rules, it often produces non-dictionary stems (like “studying” → “studi”).
Lemmatization, on the other hand, is a linguistically informed process that reduces words to their valid base or dictionary form, known as a lemma.
For example:
- “running”, “ran” → “run”
- “better” → “good”
- “studies”, “studying” → “study”
Thus, lemmatization provides more accurate and meaningful results than stemming.
2. What is Lemmatization?
Lemmatization involves:
- Identifying the part of speech (POS) of a word (noun, verb, adjective, etc.).
- Reducing the word to its lemma — its dictionary form.
Lemmatization uses a vocabulary and morphological analysis of words. For example:
- The word “am”, “are”, “is” → lemma “be”.
- The word “better” → lemma “good”.
3. Setting Up NLTK for Lemmatization
We’ll use the WordNetLemmatizer from NLTK.
WordNet is a large lexical database of English words grouped into sets of synonyms called synsets.
Step 1: Install NLTK
If not already installed:
pip install nltk
Step 2: Import and Download Dependencies
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
# Download the necessary data files
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
4. Basic Lemmatization Example
Let’s start with a simple example.
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
text = "Arjun is running and has eaten his food. He was also studying diligently."
words = word_tokenize(text)
for word in words:
print(f"{word} → {lemmatizer.lemmatize(word)}")
Expected Output
Arjun → Arjun is → is running → running and → and has → ha eaten → eaten his → his food → food . → . He → He was → wa also → also studying → studying diligently → diligently . → .
Observation:
The lemmatizer didn’t change most words — because by default it assumes all words are nouns.
For example, "running" remains "running" instead of "run".
To handle this correctly, we need to specify the part of speech (POS).
5. Lemmatization with Part of Speech (POS) Tags
To get accurate results, we must provide the correct POS tag to the lemmatizer.
NLTK supports POS tags such as:
'n'for noun'v'for verb'a'for adjective'r'for adverb
Let’s lemmatize again with POS specified as verbs:
print(lemmatizer.lemmatize("running", pos='v')) # run
print(lemmatizer.lemmatize("studies", pos='v')) # study
print(lemmatizer.lemmatize("better", pos='a')) # good
print(lemmatizer.lemmatize("ate", pos='v')) # eat
Expected Output
run study good eat
Explanation:
Providing POS dramatically improves accuracy:
"running"→"run"(verb)"better"→"good"(adjective)"ate"→"eat"(verb)
6. Automating POS Tagging with NLTK
To make lemmatization automatic, we can use POS tagging to detect each word’s part of speech dynamically.
Step 1: Import and Download Tagger
nltk.download('averaged_perceptron_tagger')
Step 2: Create a Mapping Function
NLTK’s POS tags differ from WordNet’s format. So, we need a converter function.
from nltk.corpus import wordnet
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN # Default to noun
Step 3: Lemmatize with Automatic POS Detection
from nltk import pos_tag
text = "Arjun was running fast and studying hard for his upcoming exams."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
print("Word → POS → Lemma")
print("-" * 40)
for word, tag in pos_tags:
lemma = lemmatizer.lemmatize(word, get_wordnet_pos(tag))
print(f"{word:10} → {tag:10} → {lemma}")
Expected Output
Word → POS → Lemma ---------------------------------------- Arjun → NNP → Arjun was → VBD → be running → VBG → run fast → RB → fast and → CC → and studying → VBG → study hard → RB → hard for → IN → for his → PRP$ → his upcoming → JJ → upcoming exams → NNS → exam . → . → .
Explanation:
"was"→"be""running"→"run""studying"→"study""exams"→"exam"
This shows the lemmatizer successfully identified verbs, adjectives, and plural nouns.
7. Comparing Stemming vs Lemmatization
Let’s compare both approaches using the same text.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
text = "He was running and studying hard for his exams."
words = word_tokenize(text)
print(f"{'Word':<15}{'Stemmed':<15}{'Lemmatized':<15}")
print("-" * 45)
for word in words:
stem = stemmer.stem(word)
lemma = lemmatizer.lemmatize(word, pos='v')
print(f"{word:<15}{stem:<15}{lemma:<15}")
Expected Output
Word Stemmed Lemmatized --------------------------------------------- He he He was wa be running run run and and and studying studi study hard hard hard for for for his hi his exams exam exam . . .
Observation:
- Stemming produces non-words like
"studi", while lemmatization gives valid"study". - Lemmatization is linguistically accurate but slightly slower.
8. When to Use Lemmatization
Use lemmatization when:
- You need linguistically correct base forms.
- The text will be interpreted by humans (search engines, chatbots).
- You plan to use word embeddings or deep learning models that are sensitive to semantics.
Use stemming when:
- Speed is more important than precision.
- The model is statistical or bag-of-words based, and small inaccuracies don’t affect results.
9. Common Pitfalls
- Ignoring POS tags — without them, most words will not be reduced correctly.
- Performance overhead — lemmatization is slower due to dictionary lookups.
- Language dependency — works primarily for English unless using other WordNet corpora.
10. Summary
| Feature | Stemming | Lemmatization |
|---|---|---|
| Approach | Rule-based truncation | Dictionary + morphological analysis |
| Output Words | Not always valid | Always valid dictionary words |
| Uses POS | No | Yes |
| Accuracy | Lower | Higher |
| Speed | Fast | Slower |
| Example (“studies”) | “studi” | “study” |
