Text Preprocessing: Lemmatization using NLTK

1. Introduction

When working with textual data in Natural Language Processing (NLP), it’s crucial to reduce words to their base or root form. This process ensures that words with similar meanings are treated alike by algorithms.

While stemming reduces words by chopping off prefixes or suffixes using rules, it often produces non-dictionary stems (like “studying” → “studi”).
Lemmatization, on the other hand, is a linguistically informed process that reduces words to their valid base or dictionary form, known as a lemma.

For example:

“running”, “ran” → “run”
“better” → “good”
“studies”, “studying” → “study”

Thus, lemmatization provides more accurate and meaningful results than stemming.

2. What is Lemmatization?

Lemmatization involves:

Identifying the part of speech (POS) of a word (noun, verb, adjective, etc.).
Reducing the word to its lemma — its dictionary form.

Lemmatization uses a vocabulary and morphological analysis of words. For example:

The word “am”, “are”, “is” → lemma “be”.
The word “better” → lemma “good”.

3. Setting Up NLTK for Lemmatization

We’ll use the WordNetLemmatizer from NLTK.
WordNet is a large lexical database of English words grouped into sets of synonyms called synsets.

Step 1: Install NLTK

If not already installed:

pip install nltk

Step 2: Import and Download Dependencies

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

# Download the necessary data files
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

4. Basic Lemmatization Example

Let’s start with a simple example.

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

text = "Arjun is running and has eaten his food. He was also studying diligently."

words = word_tokenize(text)

for word in words:
    print(f"{word} → {lemmatizer.lemmatize(word)}")

Expected Output

Arjun → Arjun
is → is
running → running
and → and
has → ha
eaten → eaten
his → his
food → food
. → .
He → He
was → wa
also → also
studying → studying
diligently → diligently
. → .

Observation:
The lemmatizer didn’t change most words — because by default it assumes all words are nouns.
For example, "running" remains "running" instead of "run".
To handle this correctly, we need to specify the part of speech (POS).

5. Lemmatization with Part of Speech (POS) Tags

To get accurate results, we must provide the correct POS tag to the lemmatizer.
NLTK supports POS tags such as:

'n' for noun
'v' for verb
'a' for adjective
'r' for adverb

Let’s lemmatize again with POS specified as verbs:

print(lemmatizer.lemmatize("running", pos='v'))  # run
print(lemmatizer.lemmatize("studies", pos='v'))  # study
print(lemmatizer.lemmatize("better", pos='a'))   # good
print(lemmatizer.lemmatize("ate", pos='v'))      # eat

Expected Output

run
study
good
eat

Explanation:
Providing POS dramatically improves accuracy:

"running" → "run" (verb)
"better" → "good" (adjective)
"ate" → "eat" (verb)

6. Automating POS Tagging with NLTK

To make lemmatization automatic, we can use POS tagging to detect each word’s part of speech dynamically.

Step 1: Import and Download Tagger

nltk.download('averaged_perceptron_tagger')

Step 2: Create a Mapping Function

NLTK’s POS tags differ from WordNet’s format. So, we need a converter function.

from nltk.corpus import wordnet

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to noun

Step 3: Lemmatize with Automatic POS Detection

from nltk import pos_tag

text = "Arjun was running fast and studying hard for his upcoming exams."

tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

print("Word → POS → Lemma")
print("-" * 40)
for word, tag in pos_tags:
    lemma = lemmatizer.lemmatize(word, get_wordnet_pos(tag))
    print(f"{word:10} → {tag:10} → {lemma}")

Expected Output

Word → POS → Lemma
----------------------------------------
Arjun      → NNP        → Arjun
was        → VBD        → be
running    → VBG        → run
fast       → RB         → fast
and        → CC         → and
studying   → VBG        → study
hard       → RB         → hard
for        → IN         → for
his        → PRP$       → his
upcoming   → JJ         → upcoming
exams      → NNS        → exam
.          → .          → .

Explanation:

"was" → "be"
"running" → "run"
"studying" → "study"
"exams" → "exam"

This shows the lemmatizer successfully identified verbs, adjectives, and plural nouns.

7. Comparing Stemming vs Lemmatization

Let’s compare both approaches using the same text.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

text = "He was running and studying hard for his exams."
words = word_tokenize(text)

print(f"{'Word':<15}{'Stemmed':<15}{'Lemmatized':<15}")
print("-" * 45)
for word in words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')
    print(f"{word:<15}{stem:<15}{lemma:<15}")

Expected Output

Word           Stemmed        Lemmatized    
---------------------------------------------
He             he             He            
was            wa             be            
running        run            run           
and            and            and           
studying       studi          study         
hard           hard           hard          
for            for            for           
his            hi             his           
exams          exam           exam          
.              .              .

Observation:

Stemming produces non-words like "studi", while lemmatization gives valid "study".
Lemmatization is linguistically accurate but slightly slower.

8. When to Use Lemmatization

Use lemmatization when:

You need linguistically correct base forms.
The text will be interpreted by humans (search engines, chatbots).
You plan to use word embeddings or deep learning models that are sensitive to semantics.

Use stemming when:

Speed is more important than precision.
The model is statistical or bag-of-words based, and small inaccuracies don’t affect results.

9. Common Pitfalls

Ignoring POS tags — without them, most words will not be reduced correctly.
Performance overhead — lemmatization is slower due to dictionary lookups.
Language dependency — works primarily for English unless using other WordNet corpora.

10. Summary

Feature	Stemming	Lemmatization
Approach	Rule-based truncation	Dictionary + morphological analysis
Output Words	Not always valid	Always valid dictionary words
Uses POS	No	Yes
Accuracy	Lower	Higher
Speed	Fast	Slower
Example (“studies”)	“studi”	“study”