Learnitweb

Text Preprocessing: Stemming using NLTK

1. Introduction

In Natural Language Processing (NLP), text preprocessing is a crucial step before feeding data into machine learning or deep learning models. One important part of preprocessing is stemming, which is the process of reducing words to their root or base form.

For example:

  • “running”, “runs”, and “ran” → “run”
  • “studies”, “studied” → “studi”

Notice that the stemmed form may not always be a valid word in English, but it represents a common base form that helps algorithms treat related words as the same.


2. What is Stemming?

Stemming is a rule-based process of removing suffixes (and sometimes prefixes) from words to obtain their root form.
It does not necessarily produce linguistically correct words, but it is efficient and often sufficient for many NLP tasks such as:

  • Text classification
  • Sentiment analysis
  • Information retrieval
  • Keyword extraction

Example:

Original WordStemmed Form
ConnectionConnect
ConnectionsConnect
ConnectedConnect
ConnectingConnect

3. Setting Up NLTK for Stemming

We’ll use the Natural Language Toolkit (NLTK) — a popular Python library for text processing.

Step 1: Install NLTK

If you haven’t installed it already:

pip install nltk

Step 2: Import Required Modules

import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

Step 3: Download Tokenizer Data

The word tokenizer requires the punkt package:

nltk.download('punkt')

4. Understanding the Porter Stemmer

The Porter Stemmer is one of the oldest and most widely used stemming algorithms. It applies a sequence of rule-based transformations to words.

Let’s try it out:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
text = "Arjun was running and eating at the same time. He has been studying hard for his exams."

words = word_tokenize(text)

for word in words:
    print(f"{word} → {stemmer.stem(word)}")

Expected Output

Arjun → arjun
was → wa
running → run
and → and
eating → eat
at → at
the → the
same → same
time → time
. → .
He → he
has → ha
been → been
studying → studi
hard → hard
for → for
his → hi
exams → exam
. → .

Explanation:

  • The stemmer converts "running" to "run", "eating" to "eat", and "studying" to "studi".
  • Notice "studi" is not an actual English word — this is a limitation of stemming.

5. Using the Lancaster Stemmer

NLTK also provides the Lancaster Stemmer, which is more aggressive — it sometimes removes too many characters and can produce very short stems.

from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
words = ["running", "runs", "easily", "fairly"]

for word in words:
    print(f"{word} → {stemmer.stem(word)}")

Expected Output

running → run
runs → run
easily → easy
fairly → fair

Observation:
Lancaster performs deeper reductions, sometimes too aggressively. For instance, "maximum" might become "maxim", and "computerization" becomes "comput".


6. Using the Snowball Stemmer

The Snowball Stemmer (also called the Porter2 stemmer) is an improved version of the Porter algorithm. It supports multiple languages and is slightly less aggressive than Lancaster.

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")

words = ["running", "runs", "easily", "fairly", "studies", "studying"]

for word in words:
    print(f"{word} → {stemmer.stem(word)}")

Expected Output

running → run
runs → run
easily → easili
fairly → fair
studies → studi
studying → studi

Observation:

  • "easily""easili" (Snowball maintains some suffix structure).
  • "studies" and "studying""studi", showing consistent stemming.

The Snowball stemmer is generally preferred because it offers a balance between accuracy and aggressiveness.


7. Comparing Different Stemmers

Let’s compare the three stemmers side by side:

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")

words = ["compute", "computer", "computing", "computed", "computation"]

print(f"{'Word':<15}{'Porter':<15}{'Lancaster':<15}{'Snowball':<15}")
print("-" * 60)
for w in words:
    print(f"{w:<15}{porter.stem(w):<15}{lancaster.stem(w):<15}{snowball.stem(w):<15}")

Expected Output

Word           Porter         Lancaster      Snowball       
------------------------------------------------------------
compute        comput         comput         comput         
computer       comput         comput         comput         
computing      comput         comput         comput         
computed       comput         comput         comput         
computation    comput         comput         comput         

All three produce "comput" — but in other examples, you might notice slight differences in aggressiveness.


8. When to Use Stemming

You should use stemming when:

  • The dataset is large and performance is a priority.
  • Small differences in word forms do not affect model meaning.
  • You need quick normalization of words before building features like TF-IDF or Bag of Words.

However, for applications where linguistic correctness matters (like translation or search engines), lemmatization is preferred instead.


9. Common Pitfalls of Stemming

  1. Over-stemming:
    Different words with distinct meanings may reduce to the same root.
    Example: “universe” and “university”“univers”.
  2. Under-stemming:
    Related words may not reduce to the same base.
    Example: “analysis” and “analyst” remain different.
  3. Not language-aware:
    Stemmers are language-specific. Using an English stemmer on another language produces incorrect results.

10. Summary

Stemmer TypeCharacteristicsAggressivenessSuitable For
PorterMost common, balancedMediumGeneral NLP tasks
LancasterSimple but aggressiveHighSmall datasets
SnowballAdvanced and multilingualMedium-LowModern applications

11. Final Thoughts

Stemming is a powerful normalization technique that simplifies word variations into a common base form, improving the performance of text-based models.
While it may not always produce valid words, it efficiently reduces vocabulary size and improves computational performance.