1. Introduction
In Natural Language Processing (NLP), text preprocessing is a crucial step before feeding data into machine learning or deep learning models. One important part of preprocessing is stemming, which is the process of reducing words to their root or base form.
For example:
- “running”, “runs”, and “ran” → “run”
- “studies”, “studied” → “studi”
Notice that the stemmed form may not always be a valid word in English, but it represents a common base form that helps algorithms treat related words as the same.
2. What is Stemming?
Stemming is a rule-based process of removing suffixes (and sometimes prefixes) from words to obtain their root form.
It does not necessarily produce linguistically correct words, but it is efficient and often sufficient for many NLP tasks such as:
- Text classification
- Sentiment analysis
- Information retrieval
- Keyword extraction
Example:
| Original Word | Stemmed Form |
|---|---|
| Connection | Connect |
| Connections | Connect |
| Connected | Connect |
| Connecting | Connect |
3. Setting Up NLTK for Stemming
We’ll use the Natural Language Toolkit (NLTK) — a popular Python library for text processing.
Step 1: Install NLTK
If you haven’t installed it already:
pip install nltk
Step 2: Import Required Modules
import nltk from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize
Step 3: Download Tokenizer Data
The word tokenizer requires the punkt package:
nltk.download('punkt')
4. Understanding the Porter Stemmer
The Porter Stemmer is one of the oldest and most widely used stemming algorithms. It applies a sequence of rule-based transformations to words.
Let’s try it out:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
text = "Arjun was running and eating at the same time. He has been studying hard for his exams."
words = word_tokenize(text)
for word in words:
print(f"{word} → {stemmer.stem(word)}")
Expected Output
Arjun → arjun was → wa running → run and → and eating → eat at → at the → the same → same time → time . → . He → he has → ha been → been studying → studi hard → hard for → for his → hi exams → exam . → .
Explanation:
- The stemmer converts
"running"to"run","eating"to"eat", and"studying"to"studi". - Notice
"studi"is not an actual English word — this is a limitation of stemming.
5. Using the Lancaster Stemmer
NLTK also provides the Lancaster Stemmer, which is more aggressive — it sometimes removes too many characters and can produce very short stems.
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
words = ["running", "runs", "easily", "fairly"]
for word in words:
print(f"{word} → {stemmer.stem(word)}")
Expected Output
running → run runs → run easily → easy fairly → fair
Observation:
Lancaster performs deeper reductions, sometimes too aggressively. For instance, "maximum" might become "maxim", and "computerization" becomes "comput".
6. Using the Snowball Stemmer
The Snowball Stemmer (also called the Porter2 stemmer) is an improved version of the Porter algorithm. It supports multiple languages and is slightly less aggressive than Lancaster.
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")
words = ["running", "runs", "easily", "fairly", "studies", "studying"]
for word in words:
print(f"{word} → {stemmer.stem(word)}")
Expected Output
running → run runs → run easily → easili fairly → fair studies → studi studying → studi
Observation:
"easily"→"easili"(Snowball maintains some suffix structure)."studies"and"studying"→"studi", showing consistent stemming.
The Snowball stemmer is generally preferred because it offers a balance between accuracy and aggressiveness.
7. Comparing Different Stemmers
Let’s compare the three stemmers side by side:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")
words = ["compute", "computer", "computing", "computed", "computation"]
print(f"{'Word':<15}{'Porter':<15}{'Lancaster':<15}{'Snowball':<15}")
print("-" * 60)
for w in words:
print(f"{w:<15}{porter.stem(w):<15}{lancaster.stem(w):<15}{snowball.stem(w):<15}")
Expected Output
Word Porter Lancaster Snowball ------------------------------------------------------------ compute comput comput comput computer comput comput comput computing comput comput comput computed comput comput comput computation comput comput comput
All three produce "comput" — but in other examples, you might notice slight differences in aggressiveness.
8. When to Use Stemming
You should use stemming when:
- The dataset is large and performance is a priority.
- Small differences in word forms do not affect model meaning.
- You need quick normalization of words before building features like TF-IDF or Bag of Words.
However, for applications where linguistic correctness matters (like translation or search engines), lemmatization is preferred instead.
9. Common Pitfalls of Stemming
- Over-stemming:
Different words with distinct meanings may reduce to the same root.
Example: “universe” and “university” → “univers”. - Under-stemming:
Related words may not reduce to the same base.
Example: “analysis” and “analyst” remain different. - Not language-aware:
Stemmers are language-specific. Using an English stemmer on another language produces incorrect results.
10. Summary
| Stemmer Type | Characteristics | Aggressiveness | Suitable For |
|---|---|---|---|
| Porter | Most common, balanced | Medium | General NLP tasks |
| Lancaster | Simple but aggressive | High | Small datasets |
| Snowball | Advanced and multilingual | Medium-Low | Modern applications |
11. Final Thoughts
Stemming is a powerful normalization technique that simplifies word variations into a common base form, improving the performance of text-based models.
While it may not always produce valid words, it efficiently reduces vocabulary size and improves computational performance.
