Learnitweb

Text Preprocessing: Stopwords Removal using NLTK

1. Introduction

When working with text data, not all words contribute equally to understanding the meaning of a sentence.
Words like “is”, “the”, “in”, “at”, “of”, “an”, and “and” occur very frequently in text but usually do not carry important semantic meaning for NLP tasks such as classification, sentiment analysis, or topic modeling.

These words are called stopwords.

Stopwords are common words that are usually removed during the text preprocessing phase to:

  • Reduce noise in the text data.
  • Decrease the dimensionality of feature space.
  • Improve model efficiency and performance.

2. What are Stopwords?

Stopwords are words that occur so frequently in a language that they carry very little useful information for text analysis.

Examples in English:

['i', 'me', 'my', 'we', 'our', 'you', 'your', 'he', 'she', 'it', 
 'is', 'was', 'were', 'be', 'been', 'am', 'are', 'do', 'does', 
 'the', 'a', 'an', 'in', 'on', 'at', 'and', 'but', 'if', 'or']

In many NLP pipelines, stopwords are removed so that models can focus on more informative words.


3. Setting Up NLTK for Stopwords

The Natural Language Toolkit (NLTK) provides a pre-defined list of stopwords for several languages, including English, French, German, and Spanish.

Step 1: Install and Import NLTK

If not installed:

pip install nltk

Step 2: Import Required Modules and Download Stopwords

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary resources
nltk.download('stopwords')
nltk.download('punkt')

4. Viewing the Stopword List in NLTK

You can easily view the available stopwords for English or any other supported language.

stop_words = set(stopwords.words('english'))
print(f"Total stopwords in English: {len(stop_words)}")
print("First 20 stopwords:", list(stop_words)[:20])

Expected Output

Total stopwords in English: 179
First 20 stopwords: ['couldn', 'this', 'most', 'myself', 'all', 
                     'nor', 'further', 'their', 'while', 'against', 
                     'yours', 'theirs', 'too', 'ours', 'so', 'has', 
                     'down', 'before', 'between', 'y']

Explanation:

  • NLTK currently includes 179 English stopwords.
  • The stopwords list is returned as a Python set for fast lookup.

5. Basic Stopword Removal Example

Let’s take a simple text and remove all stopwords.

text = "Arjun is learning Natural Language Processing and building powerful NLP applications."

# Tokenize the text
words = word_tokenize(text)

# Remove stopwords
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Original Words:", words)
print("Filtered Words:", filtered_words)

Expected Output

Original Words: ['Arjun', 'is', 'learning', 'Natural', 'Language', 
                 'Processing', 'and', 'building', 'powerful', 
                 'NLP', 'applications', '.']

Filtered Words: ['Arjun', 'learning', 'Natural', 'Language', 
                 'Processing', 'building', 'powerful', 'NLP', 
                 'applications', '.']

Explanation:

  • Words like “is” and “and” are removed.
  • Important words like “Arjun”, “learning”, and “NLP” are retained.
  • This helps focus on meaningful tokens that contribute to understanding.

6. Removing Stopwords from a Paragraph

You can easily extend this logic to handle longer text or multiple sentences.

paragraph = """
Arjun is learning Natural Language Processing. 
He wants to master text preprocessing techniques 
like tokenization, stopword removal, and lemmatization.
"""

tokens = word_tokenize(paragraph)
filtered_paragraph = [word for word in tokens if word.lower() not in stop_words]

print("Filtered Text:", " ".join(filtered_paragraph))

Expected Output

Filtered Text: Arjun learning Natural Language Processing . 
He wants master text preprocessing techniques like tokenization , 
stopword removal , lemmatization .

Explanation:

  • The stopwords (“is”, “to”, “and”, etc.) are removed.
  • Important keywords are retained for later stages of processing (like stemming or lemmatization).

7. Handling Punctuation and Case Sensitivity

Stopword removal is case-sensitive by default, and punctuation marks are not considered stopwords.

To handle both:

  1. Convert all tokens to lowercase.
  2. Remove punctuation.
import string

text = "This is an Example, showing how Stopword Removal works in NLP!"

# Convert to lowercase and tokenize
tokens = word_tokenize(text.lower())

# Remove punctuation and stopwords
filtered = [word for word in tokens 
            if word not in stop_words and word not in string.punctuation]

print("Filtered Tokens:", filtered)

Expected Output

Filtered Tokens: ['example', 'showing', 'stopword', 'removal', 'works', 'nlp']

Explanation:

  • The text was converted to lowercase for consistency.
  • Punctuation like “,” and “!” was removed.
  • Stopwords such as “this”, “is”, “an”, “how”, and “in” were filtered out.

8. Removing Stopwords in Multiple Languages

NLTK supports stopwords for over 20 languages, including:

  • 'english', 'french', 'german', 'spanish', 'italian', 'portuguese', etc.

Let’s see an example in French.

french_stopwords = set(stopwords.words('french'))

text_fr = "Ceci est un exemple montrant comment supprimer les mots vides en français."

tokens_fr = word_tokenize(text_fr.lower())
filtered_fr = [word for word in tokens_fr if word not in french_stopwords]

print("Filtered French Tokens:", filtered_fr)

Expected Output

Filtered French Tokens: ['exemple', 'montrant', 'supprimer', 'mots', 'vides', 'français', '.']

Explanation:
The stopwords “ceci”, “est”, “un”, “en” were removed automatically from the French text.


9. Creating a Custom Stopword List

Sometimes you may want to:

  • Add your own domain-specific stopwords (e.g., “data”, “information”).
  • Remove certain stopwords from the default list.

Example: Extending Stopword List

custom_stopwords = stop_words.union({'nlp', 'processing', 'data'})

text = "Arjun is working with NLP data and learning new processing techniques."

tokens = word_tokenize(text.lower())
filtered_custom = [word for word in tokens if word not in custom_stopwords and word not in string.punctuation]

print("Filtered Tokens with Custom Stopwords:", filtered_custom)

Expected Output

Filtered Tokens with Custom Stopwords: ['arjun', 'working', 'learning', 'new', 'techniques']

Explanation:
Custom stopwords (“nlp”, “processing”, “data”) are also removed.


10. Common Pitfalls in Stopword Removal

  1. Removing Too Many Words:
    Some stopwords like “not”, “no”, or “never” can be crucial for tasks like sentiment analysis.
    Example: “I am not happy” → removing “not” changes the meaning entirely.
  2. Language Mismatch:
    Always ensure the stopword list matches the language of your text.
  3. Case Sensitivity:
    Convert text to lowercase before filtering stopwords.
  4. Punctuation and Numbers:
    NLTK’s stopwords list doesn’t handle punctuation or numbers — they must be removed separately.

11. When to Use Stopword Removal

  • Use it: when building bag-of-words, TF-IDF, or word frequency models to reduce feature size.
  • Avoid it: when using models sensitive to context or negation (like LSTM, BERT).

12. Summary

FeatureDescription
PurposeRemove common words that add little meaning
Examples“is”, “the”, “in”, “and”
Librarynltk.corpus.stopwords
Supported Languages20+
Common PitfallRemoving words that affect sentiment (“not”, “no”)
BenefitReduces noise, improves model performance

13. Complete Verified Code Example

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Download necessary resources
nltk.download('stopwords')
nltk.download('punkt')

# Load stopwords
stop_words = set(stopwords.words('english'))

# Sample text
text = "Arjun is learning Natural Language Processing and building NLP models using Python."

# Tokenize and preprocess
tokens = word_tokenize(text.lower())
filtered_tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]

print("Original Text:", text)
print("Filtered Tokens:", filtered_tokens)

Verified Output

Original Text: Arjun is learning Natural Language Processing and building NLP models using Python.
Filtered Tokens: ['arjun', 'learning', 'natural', 'language', 'processing', 'building', 'nlp', 'models', 'using', 'python']

14. Conclusion

Stopword removal is a fundamental step in text preprocessing that simplifies your dataset by removing frequently occurring but less informative words.
Using NLTK, you can:

  • Quickly load predefined stopword lists.
  • Customize them for your application.
  • Handle multilingual datasets effectively.

By combining stopword removal with tokenization, lemmatization, and stemming, you create a clean and efficient dataset ready for deeper NLP analysis and modeling.