1. Introduction
Before we start implementing N-Grams, let’s first understand what N-Grams are and why they are important in Natural Language Processing (NLP).
In NLP, an N-Gram is a sequence of N consecutive words (or tokens) from a given text. It is a simple yet powerful way to represent and analyze text data based on the local context of words.
For example, given the sentence:
“I love natural language processing”
Here are different forms of N-Grams:
- Unigrams (1-grams): [“I”, “love”, “natural”, “language”, “processing”]
- Bigrams (2-grams): [“I love”, “love natural”, “natural language”, “language processing”]
- Trigrams (3-grams): [“I love natural”, “love natural language”, “natural language processing”]
Each N-Gram captures a small portion of context. For example, “love natural” shows a strong association between those two words, which is not visible when looking at individual words alone.
Why use N-Grams?
- N-Grams are used in text classification, machine translation, sentiment analysis, speech recognition, and language modeling.
- They help capture word dependencies and contextual meaning — essential for understanding linguistic patterns.
2. Prerequisites
To follow this tutorial, you need:
- Python 3.x
- NLTK library (Natural Language Toolkit)
You can install NLTK using:
pip install nltk
Then, download the required datasets (like punkt for tokenization):
import nltk
nltk.download('punkt')
3. Step-by-Step Implementation of N-Grams using NLTK
Let’s now go through each step carefully.
Step 1: Import the required libraries
We will use NLTK’s tokenizer and ngrams utility.
import nltk from nltk.util import ngrams from nltk.tokenize import word_tokenize
Step 2: Define the input text
We start with a simple example sentence.
text = "Natural Language Processing makes machines understand human language"
Step 3: Tokenize the text
Tokenization means breaking the sentence into individual words (tokens).
tokens = word_tokenize(text) print(tokens)
Output:
['Natural', 'Language', 'Processing', 'makes', 'machines', 'understand', 'human', 'language']
Now our text is ready for N-Gram generation.
Step 4: Generate N-Grams using nltk.util.ngrams()
Let’s generate bigrams (2-grams) first.
bigrams = list(ngrams(tokens, 2)) print(bigrams)
Output:
[('Natural', 'Language'),
('Language', 'Processing'),
('Processing', 'makes'),
('makes', 'machines'),
('machines', 'understand'),
('understand', 'human'),
('human', 'language')]
Similarly, for trigrams (3-grams):
trigrams = list(ngrams(tokens, 3)) print(trigrams)
Output:
[('Natural', 'Language', 'Processing'),
('Language', 'Processing', 'makes'),
('Processing', 'makes', 'machines'),
('makes', 'machines', 'understand'),
('machines', 'understand', 'human'),
('understand', 'human', 'language')]
Step 5: Create a function for general N-Gram generation
You can define a function to generate N-Grams dynamically.
def generate_ngrams(text, n):
tokens = word_tokenize(text)
n_grams = list(ngrams(tokens, n))
return n_grams
Now call it for different values of n:
print("Unigrams:", generate_ngrams(text, 1))
print("Bigrams:", generate_ngrams(text, 2))
print("Trigrams:", generate_ngrams(text, 3))
Step 6: Represent N-Grams as strings
Sometimes, we need to join each tuple into a readable string.
def generate_ngrams_as_strings(text, n):
tokens = word_tokenize(text)
n_grams = list(ngrams(tokens, n))
return [' '.join(gram) for gram in n_grams]
print(generate_ngrams_as_strings(text, 2))
Output:
['Natural Language', 'Language Processing', 'Processing makes', 'makes machines', 'machines understand', 'understand human', 'human language']
Step 7: Count the frequency of each N-Gram
You can also count how often each N-Gram occurs in a corpus using collections.Counter.
from collections import Counter bigram_freq = Counter(generate_ngrams_as_strings(text, 2)) print(bigram_freq)
If your text contains repeated patterns, this helps identify common N-Grams that may represent meaningful phrases.
4. Practical Example: N-Grams from a Paragraph
Let’s apply this to a larger text.
paragraph = """Natural Language Processing (NLP) is a field of Artificial Intelligence.
It enables machines to read, understand, and derive meaning from human languages."""
# Tokenize and create trigrams
tokens = word_tokenize(paragraph)
trigrams = list(ngrams(tokens, 3))
for t in trigrams:
print(t)
This approach can be extended to entire datasets for text analysis, phrase detection, or feature extraction in machine learning models.
