Learnitweb

N-Gram Implementation using NLTK

1. Introduction

Before we start implementing N-Grams, let’s first understand what N-Grams are and why they are important in Natural Language Processing (NLP).

In NLP, an N-Gram is a sequence of N consecutive words (or tokens) from a given text. It is a simple yet powerful way to represent and analyze text data based on the local context of words.

For example, given the sentence:

“I love natural language processing”

Here are different forms of N-Grams:

  • Unigrams (1-grams): [“I”, “love”, “natural”, “language”, “processing”]
  • Bigrams (2-grams): [“I love”, “love natural”, “natural language”, “language processing”]
  • Trigrams (3-grams): [“I love natural”, “love natural language”, “natural language processing”]

Each N-Gram captures a small portion of context. For example, “love natural” shows a strong association between those two words, which is not visible when looking at individual words alone.

Why use N-Grams?

  • N-Grams are used in text classification, machine translation, sentiment analysis, speech recognition, and language modeling.
  • They help capture word dependencies and contextual meaning — essential for understanding linguistic patterns.

2. Prerequisites

To follow this tutorial, you need:

  • Python 3.x
  • NLTK library (Natural Language Toolkit)

You can install NLTK using:

pip install nltk

Then, download the required datasets (like punkt for tokenization):

import nltk
nltk.download('punkt')

3. Step-by-Step Implementation of N-Grams using NLTK

Let’s now go through each step carefully.

Step 1: Import the required libraries

We will use NLTK’s tokenizer and ngrams utility.

import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

Step 2: Define the input text

We start with a simple example sentence.

text = "Natural Language Processing makes machines understand human language"

Step 3: Tokenize the text

Tokenization means breaking the sentence into individual words (tokens).

tokens = word_tokenize(text)
print(tokens)

Output:

['Natural', 'Language', 'Processing', 'makes', 'machines', 'understand', 'human', 'language']

Now our text is ready for N-Gram generation.


Step 4: Generate N-Grams using nltk.util.ngrams()

Let’s generate bigrams (2-grams) first.

bigrams = list(ngrams(tokens, 2))
print(bigrams)

Output:

[('Natural', 'Language'),
 ('Language', 'Processing'),
 ('Processing', 'makes'),
 ('makes', 'machines'),
 ('machines', 'understand'),
 ('understand', 'human'),
 ('human', 'language')]

Similarly, for trigrams (3-grams):

trigrams = list(ngrams(tokens, 3))
print(trigrams)

Output:

[('Natural', 'Language', 'Processing'),
 ('Language', 'Processing', 'makes'),
 ('Processing', 'makes', 'machines'),
 ('makes', 'machines', 'understand'),
 ('machines', 'understand', 'human'),
 ('understand', 'human', 'language')]

Step 5: Create a function for general N-Gram generation

You can define a function to generate N-Grams dynamically.

def generate_ngrams(text, n):
    tokens = word_tokenize(text)
    n_grams = list(ngrams(tokens, n))
    return n_grams

Now call it for different values of n:

print("Unigrams:", generate_ngrams(text, 1))
print("Bigrams:", generate_ngrams(text, 2))
print("Trigrams:", generate_ngrams(text, 3))

Step 6: Represent N-Grams as strings

Sometimes, we need to join each tuple into a readable string.

def generate_ngrams_as_strings(text, n):
    tokens = word_tokenize(text)
    n_grams = list(ngrams(tokens, n))
    return [' '.join(gram) for gram in n_grams]

print(generate_ngrams_as_strings(text, 2))

Output:

['Natural Language', 'Language Processing', 'Processing makes',
 'makes machines', 'machines understand', 'understand human', 'human language']

Step 7: Count the frequency of each N-Gram

You can also count how often each N-Gram occurs in a corpus using collections.Counter.

from collections import Counter

bigram_freq = Counter(generate_ngrams_as_strings(text, 2))
print(bigram_freq)

If your text contains repeated patterns, this helps identify common N-Grams that may represent meaningful phrases.


4. Practical Example: N-Grams from a Paragraph

Let’s apply this to a larger text.

paragraph = """Natural Language Processing (NLP) is a field of Artificial Intelligence.
It enables machines to read, understand, and derive meaning from human languages."""

# Tokenize and create trigrams
tokens = word_tokenize(paragraph)
trigrams = list(ngrams(tokens, 3))

for t in trigrams:
    print(t)

This approach can be extended to entire datasets for text analysis, phrase detection, or feature extraction in machine learning models.