Learnitweb

Bag of Words (BoW) Implementation Using NLTK

1. Introduction

Bag of Words (BoW) is a simple and widely used text representation technique in Natural Language Processing (NLP). It transforms text documents into numerical feature vectors by counting the occurrences of words.

The name “Bag of Words” comes from the idea that the text is treated as a bag of words, ignoring grammar, word order, and sentence structure. Only the word frequencies matter.

This representation is often used as input for algorithms like Naive Bayes, Logistic Regression, or Support Vector Machines (SVM) for text classification.


2. Intuition Behind Bag of Words

Imagine you have three short sentences:

  1. “The cat sat on the mat.”
  2. “The dog sat on the log.”
  3. “The cat chased the dog.”

Now, we build a vocabulary from all unique words:
["The", "cat", "sat", "on", "the", "mat", "dog", "log", "chased"]

Each document (sentence) can then be represented as a vector counting the occurrence of each word:

WordSentence 1Sentence 2Sentence 3
the222
cat101
dog011
sat110
on110
mat100
log010
chased001

This table is the Bag of Words matrix. Each row represents a word, and each column represents a document.


3. Steps to Implement Bag of Words Using NLTK

Let’s now implement this concept using Python’s NLTK library.


3.1 Import Required Libraries

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import pandas as pd

Explanation:

  • nltk – provides natural language processing tools.
  • word_tokenize() – splits sentences into words.
  • stopwords – common words like “is”, “and”, “the” that can be removed.
  • string – for punctuation handling.
  • pandas – to display the Bag of Words as a matrix.

3.2 Download Required NLTK Resources

nltk.download('punkt')
nltk.download('stopwords')

3.3 Prepare Sample Data

documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat chased the dog."
]

3.4 Preprocess Text

Text preprocessing improves the quality of our Bag of Words model. It involves:

  • Lowercasing
  • Removing punctuation
  • Tokenizing
  • Removing stopwords
stop_words = set(stopwords.words('english'))

def preprocess(sentence):
    tokens = word_tokenize(sentence.lower())  # Lowercase and tokenize
    tokens = [word for word in tokens if word not in string.punctuation]  # Remove punctuation
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return tokens

preprocessed_docs = [preprocess(doc) for doc in documents]
print(preprocessed_docs)

Output:

[['cat', 'sat', 'mat'], ['dog', 'sat', 'log'], ['cat', 'chased', 'dog']]

3.5 Create Vocabulary

We collect all unique words across all documents.

vocab = sorted(set([word for doc in preprocessed_docs for word in doc]))
print(vocab)

Output:

['cat', 'chased', 'dog', 'log', 'mat', 'sat']

3.6 Construct the Bag of Words Matrix

Now, for each document, count how many times each word from the vocabulary appears.

def create_bow_vector(doc):
    vector = [0] * len(vocab)
    for word in doc:
        if word in vocab:
            index = vocab.index(word)
            vector[index] += 1
    return vector

bow_matrix = [create_bow_vector(doc) for doc in preprocessed_docs]

df = pd.DataFrame(bow_matrix, columns=vocab)
print(df)

Output:

catchaseddoglogmatsat
Doc1100011
Doc2001101
Doc3111000

This is your Bag of Words representation built entirely using NLTK.


4. Interpreting the Output

  • Each row corresponds to a sentence (document).
  • Each column corresponds to a unique word (feature).
  • Each cell value shows how many times that word appears in the document.

The BoW matrix is often converted to NumPy arrays or scipy sparse matrices for model training.