Learnitweb

Practical implementation TDIDF

1. Introduction

In this tutorial, we will walk step by step through how to implement TF-IDF practically using Python.

2. Installing Required Libraries

We will use NLTK for preprocessing and Scikit-learn for TF-IDF vectorization.

pip install nltk scikit-learn

3. Step-by-Step Implementation in Python

Step 1: Import Required Libraries

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

Step 2: Prepare a Sample Corpus

Let’s define a small corpus (collection of documents) for demonstration.

corpus = [
    "Machine learning is fascinating",
    "Deep learning drives many AI applications",
    "Artificial intelligence and machine learning are related fields",
    "TF-IDF is used in text mining and information retrieval"
]

Step 3: Create TF-IDF Representation

Scikit-learn provides an easy way to compute TF-IDF using TfidfVectorizer.

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

Here:

  • fit_transform() learns the vocabulary and computes the TF-IDF matrix.
  • X is a sparse matrix of size (number_of_documents × number_of_unique_terms).

Step 4: Convert TF-IDF Matrix to DataFrame

To better visualize, we can convert the sparse matrix into a pandas DataFrame.

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df)

Output (Sample):

aiandapplicationsareartificialdeepdrivesfieldsfascinating
0000000000.577
1000.577000.5770.57700
20.4080.40800.4080.408000.4080
3000000000

Each cell represents the TF-IDF score of a word in a particular document.


Step 5: Inspect Vocabulary and TF-IDF Weights

To see which terms were extracted and their positions:

print(vectorizer.vocabulary_)

Example output:

{
 'machine': 8,
 'learning': 6,
 'fascinating': 2,
 'deep': 1,
 'ai': 0,
 'applications': 3,
 ...
}

You can also view the IDF value of each term:

print(vectorizer.idf_)

Step 6: Extract Top Keywords per Document

You can extract the most important keywords (highest TF-IDF values) from each document:

import numpy as np

feature_names = np.array(vectorizer.get_feature_names_out())

for doc in range(len(corpus)):
    sorted_indices = X[doc].toarray()[0].argsort()[::-1]
    top_n = 3  # Top 3 keywords
    top_features = feature_names[sorted_indices[:top_n]]
    print(f"Document {doc+1} top keywords: {top_features}")

Sample Output:

Document 1 top keywords: ['fascinating' 'learning' 'machine']
Document 2 top keywords: ['applications' 'deep' 'drives']
Document 3 top keywords: ['artificial' 'fields' 'related']
Document 4 top keywords: ['retrieval' 'mining' 'text']

This shows which terms are most important for each document based on TF-IDF weights.


4. Preprocessing Text before Applying TF-IDF

Before computing TF-IDF in real-world data, preprocessing is crucial for accuracy.
You can use NLTK for this purpose.

Example of Text Cleaning:

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
    return ' '.join(words)

clean_corpus = [preprocess(doc) for doc in corpus]
print(clean_corpus)

Now you can apply TF-IDF on clean_corpus instead of corpus to get cleaner, more accurate features.