1. Introduction
In this tutorial, we will walk step by step through how to implement TF-IDF practically using Python.
2. Installing Required Libraries
We will use NLTK for preprocessing and Scikit-learn for TF-IDF vectorization.
pip install nltk scikit-learn
3. Step-by-Step Implementation in Python
Step 1: Import Required Libraries
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd
Step 2: Prepare a Sample Corpus
Let’s define a small corpus (collection of documents) for demonstration.
corpus = [
"Machine learning is fascinating",
"Deep learning drives many AI applications",
"Artificial intelligence and machine learning are related fields",
"TF-IDF is used in text mining and information retrieval"
]
Step 3: Create TF-IDF Representation
Scikit-learn provides an easy way to compute TF-IDF using TfidfVectorizer.
vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(corpus)
Here:
fit_transform()learns the vocabulary and computes the TF-IDF matrix.Xis a sparse matrix of size(number_of_documents × number_of_unique_terms).
Step 4: Convert TF-IDF Matrix to DataFrame
To better visualize, we can convert the sparse matrix into a pandas DataFrame.
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()) print(df)
Output (Sample):
| ai | and | applications | are | artificial | deep | drives | fields | fascinating | … | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.577 | … |
| 1 | 0 | 0 | 0.577 | 0 | 0 | 0.577 | 0.577 | 0 | 0 | … |
| 2 | 0.408 | 0.408 | 0 | 0.408 | 0.408 | 0 | 0 | 0.408 | 0 | … |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
Each cell represents the TF-IDF score of a word in a particular document.
Step 5: Inspect Vocabulary and TF-IDF Weights
To see which terms were extracted and their positions:
print(vectorizer.vocabulary_)
Example output:
{
'machine': 8,
'learning': 6,
'fascinating': 2,
'deep': 1,
'ai': 0,
'applications': 3,
...
}
You can also view the IDF value of each term:
print(vectorizer.idf_)
Step 6: Extract Top Keywords per Document
You can extract the most important keywords (highest TF-IDF values) from each document:
import numpy as np
feature_names = np.array(vectorizer.get_feature_names_out())
for doc in range(len(corpus)):
sorted_indices = X[doc].toarray()[0].argsort()[::-1]
top_n = 3 # Top 3 keywords
top_features = feature_names[sorted_indices[:top_n]]
print(f"Document {doc+1} top keywords: {top_features}")
Sample Output:
Document 1 top keywords: ['fascinating' 'learning' 'machine'] Document 2 top keywords: ['applications' 'deep' 'drives'] Document 3 top keywords: ['artificial' 'fields' 'related'] Document 4 top keywords: ['retrieval' 'mining' 'text']
This shows which terms are most important for each document based on TF-IDF weights.
4. Preprocessing Text before Applying TF-IDF
Before computing TF-IDF in real-world data, preprocessing is crucial for accuracy.
You can use NLTK for this purpose.
Example of Text Cleaning:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def preprocess(text):
text = text.lower()
text = re.sub(r'[^a-z\s]', '', text)
words = text.split()
words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
return ' '.join(words)
clean_corpus = [preprocess(doc) for doc in corpus]
print(clean_corpus)
Now you can apply TF-IDF on clean_corpus instead of corpus to get cleaner, more accurate features.
