1. Introduction
After understanding the theoretical intuition behind Word2Vec (CBOW and Skip-gram), the next logical step is to implement it in practice.
In real-world NLP applications, we rarely train Word2Vec from scratch on small data; instead, we either train it on large corpora or use pre-trained embeddings.
This tutorial walks through a complete implementation using Python’s gensim library — one of the most popular and optimized libraries for Word2Vec.
2. Prerequisites
Before starting, ensure that you have the following installed:
pip install gensim nltk
Also, we’ll use NLTK for text preprocessing and tokenization.
3. Step-by-Step Implementation
Step 1: Import Libraries
import nltk from nltk.corpus import stopwords from gensim.models import Word2Vec from gensim.utils import simple_preprocess
Download the necessary NLTK data files if you haven’t already:
nltk.download('punkt')
nltk.download('stopwords')
Step 2: Prepare and Clean the Text Data
Let’s use a small text dataset for demonstration.
corpus = [
"The movie was fantastic and full of suspense",
"I really enjoyed the performance of the lead actor",
"The plot was dull and the movie was too long",
"What a great and thrilling experience to watch",
"The direction and cinematography were brilliant"
]
Now, preprocess this corpus — tokenize each sentence, remove stop words, and lowercase everything.
stop_words = set(stopwords.words('english'))
def preprocess(text):
tokens = simple_preprocess(text)
return [word for word in tokens if word not in stop_words]
processed_corpus = [preprocess(sentence) for sentence in corpus]
print(processed_corpus)
Output:
[['movie', 'fantastic', 'full', 'suspense'], ['really', 'enjoyed', 'performance', 'lead', 'actor'], ['plot', 'dull', 'movie', 'long'], ['great', 'thrilling', 'experience', 'watch'], ['direction', 'cinematography', 'brilliant']]
Step 3: Train the Word2Vec Model
We can now train the Word2Vec model using gensim.
model = Word2Vec(
sentences=processed_corpus,
vector_size=100, # dimensionality of word embeddings
window=3, # context window size
min_count=1, # ignore words with total frequency lower than this
sg=0, # 0 = CBOW, 1 = Skip-gram
epochs=100 # number of iterations (training epochs)
)
Here:
- vector_size defines how many features each word vector will have.
- window defines the number of context words before and after the target word.
- min_count filters out rare words.
- sg=0 means CBOW; use
sg=1for Skip-gram. - epochs defines how many times the algorithm will go through the entire corpus.
Step 4: Explore the Word Vectors
After training, you can inspect and analyze the learned vectors.
words = list(model.wv.index_to_key)
print("Vocabulary size:", len(words))
print("Sample words:", words[:10])
You can also view the vector representation of a specific word:
print("Vector for 'movie':\n", model.wv['movie'])
Step 5: Find Similar Words
Word2Vec can find semantically similar words using cosine similarity.
similar_words = model.wv.most_similar('movie', topn=5)
for word, similarity in similar_words:
print(f"{word} -> {similarity:.4f}")
Output (example):
plot -> 0.84 fantastic -> 0.80 long -> 0.79 thrilling -> 0.75 suspense -> 0.73
This shows that the model understands semantic relationships — “plot,” “fantastic,” and “suspense” are closely related to “movie”.
Step 6: Compute Similarity Between Two Words
sim = model.wv.similarity('movie', 'plot')
print(f"Similarity between 'movie' and 'plot': {sim:.4f}")
Step 7: Visualize Word Embeddings (Optional)
To visualize embeddings, we can reduce the dimensionality from 100D to 2D using PCA or t-SNE.
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Reduce dimensions
X = model.wv[model.wv.index_to_key]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# Plot the words
plt.figure(figsize=(8,6))
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(model.wv.index_to_key):
plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.show()
This will display a 2D visualization of word vectors, where semantically related words appear close to each other.
4. Example Dry Run (How It Works Internally)
Let’s break down how CBOW works internally during training.
Sentence: “The movie was fantastic”
- The model slides a window (say size = 2) over the sentence.
- For the target word “movie”, its context words are “The” and “was”.
- CBOW takes the average of the context word vectors (“The” and “was”) and tries to predict the target word “movie”.
- Using backpropagation, the model adjusts weights to minimize prediction error.
- Over multiple iterations, the vectors move in such a way that words appearing in similar contexts (like “movie” and “film”) end up closer together.
This process is repeated across all sentences, gradually shaping the embedding space.
5. Saving and Loading the Model
You can save the trained model for later use:
model.save("word2vec_model.model")
To load it again:
from gensim.models import Word2Vec
model = Word2Vec.load("word2vec_model.model")
6. Using Pre-Trained Word2Vec Models
If your dataset is small, training your own Word2Vec might not capture rich semantics.
Instead, you can use pre-trained embeddings like Google’s Word2Vec model trained on 100 billion words.
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
Then you can directly use:
model.most_similar('king')
7. Advantages of Word2Vec Practical Implementation
- Semantic Understanding
Captures deep relationships — words with similar meanings or roles appear close in the vector space. - Transferable Representations
Pre-trained embeddings can be reused for many NLP tasks like classification, summarization, or translation. - Efficient Training
Gensim’s implementation is optimized for speed and can handle large corpora efficiently. - Interpretable Vector Space
Relationships likeking - man + woman ≈ queenemerge naturally, showing the strength of distributed semantics.
8. Disadvantages
- Ignores Context
Word2Vec gives a single vector per word, even if it has multiple meanings (e.g., bank). - Large Training Data Requirement
To get good-quality embeddings, a large and diverse corpus is needed. - No Sentence-Level Understanding
It only models word-level semantics. Sentence or paragraph meaning requires averaging or advanced techniques. - Memory Consumption
High-dimensional embeddings for large vocabularies consume significant memory.
9. Applications
- Document similarity and clustering
- Sentiment classification
- Question answering systems
- Chatbots and dialogue systems
- Information retrieval
- Keyword extraction and tagging
10. Summary
| Step | Description |
|---|---|
| 1 | Preprocess and tokenize the text |
| 2 | Train Word2Vec using gensim (CBOW or Skip-gram) |
| 3 | Explore vocabulary and vectors |
| 4 | Measure word similarity |
| 5 | Visualize embeddings using PCA or t-SNE |
| 6 | Save and reuse the model for NLP tasks |
