Introduction
Word2Vec is one of the most transformative models in Natural Language Processing (NLP) for learning word embeddings — dense numerical representations of words that capture their meaning and relationships based on how they appear in text.
The architecture comes in two main types:
- Continuous Bag of Words (CBOW)
- Skip-Gram
In this tutorial, we’ll explore CBOW in depth — its working principles, training process, advantages, disadvantages, and a complete example with a dry run.
1. Intuition Behind CBOW
The Continuous Bag of Words (CBOW) model predicts a target word using its surrounding context words.
It works like a person trying to guess a missing word from a sentence.
Example:
“The cat sat on the ___.”
You can easily infer that the missing word is “mat” from the surrounding words.
That’s exactly what CBOW does — it takes the words before and after a target word and tries to predict what that missing word is.
The term “Bag of Words” means that the model ignores the order of the context words. It only considers which words appear near the target, not their sequence.
2. Architecture of CBOW
The CBOW model is a simple neural network with three layers:
Input Layer
- Each context word is represented as a one-hot vector, where one position corresponds to the word index in the vocabulary, and all others are zero.
- For a vocabulary of 10,000 words, every vector will be 10,000 dimensions long with only one non-zero entry.
Hidden Layer
- The input layer is connected to a hidden layer that has a much smaller size — usually 100 to 300 neurons.
- There is no activation function here; it just computes a weighted sum of the inputs using a weight matrix (W).
- This layer’s weights will eventually become the word embeddings.
Output Layer
- The hidden representation is passed through another weight matrix (W’) that maps it back to the vocabulary size.
- The model uses a softmax function to assign probabilities to all possible target words.
- The word with the highest probability is predicted as the target word.
3. Step-by-Step Working of CBOW
Let’s understand the complete process in stages.
Step 1: Choose the Context Window
A window size defines how many words around the target word are used as context.
Example: For window size 2 and the sentence
“The cat sat on the mat”
If the target word is “sat”, the context words are “The”, “cat”, “on”, and “the”.
Step 2: Encode Context Words
Each context word is converted into its one-hot vector representation.
For simplicity, suppose our vocabulary has only six words:
- the, cat, sat, on, mat, floor
Then, “cat” might be represented as [0,1,0,0,0,0], and “mat” as [0,0,0,0,1,0].
Step 3: Average the Context Vectors
The one-hot vectors of the context words are averaged (or summed).
This averaged vector represents the combined meaning of the context.
Step 4: Compute Hidden Layer Representation
The averaged input vector is multiplied by the weight matrix (W).
This produces a hidden layer output — a dense vector that captures contextual meaning.
Step 5: Predict the Target Word
The hidden vector is multiplied by another weight matrix (W’) and passed through a softmax function, which gives a probability for every word in the vocabulary.
The model predicts the word with the highest probability as the target word.
Step 6: Train the Model
- The model compares its predicted word with the actual target word.
- The difference (error) is calculated.
- Using backpropagation, both weight matrices (W and W’) are updated to reduce this error.
- After many iterations, the first weight matrix W contains the learned word embeddings.
4. Example and Dry Run
Let’s walk through a complete example to solidify your understanding.
Sentence:
“The cat sits on the mat”
Vocabulary:
[the, cat, sits, on, mat]
So, the vocabulary size is 5.
Window Size:
2 (we look at two words before and after the target word)
Target Word:
“sits”
Step 1: Identify Context Words
The context words are “The”, “cat”, “on”, and “the”.
Step 2: Convert Context Words into One-Hot Vectors
We represent each context word as a vector of size 5.
For simplicity, let’s assign:
- the → [1,0,0,0,0]
- cat → [0,1,0,0,0]
- sits → [0,0,1,0,0]
- on → [0,0,0,1,0]
- mat → [0,0,0,0,1]
The input layer for our context will therefore have four one-hot vectors:
[1,0,0,0,0], [0,1,0,0,0], [0,0,0,1,0], [1,0,0,0,0]
Step 3: Average the Context Vectors
We add these vectors and divide by 4:
[(1+0+0+1)/4, (0+1+0+0)/4, (0+0+0+0)/4, (0+0+1+0)/4, (0+0+0+0)/4]
This gives [0.5, 0.25, 0, 0.25, 0].
This averaged vector represents the combined context meaning.
Step 4: Multiply with Weight Matrix (Input to Hidden)
This vector is multiplied by a weight matrix (W) of size 5×N, where N is the embedding size (say, 3).
After multiplication, we get a dense hidden vector — for instance, something like [0.21, 0.36, 0.47].
This 3-dimensional vector captures the semantic meaning of the surrounding context.
Step 5: Multiply with Second Weight Matrix (Hidden to Output)
We multiply the hidden vector by another weight matrix (W’) of size N×5 (3×5).
This gives us 5 scores — one for each word in the vocabulary.
Step 6: Apply Softmax
These scores are converted into probabilities that sum to 1.
Suppose we get something like:
- the: 0.15
- cat: 0.10
- sits: 0.50
- on: 0.15
- mat: 0.10
The model predicts “sits” as the most likely target word, which matches the actual word.
Over time, training on many such examples allows the model to adjust weights so that words with similar meanings get similar vector representations.
5. Advantages of CBOW
1. Computationally Efficient
CBOW predicts a single target word using multiple context words, which reduces the number of computations and speeds up training.
2. Stable for Frequent Words
By averaging context words, CBOW provides smooth, consistent embeddings for words that appear frequently in the corpus.
3. Captures Semantic Relationships
It effectively captures similarity between words based on how often they appear in similar contexts (for example, “dog” and “cat” tend to occur near similar words like “pet”, “animal”, or “cute”).
4. Less Sensitive to Noise
Averaging over several context words helps CBOW handle noisy data more robustly than models relying on a single context word.
6. Disadvantages of CBOW
1. Poor at Representing Rare Words
Because it learns by predicting common target words, CBOW often produces weaker embeddings for rare or infrequent words.
2. Ignores Word Order
CBOW treats all context words equally, regardless of their order in the sentence.
Thus, “dog bites man” and “man bites dog” would be represented similarly.
3. Averaging Loses Detail
Averaging the context words smooths out important distinctions and subtle differences between contexts.
4. Limited to Fixed Context Window
CBOW cannot naturally handle long-distance dependencies or capture words that are semantically related but far apart in a sentence.
7. Applications of CBOW
The word embeddings learned through CBOW can be used as features for various downstream NLP tasks, including:
- Sentiment Analysis
- Text Classification
- Named Entity Recognition (NER)
- Machine Translation
- Question Answering
- Information Retrieval
8. Summary
The CBOW model in Word2Vec is a simple yet powerful way to learn word embeddings based on predicting a word from its surrounding context.
It’s computationally efficient, performs well for frequent words, and captures meaningful word relationships.
However, it has limitations such as ignoring word order and struggling with rare words.
Despite these, CBOW remains a foundational approach for understanding how words relate to each other in text and serves as the basis for more advanced embedding models.
