Introduction
Word2Vec revolutionized Natural Language Processing (NLP) by introducing word embeddings, dense numerical vectors that capture semantic and syntactic relationships between words.
The architecture of Word2Vec has two main variants:
- Continuous Bag of Words (CBOW) — predicts a target word from context words.
- Skip-Gram — predicts surrounding context words from a target word.
In this tutorial, we’ll focus on the Skip-Gram model, which is particularly effective for representing rare words and capturing more detailed contextual relationships.
We’ll explore its intuition, architecture, step-by-step working, a detailed example and dry run, and the advantages and disadvantages.
1. Intuition Behind Skip-Gram
The Skip-Gram model is the opposite of CBOW.
Instead of predicting a missing target word from surrounding context, it predicts the context words given a target word.
You can think of Skip-Gram as answering this question:
“Given a particular word, what words are likely to appear near it?”
For example, if the word “cat” appears in a sentence, Skip-Gram learns to predict nearby words such as “the”, “sat”, “on”, and “mat”.
By doing this across millions of sentences, the model learns that “cat” often appears in similar contexts as “dog”, so their vector representations become similar.
The model gets its name because it skips through words in a sequence to predict neighboring ones within a chosen window size.
2. Architecture of Skip-Gram
The Skip-Gram model has the same three-layer neural network structure as CBOW, but the data flow is reversed.
Input Layer
- The input layer represents the target word as a one-hot vector of vocabulary size V.
- For example, if there are 10,000 words in the vocabulary, the input vector is 10,000-dimensional, with only one element set to 1 (for the target word).
Hidden Layer
- The input vector is multiplied by a weight matrix (W) that maps it into a smaller-dimensional vector space (the embedding space).
- The output of this layer is the embedding vector for the target word.
- There’s no activation function — it’s a simple linear transformation.
Output Layer
- The hidden layer vector is multiplied by a second weight matrix (W’) to generate a score for every word in the vocabulary.
- The scores are passed through a softmax function to produce probabilities.
- The goal is to assign higher probabilities to the actual context words and lower probabilities to others.
3. Step-by-Step Working of Skip-Gram
Let’s understand how Skip-Gram works conceptually, step by step.
Step 1: Choose a Context Window
A context window size (C) determines how many words before and after the target word are considered its context.
For example, with C = 2 in the sentence “The cat sat on the mat”, the context of “sat” is “The”, “cat”, “on”, and “the”.
Step 2: Select a Target Word
The model picks one word (for example, “sat”) as the target and tries to predict its neighboring context words within the defined window.
Step 3: Encode the Target Word
The target word is converted into a one-hot vector.
Step 4: Multiply with the Weight Matrix (Input to Hidden)
This one-hot vector is multiplied by a weight matrix (W), which retrieves the embedding for that word.
Step 5: Predict Each Context Word
The model uses the embedding of the target word to predict each of the surrounding words separately.
It multiplies the embedding by another matrix (W’) and applies a softmax function to get a probability distribution over all vocabulary words.
Step 6: Train and Update Weights
For each context word, the model compares its predicted probability with the actual one.
Using backpropagation, it updates both W and W’ to reduce prediction error.
This process repeats for all words in the corpus, refining embeddings with each iteration.
4. Example and Dry Run
Let’s walk through a practical example to make it intuitive.
Sentence:
“The cat sits on the mat”
Vocabulary:
[the, cat, sits, on, mat]
Vocabulary size = 5
Window Size:
2 (two words before and after the target)
Target Word:
“sits”
Step 1: Identify Context Words
For the target “sits”, the context words are “the”, “cat”, “on”, and “the”.
Step 2: Input Vector for Target Word
We represent “sits” as a one-hot vector of size 5:
[0, 0, 1, 0, 0]
Step 3: Multiply with Weight Matrix (Input to Hidden)
Let’s assume the embedding size is 3.
Multiplying the one-hot vector by the input weight matrix (W) retrieves the 3-dimensional embedding of “sits”.
Suppose it gives us something like [0.25, 0.40, 0.55].
Step 4: Predict Each Context Word
Now the model uses this embedding to predict each of the four context words.
It multiplies this embedding by the second weight matrix (W’) and applies the softmax function to get a probability distribution over all 5 vocabulary words.
Suppose for one context word, we get probabilities like:
- the: 0.45
- cat: 0.20
- sits: 0.05
- on: 0.20
- mat: 0.10
The model predicts “the” as one of the likely context words — which is correct.
Step 5: Repeat for All Context Words
This prediction process is repeated for all four context words (“the”, “cat”, “on”, and “the”).
Each time, the model adjusts its weight matrices using backpropagation to make predictions more accurate.
Step 6: Learned Embeddings
After training on large text data, the rows of the first weight matrix (W) become the final word embeddings.
These embeddings capture the meaning of words based on the contexts in which they appear.
For instance, “cat” and “dog” may end up with very similar embeddings since they often occur in similar contexts like “pet”, “animal”, or “bark/meow”.
5. Advantages of Skip-Gram
1. Excellent for Rare Words
Skip-Gram is particularly good at learning meaningful representations for infrequent words because it trains each word as the target multiple times across different contexts.
2. Captures Complex Relationships
By predicting multiple context words from one target word, Skip-Gram captures richer and more nuanced word relationships than CBOW.
3. Produces High-Quality Embeddings
Skip-Gram embeddings often outperform CBOW in capturing semantic regularities such as analogies — for example,
“king – man + woman ≈ queen”.
4. Handles Small Datasets Well
Because Skip-Gram focuses on each word individually, it performs better than CBOW when the training data is limited.
6. Disadvantages of Skip-Gram
1. Slower Training
Skip-Gram predicts multiple context words for every target word, which makes training slower compared to CBOW.
2. Less Stable for Frequent Words
It can sometimes produce noisier embeddings for frequent words since it updates them very often.
3. Ignores Word Order
Like CBOW, Skip-Gram treats all context words as independent of order.
Thus, it doesn’t distinguish between sentences like “dog bites man” and “man bites dog”.
4. Limited Context Window
It still only considers a fixed-size context window and cannot model long-distance word dependencies effectively.
7. Applications of Skip-Gram Word Embeddings
Skip-Gram embeddings are widely used in modern NLP applications, including:
- Sentiment Analysis
- Machine Translation
- Question Answering Systems
- Semantic Search
- Chatbots
- Document Similarity and Clustering
8. Summary
The Skip-Gram model in Word2Vec predicts the surrounding words given a target word.
It learns high-quality word embeddings that capture both semantic (meaning-based) and syntactic (structure-based) relationships between words.
While it trains more slowly than CBOW, Skip-Gram performs better for rare words and captures deeper contextual relationships, making it highly valuable for building robust word embeddings in NLP systems.
In essence, Skip-Gram teaches a model to “understand” the linguistic patterns of a language simply by predicting how words co-occur — without any manual supervision.
