Word2Vec Skip-Gram

Introduction

Word2Vec revolutionized Natural Language Processing (NLP) by introducing word embeddings, dense numerical vectors that capture semantic and syntactic relationships between words.

The architecture of Word2Vec has two main variants:

Continuous Bag of Words (CBOW) — predicts a target word from context words.
Skip-Gram — predicts surrounding context words from a target word.

In this tutorial, we’ll focus on the Skip-Gram model, which is particularly effective for representing rare words and capturing more detailed contextual relationships.
We’ll explore its intuition, architecture, step-by-step working, a detailed example and dry run, and the advantages and disadvantages.

1. Intuition Behind Skip-Gram

The Skip-Gram model is the opposite of CBOW.
Instead of predicting a missing target word from surrounding context, it predicts the context words given a target word.

You can think of Skip-Gram as answering this question:

“Given a particular word, what words are likely to appear near it?”

For example, if the word “cat” appears in a sentence, Skip-Gram learns to predict nearby words such as “the”, “sat”, “on”, and “mat”.

By doing this across millions of sentences, the model learns that “cat” often appears in similar contexts as “dog”, so their vector representations become similar.

The model gets its name because it skips through words in a sequence to predict neighboring ones within a chosen window size.

2. Architecture of Skip-Gram

The Skip-Gram model has the same three-layer neural network structure as CBOW, but the data flow is reversed.

Input Layer

The input layer represents the target word as a one-hot vector of vocabulary size V.
For example, if there are 10,000 words in the vocabulary, the input vector is 10,000-dimensional, with only one element set to 1 (for the target word).

Hidden Layer

The input vector is multiplied by a weight matrix (W) that maps it into a smaller-dimensional vector space (the embedding space).
The output of this layer is the embedding vector for the target word.
There’s no activation function — it’s a simple linear transformation.

Output Layer

The hidden layer vector is multiplied by a second weight matrix (W’) to generate a score for every word in the vocabulary.
The scores are passed through a softmax function to produce probabilities.
The goal is to assign higher probabilities to the actual context words and lower probabilities to others.

3. Step-by-Step Working of Skip-Gram

Let’s understand how Skip-Gram works conceptually, step by step.

Step 1: Choose a Context Window

A context window size (C) determines how many words before and after the target word are considered its context.
For example, with C = 2 in the sentence “The cat sat on the mat”, the context of “sat” is “The”, “cat”, “on”, and “the”.

Step 2: Select a Target Word

The model picks one word (for example, “sat”) as the target and tries to predict its neighboring context words within the defined window.

Step 3: Encode the Target Word

The target word is converted into a one-hot vector.

Step 4: Multiply with the Weight Matrix (Input to Hidden)

This one-hot vector is multiplied by a weight matrix (W), which retrieves the embedding for that word.

Step 5: Predict Each Context Word

The model uses the embedding of the target word to predict each of the surrounding words separately.
It multiplies the embedding by another matrix (W’) and applies a softmax function to get a probability distribution over all vocabulary words.

Step 6: Train and Update Weights

For each context word, the model compares its predicted probability with the actual one.
Using backpropagation, it updates both W and W’ to reduce prediction error.
This process repeats for all words in the corpus, refining embeddings with each iteration.

4. Example and Dry Run

Let’s walk through a practical example to make it intuitive.

Sentence:

“The cat sits on the mat”

Vocabulary:

[the, cat, sits, on, mat]

Vocabulary size = 5

Window Size:

2 (two words before and after the target)

Target Word:

“sits”

Step 1: Identify Context Words

For the target “sits”, the context words are “the”, “cat”, “on”, and “the”.

Step 2: Input Vector for Target Word

We represent “sits” as a one-hot vector of size 5:
[0, 0, 1, 0, 0]

Step 3: Multiply with Weight Matrix (Input to Hidden)

Let’s assume the embedding size is 3.
Multiplying the one-hot vector by the input weight matrix (W) retrieves the 3-dimensional embedding of “sits”.
Suppose it gives us something like [0.25, 0.40, 0.55].

Step 4: Predict Each Context Word

Now the model uses this embedding to predict each of the four context words.

It multiplies this embedding by the second weight matrix (W’) and applies the softmax function to get a probability distribution over all 5 vocabulary words.

Suppose for one context word, we get probabilities like:

the: 0.45
cat: 0.20
sits: 0.05
on: 0.20
mat: 0.10

The model predicts “the” as one of the likely context words — which is correct.

Step 5: Repeat for All Context Words

This prediction process is repeated for all four context words (“the”, “cat”, “on”, and “the”).
Each time, the model adjusts its weight matrices using backpropagation to make predictions more accurate.

Step 6: Learned Embeddings

After training on large text data, the rows of the first weight matrix (W) become the final word embeddings.
These embeddings capture the meaning of words based on the contexts in which they appear.

For instance, “cat” and “dog” may end up with very similar embeddings since they often occur in similar contexts like “pet”, “animal”, or “bark/meow”.

5. Advantages of Skip-Gram

1. Excellent for Rare Words

Skip-Gram is particularly good at learning meaningful representations for infrequent words because it trains each word as the target multiple times across different contexts.

2. Captures Complex Relationships

By predicting multiple context words from one target word, Skip-Gram captures richer and more nuanced word relationships than CBOW.

3. Produces High-Quality Embeddings

Skip-Gram embeddings often outperform CBOW in capturing semantic regularities such as analogies — for example,
“king – man + woman ≈ queen”.

4. Handles Small Datasets Well

Because Skip-Gram focuses on each word individually, it performs better than CBOW when the training data is limited.

6. Disadvantages of Skip-Gram

1. Slower Training

Skip-Gram predicts multiple context words for every target word, which makes training slower compared to CBOW.

2. Less Stable for Frequent Words

It can sometimes produce noisier embeddings for frequent words since it updates them very often.

3. Ignores Word Order

Like CBOW, Skip-Gram treats all context words as independent of order.
Thus, it doesn’t distinguish between sentences like “dog bites man” and “man bites dog”.

4. Limited Context Window

It still only considers a fixed-size context window and cannot model long-distance word dependencies effectively.

7. Applications of Skip-Gram Word Embeddings

Skip-Gram embeddings are widely used in modern NLP applications, including:

Sentiment Analysis
Machine Translation
Question Answering Systems
Semantic Search
Chatbots
Document Similarity and Clustering

8. Summary

The Skip-Gram model in Word2Vec predicts the surrounding words given a target word.
It learns high-quality word embeddings that capture both semantic (meaning-based) and syntactic (structure-based) relationships between words.

While it trains more slowly than CBOW, Skip-Gram performs better for rare words and captures deeper contextual relationships, making it highly valuable for building robust word embeddings in NLP systems.

In essence, Skip-Gram teaches a model to “understand” the linguistic patterns of a language simply by predicting how words co-occur — without any manual supervision.