Learnitweb

Converting Text Chunks into Vector Embeddings Using HuggingFace

In the previous sessions, you learned:

  1. How to load data using LangChain’s document loaders.
  2. How to split documents into smaller, meaningful chunks using text splitters.

We now move to Step 3, where you convert these text chunks into vector embeddings.
These embeddings are the foundation of modern retrieval systems, semantic search, and RAG (Retrieval-Augmented Generation) applications.

This tutorial focuses entirely on OpenAI embeddings.

What Are Embeddings?

Embeddings are one of the most important concepts in modern AI systems—especially in search engines, chatbots, recommendation engines, and RAG-based applications.

At their core, embeddings are numerical representations of text, usually expressed as high-dimensional vectors (long lists of floating-point numbers).
Each vector captures the meaning of the text it represents.

In traditional systems, computers treated text as plain strings and could not understand meaning. For example, the words “bank” (river bank) and “bank” (financial institution) look identical as strings, even though their meanings differ entirely.
Similarly, the sentences:

  • “The car is parked in the garage.”
  • “The vehicle is inside the parking area.”

are different strings but carry very similar meanings.

Embeddings solve this problem.

How Do Embeddings Work?

Embeddings map text into a numerical vector space where:

  • similar meanings → vectors placed close together
  • different meanings → vectors placed far apart

This concept is called semantic similarity.

The process is powered by large neural network models trained on massive datasets. These models learn relationships between words, sentences, and concepts.

For example, an embedding for:

“India won the cricket match today.”

will be close to embeddings for:

“The Indian team secured victory.”
“Cricket match results favor India.”

But far from:

“The stock market fell significantly.”

This allows AI systems to understand what you mean, not just what you type.


Why Are Embeddings So Important?

Embeddings enable many advanced capabilities:

1. Semantic Search

Instead of searching for exact keywords, systems look for meaning.
For example, searching “car repair tips” will also match content containing “automobile maintenance.”

2. Document Similarity

You can measure how similar two documents are by simply comparing their vectors.

3. Context Retrieval for Chatbots (RAG)

When building an LLM-powered assistant, you must feed relevant context to the model.
Embeddings help the system find the most relevant text chunks.

4. Clustering and Categorization

Documents with similar meaning form natural groups in vector space.

5. Recommendations

Embedding similarity is often used in:

  • product recommendation engines
  • article recommendation
  • music recommendation

6. Detecting Duplicates

Two texts with different wording but same meaning will have highly similar embeddings.


What Does an Embedding Look Like?

An embedding is typically a list (vector) of numbers:

[0.12, -0.44, 0.03, 1.27, ...]

Depending on the model, the dimension may vary:

  • 768 dimensions
  • 1024 dimensions
  • 1536 dimensions
  • 3072 dimensions

Higher dimensions often capture more nuance and meaning.

OpenAI’s text-embedding-3-large produces 3072-dimensional vectors, while custom dimension settings (e.g., 1024 or 256) offer a trade-off between accuracy and storage cost.


Why Do We Need High-Dimensional Vectors?

Human language is incredibly rich and complex.
A high-dimensional space allows the model to encode various aspects of meaning, such as:

  • topic
  • sentiment
  • intent
  • context
  • relationships
  • entities
  • grammar patterns
  • semantic roles

Each dimension holds a learned pattern that contributes to the meaning representation.

Think of embeddings as coordinates in a semantic universe where meaning is encoded mathematically.


Embeddings in Vector Databases

Once embeddings are generated, they are stored in specialized databases called vector stores or vector databases (e.g., Chroma, Pinecone, Weaviate, Qdrant).

These databases allow fast similarity searches using algorithms like:

  • cosine similarity
  • dot-product similarity
  • Euclidean distance

This makes them the backbone of retrieval-augmented LLM applications.

Why HuggingFace Embeddings?

HuggingFace embeddings are:

  • Free and open source
  • Accurate for semantic similarity tasks
  • Fast on CPU (GPU optional)
  • Widely used in real-world RAG applications
  • Easy to integrate with LangChain

Installing Required Dependencies

Install the necessary libraries:

pip install sentence-transformers
pip install langchain-community
pip install langchain-text-splitters
pip install chromadb

These give you:

  • SentenceTransformers (HuggingFace embeddings)
  • LangChain loaders and vector stores
  • Chroma vector database

You do not need python-dotenv and do not need an API key.

Initializing HuggingFace Embeddings

LangChain provides a ready-to-use class:

from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

This creates embeddings with 384 dimensions.


Complete Workflow

We will:

  1. Load a speech.txt file
  2. Split the text into chunks
  3. Generate HuggingFace embeddings
  4. Store them in Chroma DB
  5. Run similarity search

This is the foundation of a RAG system.

Complete Program

import os

# ---------------------------------------------
# 1. Import Required Libraries
# ---------------------------------------------
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# ---------------------------------------------
# 2. Load the Document
# ---------------------------------------------
loader = TextLoader("speech.txt")   # Make sure this file exists
documents = loader.load()

print(f"Loaded {len(documents)} document(s).")

# ---------------------------------------------
# 3. Split Into Chunks
# ---------------------------------------------
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

final_docs = splitter.split_documents(documents)
print(f"Split into {len(final_docs)} chunks.")

# ---------------------------------------------
# 4. Initialize HuggingFace Embeddings (FREE)
# ---------------------------------------------
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("Embedding model loaded successfully!")

# ---------------------------------------------
# 5. Store Chunks in Chroma Vector Database
# ---------------------------------------------
db = Chroma.from_documents(
    documents=final_docs,
    embedding=embeddings
)

print("Chroma vector store created successfully!")

# ---------------------------------------------
# 6. Perform a Similarity Search
# ---------------------------------------------
query = "What did the speaker say about technology?"

results = db.similarity_search(query, k=3)

print("\n--- Retrieved Results ---\n")

for idx, r in enumerate(results, start=1):
    print(f"Result #{idx}:")
    print(r.page_content)
    print("-------------------------")

Following is the sample speech.txt.

Ladies and gentlemen,

Thank you for joining us today. We are living in a remarkable time, a time where technology is advancing faster than ever before. Artificial intelligence, renewable energy, and healthcare innovations are transforming our world in ways we could not have imagined a decade ago.

However, with all this rapid progress, we also face new challenges. We must ensure that technology is used responsibly, ethically, and for the benefit of all people. Innovation should not only solve problems—it should create opportunities.

Education will play a key role in shaping the next generation. We need systems that encourage curiosity, creativity, and critical thinking. The future belongs to those who are prepared to learn, adapt, and grow.

As we move forward, let us commit to building a world where technological progress and human values go hand in hand. Together, we can create a future that is both intelligent and compassionate.

Thank you.

Following are the retrieved results:

--- Retrieved Results ---

Result #1:
Ladies and gentlemen,

Thank you for joining us today. We are living in a remarkable time, a time where technology is advancing faster than ever before. Artificial intelligence, renewable energy, and healthcare innovations are transforming our world in ways we could not have imagined a decade ago.
-------------------------
Result #2:
However, with all this rapid progress, we also face new challenges. We must ensure that technology is used responsibly, ethically, and for the benefit of all people. Innovation should not only solve problems—it should create opportunities.

Education will play a key role in shaping the next generation. We need systems that encourage curiosity, creativity, and critical thinking. The future belongs to those who are prepared to learn, adapt, and grow.
-------------------------
Result #3:
As we move forward, let us commit to building a world where technological progress and human values go hand in hand. Together, we can create a future that is both intelligent and compassionate.

Thank you.
-------------------------