Learnitweb

Working with ChromaDB Using LangChain + Hugging Face Embeddings

Vector databases play a crucial role in modern LLM-powered applications. Whenever we want to store, search, or retrieve information semantically (using meaning instead of keywords), we rely on vector stores. In this tutorial, we will focus on ChromaDB, one of the most popular and developer-friendly open-source vector databases.

This guide is written in simple language and includes the complete working program that you successfully executed using Hugging Face embeddings.

1. What Is ChromaDB?

ChromaDB is:

  • A native and open-source vector database
  • Designed for LLM applications, semantic search, and RAG pipelines
  • Built for developer productivity and happiness
  • Licensed under Apache 2.0
  • Supports persistent storage using SQLite
  • Fast, lightweight, and local-first

ChromaDB is widely used in:

  • ChatGPT-style chatbots
  • RAG (Retrieval Augmented Generation)
  • Search engines
  • Document understanding applications

2. Installing ChromaDB and Required Libraries

To use ChromaDB with LangChain, we need to install several components:

  • chromadb → core vector DB
  • langchain-community → Chroma integration
  • langchain-text-splitters → for chunking
  • sentence-transformers → for HuggingFace embeddings
  • torch → required by transformer models

Update your requirements.txt:

langchain
langchain-community
langchain-text-splitters
chromadb
sentence-transformers
torch

Install everything:

pip install -r requirements.txt

3. Understanding the Workflow

The ChromaDB pipeline looks like this:

  1. Load the document (speech.txt)
  2. Split the document into smaller chunks
  3. Generate embeddings using a model
    (we use sentence-transformers/all-MiniLM-L6-v2)
  4. Store embeddings inside ChromaDB
  5. Persist the vector DB to disk (SQLite)
  6. Reload it later
  7. Perform similarity search
  8. Use retriever for RAG-style tasks

This is exactly what we implement in the complete code below.

4. Document Loading and Splitting

We begin by loading speech.txt using LangChain’s TextLoader.
Then we split it using RecursiveCharacterTextSplitter.

  • chunk_size=500
  • chunk_overlap=50
    (Overlap helps preserve context between chunks)

This prepares the text for embedding.

5. Embeddings with Hugging Face

We use:

sentence-transformers/all-MiniLM-L6-v2

Advantages:

  • Free
  • Works offline
  • Fast on CPU
  • High-quality embeddings

LangChain’s HuggingFaceEmbeddings wrapper makes it easy to use.

6. Creating the Vector Store

We feed our chunks + embeddings into ChromaDB using:

Chroma.from_documents()

Important parameters:

  • persist_directory → folder where Chroma stores SQLite DB
  • collection_name → name of the dataset
  • embedding → embedding model instance

Chroma automatically:

  • Stores vectors
  • Builds ANN index
  • Saves metadata

7. Persisting Chroma to Disk

Chroma stores data using:

  • A folder structure
  • A chroma.sqlite3 file

This allows:

  • Reloading the DB anytime
  • Reusing it without recomputing embeddings
  • Hosting the DB on servers or cloud

8. Similarity Search & Retriever

Two powerful features:

similarity_search(query)

Returns top-k most relevant chunks.

as_retriever()

Standard interface for RAG applications.

9. Complete Program

"""
ChromaDB + HuggingFace Embeddings Example
-----------------------------------------

This program:
- Loads speech.txt
- Splits text into chunks
- Creates embeddings using Hugging Face free model
- Builds Chroma vector store
- Saves it to disk
- Reloads it again
- Performs similarity search
- Uses retriever interface

Works on Windows (no Ollama, no OpenAI).
"""

import os
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings


# ----------------------------
# CONFIGURATION
# ----------------------------
SPEECH_FILE = "speech.txt"
PERSIST_DIR = "chroma_hf_db"
COLLECTION_NAME = "speech_collection"

# Hugging Face embedding model (FREE + FAST)
HF_MODEL = "sentence-transformers/all-MiniLM-L6-v2"


def load_document(file_path):
    print("[1] Loading document:", file_path)
    loader = TextLoader(file_path)
    docs = loader.load()
    print("    Loaded", len(docs), "document(s).")
    return docs


def split_documents(docs):
    print("[2] Splitting into chunks...")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50
    )
    chunks = splitter.split_documents(docs)
    print("    Total chunks:", len(chunks))
    return chunks


def create_embeddings():
    print("[3] Loading HuggingFace embedding model:", HF_MODEL)

    # This downloads the model only ONCE
    embedding = HuggingFaceEmbeddings(
        model_name=HF_MODEL,
        model_kwargs={'device': 'cpu'},      # use CPU
        encode_kwargs={'normalize_embeddings': True}
    )

    print("    Embedding model ready.")
    return embedding


def build_vectorstore(chunks, embeddings):
    print("[4] Creating Chroma vector DB and saving to disk...")

    vector_db = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=PERSIST_DIR,
        collection_name=COLLECTION_NAME
    )

    vector_db.persist()
    print("    Saved to:", PERSIST_DIR)
    return vector_db


def similarity_search(vector_db):
    print("[5] Running similarity search...")

    query = "What does the speaker say about entering the war?"
    docs = vector_db.similarity_search(query, k=2)

    print("\nQUERY:", query)
    print("\nRESULT:\n", docs[0].page_content)


def reload_vectorstore(embeddings):
    print("[6] Reloading stored Chroma DB...")

    vector_db = Chroma(
        collection_name=COLLECTION_NAME,
        embedding_function=embeddings,
        persist_directory=PERSIST_DIR
    )

    print("    Reloaded successfully!")
    return vector_db


def retriever_example(vector_db):
    print("[7] Retriever example...")

    retriever = vector_db.as_retriever(search_kwargs={"k": 1})
    query = "What is the main message of the speech?"

    result = retriever.invoke(query)

    print("\nQUERY:", query)
    print("\nRETRIEVED:\n", result[0].page_content)


def main():
    docs = load_document(SPEECH_FILE)
    chunks = split_documents(docs)
    embeddings = create_embeddings()

    db = build_vectorstore(chunks, embeddings)
    similarity_search(db)

    db2 = reload_vectorstore(embeddings)
    retriever_example(db2)

    print("\n[✔] Completed successfully")


if __name__ == "__main__":
    main()