In this tutorial, we will explore how data ingestion works in LangChain — the very first step in building any Generative AI pipeline.
LangChain provides a structured way to load, transform, and store data so that it can later be used for retrieval-based question answering, semantic search, or chatbot applications powered by large language models (LLMs).
We will dive deeply into the data ingestion phase, where data from multiple sources (text files, PDFs and web pages) is loaded and converted into a standard document format.
1. The Generative AI Application Flow
- Data Ingestion – Loading data from different sources (text, PDFs, web, APIs, etc.) into LangChain’s document format.
- Data Transformation – Cleaning and preprocessing the data for uniformity.
- Vectorization – Converting the text into numerical vectors using embedding models.
- Vector Store – Saving vectors into a database such as FAISS, Pinecone, or Chroma.
- Retrieval Chain – Creating a retrieval mechanism that fetches relevant documents during a query.
- Prompt Construction – Designing structured prompts that combine retrieved context with the user’s question.
- LLM Querying – Sending the final prompt to the LLM model for generating the response.
2. What is Data Ingestion in LangChain?
Data ingestion is the process of loading data from multiple formats and sources into a uniform structure that LangChain understands — called a Document.
A Document in LangChain typically contains:
- page_content – the text data or content
- metadata – information about the source such as filename, URL, or page number
For example:
- A
.txtfile becomes a single document. - A
.pdfwith multiple pages becomes a list of documents (one per page). - A web page becomes a document containing its text content and source URL.
LangChain supports numerous data ingestion methods through Document Loaders, which can handle files, APIs, databases, and more.
3. Document Loaders in LangChain
Document Loaders are specialized tools that help you load content from various data sources into LangChain’s document format.
Some of the most commonly used loaders include:
- TextLoader – Reads plain text files (
.txt) - PyPDFLoader – Loads content from PDF files
- WebBaseLoader – Scrapes and extracts text from web pages
- ArxivLoader – Loads academic research papers directly from arXiv.org
- WikipediaLoader – Fetches articles from Wikipedia
- CSVLoader, ImageLoader, and many others for structured or multimedia data.
Each loader abstracts away the complexity of reading files or scraping web pages, providing a standardized interface to retrieve content.
4. Common Data Ingestion Examples
Let’s go through each type of data source conceptually, step-by-step.
4.1 Loading a Text File
When you have plain .txt files, LangChain can easily convert them into a Document object.
All the text content becomes part of the page_content, and metadata contains the file name and path.
Typical use case:
- Reading meeting transcripts, reports, or textual notes.
Process
- Specify the file path (e.g.,
speech.txt). - Use the TextLoader to read and load the content.
- LangChain automatically stores it as a single
Document.
from langchain_community.document_loaders import TextLoader
# Standard libraries
import os
from pprint import pprint
# Example: load a simple text file (speech.txt)
text_file_path = "speech.txt" # ensure this file exists
# Initialize loader and load documents
loader = TextLoader(text_file_path)
documents = loader.load()
# documents is typically a list with a single Document (for a small txt file)
print("Loaded text documents count:", len(documents))
print("--- sample document metadata ---")
pprint(documents[0].metadata)
print("--- sample content (first 800 chars) ---")
print(documents[0].page_content[:800])
Output
Loaded text documents count: 1
--- sample document metadata ---
{'source': 'speech.txt'}
--- sample content (first 800 chars) ---
this is Some
sample
text in the
sample text file
4.2 Loading a PDF File
PDFs are often used for reports, research papers, and contracts.
LangChain handles PDFs using PyPDFLoader, where each page is treated as a separate document.
Process
- Provide the PDF file path (e.g.,
attention.pdf). - LangChain reads each page individually.
- Every page becomes a separate
Documentwith its own metadata (page number and file name).
from langchain_community.document_loaders import PyPDFLoader
# Example: load a PDF document where each page becomes a Document
pdf_path = "sample.pdf" # put the file here
pdf_loader = PyPDFLoader(pdf_path)
pdf_docs = pdf_loader.load()
print("Pages loaded as documents:", len(pdf_docs))
print("--- metadata for page 1 ---")
pprint(pdf_docs[0].metadata)
print("--- page 1 content (first 600 chars) ---")
print(pdf_docs[0].page_content[:600])
Output
Pages loaded as documents: 2
--- metadata for page 1 ---
{'author': 'Microsoft account',
'creationdate': '2025-11-08T23:02:19+05:30',
'creator': 'Microsoft® Word 2013',
'moddate': '2025-11-08T23:02:19+05:30',
'page': 0,
'page_label': '1',
'producer': 'Microsoft® Word 2013',
'source': 'sample.pdf',
'total_pages': 2}
--- page 1 content (first 600 chars) ---
th i s is Som e
sa m p l e
text i n the
sa m p l e text fil e
Page 1
4.3 Loading from the Web
If you want to pull content directly from a website, LangChain provides the WebBaseLoader.
It uses BeautifulSoup under the hood for HTML parsing.
Process
- Provide one or multiple URLs.
- The loader fetches the HTML content.
- BeautifulSoup cleans and extracts only relevant portions of the text (you can specify which classes or tags to focus on).
Use Case
- Scraping blog posts, product descriptions, or documentation pages.
Customization
You can specify:
- Which HTML tags or CSS classes to extract.
- Whether to remove extra spaces or newline characters.
- Which elements to exclude (e.g., headers, footers).
from langchain_community.document_loaders import WebBaseLoader
# Example: load a web page (single URL) with default extraction
url = "https://example.com/some-article" # replace with a real URL you want to test
web_loader = WebBaseLoader(url) # default extraction for main text
web_docs = web_loader.load()
print("Web pages loaded:", len(web_docs))
for i, d in enumerate(web_docs):
print(f"--- doc {i} metadata ---")
pprint(d.metadata)
print("content snippet:", d.page_content[:600])
print()
Output
Web pages loaded: 1
--- doc 0 metadata ---
{'language': 'en',
'source': 'https://example.com/some-article',
'title': 'Example Domain'}
content snippet: Example DomainExample DomainThis domain is for use in documentation examples without needing permission. Avoid use in operations.Learn more
