Data Ingestion with Document Loaders

In this tutorial, we will explore how data ingestion works in LangChain — the very first step in building any Generative AI pipeline.

LangChain provides a structured way to load, transform, and store data so that it can later be used for retrieval-based question answering, semantic search, or chatbot applications powered by large language models (LLMs).

We will dive deeply into the data ingestion phase, where data from multiple sources (text files, PDFs and web pages) is loaded and converted into a standard document format.

1. The Generative AI Application Flow

Data Ingestion – Loading data from different sources (text, PDFs, web, APIs, etc.) into LangChain’s document format.
Data Transformation – Cleaning and preprocessing the data for uniformity.
Vectorization – Converting the text into numerical vectors using embedding models.
Vector Store – Saving vectors into a database such as FAISS, Pinecone, or Chroma.
Retrieval Chain – Creating a retrieval mechanism that fetches relevant documents during a query.
Prompt Construction – Designing structured prompts that combine retrieved context with the user’s question.
LLM Querying – Sending the final prompt to the LLM model for generating the response.

2. What is Data Ingestion in LangChain?

Data ingestion is the process of loading data from multiple formats and sources into a uniform structure that LangChain understands — called a Document.

A Document in LangChain typically contains:

page_content – the text data or content
metadata – information about the source such as filename, URL, or page number

For example:

A .txt file becomes a single document.
A .pdf with multiple pages becomes a list of documents (one per page).
A web page becomes a document containing its text content and source URL.

LangChain supports numerous data ingestion methods through Document Loaders, which can handle files, APIs, databases, and more.

3. Document Loaders in LangChain

Document Loaders are specialized tools that help you load content from various data sources into LangChain’s document format.

Some of the most commonly used loaders include:

TextLoader – Reads plain text files (.txt)
PyPDFLoader – Loads content from PDF files
WebBaseLoader – Scrapes and extracts text from web pages
ArxivLoader – Loads academic research papers directly from arXiv.org
WikipediaLoader – Fetches articles from Wikipedia
CSVLoader, ImageLoader, and many others for structured or multimedia data.

Each loader abstracts away the complexity of reading files or scraping web pages, providing a standardized interface to retrieve content.

4. Common Data Ingestion Examples

Let’s go through each type of data source conceptually, step-by-step.

4.1 Loading a Text File

When you have plain .txt files, LangChain can easily convert them into a Document object.
All the text content becomes part of the page_content, and metadata contains the file name and path.

Typical use case:

Reading meeting transcripts, reports, or textual notes.

Process

Specify the file path (e.g., speech.txt).
Use the TextLoader to read and load the content.
LangChain automatically stores it as a single Document.

from langchain_community.document_loaders import TextLoader

# Standard libraries 
import os
from pprint import pprint
# Example: load a simple text file (speech.txt)

text_file_path = "speech.txt"  # ensure this file exists

# Initialize loader and load documents
loader = TextLoader(text_file_path)
documents = loader.load()

# documents is typically a list with a single Document (for a small txt file)
print("Loaded text documents count:", len(documents))
print("--- sample document metadata ---")
pprint(documents[0].metadata)
print("--- sample content (first 800 chars) ---")
print(documents[0].page_content[:800])

Output

Loaded text documents count: 1
--- sample document metadata ---
{'source': 'speech.txt'}
--- sample content (first 800 chars) ---
this is Some
sample
text in the
sample text file

4.2 Loading a PDF File

PDFs are often used for reports, research papers, and contracts.
LangChain handles PDFs using PyPDFLoader, where each page is treated as a separate document.

Process

Provide the PDF file path (e.g., attention.pdf).
LangChain reads each page individually.
Every page becomes a separate Document with its own metadata (page number and file name).

from langchain_community.document_loaders import PyPDFLoader
# Example: load a PDF document where each page becomes a Document
pdf_path = "sample.pdf"  # put the file here

pdf_loader = PyPDFLoader(pdf_path)
pdf_docs = pdf_loader.load()

print("Pages loaded as documents:", len(pdf_docs))
print("--- metadata for page 1 ---")
pprint(pdf_docs[0].metadata)
print("--- page 1 content (first 600 chars) ---")
print(pdf_docs[0].page_content[:600])

Output

Pages loaded as documents: 2
--- metadata for page 1 ---
{'author': 'Microsoft account',
 'creationdate': '2025-11-08T23:02:19+05:30',
 'creator': 'Microsoft® Word 2013',
 'moddate': '2025-11-08T23:02:19+05:30',
 'page': 0,
 'page_label': '1',
 'producer': 'Microsoft® Word 2013',
 'source': 'sample.pdf',
 'total_pages': 2}
--- page 1 content (first 600 chars) ---
th i s is  Som e 
sa m p l e 
text i n  the 
sa m p l e  text fil e 
Page  1

4.3 Loading from the Web

If you want to pull content directly from a website, LangChain provides the WebBaseLoader.
It uses BeautifulSoup under the hood for HTML parsing.

Process

Provide one or multiple URLs.
The loader fetches the HTML content.
BeautifulSoup cleans and extracts only relevant portions of the text (you can specify which classes or tags to focus on).

Use Case

Scraping blog posts, product descriptions, or documentation pages.

Customization
You can specify:

Which HTML tags or CSS classes to extract.
Whether to remove extra spaces or newline characters.
Which elements to exclude (e.g., headers, footers).

from langchain_community.document_loaders import WebBaseLoader
# Example: load a web page (single URL) with default extraction
url = "https://example.com/some-article"  # replace with a real URL you want to test

web_loader = WebBaseLoader(url)   # default extraction for main text
web_docs = web_loader.load()

print("Web pages loaded:", len(web_docs))
for i, d in enumerate(web_docs):
    print(f"--- doc {i} metadata ---")
    pprint(d.metadata)
    print("content snippet:", d.page_content[:600])
    print()

Output

Web pages loaded: 1
--- doc 0 metadata ---
{'language': 'en',
 'source': 'https://example.com/some-article',
 'title': 'Example Domain'}
content snippet: Example DomainExample DomainThis domain is for use in documentation examples without needing permission. Avoid use in operations.Learn more