Learnitweb

Text Splitting Techniques – RecursiveCharacterTextSplitter

In this tutorial, we continue our journey into LangChain, a powerful framework that connects large language models (LLMs) with external data sources and tools.

In the previous part, we explored data ingestion techniques using different document loaders to read content from PDFs, text files, web pages, and Wikipedia.

Now, we move to the next critical step — data transformation. This phase focuses on how to split large documents into smaller, manageable text chunks, which can then be used effectively by language models.

1. Why Data Transformation is Needed in LangChain

Before we dive into the implementation, let’s understand why we need to transform data into chunks.

Every large language model (LLM) such as GPT, Claude, or Gemini has a context size limitation. This defines the maximum number of tokens (or words) it can process in a single request.

If your document is too large, it must be divided into smaller text segments so that the model can process and reason over them efficiently.

LangChain provides multiple methods to achieve this segmentation through a set of tools called Text Splitters.

2. Setting Up the Environment

Make sure you have installed the LangChain text splitting library before proceeding.

Open your terminal and run the following command:

pip install langchain-text-splitters

3. Importing Required Libraries

LangChain provides various text splitting utilities inside the langchain_text_splitters module.

For this example, we’ll use the Recursive Character Text Splitter, which is one of the most commonly used splitters.

from langchain_text_splitters import RecursiveCharacterTextSplitter

5. Splitting Text into Chunks

Now that we have our documents, the next step is to split them into smaller pieces.

Method 1: Recursively Splitting Text by Characters

The Recursive Character Text Splitter divides documents based on character count, ensuring no chunk exceeds a certain size.

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # Maximum number of characters per chunk
    chunk_overlap=50      # Overlapping characters between chunks
)

Here:

  • chunk_size=500 means each text chunk will have a maximum of 500 characters.
  • chunk_overlap=50 ensures some overlapping text between chunks for better continuity.

Now let’s split our PDF documents:

final_documents = text_splitter.split_documents(docs)

You can verify the output:

print(final_documents[0])
print(final_documents[1])

You’ll notice that some parts of the text repeat slightly between chunks — this is due to the 50-character overlap.

For example:

Chunk 1: "University of Toronto ... end of text"
Chunk 2: "... University of Toronto continues ..."

This overlap ensures contextual continuity between chunks, which is helpful during embedding generation and retrieval.

6. Handling Text Files (Plain Text Input)

Let’s now see how to handle .txt files, such as speech.txt.

Loading a Text File into Memory

with open("speech.txt", "r", encoding="utf-8") as f:
    speech = f.read()

This reads the file content into a variable called speech, which is a plain string, not a document object.

You can print it to verify:

print(speech)

7. Converting Text into LangChain Documents

LangChain prefers data to be in a Document format since many downstream operations (like embeddings and retrieval) expect that structure.

To convert our text into Document objects, we can use the create_documents() method.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

# Convert text into Document objects
documents = text_splitter.create_documents([speech])

print(documents[0])
print(documents[1])

Let’s increase the chunk size slightly for better results:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20
)

documents = text_splitter.create_documents([speech])

Now when you print the first two chunks, you’ll notice something like:

Document 1: "The world must be made safe for democracy. Its peace must be..."
Document 2: "... safe for democracy. Its peace must be planted upon..."

You can see that the phrase “safe for democracy” appears in both chunks because of the 20-character overlap.


8. Understanding the Output

The create_documents() and split_documents() functions both return a list of Document objects.

You can verify this using:

type(documents[0])

Output:

<class 'langchain.schema.document.Document'>

Each Document object contains:

  • page_content — the actual text chunk
  • metadata — optional additional information (like source filename or page number)

This structured format makes it easy to pass these chunks into embedding models, vector databases, or retrievers in LangChain.