In this tutorial, we continue our exploration of text splitting techniques in LangChain, focusing on one of the most fundamental operations in any Retrieval-Augmented Generation (RAG) pipeline — preparing text data for large language models.
In the previous tutorial, we discussed the Recursive Character Text Splitter, learned how to use it in code, and explored how it helps us efficiently divide long documents into smaller, manageable chunks.
In this session, we’ll:
- Review key properties of the Recursive Character Text Splitter
- Understand how it differs from the Character Text Splitter
- Learn how to use the Character Text Splitter in practice with examples
- Discuss when to use which splitter
1. Quick Recap: Recursive Character Text Splitter
The Recursive Character Text Splitter is one of the most recommended and commonly used text splitters in LangChain because of its adaptability and intelligent behavior.
Here’s why:
- Generic Use Case
It is designed for general-purpose text splitting, meaning you can use it on most documents (PDFs, TXT, HTML, etc.) without fine-tuning. - Parameterized by a List of Characters
The splitter uses a list of characters as potential breakpoints to divide text. It starts from the largest unit (like paragraphs) and recursively breaks down text into smaller chunks until the desired chunk size is reached. - Default Separator List
By default, the recursive character splitter uses the following characters for splitting:"\n\n"(double newline — separates paragraphs)"\n"(single newline — separates sentences or lines)" "(space — separates words)
- Chunk Size Measurement
Thechunk_sizeparameter determines how large each chunk can be (in number of characters), andchunk_overlapdefines how much text from the end of one chunk should be repeated at the start of the next chunk to maintain context.
In summary, the recursive splitter:
- Prioritizes natural text boundaries
- Maintains text integrity
- Is ideal for most LLM preprocessing scenarios
2. Introduction to Character Text Splitter
Now that we understand the recursive variant, let’s discuss another important splitter — the Character Text Splitter.
The Character Text Splitter works similarly but is more manual and straightforward in its logic. It splits text based on a single character separator instead of a hierarchy of characters.
Key Characteristics
- Splitting Mechanism
The Character Text Splitter splits the text wherever it encounters the specified separator character (for example, newline"\n"or blank" ").
If no separator is provided, it defaults to newline ("\n") and double newline ("\n\n"). - Chunk Size Measurement
Like the recursive version, chunk size here is also measured in number of characters.
The splitter ensures that no chunk exceeds the specifiedchunk_sizevalue, unless a separator cannot be found. - Control and Simplicity
While this splitter offers less structural intelligence, it gives you fine-grained control — you decide exactly which character acts as a boundary.
3. Installing and Importing the Character Text Splitter
If you haven’t already installed the LangChain text splitters library, run:
pip install langchain-text-splitters
Then, import the required class:
from langchain_text_splitters import CharacterTextSplitter
4. Example: Splitting a Text File
Let’s take a simple example using the file speech.txt that we used previously. This file is located in the same data_transformer folder.
Step 1: Loading the Document
We’ll load the text file into a document object using LangChain’s text loader.
from langchain.document_loaders import TextLoader
loader = TextLoader("speech.txt")
docs = loader.load()
print(docs[0].page_content)
This gives us the entire text of the file as a document.
Step 2: Initializing the Character Text Splitter
Next, let’s initialize our splitter:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n", # Split text wherever a newline occurs
chunk_size=100, # Each chunk can have up to 100 characters
chunk_overlap=20 # Overlap of 20 characters between chunks
)
We now use this splitter to divide our text:
split_docs = text_splitter.split_documents(docs)
You can print the first few chunks to inspect them:
print(split_docs[0].page_content) print(split_docs[1].page_content)
Step 3: Understanding Chunking Behavior
Sometimes, you may notice warnings like:
Created a chunk of 470 characters, which is longer than the specified 100.
This happens because the splitter couldn’t find the separator (for example, a newline) within that range. In such cases, it preserves the entire segment instead of breaking text arbitrarily.
This is expected behavior and helps maintain text coherence.
5. Using Default Separators
If you do not specify a separator, the Character Text Splitter defaults to using:
"\n\n""\n"
Here’s an example:
text_splitter = CharacterTextSplitter(
chunk_size=100,
chunk_overlap=20
)
split_docs = text_splitter.create_documents(["The world must be made safe for democracy. ..."])
You can now print and inspect the chunks:
print(split_docs[0].page_content) print(split_docs[1].page_content)
You’ll see that the text splits around newline characters by default.
6. Comparing Recursive vs. Character Text Splitters
| Feature | Recursive Character Text Splitter | Character Text Splitter |
|---|---|---|
| Splitting Logic | Splits text recursively based on a list of characters (paragraph → sentence → word) | Splits based on a single character separator |
| Structure Preservation | Maintains natural structure of text | May split mid-sentence if separator not present |
| Default Separators | \n\n, \n, and space " " | \n\n and \n |
| Best Use Case | For most general text documents | When you need explicit control over splitting logic |
| Recommendation | Use this one for almost all cases | Use only when recursive splitting is unnecessary |
7. Which One Should You Use?
If you’re unsure which splitter to pick, always start with:
RecursiveCharacterTextSplitter
It is more flexible, handles varied text structures gracefully, and is the default choice for most text processing scenarios in LangChain.
The Character Text Splitter is useful when:
- You have highly structured text with consistent separators
- You want to control exactly where the text is split
