Learnitweb

Understanding Character Text Splitter vs. Recursive Character Text Splitter

In this tutorial, we continue our exploration of text splitting techniques in LangChain, focusing on one of the most fundamental operations in any Retrieval-Augmented Generation (RAG) pipeline — preparing text data for large language models.

In the previous tutorial, we discussed the Recursive Character Text Splitter, learned how to use it in code, and explored how it helps us efficiently divide long documents into smaller, manageable chunks.

In this session, we’ll:

  • Review key properties of the Recursive Character Text Splitter
  • Understand how it differs from the Character Text Splitter
  • Learn how to use the Character Text Splitter in practice with examples
  • Discuss when to use which splitter

1. Quick Recap: Recursive Character Text Splitter

The Recursive Character Text Splitter is one of the most recommended and commonly used text splitters in LangChain because of its adaptability and intelligent behavior.

Here’s why:

  1. Generic Use Case
    It is designed for general-purpose text splitting, meaning you can use it on most documents (PDFs, TXT, HTML, etc.) without fine-tuning.
  2. Parameterized by a List of Characters
    The splitter uses a list of characters as potential breakpoints to divide text. It starts from the largest unit (like paragraphs) and recursively breaks down text into smaller chunks until the desired chunk size is reached.
  3. Default Separator List
    By default, the recursive character splitter uses the following characters for splitting:
    • "\n\n" (double newline — separates paragraphs)"\n" (single newline — separates sentences or lines)" " (space — separates words)
    This ensures that the splitter tries to preserve natural text structure — paragraphs, sentences, and words remain intact whenever possible.
  4. Chunk Size Measurement
    The chunk_size parameter determines how large each chunk can be (in number of characters), and chunk_overlap defines how much text from the end of one chunk should be repeated at the start of the next chunk to maintain context.

In summary, the recursive splitter:

  • Prioritizes natural text boundaries
  • Maintains text integrity
  • Is ideal for most LLM preprocessing scenarios

2. Introduction to Character Text Splitter

Now that we understand the recursive variant, let’s discuss another important splitter — the Character Text Splitter.

The Character Text Splitter works similarly but is more manual and straightforward in its logic. It splits text based on a single character separator instead of a hierarchy of characters.

Key Characteristics

  1. Splitting Mechanism
    The Character Text Splitter splits the text wherever it encounters the specified separator character (for example, newline "\n" or blank " ").
    If no separator is provided, it defaults to newline ("\n") and double newline ("\n\n").
  2. Chunk Size Measurement
    Like the recursive version, chunk size here is also measured in number of characters.
    The splitter ensures that no chunk exceeds the specified chunk_size value, unless a separator cannot be found.
  3. Control and Simplicity
    While this splitter offers less structural intelligence, it gives you fine-grained control — you decide exactly which character acts as a boundary.

3. Installing and Importing the Character Text Splitter

If you haven’t already installed the LangChain text splitters library, run:

pip install langchain-text-splitters

Then, import the required class:

from langchain_text_splitters import CharacterTextSplitter

4. Example: Splitting a Text File

Let’s take a simple example using the file speech.txt that we used previously. This file is located in the same data_transformer folder.

Step 1: Loading the Document

We’ll load the text file into a document object using LangChain’s text loader.

from langchain.document_loaders import TextLoader

loader = TextLoader("speech.txt")
docs = loader.load()

print(docs[0].page_content)

This gives us the entire text of the file as a document.

Step 2: Initializing the Character Text Splitter

Next, let’s initialize our splitter:

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",         # Split text wherever a newline occurs
    chunk_size=100,         # Each chunk can have up to 100 characters
    chunk_overlap=20        # Overlap of 20 characters between chunks
)

We now use this splitter to divide our text:

split_docs = text_splitter.split_documents(docs)

You can print the first few chunks to inspect them:

print(split_docs[0].page_content)
print(split_docs[1].page_content)

Step 3: Understanding Chunking Behavior

Sometimes, you may notice warnings like:

Created a chunk of 470 characters, which is longer than the specified 100.

This happens because the splitter couldn’t find the separator (for example, a newline) within that range. In such cases, it preserves the entire segment instead of breaking text arbitrarily.

This is expected behavior and helps maintain text coherence.

5. Using Default Separators

If you do not specify a separator, the Character Text Splitter defaults to using:

  • "\n\n"
  • "\n"

Here’s an example:

text_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

split_docs = text_splitter.create_documents(["The world must be made safe for democracy. ..."])

You can now print and inspect the chunks:

print(split_docs[0].page_content)
print(split_docs[1].page_content)

You’ll see that the text splits around newline characters by default.

6. Comparing Recursive vs. Character Text Splitters

FeatureRecursive Character Text SplitterCharacter Text Splitter
Splitting LogicSplits text recursively based on a list of characters (paragraph → sentence → word)Splits based on a single character separator
Structure PreservationMaintains natural structure of textMay split mid-sentence if separator not present
Default Separators\n\n, \n, and space " "\n\n and \n
Best Use CaseFor most general text documentsWhen you need explicit control over splitting logic
RecommendationUse this one for almost all casesUse only when recursive splitting is unnecessary

7. Which One Should You Use?

If you’re unsure which splitter to pick, always start with:

RecursiveCharacterTextSplitter

It is more flexible, handles varied text structures gracefully, and is the default choice for most text processing scenarios in LangChain.

The Character Text Splitter is useful when:

  • You have highly structured text with consistent separators
  • You want to control exactly where the text is split