When working with real-world APIs, we often receive large JSON responses that contain deeply nested objects, arrays, and long fields. Before we send this data to an LLM or convert it into embeddings for retrieval, we must break it into smaller and meaningful chunks.
However, depending on your LangChain version, the built-in RecursiveJsonSplitter may not always work reliably, especially if:
- Your JSON starts with a list
- The JSON contains deeply nested lists
- You are using an older version of
langchain-text-splitters - You encounter unexpected
IndexErroror missing arguments likeconvert_lists
To overcome these issues, we will build a custom JSON splitter that works consistently across all Python and LangChain environments.
This tutorial explains how this splitter is designed and how you can use it to chunk JSON data effectively.
Why Do We Need a Custom JSON Splitter?
Many built-in JSON splitting tools assume the JSON root is always a dictionary and that lists can be safely traversed. But real API responses are often:
- lists of dictionaries,
- dictionaries containing lists,
- or complex nested structures.
For example:
[
{ "id": 1, "title": "Sample", "body": "..." },
{ "id": 2, "title": "Another", "body": "..." }
]
Some versions of LangChain fail to handle this kind of structure due to:
- incorrect path tracking
- missing internal functions
- no support for list-rooted JSON
- no support for certain constructor arguments
This leads to errors such as:
IndexError: list index out of range
or
TypeError: unexpected keyword argument 'convert_lists'
To avoid all of these issues, we will create a robust, custom splitter.
Objective of the Custom JSON Splitter
The custom splitter we create will:
- Traverse the JSON recursively
- Break it into chunks based on maximum character size
- Handle both dicts and lists correctly
- Maintain structural meaning
- Work with any version of LangChain
- Never throw IndexError when traversing lists
- Produce clean JSON fragments suitable for embeddings or LLMs
This gives complete control over how JSON is broken apart.
Designing the Custom JSON Splitter
We will build a recursive function that:
- Converts each JSON object into a pretty-printed JSON string
- Checks if its size exceeds the allowed limit
- If it fits — store it as a chunk
- If it doesn’t fit — recursively process its children
It handles:
- Dictionaries
- Lists
- Primitives
- Large strings
This guarantees safe splitting for all kinds of JSON structures.
Complete Working Code
Here is the full implementation of the custom JSON splitter along with an example API:
import json
import requests
from typing import Any, List
# ------------------------------------------------------------
# Custom Recursive JSON Splitter
# ------------------------------------------------------------
def json_to_chunks(data: Any, max_chars: int = 300) -> List[str]:
"""Recursively split a JSON object into chunks based on max character size."""
chunks = []
def recurse(obj, path=""):
text = json.dumps(obj, indent=2)
# If object fits into one chunk, use it directly
if len(text) <= max_chars:
chunks.append(text)
return
# If dictionary, split key-by-key
if isinstance(obj, dict):
for key, value in obj.items():
recurse(value, f"{path}/{key}")
return
# If list, split item-by-item
if isinstance(obj, list):
for index, item in enumerate(obj):
recurse(item, f"{path}[{index}]")
return
# For large primitive values or long strings
chunks.append(text)
recurse(data)
return chunks
# ------------------------------------------------------------
# 1. Load JSON from an API
# ------------------------------------------------------------
url = "https://jsonplaceholder.typicode.com/posts"
data = requests.get(url).json()
# Wrap JSON list in a dict for consistent processing
wrapped_json = {"items": data}
print("Fetched JSON successfully.")
print("------------------------------------------------------------")
# ------------------------------------------------------------
# 2. Split JSON into chunks
# ------------------------------------------------------------
chunks = json_to_chunks(wrapped_json, max_chars=300)
print("Total chunks generated:", len(chunks))
print("------------------------------------------------------------")
print("\nFIRST 3 CHUNKS:\n")
for c in chunks[:3]:
print(c)
print("------------------------------------------------------------")
# ------------------------------------------------------------
# 3. OPTIONAL: Convert chunks to LangChain Document objects
# ------------------------------------------------------------
try:
from langchain.schema import Document
docs = [Document(page_content=c, metadata={}) for c in chunks]
print("\nCreated Document objects:", len(docs))
except:
print("\nLangChain not installed or incompatible version.")
How the Custom Splitter Works
Here is the logic:
1. Convert JSON to Pretty String
We use:
text = json.dumps(obj, indent=2)
This gives a human-readable string representation.
2. Check if the chunk fits
If the entire JSON fragment ≤ max_chars, it is stored as one chunk.
3. Handle Dictionaries
If obj is a dict:
- Process each key/value pair separately
- Split deeply nested objects automatically
4. Handle Lists
If obj is a list:
- Process each element individually
- Good for arrays of objects returned by APIs
5. Handle Strings and Primitives
If a value is a long string but still under max_chars, it is kept as-is.
