In earlier lessons, we explored several text-splitting techniques in LangChain. In this tutorial, we will focus on a powerful and lesser-known utility: the HTML Header Text Splitter. This splitter helps you break down HTML documents into logical, structured chunks based on the hierarchy of HTML header tags such as <h1>, <h2>, and <h3>.
This becomes especially useful when dealing with webpages or documentation pages that contain meaningful structure. The splitter preserves contextual metadata and ensures that related content stays together.
Understanding the HTML Header Text Splitter
The HTMLHeaderTextSplitter is a structure-aware text splitter that:
- Splits text at the level of HTML header tags.
- Adds metadata for each chunk, based on which header it falls under.
- Allows returning chunks either element-by-element or as merged blocks of related content.
- Preserves semantic structure instead of blindly breaking text by character count.
This means you can split content by logical sections of the HTML document rather than arbitrary character limits. It is extremely useful for building RAG (Retrieval-Augmented Generation) systems that depend heavily on well-defined document boundaries.
Importing the HTML Header Text Splitter
The splitter is available in the langchain_text_splitters module (depending on your LangChain version):
from langchain_text_splitters import HTMLHeaderTextSplitter
Preparing an HTML String
Let’s assume you have a large HTML string that contains various header tags.
Example:
html_string = """
<!DOCTYPE html>
<html>
<body>
<h1>Foo Bar Main Section</h1>
<p>Some intro about foo.</p>
<h2>First Topic</h2>
<p>Details about the first topic.</p>
<div>
<h3>Subtopic Discussion</h3>
<p>Some deeper explanation inside a div.</p>
</div>
</body>
</html>
"""
This is the content we want to divide into structured chunks based on <h1>, <h2>, and <h3>.
Defining Headers to Split On
We can specify which header tags to consider and optionally provide custom names for them.
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
Here:
("h1", "Header 1")means: whenever an<h1>tag appears, treat it as a major section.- Similarly for
<h2>and<h3>.
Creating the Splitter
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
Now, the splitter knows how to process HTML and where to divide the content.
Splitting the HTML Content
html_header_splits = html_splitter.split_text(html_string)
The output will be a list of Document objects, each representing a chunk.
Each document contains:
page_content– the actual text in the chunkmetadata– header hierarchy information
You can simply print the result:
for doc in html_header_splits:
print(doc)
Typical output:
Document(
page_content='Some intro about foo.',
metadata={'Header 1': 'Foo Bar Main Section'}
)
Document(
page_content='Details about the first topic.',
metadata={'Header 1': 'Foo Bar Main Section', 'Header 2': 'First Topic'}
)
Document(
page_content='Some deeper explanation inside a div.',
metadata={'Header 1': 'Foo Bar Main Section', 'Header 2': 'First Topic', 'Header 3': 'Subtopic Discussion'}
)
This structured metadata becomes extremely valuable when indexing or performing retrieval.
Splitting Content Directly from a URL
The same technique can be used for a webpage.
Assume you have a URL with valid HTML:
url = "https://example.com/some-article"
If you load the HTML (via requests, BeautifulSoup, or LangChain loaders), you can pass it to the same splitter:
import requests html_text = requests.get(url).text html_header_splits = html_splitter.split_text(html_text)
Since webpages often contain lengthy and deeply nested sections, splitting becomes slower, but the output will be a clean set of structured chunks.
This is extremely helpful when performing:
- Web scraping
- Document indexing
- RAG pipelines
- Search systems
- Topic-aware embeddings
Why Use HTML Header Splitting?
Traditional text splitters do not understand the structure of a webpage. They might break text in the middle of a section, paragraph, or even mid-sentence.
HTML header splitting provides:
- Semantically meaningful chunking
- More accurate retrieval in RAG applications
- Cleaner embeddings (per logical section)
- Automatic metadata extraction
- Useful for blogs, documentation, or educational sites
Whenever your source is HTML with proper header hierarchy, this splitter is one of the best choices.
Complete Program
# ------------------------------------------------------------
# HTML Header Text Splitting Example with LangChain
# ------------------------------------------------------------
from langchain_text_splitters import HTMLHeaderTextSplitter
import requests
# ------------------------------------------------------------
# 1. Example HTML String
# ------------------------------------------------------------
html_string = """
<!DOCTYPE html>
<html>
<body>
<h1>Foo Bar Main Section</h1>
<p>Some intro about foo.</p>
<h2>First Topic</h2>
<p>Details about the first topic.</p>
<div>
<h3>Subtopic Discussion</h3>
<p>Some deeper explanation inside a div.</p>
</div>
<h2>Second Topic</h2>
<p>Information about the second topic.</p>
</body>
</html>
"""
# ------------------------------------------------------------
# 2. Define the headers to split on
# ------------------------------------------------------------
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
# ------------------------------------------------------------
# 3. Create the HTMLHeaderTextSplitter
# ------------------------------------------------------------
html_splitter = HTMLHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
# ------------------------------------------------------------
# 4. Split the HTML string into documents
# ------------------------------------------------------------
html_header_splits = html_splitter.split_text(html_string)
print("\n=========== Splits from HTML STRING ===========\n")
for i, doc in enumerate(html_header_splits):
print(f"Chunk {i+1}:")
print("CONTENT:")
print(doc.page_content)
print("METADATA:")
print(doc.metadata)
print("----------------------------------------")
# ------------------------------------------------------------
# 5. Splitting HTML from a live URL
# ------------------------------------------------------------
url = "https://www.example.com" # Replace with any real site having <h1>, <h2>, <h3>
print("\nFetching URL:", url)
html_from_url = requests.get(url).text
html_url_splits = html_splitter.split_text(html_from_url)
print("\n=========== Splits from URL HTML ===========\n")
for i, doc in enumerate(html_url_splits):
print(f"URL Chunk {i+1}:")
print("CONTENT:")
print(doc.page_content)
print("METADATA:")
print(doc.metadata)
print("----------------------------------------")
Output
=========== Splits from HTML STRING ===========
Chunk 1:
CONTENT:
Foo Bar Main Section
METADATA:
{'Header 1': 'Foo Bar Main Section'}
----------------------------------------
Chunk 2:
CONTENT:
Some intro about foo.
METADATA:
{'Header 1': 'Foo Bar Main Section'}
----------------------------------------
Chunk 3:
CONTENT:
First Topic
METADATA:
{'Header 1': 'Foo Bar Main Section', 'Header 2': 'First Topic'}
----------------------------------------
Chunk 4:
CONTENT:
Details about the first topic.
METADATA:
...
Learn more
METADATA:
{'Header 1': 'Example Domain'}
----------------------------------------
Conclusion
The HTML Header Text Splitter allows you to intelligently split web content based on document structure rather than arbitrary character limits. It provides clean, meaningful segments with metadata that can greatly enhance downstream tasks like vector search and retrieval.
