Splitting Text Using the HTML Header Text Splitter in LangChain

In earlier lessons, we explored several text-splitting techniques in LangChain. In this tutorial, we will focus on a powerful and lesser-known utility: the HTML Header Text Splitter. This splitter helps you break down HTML documents into logical, structured chunks based on the hierarchy of HTML header tags such as <h1>, <h2>, and <h3>.

This becomes especially useful when dealing with webpages or documentation pages that contain meaningful structure. The splitter preserves contextual metadata and ensures that related content stays together.

Understanding the HTML Header Text Splitter

The HTMLHeaderTextSplitter is a structure-aware text splitter that:

Splits text at the level of HTML header tags.
Adds metadata for each chunk, based on which header it falls under.
Allows returning chunks either element-by-element or as merged blocks of related content.
Preserves semantic structure instead of blindly breaking text by character count.

This means you can split content by logical sections of the HTML document rather than arbitrary character limits. It is extremely useful for building RAG (Retrieval-Augmented Generation) systems that depend heavily on well-defined document boundaries.

Importing the HTML Header Text Splitter

The splitter is available in the langchain_text_splitters module (depending on your LangChain version):

from langchain_text_splitters import HTMLHeaderTextSplitter

Preparing an HTML String

Let’s assume you have a large HTML string that contains various header tags.

Example:

html_string = """
<!DOCTYPE html>
<html>
<body>

<h1>Foo Bar Main Section</h1>
<p>Some intro about foo.</p>

<h2>First Topic</h2>
<p>Details about the first topic.</p>

<div>
    <h3>Subtopic Discussion</h3>
    <p>Some deeper explanation inside a div.</p>
</div>

</body>
</html>
"""

This is the content we want to divide into structured chunks based on <h1>, <h2>, and <h3>.

Defining Headers to Split On

We can specify which header tags to consider and optionally provide custom names for them.

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

Here:

("h1", "Header 1") means: whenever an <h1> tag appears, treat it as a major section.
Similarly for <h2> and <h3>.

Creating the Splitter

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

Now, the splitter knows how to process HTML and where to divide the content.

Splitting the HTML Content

html_header_splits = html_splitter.split_text(html_string)

The output will be a list of Document objects, each representing a chunk.
Each document contains:

page_content – the actual text in the chunk
metadata – header hierarchy information

You can simply print the result:

for doc in html_header_splits:
    print(doc)

Typical output:

Document(
    page_content='Some intro about foo.',
    metadata={'Header 1': 'Foo Bar Main Section'}
)

Document(
    page_content='Details about the first topic.',
    metadata={'Header 1': 'Foo Bar Main Section', 'Header 2': 'First Topic'}
)

Document(
    page_content='Some deeper explanation inside a div.',
    metadata={'Header 1': 'Foo Bar Main Section', 'Header 2': 'First Topic', 'Header 3': 'Subtopic Discussion'}
)

This structured metadata becomes extremely valuable when indexing or performing retrieval.

Splitting Content Directly from a URL

The same technique can be used for a webpage.
Assume you have a URL with valid HTML:

url = "https://example.com/some-article"

If you load the HTML (via requests, BeautifulSoup, or LangChain loaders), you can pass it to the same splitter:

import requests

html_text = requests.get(url).text
html_header_splits = html_splitter.split_text(html_text)

Since webpages often contain lengthy and deeply nested sections, splitting becomes slower, but the output will be a clean set of structured chunks.

This is extremely helpful when performing:

Web scraping
Document indexing
RAG pipelines
Search systems
Topic-aware embeddings

Why Use HTML Header Splitting?

Traditional text splitters do not understand the structure of a webpage. They might break text in the middle of a section, paragraph, or even mid-sentence.

HTML header splitting provides:

Semantically meaningful chunking
More accurate retrieval in RAG applications
Cleaner embeddings (per logical section)
Automatic metadata extraction
Useful for blogs, documentation, or educational sites

Whenever your source is HTML with proper header hierarchy, this splitter is one of the best choices.

Complete Program

# ------------------------------------------------------------
# HTML Header Text Splitting Example with LangChain
# ------------------------------------------------------------

from langchain_text_splitters import HTMLHeaderTextSplitter
import requests

# ------------------------------------------------------------
# 1. Example HTML String
# ------------------------------------------------------------

html_string = """
<!DOCTYPE html>
<html>
<body>

<h1>Foo Bar Main Section</h1>
<p>Some intro about foo.</p>

<h2>First Topic</h2>
<p>Details about the first topic.</p>

<div>
    <h3>Subtopic Discussion</h3>
    <p>Some deeper explanation inside a div.</p>
</div>

<h2>Second Topic</h2>
<p>Information about the second topic.</p>

</body>
</html>
"""

# ------------------------------------------------------------
# 2. Define the headers to split on
# ------------------------------------------------------------

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

# ------------------------------------------------------------
# 3. Create the HTMLHeaderTextSplitter
# ------------------------------------------------------------

html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

# ------------------------------------------------------------
# 4. Split the HTML string into documents
# ------------------------------------------------------------

html_header_splits = html_splitter.split_text(html_string)

print("\n=========== Splits from HTML STRING ===========\n")

for i, doc in enumerate(html_header_splits):
    print(f"Chunk {i+1}:")
    print("CONTENT:")
    print(doc.page_content)
    print("METADATA:")
    print(doc.metadata)
    print("----------------------------------------")

# ------------------------------------------------------------
# 5. Splitting HTML from a live URL
# ------------------------------------------------------------

url = "https://www.example.com"  # Replace with any real site having <h1>, <h2>, <h3>

print("\nFetching URL:", url)
html_from_url = requests.get(url).text

html_url_splits = html_splitter.split_text(html_from_url)

print("\n=========== Splits from URL HTML ===========\n")

for i, doc in enumerate(html_url_splits):
    print(f"URL Chunk {i+1}:")
    print("CONTENT:")
    print(doc.page_content)
    print("METADATA:")
    print(doc.metadata)
    print("----------------------------------------")

Output

=========== Splits from HTML STRING ===========

Chunk 1:
CONTENT:
Foo Bar Main Section
METADATA:
{'Header 1': 'Foo Bar Main Section'}
----------------------------------------
Chunk 2:
CONTENT:
Some intro about foo.
METADATA:
{'Header 1': 'Foo Bar Main Section'}
----------------------------------------
Chunk 3:
CONTENT:
First Topic
METADATA:
{'Header 1': 'Foo Bar Main Section', 'Header 2': 'First Topic'}
----------------------------------------
Chunk 4:
CONTENT:
Details about the first topic.
METADATA:
...
Learn more
METADATA:
{'Header 1': 'Example Domain'}
----------------------------------------

Conclusion

The HTML Header Text Splitter allows you to intelligently split web content based on document structure rather than arbitrary character limits. It provides clean, meaningful segments with metadata that can greatly enhance downstream tasks like vector search and retrieval.