<h1>
, <h2>
, <h3>
, etc.), and adds metadata for each header relevant to any given chunk.
Capabilities:
<section>
, <div>
, or custom-defined sections.RecursiveCharacterTextSplitter
for large sections.HTMLHeaderTextSplitter
when: You need to split an HTML document based on its header hierarchy and maintain metadata about the headers.HTMLSectionSplitter
when: You need to split the document into larger, more general sections, possibly based on custom tags or font sizes.HTMLSemanticPreservingSplitter
when: You need to split the document into chunks while preserving semantic elements like tables and lists, ensuring that they are not split and that their context is maintained.Feature | HTMLHeaderTextSplitter | HTMLSectionSplitter | HTMLSemanticPreservingSplitter |
---|---|---|---|
Splits based on headers | Yes | Yes | Yes |
Preserves semantic elements (tables, lists) | No | No | Yes |
Adds metadata for headers | Yes | Yes | Yes |
Custom handlers for HTML tags | No | No | Yes |
Preserves media (images, videos) | No | No | Yes |
Considers font sizes | No | Yes | No |
Uses XSLT transformations | No | Yes | No |
headers_to_split_on
when instantiating HTMLHeaderTextSplitter
as shown below.
return_each_element=True
when instantiating HTMLHeaderTextSplitter
:
Document
:
split_text_from_url
method.
Similarly, a local HTML file can be passed to the split_text_from_file
method.
HTMLHeaderTextSplitter
, which splits based on HTML headers, can be composed with another splitter which constrains splits based on character lengths, such as RecursiveCharacterTextSplitter
.
This can be done using the .split_documents
method of the second splitter:
HTMLHeaderTextSplitter
will attempt to attach all “relevant” headers to any given chunk, it can sometimes miss certain headers. For example, the algorithm assumes an informational hierarchy in which headers are always at nodes “above” associated text, i.e. prior siblings, ancestors, and combinations thereof. In the following news article (as of the writing of this document), the document is structured such that the text of the top-level headline, while tagged “h1”, is in a distinct subtree from the text elements that we’d expect it to be “above”—so we can observe that the “h1” element and its associated text do not show up in the chunk metadata (but, where applicable, we do see “h2” and its associated text):
HTMLSectionSplitter
is a “structure-aware” text splitter that splits text at the element level and adds metadata for each header “relevant” to any given chunk. It lets you split HTML by sections.
It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures.
Use xslt_path
to provide an absolute path to transform the HTML so that it can detect sections based on provided tags. The default is to use the converting_to_header.xslt
file in the data_connection/document_transformers
directory. This is for converting the html to a format/layout that is easier to detect sections. For example, span
based on their font size can be converted to header tags to be detected as a section.
HTMLSectionSplitter
can be used with other text splitters as part of a chunking pipeline. Internally, it uses the RecursiveCharacterTextSplitter
when the section size is larger than the chunk size. It also considers the font size of the text to determine whether it is a section or not based on the determined font size threshold.
HTMLSemanticPreservingSplitter
is designed to split HTML content into manageable chunks while preserving the semantic structure of important elements like tables, lists, and other HTML components. This ensures that such elements are not split across chunks, causing loss of contextual relevancy such as table headers, list headers etc.
This splitter is designed at its heart, to create contextually relevant chunks. General Recursive splitting with HTMLHeaderTextSplitter
can cause tables, lists and other structured elements to be split in the middle, losing significant context and creating bad chunks.
The HTMLSemanticPreservingSplitter
is essential for splitting HTML content that includes structured elements like tables and lists, especially when it’s critical to preserve these elements intact. Additionally, its ability to define custom handlers for specific HTML tags makes it a versatile tool for processing complex HTML documents.
IMPORTANT: max_chunk_size
is not a definite maximum size of a chunk, the calculation of max size, occurs when the preserved content is not apart of the chunk, to ensure it is not split. When we add the preserved data back in to the chunk, there is a chance the chunk size will exceed the max_chunk_size
. This is crucial to ensure we maintain the structure of the original document
HTMLSemanticPreservingSplitter
can preserve a table and a large list within an HTML document. The chunk size will be set to 50 characters to illustrate how the splitter ensures that these elements are not split, even when they exceed the maximum defined chunk size.
HTMLSemanticPreservingSplitter
ensures that the entire table and the unordered list (<ul>
) are preserved within their respective chunks. Even though the chunk size is set to 50 characters, the splitter recognizes that these elements should not be split and keeps them intact.
This is particularly important when dealing with data tables or lists, where splitting the content could lead to loss of context or confusion. The resulting Document
objects retain the full structure of these elements, ensuring that the contextual relevance of the information is maintained.
HTMLSemanticPreservingSplitter
allows you to define custom handlers for specific HTML elements. Some platforms, have custom HTML tags that are not natively parsed by BeautifulSoup
, when this occurs, you can utilize custom handlers to add the formatting logic easily.
This can be particularly useful for elements that require special processing, such as <iframe>
tags or specific ‘data-’ elements. In this example, we’ll create a custom handler for iframe
tags that converts them into Markdown-like links.
iframe
tags that converts them into Markdown-like links. When the splitter processes the HTML content, it uses this custom handler to transform the iframe
tags while preserving other elements like tables and lists. The resulting Document
objects show how the iframe is handled according to the custom logic you provided.
Important: When presvering items such as links, you should be mindful not to include .
in your separators, or leave separators blank. RecursiveCharacterTextSplitter
splits on full stop, which will cut links in half. Ensure you provide a separator list with .
instead.
<img>
tag and turn off preserve_images
to insert any content we would like to embed in our chunks.
<img>
element in HTML, we can further process the data with our agent, and insert the result directly into our chunk. It is important to ensure preserve_images
is set to False
otherwise the default processing of <img>
fields will take place.