Basic usage:
MarkdownHeaderTextSplitter
strips headers being split on from the output chunk’s content. This can be disabled by setting strip_headers = False
.
The default
MarkdownHeaderTextSplitter
strips white spaces and new lines. To preserve the original formatting of your Markdown documents, check out ExperimentalMarkdownSyntaxTextSplitter.How to return Markdown lines as separate documents
By default,MarkdownHeaderTextSplitter
aggregates lines based on the headers specified in headers_to_split_on
. We can disable this by specifying return_each_line
:
metadata
for each document.
How to constrain chunk size:
Within each markdown group we can then apply any text splitter we want, such asRecursiveCharacterTextSplitter
, which allows for further control of the chunk size.