["\n\n", "\n", " ", ""]
. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.
.split_text
.
To create LangChain Document objects (e.g., for use in downstream tasks), use .create_documents
.
RecursiveCharacterTextSplitter
:
chunk_size
: The maximum size of a chunk, where size is determined by thelength_function
.chunk_overlap
: Target overlap between chunks. Overlapping chunks helps to mitigate loss of information when context is divided between chunks.length_function
: Function determining the chunk size.is_separator_regex
: Whether the separator list (defaulting to["\n\n", "\n", " ", ""]
) should be interpreted as regex.
Splitting text from languages without word boundaries
Some writing systems do not have word boundaries, for example Chinese, Japanese, and Thai. Splitting text with the default separator list of["\n\n", "\n", " ", ""]
can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:
- Add ASCII full-stop “
.
”, Unicode fullwidth full stop “.
” (used in Chinese text), and ideographic full stop “。
” (used in Japanese and Chinese) - Add Zero-width space used in Thai, Myanmar, Kmer, and Japanese.
- Add ASCII comma “
,
”, Unicode fullwidth comma “,
”, and Unicode ideographic comma “、
”