esprima
)tree_sitter
and tree_sitter_languages
.
It is straightforward to add support for additional languages using tree_sitter
,
although this currently requires modifying LangChain.
The language used for parsing can be configured, along with the minimum number of
lines required to activate the splitting based on syntax.
If a language is not explicitly specified, LanguageParser
will infer one from
filename extensions, if present.
parser_threshold
indicates the minimum number of lines that the source code file must have to be segmented using the parser.
cpp.py
.cpp.py
file, adapting it to suit the language you are incorporating.test_language.py
in the designated directory(langchain/libs/community/tests/unit_tests/document_loaders/parsers/language).test_cpp.py
to establish fundamental tests for the parsed elements in the new language.language_parser.py
file. Ensure to update LANGUAGE_EXTENSIONS and LANGUAGE_SEGMENTERS along with the docstring for LanguageParser to recognize and handle the added language.text_splitter.py
in class Language for proper parsing.