BPE
tokenizer created by OpenAI
.tiktoken
to estimate tokens used. It will probably be more accurate for the OpenAI models.
tiktoken
tokenizer.tiktoken
directly.
tiktoken
, use its .from_tiktoken_encoder()
method. Note that splits from this method can be larger than the chunk size measured by the tiktoken
tokenizer.
The .from_tiktoken_encoder()
method takes either encoding_name
as an argument (e.g. cl100k_base
), or the model_name
(e.g. gpt-4
). All additional arguments like chunk_size
, chunk_overlap
, and separators
are used to instantiate CharacterTextSplitter
:
RecursiveCharacterTextSplitter.from_tiktoken_encoder
, where each split will be recursively split if it has a larger size:
TokenTextSplitter
splitter, which works with tiktoken
directly and will ensure each split is smaller than chunk size.
TokenTextSplitter
directly can split the tokens for a character between two chunks causing malformed Unicode characters. Use RecursiveCharacterTextSplitter.from_tiktoken_encoder
or CharacterTextSplitter.from_tiktoken_encoder
to ensure chunks contain valid Unicode strings.
spaCy
tokenizer.SentenceTransformersTokenTextSplitter
. You can optionally specify:
chunk_overlap
: integer count of token overlap;model_name
: sentence-transformer model name, defaulting to "sentence-transformers/all-mpnet-base-v2"
;tokens_per_chunk
: desired token count per chunk.NLTK
to split based on NLTK tokenizers.
NLTK
tokenizer.Kkma
(Korean Knowledge Morpheme Analyzer). Kkma
provides detailed morphological analysis of Korean text. It breaks down sentences into words and words into their respective morphemes, identifying parts of speech for each token. It can segment a block of text into individual sentences, which is particularly useful for processing long texts.
Kkma
is renowned for its detailed analysis, it is important to note that this precision may impact processing speed. Thus, Kkma
is best suited for applications where analytical depth is prioritized over rapid text processing.
Hugging Face
tokenizer.