Splitting by token

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

js-tiktoken

js-tiktoken is a JavaScript vesrion of the BPE tokenizer created by OpenAI.

We can use tiktoken to estimate tokens used using TokenTextSplitter. It will probably be more accurate for OpenAI mdoels.

How the text is split: by character passed in.
How the chunk size is measured: by tiktoken tokenizer.

npm install @langchain/textsplitters

import { TokenTextSplitter } from "@langchain/textsplitters";
import { readFileSync } from "fs";

// Example: read a long document
const stateOfTheUnion = readFileSync("state_of_the_union.txt", "utf8");

To split with a TokenTextSplitter and then merge chunks with tiktoken, pass in an encodingName (e.g. cl100k_base) when initializing the TokenTextSplitter. Note that splits from this method can be larger than the chunk size measured by the tiktoken tokenizer.

import { TokenTextSplitter } from "@langchain/textsplitters";

// Example: use cl100k_base encoding
const splitter = new TokenTextSplitter({ encodingName: "cl100k_base", chunkSize: 10, chunkOverlap: 0 });

const texts = splitter.splitText(stateOfTheUnion);
console.log(texts[0]);

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.

Last year COVID-19 kept us apart. This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

General integrations

RAG integrations

js-tiktoken

Popular Providers

General integrations

RAG integrations

​js-tiktoken

js-tiktoken