Documentation Index
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt
Use this file to discover all available pages before exploring further.
BM25 (Wikipedia) also known as the Okapi BM25, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.
BM25Retriever retriever uses the rank_bm25 package.
pip install -qU rank_bm25
from langchain_community.retrievers import BM25Retriever
Create new retriever with texts
retriever = BM25Retriever.from_texts(["foo", "bar", "world", "hello", "foo bar"])
Create a new retriever with documents
You can now create a new retriever with the documents you created.
from langchain_core.documents import Document
retriever = BM25Retriever.from_documents(
[
Document(page_content="foo"),
Document(page_content="bar"),
Document(page_content="world"),
Document(page_content="hello"),
Document(page_content="foo bar"),
]
)
Use retriever
We can now use the retriever!
result = retriever.invoke("foo")
[Document(metadata={}, page_content='foo'),
Document(metadata={}, page_content='foo bar'),
Document(metadata={}, page_content='hello'),
Document(metadata={}, page_content='world')]
Preprocessing function
Pass a custom preprocessing function to the retriever to improve search results. Tokenizing text at the word level can enhance retrieval, especially when using vector stores like Chroma, Pinecone, or Faiss for chunked documents.
import nltk
nltk.download("punkt_tab")
from nltk.tokenize import word_tokenize
retriever = BM25Retriever.from_documents(
[
Document(page_content="foo"),
Document(page_content="bar"),
Document(page_content="world"),
Document(page_content="hello"),
Document(page_content="foo bar"),
],
k=2,
preprocess_func=word_tokenize,
)
result = retriever.invoke("bar")
result
[Document(metadata={}, page_content='bar'),
Document(metadata={}, page_content='foo bar')]
BM25Plus variant
-
BM25Retriever also supports the BM25Plus variant, which is designed to reduce the bias against short documents present in standard BM25.
-
BM25Plus ensures that matched terms always contribute a positive score, which can improve recall for short texts, passages, or chunked documents commonly used in retrieval-augmented generation (RAG) workflows.
By default, BM25Retriever uses standard BM25 (BM25Okapi). BM25Plus must be explicitly enabled.
Example: Using BM25Plus
from langchain_community.retrievers import BM25Retriever
from langchain_core.documents import Document
docs = [
Document(
page_content=(
"LangChain provides tools for building applications with large language models. "
"It supports retrieval augmented generation and agents."
)
),
Document(
page_content="LangChain retrieval augmented generation"
),
]
retriever = BM25Retriever.from_documents(
docs,
bm25_variant="plus",
bm25_params={"delta": 0.5},
)
result = retriever.invoke("retrieval augmented generation")
result
BM25Plus is particularly useful when working with:
- Short documents or passages
- Chunked text in RAG systems
- Corpora with highly variable document lengths
For long-form documents with uniform lengths, standard BM25 may provide slightly higher precision.