Skip to main content
BM25 (Wikipedia) also known as the Okapi BM25, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query. BM25Retriever retriever uses the rank_bm25 package.
pip install -qU  rank_bm25
from langchain_community.retrievers import BM25Retriever

Create new retriever with texts

retriever = BM25Retriever.from_texts(["foo", "bar", "world", "hello", "foo bar"])

Create a new retriever with documents

You can now create a new retriever with the documents you created.
from langchain_core.documents import Document

retriever = BM25Retriever.from_documents(
    [
        Document(page_content="foo"),
        Document(page_content="bar"),
        Document(page_content="world"),
        Document(page_content="hello"),
        Document(page_content="foo bar"),
    ]
)

Use retriever

We can now use the retriever!
result = retriever.invoke("foo")
result
[Document(metadata={}, page_content='foo'),
 Document(metadata={}, page_content='foo bar'),
 Document(metadata={}, page_content='hello'),
 Document(metadata={}, page_content='world')]

Preprocessing function

Pass a custom preprocessing function to the retriever to improve search results. Tokenizing text at the word level can enhance retrieval, especially when using vector stores like Chroma, Pinecone, or Faiss for chunked documents.
import nltk

nltk.download("punkt_tab")
from nltk.tokenize import word_tokenize

retriever = BM25Retriever.from_documents(
    [
        Document(page_content="foo"),
        Document(page_content="bar"),
        Document(page_content="world"),
        Document(page_content="hello"),
        Document(page_content="foo bar"),
    ],
    k=2,
    preprocess_func=word_tokenize,
)

result = retriever.invoke("bar")
result
[Document(metadata={}, page_content='bar'),
 Document(metadata={}, page_content='foo bar')]

BM25Plus variant

  • BM25Retriever also supports the BM25Plus variant, which is designed to reduce the bias against short documents present in standard BM25.
  • BM25Plus ensures that matched terms always contribute a positive score, which can improve recall for short texts, passages, or chunked documents commonly used in retrieval-augmented generation (RAG) workflows.
By default, BM25Retriever uses standard BM25 (BM25Okapi). BM25Plus must be explicitly enabled.

Example: Using BM25Plus

from langchain_community.retrievers import BM25Retriever
from langchain_core.documents import Document

docs = [
    Document(
        page_content=(
            "LangChain provides tools for building applications with large language models. "
            "It supports retrieval augmented generation and agents."
        )
    ),

    Document(
        page_content="LangChain retrieval augmented generation"
    ),
]

retriever = BM25Retriever.from_documents(
    docs,
    bm25_variant="plus",
    bm25_params={"delta": 0.5},
)

result = retriever.invoke("retrieval augmented generation")
result
BM25Plus is particularly useful when working with:
  • Short documents or passages
  • Chunked text in RAG systems
  • Corpora with highly variable document lengths
For long-form documents with more uniform lengths, standard BM25 may provide slightly higher precision.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.