Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.langchain.com/llms.txt

Use this file to discover all available pages before exploring further.

Sentence Transformers is the most widely used Python framework for state-of-the-art text and image embeddings. The Hugging Face Hub hosts thousands of pretrained embedding and reranker models that run locally with no API key required, accessible via the HuggingFaceEmbeddings class.

Setup

pip install -qU langchain-huggingface
langchain-huggingface pulls in sentence-transformers as a dependency, which in turn installs transformers and torch.

Basic usage

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

query_embedding = embeddings.embed_query("What is a sentence embedding?")
doc_embeddings = embeddings.embed_documents(
    [
        "Sentence embeddings map text to dense vectors.",
        "LangChain provides a standard Embeddings interface.",
    ]
)

Choosing a model

Start from the MTEB leaderboard. Strong starting points across different tradeoffs:
ModelSizeNotes
sentence-transformers/all-mpnet-base-v2110MClassic, small, CPU-friendly, no prompt required
BAAI/bge-m3570MMultilingual; produces dense, sparse, and multi-vector embeddings in one pass
mixedbread-ai/mxbai-embed-large-v1335MStrong English performance, supports Matryoshka truncation
nomic-ai/modernbert-embed-base149M8192-token context, modern architecture
lightonai/DenseOn149Mmodern architecture, strong performance for its size
Qwen/Qwen3-Embedding-0.6B595MMultilingual, instruction-aware, top MTEB performance
See also Factors to weigh for a deeper walkthrough of the tradeoffs.

Normalize embeddings

Models trained with cosine similarity benefit from normalized output vectors. If your vector store uses cosine similarity, normalize at the source:
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    encode_kwargs={"normalize_embeddings": True},
)

Device and throughput

Sentence Transformers auto-selects the best available device (CUDA > MPS > CPU), so you don’t need to set device= explicitly in most cases. On a GPU, raise batch_size to keep it fed:
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    encode_kwargs={"batch_size": 64, "normalize_embeddings": True},
)
To pin to a specific device, pass model_kwargs={"device": "cpu"} (or "cuda:1", etc.). For multiple GPUs, set multi_process=True. For Intel CPUs, use model_kwargs={"backend": "ipex"} after installing optimum[ipex].

Query and document prompts

Some models (intfloat/e5-*, Qwen/Qwen3-Embedding-*, many BAAI/bge-*) are trained with distinct prompts for queries and documents. Pass these via encode_kwargs and query_encode_kwargs:
embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    encode_kwargs={"prompt": "passage: "},
    query_encode_kwargs={"prompt": "query: "},
)
Using the right prompts at indexing and query time typically gives a meaningful retrieval quality boost. Check each model’s card on Hugging Face for the recommended prompt strings.

Deploy for production

For serving Sentence Transformers models at scale, use Text Embeddings Inference (TEI), a dedicated inference server from Hugging Face with batching, GPU support, and OpenAI-compatible APIs. Point LangChain at a TEI deployment via HuggingFaceEndpointEmbeddings: see the main Hugging Face embeddings guide.

Reranking

The same ecosystem hosts cross-encoder reranker models. For a local reranker on top of a vector store, see the Cross Encoder Reranker guide.

Troubleshooting

If the accelerate package is missing or fails to import:
pip install -qU accelerate