Sentence Transformers on Hugging Face integration

Sentence Transformers is the most widely used Python framework for state-of-the-art text and image embeddings. The Hugging Face Hub hosts thousands of pretrained embedding and reranker models that run locally with no API key required, accessible via the HuggingFaceEmbeddings class.

Setup

pip install -qU langchain-huggingface

langchain-huggingface pulls in sentence-transformers as a dependency, which in turn installs transformers and torch.

Basic usage

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

query_embedding = embeddings.embed_query("What is a sentence embedding?")
doc_embeddings = embeddings.embed_documents(
    [
        "Sentence embeddings map text to dense vectors.",
        "LangChain provides a standard Embeddings interface.",
    ]
)

Choosing a model

Start from the MTEB leaderboard. Strong starting points across different tradeoffs:

Model	Size	Notes
`sentence-transformers/all-mpnet-base-v2`	110M	Classic, small, CPU-friendly, no prompt required
`BAAI/bge-m3`	570M	Multilingual; produces dense, sparse, and multi-vector embeddings in one pass
`mixedbread-ai/mxbai-embed-large-v1`	335M	Strong English performance, supports Matryoshka truncation
`nomic-ai/modernbert-embed-base`	149M	8192-token context, modern architecture
`lightonai/DenseOn`	149M	modern architecture, strong performance for its size
`Qwen/Qwen3-Embedding-0.6B`	595M	Multilingual, instruction-aware, top MTEB performance

See also Factors to weigh for a deeper walkthrough of the tradeoffs.

Normalize embeddings

Models trained with cosine similarity benefit from normalized output vectors. If your vector store uses cosine similarity, normalize at the source:

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    encode_kwargs={"normalize_embeddings": True},
)

Device and throughput

Sentence Transformers auto-selects the best available device (CUDA > MPS > CPU), so you don’t need to set device= explicitly in most cases. On a GPU, raise batch_size to keep it fed:

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    encode_kwargs={"batch_size": 64, "normalize_embeddings": True},
)

To pin to a specific device, pass model_kwargs={"device": "cpu"} (or "cuda:1", etc.). For multiple GPUs, set multi_process=True. For Intel CPUs, use model_kwargs={"backend": "ipex"} after installing optimum[ipex].

Query and document prompts

Some models (intfloat/e5-*, Qwen/Qwen3-Embedding-*, many BAAI/bge-*) are trained with distinct prompts for queries and documents. Pass these via encode_kwargs and query_encode_kwargs:

embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    encode_kwargs={"prompt": "passage: "},
    query_encode_kwargs={"prompt": "query: "},
)

Using the right prompts at indexing and query time typically gives a meaningful retrieval quality boost. Check each model’s card on Hugging Face for the recommended prompt strings.

Deploy for production

For serving Sentence Transformers models at scale, use Text Embeddings Inference (TEI), a dedicated inference server from Hugging Face with batching, GPU support, and OpenAI-compatible APIs. Point LangChain at a TEI deployment via HuggingFaceEndpointEmbeddings: see the main Hugging Face embeddings guide.

Reranking

The same ecosystem hosts cross-encoder reranker models. For a local reranker on top of a vector store, see the Cross Encoder Reranker guide.

Troubleshooting

If the accelerate package is missing or fails to import:

pip install -qU accelerate

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Edit this page on GitHub or file an issue.

Documentation Index

​Setup

​Basic usage

​Choosing a model

​Normalize embeddings

​Device and throughput

​Query and document prompts

​Deploy for production

​Reranking

​Troubleshooting

Setup

Basic usage

Choosing a model

Normalize embeddings

Device and throughput

Query and document prompts

Deploy for production

Reranking

Troubleshooting