Documentation Index
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt
Use this file to discover all available pages before exploring further.
Sentence Transformers is the most widely used Python framework for state-of-the-art text and image embeddings. The Hugging Face Hub hosts thousands of pretrained embedding and reranker models that run locally with no API key required, accessible via the HuggingFaceEmbeddings class.
Setup
langchain-huggingface pulls in sentence-transformers as a dependency, which in turn installs transformers and torch.
Basic usage
Choosing a model
Start from the MTEB leaderboard. Strong starting points across different tradeoffs:| Model | Size | Notes |
|---|---|---|
sentence-transformers/all-mpnet-base-v2 | 110M | Classic, small, CPU-friendly, no prompt required |
BAAI/bge-m3 | 570M | Multilingual; produces dense, sparse, and multi-vector embeddings in one pass |
mixedbread-ai/mxbai-embed-large-v1 | 335M | Strong English performance, supports Matryoshka truncation |
nomic-ai/modernbert-embed-base | 149M | 8192-token context, modern architecture |
lightonai/DenseOn | 149M | modern architecture, strong performance for its size |
Qwen/Qwen3-Embedding-0.6B | 595M | Multilingual, instruction-aware, top MTEB performance |
Normalize embeddings
Models trained with cosine similarity benefit from normalized output vectors. If your vector store uses cosine similarity, normalize at the source:Device and throughput
Sentence Transformers auto-selects the best available device (CUDA > MPS > CPU), so you don’t need to setdevice= explicitly in most cases. On a GPU, raise batch_size to keep it fed:
model_kwargs={"device": "cpu"} (or "cuda:1", etc.). For multiple GPUs, set multi_process=True. For Intel CPUs, use model_kwargs={"backend": "ipex"} after installing optimum[ipex].
Query and document prompts
Some models (intfloat/e5-*, Qwen/Qwen3-Embedding-*, many BAAI/bge-*) are trained with distinct prompts for queries and documents. Pass these via encode_kwargs and query_encode_kwargs:
Deploy for production
For serving Sentence Transformers models at scale, use Text Embeddings Inference (TEI), a dedicated inference server from Hugging Face with batching, GPU support, and OpenAI-compatible APIs. Point LangChain at a TEI deployment viaHuggingFaceEndpointEmbeddings: see the main Hugging Face embeddings guide.
Reranking
The same ecosystem hosts cross-encoder reranker models. For a local reranker on top of a vector store, see the Cross Encoder Reranker guide.Troubleshooting
If theaccelerate package is missing or fails to import:
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

