SagemakerEndpointCrossEncoder enables you to use these HuggingFace models loaded on Sagemaker.
This builds on top of ideas in the ContextualCompressionRetriever. Overall structure of this document came from Cohere Reranker documentation.
For more about why cross encoder can be used as reranking mechanism in conjunction with embeddings for better retrieval, refer to Hugging Face Cross-Encoders documentation.
Set up the base vector store retriever
Let’s start by initializing a simple vector store retriever and storing the 2023 State of the Union speech (in chunks). We can set up the retriever to retrieve a high number (20) of docs.Doing reranking with CrossEncoderReranker
Now let’s wrap our base retriever with aContextualCompressionRetriever. CrossEncoderReranker uses HuggingFaceCrossEncoder to rerank the returned results.
Uploading Hugging Face model to SageMaker endpoint
Here is a sampleinference.py for creating an endpoint that works with SagemakerEndpointCrossEncoder. For more details with step-by-step guidance, refer to this article.
It downloads Hugging Face model on the fly, so you do not need to keep the model artifacts such as pytorch_model.bin in your model.tar.gz.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.