Ray Serve is a scalable model serving library for building online inference APIs. Serve is particularly well suited for system composition, enabling you to build a complex inference service consisting of multiple chains and business logic all in Python code.
This notebook shows a simple example of how to deploy an OpenAI chain into production. You can extend it to deploy your own self-hosted models where you can easily define amount of hardware resources (GPUs and CPUs) needed to run your model in production efficiently. Read more about available options including autoscaling in the Ray Serve documentation.
The general skeleton for deploying a service is the following:
Copy
Ask AI
# 0: Import ray serve and request from starlettefrom ray import servefrom starlette.requests import Request# 1: Define a Ray Serve deployment.@serve.deploymentclass LLMServe: def __init__(self) -> None: # All the initialization code goes here pass async def __call__(self, request: Request) -> str: # You can parse the request here # and return a response return "Hello World"# 2: Bind the model to deploymentdeployment = LLMServe.bind()# 3: Run the deploymentserve.api.run(deployment)
Example of deploying and OpenAI chain with custom prompts
Get an OpenAI API key from here. By running the following code, you will be asked to provide your API key.
Copy
Ask AI
from langchain.chains import LLMChainfrom langchain_core.prompts import PromptTemplatefrom langchain_openai import OpenAI
Copy
Ask AI
from getpass import getpassOPENAI_API_KEY = getpass()
Copy
Ask AI
@serve.deploymentclass DeployLLM: def __init__(self): # We initialize the LLM, template and the chain here llm = OpenAI(openai_api_key=OPENAI_API_KEY) template = "Question: {question}\n\nAnswer: Let's think step by step." prompt = PromptTemplate.from_template(template) self.chain = LLMChain(llm=llm, prompt=prompt) def _run_chain(self, text: str): return self.chain(text) async def __call__(self, request: Request): # 1. Parse the request text = request.query_params["text"] # 2. Run the chain resp = self._run_chain(text) # 3. Return the response return resp["text"]
Now we can bind the deployment.
Copy
Ask AI
# Bind the model to deploymentdeployment = DeployLLM.bind()
We can assign the port number and host when we want to run the deployment.
Copy
Ask AI
# Example port numberPORT_NUMBER = 8282# Run the deploymentserve.api.run(deployment, port=PORT_NUMBER)
Now that service is deployed on port localhost:8282 we can send a post request to get the results back.
Copy
Ask AI
import requeststext = "What NFL team won the Super Bowl in the year Justin Beiber was born?"response = requests.post(f"http://localhost:{PORT_NUMBER}/?text={text}")print(response.content.decode())
Assistant
Responses are generated using AI and may contain mistakes.