vLLM integration

vLLM is a fast and easy-to-use library for LLM inference and serving, offering:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels

This notebooks goes over how to use a LLM with langchain and vLLM. To use, you should have the vllm python package installed.

pip install -qU  vllm -q

from langchain_community.llms import VLLM

llm = VLLM(
    model="mosaicml/mpt-7b",
    trust_remote_code=True,  # mandatory for hf models
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
)

print(llm.invoke("What is the capital of France ?"))

INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]

What is the capital of France ? The capital of France is Paris.

Integrate the model in an LLMChain

from langchain_classic.chains import LLMChain
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who was the US president in the year the first Pokemon game was released?"

print(llm_chain.invoke(question))

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]

The first Pokemon game was released in 1996.
The president was Bill Clinton.
Clinton was president from 1993 to 2001.
The answer is Clinton.

Distributed inference

vLLM supports distributed tensor-parallel inference and serving. To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs

from langchain_community.llms import VLLM

llm = VLLM(
        model="mosaicml/mpt-30b",
        tensor_parallel_size=4,
        trust_remote_code=True,  # mandatory for hf models
)

llm.invoke("What is the future of AI?")

Quantization

vLLM supports awq quantization. To enable it, pass quantization to vllm_kwargs.

llm_q = VLLM(
        model="TheBloke/Llama-2-7b-Chat-AWQ",
        trust_remote_code=True,
        max_new_tokens=512,
        vllm_kwargs={"quantization": "awq"},
)

OpenAI-Compatible Server

vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. This server can be queried in the same format as OpenAI API.

OpenAI-Compatible completion

from langchain_community.llms import VLLMOpenAI

llm = VLLMOpenAI(
        openai_api_key="EMPTY",
        openai_api_base="http://localhost:8000/v1",
        model_name="tiiuae/falcon-7b",
        model_kwargs={"stop": ["."]},
)
print(llm.invoke("Rome is"))

 a city that is filled with history, ancient buildings, and art around every corner

LoRA adapter

LoRA adapters can be used with any vLLM model that implements SupportsLoRA.

from langchain_community.llms import VLLM
from vllm.lora.request import LoRARequest

llm = VLLM(
        model="meta-llama/Llama-3.2-3B-Instruct",
        max_new_tokens=300,
        top_k=1,
        top_p=0.90,
        temperature=0.1,
        vllm_kwargs={
        "gpu_memory_utilization": 0.5,
        "enable_lora": True,
        "max_model_len": 350,
    },
)
LoRA_ADAPTER_PATH = "path/to/adapter"
lora_adapter = LoRARequest("lora_adapter", 1, LoRA_ADAPTER_PATH)

print(
        llm.invoke("What are some popular Korean street foods?", lora_request=lora_adapter)
)

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

Integrate the model in an LLMChain

Distributed inference

Quantization

OpenAI-Compatible Server

OpenAI-Compatible completion

LoRA adapter

Popular Providers

Integrations by component

​Integrate the model in an LLMChain

​Distributed inference

​Quantization

​OpenAI-Compatible Server

​OpenAI-Compatible completion

​LoRA adapter

Integrate the model in an LLMChain

Distributed inference

Quantization

OpenAI-Compatible Server

OpenAI-Compatible completion

LoRA adapter