- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Optimized CUDA kernels
vllm python package installed.
Integrate the model in an LLMChain
Distributed inference
vLLM supports distributed tensor-parallel inference and serving. To run multi-GPU inference with the LLM class, set thetensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs
Quantization
vLLM supportsawq quantization. To enable it, pass quantization to vllm_kwargs.
OpenAI-Compatible Server
vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. This server can be queried in the same format as OpenAI API.OpenAI-Compatible completion
LoRA adapter
LoRA adapters can be used with any vLLM model that implementsSupportsLoRA.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.