vLLM is the highest-throughput open source inference server for large language models, achieving 2-24x higher throughput than naive implementations through PagedAttention, a custom attention algorithm that eliminates memory waste from KV cache fragmentation. It provides an OpenAI-compatible API out of the box, making it a drop-in replacement for the OpenAI API in most applications. For teams self-hosting LLMs, vLLM is the right choice when throughput matters: it consistently outperforms both Ollama and Hugging Face TGI on tokens-per-second benchmarks for high-concurrency workloads.
PagedAttention: Why It Is Faster
The standard KV cache in transformer inference pre-allocates memory for the maximum possible sequence length. For a model with a 4,096-token context window, even if a request generates only 50 tokens, the server reserves memory for 4,096 tokens. This wastes 98.8% of the KV cache for that request.
PagedAttention divides the KV cache into fixed-size pages (analogous to OS virtual memory pages) and allocates them dynamically. Pages are assigned as the sequence grows and freed when the request completes. This allows:
- Larger effective batch sizes (more requests fit in the same GPU memory)
- Continuous batching (new requests are added to in-progress batches without waiting)
- Copy-on-write sharing of KV cache pages between parallel decoding paths (useful for beam search)
The result: on standard benchmarks, vLLM achieves throughput 24x higher than Hugging Face's original transformers inference on multi-request workloads.
Installation and Basic Setup
Requirements: NVIDIA GPU with CUDA 11.8+, Python 3.9+
pip install vllm
Start a model server:
python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 --port 8000 --dtype float16
The server is now running at http://localhost:8000 with an OpenAI-compatible API. Any code using the OpenAI SDK can point to this endpoint by setting:
from openai import OpenAI
client = OpenAI(
api_key="not-needed", # vLLM does not require auth by default
base_url="http://localhost:8000/v1"
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[{"role": "user", "content": "Explain transformers briefly"}]
)
print(response.choices[0].message.content)