vLLM is the highest-throughput open source inference server for large language models, achieving 2-24x higher throughput than naive implementations through PagedAttention, a custom attention algorithm that eliminates memory waste from KV cache fragmentation. It provides an OpenAI-compatible API out of the box, making it a drop-in replacement for the OpenAI API in most applications. For teams self-hosting LLMs, vLLM is the right choice when throughput matters: it consistently outperforms both Ollama and Hugging Face TGI on tokens-per-second benchmarks for high-concurrency workloads.
PagedAttention: Why It Is Faster
The standard KV cache in transformer inference pre-allocates memory for the maximum possible sequence length. For a model with a 4,096-token context window, even if a request generates only 50 tokens, the server reserves memory for 4,096 tokens. This wastes 98.8% of the KV cache for that request.
PagedAttention divides the KV cache into fixed-size pages (analogous to OS virtual memory pages) and allocates them dynamically. Pages are assigned as the sequence grows and freed when the request completes. This allows:
- Larger effective batch sizes (more requests fit in the same GPU memory)
- Continuous batching (new requests are added to in-progress batches without waiting)
- Copy-on-write sharing of KV cache pages between parallel decoding paths (useful for beam search)
The result: on standard benchmarks, vLLM achieves throughput 24x higher than Hugging Face's original transformers inference on multi-request workloads.
Installation and Basic Setup
Requirements: NVIDIA GPU with CUDA 11.8+, Python 3.9+
pip install vllm
Start a model server:
python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 --port 8000 --dtype float16
The server is now running at http://localhost:8000 with an OpenAI-compatible API. Any code using the OpenAI SDK can point to this endpoint by setting:
from openai import OpenAI
client = OpenAI(
api_key="not-needed", # vLLM does not require auth by default
base_url="http://localhost:8000/v1"
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[{"role": "user", "content": "Explain transformers briefly"}]
)
print(response.choices[0].message.content)
Handling Concurrent Requests
vLLM's continuous batching automatically handles concurrent requests efficiently. You do not need to configure batching manually for most use cases. The scheduler groups in-flight requests and processes them together at each forward pass.
Key configuration parameters:
python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 --max-num-seqs 256 # max concurrent sequences
--max-num-batched-tokens 32768 # max tokens per batch
--gpu-memory-utilization 0.90 # fraction of GPU memory to use
For production, set --gpu-memory-utilization to 0.90-0.95. Leaving GPU memory headroom prevents OOM errors during traffic spikes.
Model Quantization
For running larger models on limited hardware, vLLM supports:
AWQ (Activation-aware Weight Quantization): 4-bit quantization with minimal quality loss. Reduces model size by ~4x.
python -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --quantization awq
GPTQ: Another 4-bit quantization method. Slightly lower quality than AWQ but more model availability (many quantized models on Hugging Face use GPTQ).
FP8: 8-bit floating point. Available on H100 and A100 GPUs. Better quality than INT4 with better throughput than FP16.
Memory savings from quantization:
| Model | FP16 VRAM | AWQ VRAM | Speed impact | |-------|----------|---------|--------------| | Mistral 7B | 14GB | 6GB | ~5% slower | | Llama 3.1 8B | 16GB | 7GB | ~5% slower | | Llama 3.1 70B | 140GB | 40GB | ~8% slower |
With AWQ quantization, Mistral 7B fits on a single 8GB GPU (RTX 3070, RTX 4060), significantly reducing hardware costs.
Tensor Parallelism for Multi-GPU
For models too large for a single GPU, vLLM supports tensor parallelism across multiple GPUs:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 2 # 2 GPUs
--dtype float16
Tensor parallelism splits model layers across GPUs, enabling inference on models that do not fit on a single device. Communication overhead between GPUs reduces throughput by 10-20% versus single-GPU inference.
Hardware Requirements and Cost Calculator
For production self-hosting, the most cost-effective GPU configurations:
Single A10G (24GB, AWS g5.2xlarge):
- On-demand: $1.212/hour = ~$876/month
- Spot: $0.36-0.45/hour = ~$260-325/month
- Models that fit: Mistral 7B (FP16), Llama 3.1 8B (FP16), Llama 3.1 70B (AWQ requires 2 GPUs)
- Throughput: ~1,200 tokens/second at batch 32 for 7B models
Single RTX 4090 (24GB, bare metal):
- Hardware cost: ~$1,600 (one-time)
- Hetzner dedicated server with RTX 4090: ~$200-250/month
- Better cost per token than cloud instances for stable long-term workloads
2x A100 80GB (AWS p4d.24xlarge, 8 GPUs):
- On-demand: $32.77/hour (very expensive)
- For Llama 3.1 70B: use spot instances or a model provider
Keep Reading
- Running Open Source LLMs in Production — Full production guide including server choice comparison
- Fine-Tuning an LLM with QLoRA — Adapting models that vLLM will serve
- Open Source LLM Benchmarks 2026 — Choosing which model to serve with vLLM
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.