The Problem With Naive LLM Serving
When you naively serve an LLM, the KV (key-value) cache for each request is allocated as a contiguous memory block. Because request lengths vary wildly, up to 60–80% of that memory is wasted on internal fragmentation. The result: low GPU utilisation and throughput that barely beats a single user.
PagedAttention: KV Cache as Virtual Memory
vLLM solves this with PagedAttention, described in the 2023 paper. Inspired by OS virtual memory, PagedAttention divides the KV cache into fixed-size pages that can be stored non-contiguously. Pages are allocated on demand and freed immediately when a request finishes — near-zero fragmentation. Combined with continuous batching (new requests slot into the batch mid-flight rather than waiting for the whole batch to drain), vLLM achieves 10–24x higher throughput than HuggingFace TGI in head-to-head benchmarks.
Quick Start With Docker
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 1
The server starts an OpenAI-compatible API on port 8000. Hit it immediately:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Explain PagedAttention"}]}'
Python API
pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["What is tensor parallelism?"], params)
print(outputs[0].outputs[0].text)
Tensor Parallelism Across GPUs
For models that don't fit on a single GPU, split across N devices:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 4
vLLM uses Megatron-style column/row parallelism automatically — no code changes needed.
Quantization Options
| Method | Speed | Quality | Notes | |---|---|---|---| | AWQ | Fast | High | Pre-quantized weights, 4-bit | | GPTQ | Fast | High | 4-bit, calibration required | | int8 | Medium | Very high | LLM.int8() via bitsandbytes |
Load a pre-quantized AWQ model:
python -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-70B-AWQ --quantization awq
Throughput vs HuggingFace TGI
On an A100 80GB with Llama 3.1 70B (Q4 AWQ), vLLM delivers roughly 18 requests/sec at 512 output tokens versus ~4 req/sec for TGI with the same setup. The gap widens further at higher concurrency because continuous batching prevents head-of-line blocking.
Consult the vLLM docs for the full benchmark suite and deployment recipes including Ray Serve integration for auto-scaling clusters.