vLLM: The Fastest Open Source LLM Inference Server

PagedAttention gives vLLM 2-24x throughput over naive implementations. Here is how to set it up, configure batching, quantize models, and calculate hardware costs.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

9 min read

// tags

#vllm#llm-inference#pagedattention#self-hosted-ai

FIG. ART-18

9 min read

“

vLLM: The Fastest Open Source LLM Inference Server

// reading plan

sections

786

words

min read

// Machine Learning

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

PyTorch, MLflow, DVC, vLLM, Airflow -- the ML tooling landscape is vast. Here is a curated map of the tools that matter, what each does, and how to choose for startup vs enterprise.

11 min read

// Machine Learning

Reducing ML Model Serving Latency for Production

vLLM is the highest-throughput open source inference server for large language models, achieving 2-24x higher throughput than naive implementations through PagedAttention, a custom attention algorithm that eliminates memory waste from KV cache fragmentation. It provides an OpenAI-compatible API out of the box, making it a drop-in replacement for the OpenAI API in most applications. For teams self-hosting LLMs, vLLM is the right choice when throughput matters: it consistently outperforms both Ollama and Hugging Face TGI on tokens-per-second benchmarks for high-concurrency workloads.

PagedAttention: Why It Is Faster

The standard KV cache in transformer inference pre-allocates memory for the maximum possible sequence length. For a model with a 4,096-token context window, even if a request generates only 50 tokens, the server reserves memory for 4,096 tokens. This wastes 98.8% of the KV cache for that request.

PagedAttention divides the KV cache into fixed-size pages (analogous to OS virtual memory pages) and allocates them dynamically. Pages are assigned as the sequence grows and freed when the request completes. This allows:

Larger effective batch sizes (more requests fit in the same GPU memory)
Continuous batching (new requests are added to in-progress batches without waiting)
Copy-on-write sharing of KV cache pages between parallel decoding paths (useful for beam search)

The result: on standard benchmarks, vLLM achieves throughput 24x higher than Hugging Face's original transformers inference on multi-request workloads.

Installation and Basic Setup

Requirements: NVIDIA GPU with CUDA 11.8+, Python 3.9+

pip install vllm

Start a model server:

python -m vllm.entrypoints.openai.api_server   --model mistralai/Mistral-7B-Instruct-v0.3   --port 8000   --dtype float16

The server is now running at http://localhost:8000 with an OpenAI-compatible API. Any code using the OpenAI SDK can point to this endpoint by setting:

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",  # vLLM does not require auth by default
    base_url="http://localhost:8000/v1"
)

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": "Explain transformers briefly"}]
)
print(response.choices[0].message.content)

Handling Concurrent Requests

vLLM's continuous batching automatically handles concurrent requests efficiently. You do not need to configure batching manually for most use cases. The scheduler groups in-flight requests and processes them together at each forward pass.

Key configuration parameters:

python -m vllm.entrypoints.openai.api_server   --model mistralai/Mistral-7B-Instruct-v0.3   --max-num-seqs 256          # max concurrent sequences
  --max-num-batched-tokens 32768   # max tokens per batch
  --gpu-memory-utilization 0.90    # fraction of GPU memory to use

For production, set --gpu-memory-utilization to 0.90-0.95. Leaving GPU memory headroom prevents OOM errors during traffic spikes.

Model Quantization

For running larger models on limited hardware, vLLM supports:

AWQ (Activation-aware Weight Quantization): 4-bit quantization with minimal quality loss. Reduces model size by ~4x.

python -m vllm.entrypoints.openai.api_server   --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ   --quantization awq

GPTQ: Another 4-bit quantization method. Slightly lower quality than AWQ but more model availability (many quantized models on Hugging Face use GPTQ).

FP8: 8-bit floating point. Available on H100 and A100 GPUs. Better quality than INT4 with better throughput than FP16.

Memory savings from quantization:

| Model | FP16 VRAM | AWQ VRAM | Speed impact | |-------|----------|---------|--------------| | Mistral 7B | 14GB | 6GB | ~5% slower | | Llama 3.1 8B | 16GB | 7GB | ~5% slower | | Llama 3.1 70B | 140GB | 40GB | ~8% slower |

With AWQ quantization, Mistral 7B fits on a single 8GB GPU (RTX 3070, RTX 4060), significantly reducing hardware costs.

Tensor Parallelism for Multi-GPU

For models too large for a single GPU, vLLM supports tensor parallelism across multiple GPUs:

python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-70B-Instruct   --tensor-parallel-size 2    # 2 GPUs
  --dtype float16

Tensor parallelism splits model layers across GPUs, enabling inference on models that do not fit on a single device. Communication overhead between GPUs reduces throughput by 10-20% versus single-GPU inference.

Hardware Requirements and Cost Calculator

For production self-hosting, the most cost-effective GPU configurations:

Single A10G (24GB, AWS g5.2xlarge):

On-demand: $1.212/hour = ~$876/month
Spot: $0.36-0.45/hour = ~$260-325/month
Models that fit: Mistral 7B (FP16), Llama 3.1 8B (FP16), Llama 3.1 70B (AWQ requires 2 GPUs)
Throughput: ~1,200 tokens/second at batch 32 for 7B models

Single RTX 4090 (24GB, bare metal):

Hardware cost: ~$1,600 (one-time)
Hetzner dedicated server with RTX 4090: ~$200-250/month
Better cost per token than cloud instances for stable long-term workloads

2x A100 80GB (AWS p4d.24xlarge, 8 GPUs):

On-demand: $32.77/hour (very expensive)
For Llama 3.1 70B: use spot instances or a model provider

Keep Reading

Running Open Source LLMs in Production — Full production guide including server choice comparison
Fine-Tuning an LLM with QLoRA — Adapting models that vLLM will serve
Open Source LLM Benchmarks 2026 — Choosing which model to serve with vLLM

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

vLLM: The Fastest Open Source LLM Inference Server

Related Articles

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Reducing ML Model Serving Latency for Production

PagedAttention: Why It Is Faster

Installation and Basic Setup

Handling Concurrent Requests

Model Quantization

Tensor Parallelism for Multi-GPU

Hardware Requirements and Cost Calculator

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Deployment Patterns: From REST API to Edge Inference

vLLM: The Fastest Open Source LLM Inference Server

Related Articles

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Reducing ML Model Serving Latency for Production

PagedAttention: Why It Is Faster

Installation and Basic Setup

Handling Concurrent Requests

Model Quantization

Tensor Parallelism for Multi-GPU

Hardware Requirements and Cost Calculator

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Deployment Patterns: From REST API to Edge Inference

The workspace your team
actually needs