ML model serving latency is a production engineering problem, not just a modeling problem. A model that achieves state-of-the-art accuracy but takes 3 seconds per inference is a model that will not be used. Users abandon interactive features above roughly 300ms. This guide covers the full stack of techniques for reducing ML serving latency and hitting production targets.
Why Latency Matters (The User Psychology Perspective)
Research on web performance consistently shows:
- 100ms: Users perceive the response as instant
- 300ms: Users start to notice the delay
- 1,000ms (1 second): Users lose focus and start thinking about something else
- 3,000ms (3 seconds): 40% of users abandon the interaction
For ML-powered features -- autocomplete, recommendations, classification, question answering -- the 300ms threshold is the practical target for interactive use cases. For background tasks (document analysis, batch recommendations), you have more flexibility. Know your latency budget before optimizing.
The Latency Budget: Where Time Goes
Before optimizing, measure where time is actually spent. A typical ML serving path:
Total latency = network (client to server) + preprocessing + model inference + postprocessing + network (server to client)
For a server-side model:
- Network: 10-100ms (depends on geography)
- Preprocessing (tokenization, feature extraction): 5-50ms
- Model inference: 20-2000ms (the main variable)
- Postprocessing: 1-10ms
Profile your actual system before optimizing. A model that takes 200ms to infer but 500ms to preprocess requires different optimization than the reverse.
Model Quantization: 2-4x Faster with Minimal Quality Loss
Quantization converts model weights from 32-bit floating point (FP32) to lower precision formats. The two most common:
INT8 quantization: Weights and activations represented as 8-bit integers instead of 32-bit floats. Memory footprint reduced 4x. Inference 2-4x faster on hardware with INT8 support (most modern CPUs and GPUs). Quality loss is typically 0.5-2% on benchmark tasks.
FP16 / BF16 quantization: Half-precision floating point. 2x memory reduction. 2x faster on modern GPUs with tensor cores (NVIDIA A100, H100, consumer RTX cards). Minimal quality loss.
# Post-training quantization with Hugging Face
from transformers import AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# Dynamic INT8 quantization (no calibration data needed)
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # quantize linear layers
dtype=torch.qint8
)
For more aggressive quantization (INT4, INT2), tools like bitsandbytes and llm.int8() allow loading large language models in reduced precision on consumer hardware. Quality loss increases but models that previously required 80GB GPU RAM can run on 24GB.
ONNX Runtime provides hardware-optimized inference for quantized models across CPU and GPU. Converting a Hugging Face model to ONNX and running it with ONNX Runtime typically gives 2-5x speedup over native PyTorch inference.
Batching: GPU Utilization at Scale
GPUs are massively parallel processors. Sending a single example to a GPU underutilizes it -- the GPU has capacity for dozens of parallel computations simultaneously. Batching sends multiple requests through the model at once, amortizing the overhead of a GPU forward pass across many examples.
Static batching: Wait until you have N requests, then run them as a batch. Simple to implement but introduces latency for users who arrive when the batch is not yet full.
Dynamic batching: Accept requests as they arrive, collect them for a configurable window (e.g., 10ms), then process the collected batch. Balances throughput and latency. This is the standard approach in production ML serving systems.
Continuous batching (for LLMs): Because LLM generation is iterative (one token at a time), traditional batching is inefficient -- you have to wait for the longest sequence in the batch to finish before adding new requests. Continuous batching (PagedAttention, used in vLLM) interleaves multiple requests at the token generation level, dramatically improving GPU utilization for LLM serving.
# vLLM for high-throughput LLM serving
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling_params = SamplingParams(temperature=0.8, max_tokens=256)
prompts = ["Explain gradient descent.", "What is overfitting?", "Define entropy."]
outputs = llm.generate(prompts, sampling_params) # batched automatically
Caching: Avoid Redundant Computation
For models where the same input is queried repeatedly, caching is the highest-leverage optimization -- you avoid inference entirely.
Exact match caching: Store (input_hash -> output) in Redis or a similar key-value store. Look up the hash before running inference. Works best for structured inputs (classification of a fixed set of items, embeddings for a static corpus).
Embedding caching: If you use BERT embeddings as features, cache the embeddings for each unique text. Recompute only when the underlying text changes.
KV cache (for LLMs): During autoregressive generation, the key and value matrices for the prompt tokens do not change between generation steps. Most LLM serving frameworks cache these between forward passes, significantly reducing per-token generation cost for long prompts. This is enabled by default in vLLM and TGI.
Semantic caching: Cache based on semantic similarity, not exact match. If a user asks "What is the capital of France?" and the cache contains the answer to "Tell me the capital of France," return the cached result. Tools like GPTCache implement semantic caching for LLM applications.
Hardware: The Infrastructure Foundation
CPU vs GPU: CPU inference is appropriate for small models (BERT-base, DistilBERT) with low request volume. A single CPU core can handle ~20-50ms BERT inference. For high-throughput serving or large models, GPU is necessary.
GPU selection: A100 vs H100 vs consumer RTX. For production:
- A100 (80GB): Highest memory, good for large models. $2-3/hour on cloud.
- H100: Faster than A100 (especially for FP8 inference). $4-5/hour on cloud.
- RTX 4090 (24GB): Cost-effective for smaller models in lower-traffic scenarios.
For LLM inference, the bottleneck is often memory bandwidth (moving weights from GPU memory to compute units), not raw FLOPS. H100 has 3.35 TB/s memory bandwidth vs A100's 2 TB/s.
Tensor cores: Modern NVIDIA GPUs have tensor cores that accelerate FP16, BF16, and INT8 matrix operations specifically. Using FP16 or INT8 quantization lets you take full advantage of these accelerators.
Optimization Runtimes: TensorRT, ONNX, and vLLM
TensorRT (NVIDIA): Compiles models specifically for NVIDIA GPU architecture, applying layer fusion, precision calibration, and kernel selection. Achieves 2-10x speedup over standard PyTorch inference on NVIDIA hardware. Most valuable for production deployments on fixed GPU types. Complex to set up.
ONNX Runtime: Cross-platform, hardware-agnostic inference. Simpler to set up than TensorRT. Provides 2-5x speedup on CPU and GPU through operator fusion and graph optimization. The right default for CPU inference.
vLLM: Purpose-built for LLM serving. PagedAttention for memory efficiency, continuous batching, tensor parallelism for multi-GPU. Standard for production LLM endpoints.
Triton Inference Server (NVIDIA): Supports multiple backends (PyTorch, ONNX, TensorRT, custom). Dynamic batching, model versioning, gRPC and REST APIs out of the box. Good for multi-model serving infrastructure.
Cold Start: Keeping Models Warm
Serverless ML inference (Lambda, Cloud Run) suffers from cold starts: the first request after a period of inactivity must load the model from storage into memory, taking 5-30 seconds for large models. This is unacceptable for user-facing features.
Solutions:
- Minimum instance count: Keep at least one instance warm at all times. Costs money even at zero traffic.
- Lazy loading with warm-up: Accept the cold start but serve cached responses or a fallback during warm-up.
- Pre-warmed containers: Deploy to always-on compute (EC2, GKE) rather than serverless. More operational overhead but no cold starts.
When to Pre-Compute
Some ML predictions can be computed ahead of time rather than at request time:
- User recommendations: Compute personalized recommendations nightly. At request time, simply retrieve from a key-value store. Latency becomes a database lookup (1-5ms) instead of a model inference (100-500ms). Tradeoff: recommendations are stale.
- Embeddings for a fixed corpus: If you have a fixed set of documents to search, compute embeddings offline and store in a vector database. At query time, only embed the query (fast, short text) and search the pre-computed vectors.
- Content classification: Classify all your content during ingestion rather than at read time. Store the labels in your database.
Pre-computation is the highest-leverage latency optimization when the model input is predictable and the output can be stored. It is not always possible (user query cannot be predicted in advance), but when it is, use it.
Keep Reading
- Knowledge Distillation Guide -- train smaller, faster models to replace large ones in production
- ML Tools Ecosystem 2026 -- vLLM, Triton, and the full serving toolchain in context
- The Complete Machine Learning Guide for Software Developers -- broader ML engineering context
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.