Reducing ML Model Serving Latency for Production

Users abandon features above 300ms. Here is the complete playbook for hitting production latency targets: quantization, batching, caching, hardware selection, and pre-computation.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

10 min read

// tags

#ml-serving#latency#quantization#vllm#production-ml

FIG. ART-24

10 min read

“

Reducing ML Model Serving Latency for Production

// reading plan

sections

1,397

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Gradient Descent Explained: How Machine Learning Models Actually Learn

Model Quantization: 2-4x Faster with Minimal Quality Loss

Quantization converts model weights from 32-bit floating point (FP32) to lower precision formats. The two most common:

INT8 quantization: Weights and activations represented as 8-bit integers instead of 32-bit floats. Memory footprint reduced 4x. Inference 2-4x faster on hardware with INT8 support (most modern CPUs and GPUs). Quality loss is typically 0.5-2% on benchmark tasks.

FP16 / BF16 quantization: Half-precision floating point. 2x memory reduction. 2x faster on modern GPUs with tensor cores (NVIDIA A100, H100, consumer RTX cards). Minimal quality loss.

# Post-training quantization with Hugging Face
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

# Dynamic INT8 quantization (no calibration data needed)
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},  # quantize linear layers
    dtype=torch.qint8
)

For more aggressive quantization (INT4, INT2), tools like bitsandbytes and llm.int8() allow loading large language models in reduced precision on consumer hardware. Quality loss increases but models that previously required 80GB GPU RAM can run on 24GB.

ONNX Runtime provides hardware-optimized inference for quantized models across CPU and GPU. Converting a Hugging Face model to ONNX and running it with ONNX Runtime typically gives 2-5x speedup over native PyTorch inference.

Batching: GPU Utilization at Scale

GPUs are massively parallel processors. Sending a single example to a GPU underutilizes it -- the GPU has capacity for dozens of parallel computations simultaneously. Batching sends multiple requests through the model at once, amortizing the overhead of a GPU forward pass across many examples.

Static batching: Wait until you have N requests, then run them as a batch. Simple to implement but introduces latency for users who arrive when the batch is not yet full.

Dynamic batching: Accept requests as they arrive, collect them for a configurable window (e.g., 10ms), then process the collected batch. Balances throughput and latency. This is the standard approach in production ML serving systems.

Continuous batching (for LLMs): Because LLM generation is iterative (one token at a time), traditional batching is inefficient -- you have to wait for the longest sequence in the batch to finish before adding new requests. Continuous batching (PagedAttention, used in vLLM) interleaves multiple requests at the token generation level, dramatically improving GPU utilization for LLM serving.

# vLLM for high-throughput LLM serving
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling_params = SamplingParams(temperature=0.8, max_tokens=256)

prompts = ["Explain gradient descent.", "What is overfitting?", "Define entropy."]
outputs = llm.generate(prompts, sampling_params)  # batched automatically

Caching: Avoid Redundant Computation

For models where the same input is queried repeatedly, caching is the highest-leverage optimization -- you avoid inference entirely.

Exact match caching: Store (input_hash -> output) in Redis or a similar key-value store. Look up the hash before running inference. Works best for structured inputs (classification of a fixed set of items, embeddings for a static corpus).

Embedding caching: If you use BERT embeddings as features, cache the embeddings for each unique text. Recompute only when the underlying text changes.

KV cache (for LLMs): During autoregressive generation, the key and value matrices for the prompt tokens do not change between generation steps. Most LLM serving frameworks cache these between forward passes, significantly reducing per-token generation cost for long prompts. This is enabled by default in vLLM and TGI.

Semantic caching: Cache based on semantic similarity, not exact match. If a user asks "What is the capital of France?" and the cache contains the answer to "Tell me the capital of France," return the cached result. Tools like GPTCache implement semantic caching for LLM applications.

Hardware: The Infrastructure Foundation

CPU vs GPU: CPU inference is appropriate for small models (BERT-base, DistilBERT) with low request volume. A single CPU core can handle ~20-50ms BERT inference. For high-throughput serving or large models, GPU is necessary.

GPU selection: A100 vs H100 vs consumer RTX. For production:

A100 (80GB): Highest memory, good for large models. $2-3/hour on cloud.
H100: Faster than A100 (especially for FP8 inference). $4-5/hour on cloud.
RTX 4090 (24GB): Cost-effective for smaller models in lower-traffic scenarios.

For LLM inference, the bottleneck is often memory bandwidth (moving weights from GPU memory to compute units), not raw FLOPS. H100 has 3.35 TB/s memory bandwidth vs A100's 2 TB/s.

Tensor cores: Modern NVIDIA GPUs have tensor cores that accelerate FP16, BF16, and INT8 matrix operations specifically. Using FP16 or INT8 quantization lets you take full advantage of these accelerators.

Optimization Runtimes: TensorRT, ONNX, and vLLM

TensorRT (NVIDIA): Compiles models specifically for NVIDIA GPU architecture, applying layer fusion, precision calibration, and kernel selection. Achieves 2-10x speedup over standard PyTorch inference on NVIDIA hardware. Most valuable for production deployments on fixed GPU types. Complex to set up.

ONNX Runtime: Cross-platform, hardware-agnostic inference. Simpler to set up than TensorRT. Provides 2-5x speedup on CPU and GPU through operator fusion and graph optimization. The right default for CPU inference.

vLLM: Purpose-built for LLM serving. PagedAttention for memory efficiency, continuous batching, tensor parallelism for multi-GPU. Standard for production LLM endpoints.

Triton Inference Server (NVIDIA): Supports multiple backends (PyTorch, ONNX, TensorRT, custom). Dynamic batching, model versioning, gRPC and REST APIs out of the box. Good for multi-model serving infrastructure.

Cold Start: Keeping Models Warm

Serverless ML inference (Lambda, Cloud Run) suffers from cold starts: the first request after a period of inactivity must load the model from storage into memory, taking 5-30 seconds for large models. This is unacceptable for user-facing features.

Solutions:

Minimum instance count: Keep at least one instance warm at all times. Costs money even at zero traffic.
Lazy loading with warm-up: Accept the cold start but serve cached responses or a fallback during warm-up.
Pre-warmed containers: Deploy to always-on compute (EC2, GKE) rather than serverless. More operational overhead but no cold starts.

When to Pre-Compute

Some ML predictions can be computed ahead of time rather than at request time:

User recommendations: Compute personalized recommendations nightly. At request time, simply retrieve from a key-value store. Latency becomes a database lookup (1-5ms) instead of a model inference (100-500ms). Tradeoff: recommendations are stale.
Embeddings for a fixed corpus: If you have a fixed set of documents to search, compute embeddings offline and store in a vector database. At query time, only embed the query (fast, short text) and search the pre-computed vectors.
Content classification: Classify all your content during ingestion rather than at read time. Store the labels in your database.

Pre-computation is the highest-leverage latency optimization when the model input is predictable and the output can be stored. It is not always possible (user query cannot be predicted in advance), but when it is, use it.

Keep Reading

Knowledge Distillation Guide -- train smaller, faster models to replace large ones in production
ML Tools Ecosystem 2026 -- vLLM, Triton, and the full serving toolchain in context
The Complete Machine Learning Guide for Software Developers -- broader ML engineering context

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.

Reducing ML Model Serving Latency for Production

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Gradient Descent Explained: How Machine Learning Models Actually Learn

Why Latency Matters (The User Psychology Perspective)

The Latency Budget: Where Time Goes

Model Quantization: 2-4x Faster with Minimal Quality Loss

Batching: GPU Utilization at Scale

Caching: Avoid Redundant Computation

Hardware: The Infrastructure Foundation

Optimization Runtimes: TensorRT, ONNX, and vLLM

Cold Start: Keeping Models Warm

When to Pre-Compute

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

Reducing ML Model Serving Latency for Production

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Gradient Descent Explained: How Machine Learning Models Actually Learn

Why Latency Matters (The User Psychology Perspective)

The Latency Budget: Where Time Goes

Model Quantization: 2-4x Faster with Minimal Quality Loss

Batching: GPU Utilization at Scale

Caching: Avoid Redundant Computation

Hardware: The Infrastructure Foundation

Optimization Runtimes: TensorRT, ONNX, and vLLM

Cold Start: Keeping Models Warm

When to Pre-Compute

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs