A 70B parameter language model requires 140GB of GPU memory to run in FP16. A production system serving thousands of concurrent users cannot afford one H100 per user. Model compression is the set of techniques that make models smaller, faster, and cheaper to serve without unacceptable quality loss. This guide covers the main approaches, when to use each, and what tradeoffs to expect.
Why Model Compression Matters
The gap between what a model achieves in a research notebook and what is economically feasible to serve in production is large. Compression bridges that gap. The three main dimensions of compression:
Size — how much memory the model requires. Smaller models fit on cheaper hardware (consumer GPUs, mobile devices, edge processors).
Latency — how long inference takes per request. Faster models handle more concurrent users on the same hardware.
Throughput — how many requests per second the model can process. Compressed models typically improve throughput proportionally to their speed improvement.
The tradeoff is accuracy: compression reduces quality. The question is whether the quality reduction is acceptable for your use case.
Pruning
Pruning removes weights from a neural network based on the assumption that many weights contribute little to the model's output.
Magnitude pruning is the simplest approach: set all weights below a threshold to zero. After pruning, the model is smaller (in terms of non-zero weights) and can use sparse matrix operations. Typically retrained for a few epochs after pruning to recover accuracy.
import torch
import torch.nn.utils.prune as prune
# Prune 20% of the smallest weights in conv1
prune.l1_unstructured(model.conv1, name="weight", amount=0.2)
# Retrain for a few epochs...
# Make pruning permanent (remove the mask)
prune.remove(model.conv1, "weight")
Structured pruning removes entire neurons, filters, or attention heads rather than individual weights. Structured sparsity is more hardware-friendly because it reduces the actual matrix dimensions rather than creating sparse patterns that require specialized sparse matrix libraries.
Typical outcome: 20-50% of weights can be pruned with less than 1-2% accuracy loss if the model is fine-tuned after pruning. More aggressive pruning requires larger quality tradeoffs.
Quantization
Quantization reduces the numerical precision of model weights and activations. Standard neural network weights are stored as 32-bit floating-point numbers (FP32). Reducing to 16-bit (FP16), 8-bit integer (INT8), or 4-bit (INT4) dramatically reduces memory and speeds up computation.
FP16 (half precision) — 2x smaller than FP32, essentially no accuracy loss for inference (not training). All modern GPUs support fast FP16 computation. The safest first step in any compression pipeline.
INT8 quantization — 4x smaller than FP32, roughly 2-4x faster on supported hardware (Intel CPUs, NVIDIA GPUs with Tensor Cores). Typical accuracy loss is less than 1% for most vision and NLP models when calibrated correctly. This is the most commonly used production quantization level.
Post-training quantization (PTQ) calibrates the quantization parameters using a small representative dataset without retraining:
import torch
from torch.quantization import quantize_dynamic
# Dynamic quantization: quantize weights to INT8, activations quantized at runtime
quantized_model = quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.qint8
)
# Compare model sizes
original_size = sum(p.numel() * p.element_size() for p in model.parameters())
quantized_size = sum(p.numel() * p.element_size() for p in quantized_model.parameters())
print(f"Original: {original_size/1e6:.1f}MB, Quantized: {quantized_size/1e6:.1f}MB")
Quantization-aware training (QAT) inserts fake quantization operations during training, allowing the model to adapt to quantization noise. Produces better accuracy than PTQ, especially for aggressive quantization (INT4), at the cost of a retraining step.
INT4 quantization — 8x smaller than FP32. More accuracy loss than INT8 but often acceptable for LLMs where the model is large enough that individual weight precision matters less. GPTQ and AWQ are popular INT4 quantization methods for LLMs that maintain quality through careful per-channel calibration.
BitsAndBytes makes INT4 quantization accessible in Python:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=quantization_config,
)
Knowledge Distillation
Knowledge distillation trains a small "student" model to mimic the behavior of a large "teacher" model. The student learns not just from the hard labels (correct class) but from the teacher's soft probability distributions, which contain information about which classes are similar.
DistilBERT is the canonical example: it is 40% smaller than BERT, 60% faster, and retains 97% of BERT's performance on most tasks. The student learned from BERT's output distributions during training, not just from the training labels.
The distillation loss is a combination of:
- Cross-entropy loss against the true labels (standard supervised loss)
- KL divergence between the student's and teacher's output distributions (distillation loss)
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.5):
# Soft loss: student learns from teacher's distributions
soft_loss = F.kl_div(
F.log_softmax(student_logits / temperature, dim=-1),
F.softmax(teacher_logits / temperature, dim=-1),
reduction="batchmean"
) * (temperature ** 2)
# Hard loss: student also learns from ground truth labels
hard_loss = F.cross_entropy(student_logits, labels)
return alpha * soft_loss + (1 - alpha) * hard_loss
Temperature controls how much information the soft labels carry. Higher temperature produces softer distributions that reveal more about inter-class relationships but also make the training signal noisier. Temperature of 4-8 is typical for distillation.
Knowledge distillation requires access to the teacher model during training (or pre-computed teacher logits). It is more expensive upfront than PTQ but can produce higher quality compressed models, especially when the gap in size between teacher and student is large.
ONNX: Cross-Platform Optimized Inference
ONNX (Open Neural Network Exchange) is a model format supported by all major ML frameworks. Converting to ONNX and running with ONNX Runtime provides optimized inference across CPUs, GPUs, and specialized accelerators without framework dependencies.
Converting a PyTorch model to ONNX:
import torch
dummy_input = torch.randn(1, 3, 224, 224) # Batch size 1, 3 channels, 224x224
torch.onnx.export(
model,
dummy_input,
"model.onnx",
opset_version=17,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}},
)
ONNX Runtime inference is typically 1.5-3x faster than PyTorch inference on CPU because it applies graph optimizations (operator fusion, constant folding, memory layout optimization) that PyTorch's eager execution mode does not perform by default.
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])
inputs = {"input": np.random.randn(1, 3, 224, 224).astype(np.float32)}
outputs = session.run(None, inputs)
TensorRT (NVIDIA's inference optimizer) can further optimize ONNX models for NVIDIA hardware, often achieving 2-5x additional speedup over ONNX Runtime through layer fusion and precision calibration.
When Full Precision Is Necessary
Compression is not appropriate for all use cases. Maintain full FP32 or FP16 precision when:
Medical imaging and diagnosis — quantization-induced errors in segmentation or classification can lead to misdiagnosis. Regulatory bodies often require full precision for medical AI.
Safety-critical systems — autonomous vehicles, aerospace, industrial control systems. The accuracy loss from quantization, however small on average, is unacceptable when individual errors have catastrophic consequences.
Scientific computing — physics simulations, molecular dynamics, climate models where numerical precision directly affects result validity.
Financial calculations — certain financial models where small numerical errors accumulate over many time steps or compound in complex calculations.
For all other applications, FP16 is a safe and universally beneficial first step, INT8 is appropriate for most classification and regression tasks, and INT4 is acceptable for LLMs where the scale of the model makes it robust to per-weight precision reduction.
Keep Reading
- ML Deployment Patterns Guide — how and where to deploy compressed models
- Neural Network Training Guide — training the models you will later compress
- Machine Learning Complete Guide for Software Developers — the broader ML context
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.