Why Compress LLMs?
A Llama 3 70B model requires 140GB of VRAM in FP16 — four A100 GPUs just to load. Inference costs scale linearly with model size. Compression techniques reduce these requirements while preserving most of the capability, making large models deployable on smaller hardware.
The three main approaches have different tradeoffs: quantization is the easiest to apply, pruning requires more expertise, and distillation is the most expensive upfront but produces the best small-model quality.
1. Quantization
Quantization reduces the numerical precision of model weights. A standard FP16 model uses 2 bytes per parameter; INT8 uses 1 byte; INT4 uses 0.5 bytes.
LLM.int8() — the bitsandbytes approach. Uses INT8 for most weights but keeps a small set of "outlier" weights in FP16. Near-lossless quality, ~2x memory reduction.
from transformers import AutoModelForCausalLM
import bitsandbytes
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
load_in_8bit=True, # LLM.int8()
device_map="auto",
)
GPTQ — post-training quantization to 4-bit using gradient-based weight rounding. Requires a calibration dataset, produces better quality than naive INT4. Standard for community-quantized models on HuggingFace.
AWQ (Activation-aware Weight Quantization) — identifies the most important weights using activation statistics and preserves their precision. Often matches or beats GPTQ quality with faster inference.
| Method | Bit Width | Memory vs FP16 | Quality Loss | |--------|-----------|----------------|--------------| | LLM.int8() | 8-bit | ~50% | Minimal | | GPTQ | 4-bit | ~25% | Low | | AWQ | 4-bit | ~25% | Very low | | bitsandbytes 4-bit | 4-bit | ~25% | Low-Medium |
2. Pruning
Pruning removes weights or entire structural components (attention heads, FFN neurons) from a model. Unlike quantization, pruning permanently removes computation — yielding real speedups, not just memory savings.
SparseGPT enables one-shot pruning of Llama-scale models to 50% sparsity with minimal accuracy loss, without any retraining:
# SparseGPT is applied post-training
from sparseml.transformers import SparseAutoModelForCausalLM
model = SparseAutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
recipe="zoo:nlp/text_generation/llama-3-8b/pytorch/huggingface/llama3_8b/pruned50-none"
)
Structured pruning (removing entire attention heads or FFN channels) achieves real speedups on standard hardware. Unstructured pruning (zeroing individual weights) requires sparse matrix hardware support to realize the theoretical speedup.
10-50% speedup is realistic with structured pruning at 20-40% sparsity, with 1-3% quality degradation on standard benchmarks.
3. Knowledge Distillation
Distillation trains a small "student" model to mimic a large "teacher" model. The student learns from the teacher's output probabilities (soft targets), not just the ground-truth labels.
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.7):
soft_loss = F.kl_div(
F.log_softmax(student_logits / temperature, dim=-1),
F.softmax(teacher_logits / temperature, dim=-1),
reduction="batchmean"
) * (temperature ** 2)
hard_loss = F.cross_entropy(student_logits, labels)
return alpha * soft_loss + (1 - alpha) * hard_loss
Real examples: DistilBERT is 40% smaller than BERT-base with 97% of its performance. Phi-3 Mini (3.8B parameters) achieves GPT-3.5-level performance on many benchmarks through careful data curation and distillation from larger models.
Distillation is the most expensive technique (requires training the student model) but produces the best small-model quality because the student learns from richer supervision than ground truth alone.
Combining Techniques
In practice, the three techniques are complementary. A common production pipeline: distill a 70B model to 13B → prune 20% of heads → quantize to 4-bit AWQ. The result is a model that runs on a single A10G with performance competitive with the original 70B at full precision.