LLM Compression: Pruning, Distillation, and Quantization Compared

Three techniques for making large language models smaller and faster - quantization, pruning, and knowledge distillation - each with different tradeoffs in quality, speed, and implementation complexity.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 15, 2026

8 min read

// tags

#compression#pruning#distillation#quantization#efficiency

FIG. ART-36

8 min read

“

LLM Compression: Pruning, Distillation, and Quantization Compared

// reading plan

sections

538

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Supervised Learning Explained: How Models Learn from Labeled Examples

2. Pruning

Pruning removes weights or entire structural components (attention heads, FFN neurons) from a model. Unlike quantization, pruning permanently removes computation - yielding real speedups, not just memory savings.

SparseGPT enables one-shot pruning of Llama-scale models to 50% sparsity with minimal accuracy loss, without any retraining:

# SparseGPT is applied post-training
from sparseml.transformers import SparseAutoModelForCausalLM

model = SparseAutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    recipe="zoo:nlp/text_generation/llama-3-8b/pytorch/huggingface/llama3_8b/pruned50-none"
)

Structured pruning (removing entire attention heads or FFN channels) achieves real speedups on standard hardware. Unstructured pruning (zeroing individual weights) requires sparse matrix hardware support to realize the theoretical speedup.

10-50% speedup is realistic with structured pruning at 20-40% sparsity, with 1-3% quality degradation on standard benchmarks.

3. Knowledge Distillation

Distillation trains a small "student" model to mimic a large "teacher" model. The student learns from the teacher's output probabilities (soft targets), not just the ground-truth labels.

import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.7):
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=-1),
        F.softmax(teacher_logits / temperature, dim=-1),
        reduction="batchmean"
    ) * (temperature ** 2)
    hard_loss = F.cross_entropy(student_logits, labels)
    return alpha * soft_loss + (1 - alpha) * hard_loss

Real examples: DistilBERT is 40% smaller than BERT-base with 97% of its performance. Phi-3 Mini (3.8B parameters) achieves GPT-3.5-level performance on many benchmarks through careful data curation and distillation from larger models.

Distillation is the most expensive technique (requires training the student model) but produces the best small-model quality because the student learns from richer supervision than ground truth alone.

Combining Techniques

In practice, the three techniques are complementary. A common production pipeline: distill a 70B model to 13B → prune 20% of heads → quantize to 4-bit AWQ. The result is a model that runs on a single A10G with performance competitive with the original 70B at full precision.

Method	Bit Width	Memory vs FP16	Quality Loss
LLM.int8()	8-bit	~50%	Minimal
GPTQ	4-bit	~25%	Low
AWQ	4-bit	~25%	Very low
bitsandbytes 4-bit	4-bit	~25%	Low-Medium

LLM Compression: Pruning, Distillation, and Quantization Compared

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

Why Compress LLMs?

1. Quantization

2. Pruning

3. Knowledge Distillation

Combining Techniques

Resources

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

LLM Compression: Pruning, Distillation, and Quantization Compared

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

Why Compress LLMs?

1. Quantization

2. Pruning

3. Knowledge Distillation

Combining Techniques

Resources

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

The workspace your team
actually needs