Knowledge Distillation: Training Small Models to Match Large Ones

Knowledge distillation lets you deploy fast, small models that match the performance of large ones. Here is how it works, why soft targets help, and when to use it in production.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

9 min read

// tags

#knowledge-distillation#distilbert#model-compression#production-ml

FIG. ART-28

9 min read

“

Knowledge Distillation: Training Small Models to Match Large Ones

// reading plan

sections

1,169

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Gradient Descent Explained: How Machine Learning Models Actually Learn

DistilBERT: The Canonical Example

DistilBERT (Sanh, Debut, Chaumond, and Wolf, 2019) demonstrated that distillation could be applied to large pretrained language models, not just task-specific classifiers.

The distillation process for DistilBERT:

Student architecture: same as BERT-base but with 6 transformer layers instead of 12 (BERT-base has 12 layers)
Teacher: BERT-base (110M parameters)
Training: the student is initialized from every other layer of the teacher (a technique called "layer initialization"), then trained to match the teacher's MLM output distributions on the same pretraining corpus

Results:

40% fewer parameters (66M vs 110M)
60% faster inference
97% of BERT-base performance on GLUE benchmark
97% of BERT-base performance on SQuAD

The layer initialization trick (starting the student weights from the teacher rather than random initialization) is important. It significantly reduces training time and improves final performance compared to random initialization.

Intermediate Layer Distillation

The original Hinton et al. distillation paper focused on matching the final output distributions. But you can also match intermediate representations:

Feature-based distillation (FitNets, Romero et al., 2015): Train the student's intermediate hidden states to match the teacher's hidden states. This gives the student more signal about the teacher's internal representations, not just its final outputs.

Attention-based distillation (TinyBERT, Jiao et al., 2020): Match both the attention maps and the hidden states at each layer. TinyBERT achieves competitive performance with BERT-base at 1/7th the size.

PKD (Patient Knowledge Distillation, Sun et al., 2019): Match outputs from multiple layers of the teacher, not just the final layer. "Patient" refers to using all layers, not rushing to only look at the final output.

Each of these approaches adds training complexity but can improve the student's performance, particularly when the student is much smaller than the teacher.

Task-Specific vs Task-Agnostic Distillation

Task-agnostic distillation (like DistilBERT) happens at the pretraining stage. The student learns from the teacher on the same general pretraining task (masked language modeling). The resulting student is a general-purpose model that can be fine-tuned on downstream tasks, just like the teacher.

Task-specific distillation happens after fine-tuning. You fine-tune the teacher on task A, then distill the fine-tuned teacher into a student specifically for task A. This typically produces a smaller model with higher task-specific performance than task-agnostic distillation followed by fine-tuning.

For production deployments where you have a specific, stable task (sentiment classification, intent detection, named entity recognition), task-specific distillation is usually the right approach. You get maximum compression for your specific use case.

When to Use Knowledge Distillation in Production

Knowledge distillation is the right tool when:

Inference latency is a hard constraint. If you need sub-10ms inference for a text classifier and a full BERT model takes 50ms, distillation into a 6-layer model might get you to 20ms with minimal accuracy loss.

Deployment target is resource-constrained. Edge devices, mobile, embedded systems. Distillation combined with quantization (INT8) can reduce a BERT model from 440MB to under 50MB.

You have a fixed task and labeled training data. Task-specific distillation works best when you can distill the teacher's knowledge specifically for your use case.

You cannot afford API costs at scale. A distilled model running on your own hardware has predictable, low marginal cost per inference. At millions of calls per day, this matters.

When distillation is NOT worth the effort:

Your task changes frequently (you would need to re-distill)
Accuracy requirements are extremely high and the accuracy gap matters
You are in early prototyping and do not know yet if the model will be used at scale

The Practical Workflow

# Step 1: Fine-tune teacher on your task
teacher = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
# ... fine-tune teacher on your labeled data ...

# Step 2: Use teacher to generate soft labels for your training data
teacher.eval()
with torch.no_grad():
    teacher_logits = teacher(**inputs).logits  # shape: (batch_size, num_labels)

# Step 3: Train student to match teacher logits
student = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
T = 4.0  # temperature
alpha = 0.5

student_logits = student(**inputs).logits
hard_loss = CrossEntropyLoss()(student_logits, hard_labels)
soft_loss = KLDivLoss()(
    F.log_softmax(student_logits / T, dim=-1),
    F.softmax(teacher_logits / T, dim=-1)
) * (T ** 2)

loss = alpha * hard_loss + (1 - alpha) * soft_loss

Keep Reading

BERT Explained for Developers -- DistilBERT is BERT's distilled sibling; understand the parent model first
ML Model Evaluation Metrics Guide -- how to measure whether your distilled model actually retained performance
ML Serving Latency Guide -- distillation is one of several strategies for hitting latency targets in production

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.

Knowledge Distillation: Training Small Models to Match Large Ones

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Gradient Descent Explained: How Machine Learning Models Actually Learn

The Core Idea: Hard Labels vs Soft Targets

The Role of Temperature

DistilBERT: The Canonical Example

Intermediate Layer Distillation

Task-Specific vs Task-Agnostic Distillation

When to Use Knowledge Distillation in Production

The Practical Workflow

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

Knowledge Distillation: Training Small Models to Match Large Ones

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Gradient Descent Explained: How Machine Learning Models Actually Learn

The Core Idea: Hard Labels vs Soft Targets

The Role of Temperature

DistilBERT: The Canonical Example

Intermediate Layer Distillation

Task-Specific vs Task-Agnostic Distillation

When to Use Knowledge Distillation in Production

The Practical Workflow

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs