DistilBERT vs BERT
The DistilBERT paper introduced knowledge distillation for transformer compression. A large BERT teacher trains a 6-layer DistilBERT student to mimic its output distributions, not just match labels. The result: 66M parameters (vs BERT-base's 110M), 60% faster inference, 40% smaller, with 97% of BERT's GLUE performance.
For production classification (sentiment, intent, topic), that 3% gap rarely matters. DistilBERT's latency advantage often matters more than marginal accuracy when serving thousands of requests per second.
The HuggingFace DistilBERT page provides cased and uncased variants.
Fine-Tuning for Classification With Trainer API
from transformers import (
DistilBertForSequenceClassification,
DistilBertTokenizerFast,
Trainer,
TrainingArguments,
)
from datasets import load_dataset
import numpy as np
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2
)
dataset = load_dataset("imdb")
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, padding=True, max_length=512)
dataset = dataset.map(tokenize, batched=True, batch_size=1000)
training_args = TrainingArguments(
output_dir="./distilbert-sentiment",
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
evaluation_strategy="epoch",
learning_rate=2e-5,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
)
trainer.train()
ONNX Export for CPU Inference (5-10ms Latency)
Export to ONNX for deployment without PyTorch dependency:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import torch
model = DistilBertForSequenceClassification.from_pretrained("./distilbert-sentiment")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model.eval()
dummy_input = tokenizer("Sample text", return_tensors="pt")
torch.onnx.export(
model,
(dummy_input["input_ids"], dummy_input["attention_mask"]),
"distilbert_classifier.onnx",
opset_version=14,
input_names=["input_ids", "attention_mask"],
output_names=["logits"],
dynamic_axes={"input_ids": {0: "batch", 1: "seq"}, "attention_mask": {0: "batch", 1: "seq"}},
)
ONNX Runtime inference averages 5-10ms per request on a 4-core CPU, compared to 50-80ms for PyTorch on the same hardware.
FastAPI Deployment
Wrap the ONNX session in FastAPI for a production-ready classification endpoint. Load the session once at startup, process batches of up to 32 texts per request. With 4 Gunicorn workers on a 4-vCPU machine, throughput exceeds 200 classifications/second.
Cost Comparison vs GPT-4o API
Classifying 10 million texts/month:
| Approach | Monthly Cost | |---|---| | GPT-4o API | ~$30,000 | | GPT-4o-mini API | ~$600 | | DistilBERT (1x c5.xlarge) | ~$120 | | DistilBERT (self-hosted, bare metal) | ~$15 |
For high-volume classification tasks — content moderation, intent routing, topic labeling — fine-tuned DistilBERT is the economically rational choice. Fine-tune once on your labeled data; the model then classifies at near-zero marginal cost.