What RoBERTa Fixed in BERT
BERT's pre-training had three weaknesses that RoBERTa corrected:
- Static masking: BERT masks the same tokens every epoch. RoBERTa uses dynamic masking — different tokens masked each epoch, forcing more robust representations.
- Next Sentence Prediction (NSP): BERT's NSP objective was shown to hurt performance. RoBERTa removes it entirely, training only on masked language modeling.
- Training scale: BERT trained on 16GB for 1M steps. RoBERTa trained on 160GB for 500K steps with larger batch sizes.
The result: RoBERTa-base consistently outperforms BERT-base on GLUE tasks by 3-7 points without architectural changes. The HuggingFace RoBERTa page provides base and large variants.
Fine-Tuning in Under 30 Lines
from transformers import RobertaForSequenceClassification, RobertaTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=4)
dataset = load_dataset("ag_news")
def tokenize_fn(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)
tokenized = dataset.map(tokenize_fn, batched=True)
tokenized = tokenized.rename_column("label", "labels")
args = TrainingArguments(
output_dir="roberta-ag-news",
num_train_epochs=3,
per_device_train_batch_size=32,
evaluation_strategy="epoch",
load_best_model_at_end=True,
)
trainer = Trainer(model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"])
trainer.train()
Multi-Label Classification
For multi-label tasks (a text belonging to multiple categories simultaneously), change the loss function:
import torch
from transformers import RobertaForSequenceClassification
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=10, problem_type="multi_label_classification")
# Trainer handles BCEWithLogitsLoss automatically when labels are float tensors
# Labels should be float tensors: [0.0, 1.0, 0.0, 1.0, ...]
Few-Shot Classification With SetFit
For low-data scenarios (<50 labeled examples), SetFit outperforms RoBERTa fine-tuning by using contrastive sentence transformer training:
from setfit import SetFitModel, SetFitTrainer
from datasets import Dataset
train_data = Dataset.from_dict({
"text": ["positive example 1", "negative example 1", ...], # 8-16 examples per class
"label": [1, 0, ...]
})
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
trainer = SetFitTrainer(model=model, train_dataset=train_data)
trainer.train()
SetFit requires no GPU for fine-tuning with small datasets and reaches full RoBERTa fine-tune quality with 50x less data.
Production ONNX Export
Export fine-tuned RoBERTa to ONNX following the same pattern as DistilBERT. RoBERTa-base in ONNX on CPU achieves 15-25ms latency per request — acceptable for most classification use cases without GPU infrastructure.