The Alignment Problem at 7B Scale
Getting small language models to be genuinely helpful without a massive RLHF pipeline seemed impossible a year ago. Zephyr 7B, released by the HuggingFace H4 team, changed that narrative by combining distilled supervised fine-tuning (dSFT) with Direct Preference Optimization (DPO) — and the results beat models ten times its size.
What Distilled SFT Means
Traditional SFT uses human-written demonstrations. dSFT replaces those with synthetic completions generated by a capable teacher model (GPT-4 in this case). The HuggingFace team sampled 200,000 prompt-completion pairs from UltraChat (a large synthetic dialogue dataset), then filtered for quality using the UltraFeedback preference signals.
The pipeline:
- Sample prompts from UltraChat and UltraFeedback
- Generate 4 completions per prompt using GPT-3.5, GPT-4, Llama, and Claude
- Score completions with GPT-4 on helpfulness, honesty, instruction-following
- Use highest-scoring completion for SFT; use all 4 with scores for DPO
DPO Instead of PPO
Zephyr skips the reward model and the PPO training loop entirely. DPO reformulates alignment as a classification problem: given a chosen response and a rejected response, update the policy directly to prefer the chosen one. This is simpler, more stable, and requires far fewer compute hours.
from trl import DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/mistral-7b-sft-beta")
ref_model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/mistral-7b-sft-beta")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/mistral-7b-sft-beta")
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
beta=0.1,
train_dataset=dpo_dataset, # {"prompt", "chosen", "rejected"}
tokenizer=tokenizer,
)
trainer.train()
MT-Bench Results
Zephyr-β scores 7.34 on MT-Bench — essentially matching Llama 2 70B Chat (7.35) at 10% of the parameter count. MT-Bench is a multi-turn benchmark that tests reasoning, math, coding, and creative writing. The result was striking enough that it forced a rethinking of how much compute alignment actually requires.
Zephyr-α vs Zephyr-β
The α variant used only dSFT without DPO and scored 6.18 on MT-Bench. Adding DPO in β jumped that to 7.34 — a 1.16-point gain purely from preference optimization with no additional compute-heavy training. This validated DPO as a practical alignment tool for resource-constrained teams.
Lessons for Aligning Small Models
The key takeaways from the Zephyr recipe: synthetic data quality matters more than quantity, teacher model selection (GPT-4 over GPT-3.5) has outsized impact on SFT quality, and DPO is a reliable substitute for PPO when your preference data is clean. Teams building domain-specific assistants on tight budgets should start here.