The Complexity Problem With RLHF
Standard RLHF requires training and running three separate models simultaneously: the SFT (supervised fine-tuned) policy, the reward model, and the RLHF-updated policy (kept close to SFT via KL penalty). PPO introduces additional hyperparameters (clip ratio, value function, GAE lambda) and training instability. The entire system is difficult to debug and expensive to run.
The DPO Insight
The DPO paper (arXiv:2305.18290) by Rafailov et al. shows that the optimal RLHF policy has a closed-form solution in terms of the reference policy (SFT model) and the reward function. Substituting this back into the reward modeling objective produces a loss that depends only on the policy being trained — no separate reward model needed.
The DPO loss is:
L_DPO = -E[log sigma(beta * (log pi(y_w|x) - log pi_ref(y_w|x)) - beta * (log pi(y_l|x) - log pi_ref(y_l|x)))]
where y_w is the preferred response, y_l is the rejected response, pi is the model being trained, pi_ref is the frozen SFT reference, and beta controls how closely the model should stay near the reference.
What This Means Intuitively
DPO increases the relative log-probability of preferred responses over rejected responses, weighted by how much the current policy already agrees with the reference. Responses where the model has already diverged from the reference are downweighted, preventing overtraining on easy examples.
Preference Dataset Format
DPO requires a dataset of (prompt, chosen, rejected) triplets — the same format as RLHF reward model training. You can use HuggingFace's Anthropic HH-RLHF dataset, Ultrafeedback, or Preference 700K directly.
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
ref_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
config = DPOConfig(
beta=0.1,
learning_rate=1e-6,
per_device_train_batch_size=4,
max_length=1024,
max_prompt_length=512,
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=config,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
DPO vs RLHF in Practice
DPO training is simpler, faster, and more stable than PPO-based RLHF. However, it lacks the iterative online learning of RLHF — the preference data is fixed. Online DPO and SPIN (self-play fine-tuning) variants attempt to address this by generating new samples during training.
ORPO and SimPO Variants
ORPO (Odds Ratio Preference Optimization) eliminates the need for a reference model entirely by folding the alignment signal into the SFT loss. SimPO (Simple Preference Optimization) removes the reference model while using a length-normalized reward, preventing verbosity bias.