The Problem With Raw Language Model Pretraining
A language model trained purely on next-token prediction learns to produce text that looks like the internet — which includes harmful content, misleading information, and text that ignores user intent. GPT-3 was powerful but difficult to deploy safely. The InstructGPT paper (arXiv:2203.02155) introduced Reinforcement Learning from Human Feedback (RLHF) to close that gap.
The Three-Stage RLHF Pipeline
Stage 1: Supervised Fine-Tuning (SFT)
Start with a pretrained GPT-3 model. Collect a dataset of (prompt, ideal response) pairs written by human contractors. Fine-tune the model on this data with standard cross-entropy loss. This gives you a model that follows instructions in format, but SFT alone is not enough — the human-written demonstrations are expensive and limited in scale.
Stage 2: Train a Reward Model
Sample multiple responses from the SFT model for each prompt. Show pairs of responses to human labelers and ask which is better. This produces a ranked preference dataset. Train a separate model (initialized from SFT) to output a scalar reward score, optimized with a pairwise ranking loss:
loss = -log(sigmoid(r_θ(x, y_w) - r_θ(x, y_l)))
where y_w is the preferred response and y_l is the rejected one.
Stage 3: PPO Fine-Tuning
Use the reward model as the environment and the SFT model as the starting policy. Run Proximal Policy Optimization (PPO) to update the policy to maximize expected reward, with a KL divergence penalty against the SFT model to prevent reward hacking and mode collapse.
Why SFT Alone Was Not Enough
SFT suffers from two problems. First, demonstration data is expensive — you can only collect so many human-written examples. Second, the model learns to imitate the distribution of demonstrations, not to maximize quality. RLHF can generate millions of samples and rank them cheaply, providing a much richer training signal.
Reward Hacking and KL Penalty
Without constraints, the policy learns to produce text that the reward model scores highly but that is not actually useful — this is reward hacking. The KL penalty term keeps the RLHF model close to the SFT baseline, preventing degenerate outputs while still improving alignment.
# PPO objective with KL penalty (simplified)
# r_ppo = r_reward_model - beta * KL(policy || sft_policy)
# where beta is typically 0.02 to 0.1
def ppo_reward(reward_score, policy_logprobs, sft_logprobs, beta=0.05):
kl_penalty = (policy_logprobs - sft_logprobs).sum()
return reward_score - beta * kl_penalty
Results and Impact
A 1.3B InstructGPT model was preferred by labelers over the 175B GPT-3 in 85% of comparisons. On TruthfulQA, InstructGPT was significantly more truthful. The paper also showed reduced toxic outputs without a significant capability regression — what became known as the alignment tax being close to zero.
How Claude and ChatGPT Use This
Both ChatGPT and Claude use RLHF variants as core alignment techniques. Anthropic extends it with Constitutional AI (RLAIF) to scale human feedback with AI-generated critiques. The InstructGPT recipe is now the standard starting point for any production LLM deployment.