RLHF Explained: How InstructGPT Taught GPT-3 to Follow Instructions

InstructGPT showed that a 1.3B model trained with RLHF outperforms 175B GPT-3 on instruction following. Here is how the three-step pipeline actually works.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 4, 2026

9 min read

// tags

#rlhf#instructgpt#alignment#ppo#reward-model

FIG. ART-26

9 min read

“

RLHF Explained: How InstructGPT Taught GPT-3 to Follow Instructions

// reading plan

sections

495

words

min read

// LLM & Language Models

LLM Safety and Alignment Explained for Developers

What alignment means, how RLHF and Constitutional AI shape model behavior, why models still fail, and what application-level guardrails you actually need to build.

8 min read

// Machine Learning

Reinforcement Learning for Software Developers: A Practical Guide

The Problem With Raw Language Model Pretraining

A language model trained purely on next-token prediction learns to produce text that looks like the internet — which includes harmful content, misleading information, and text that ignores user intent. GPT-3 was powerful but difficult to deploy safely. The InstructGPT paper (arXiv:2203.02155) introduced Reinforcement Learning from Human Feedback (RLHF) to close that gap.

The Three-Stage RLHF Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pretrained GPT-3 model. Collect a dataset of (prompt, ideal response) pairs written by human contractors. Fine-tune the model on this data with standard cross-entropy loss. This gives you a model that follows instructions in format, but SFT alone is not enough — the human-written demonstrations are expensive and limited in scale.

Stage 2: Train a Reward Model

Sample multiple responses from the SFT model for each prompt. Show pairs of responses to human labelers and ask which is better. This produces a ranked preference dataset. Train a separate model (initialized from SFT) to output a scalar reward score, optimized with a pairwise ranking loss:

loss = -log(sigmoid(r_θ(x, y_w) - r_θ(x, y_l)))

where y_w is the preferred response and y_l is the rejected one.

Stage 3: PPO Fine-Tuning

Use the reward model as the environment and the SFT model as the starting policy. Run Proximal Policy Optimization (PPO) to update the policy to maximize expected reward, with a KL divergence penalty against the SFT model to prevent reward hacking and mode collapse.

Why SFT Alone Was Not Enough

SFT suffers from two problems. First, demonstration data is expensive — you can only collect so many human-written examples. Second, the model learns to imitate the distribution of demonstrations, not to maximize quality. RLHF can generate millions of samples and rank them cheaply, providing a much richer training signal.

Reward Hacking and KL Penalty

Without constraints, the policy learns to produce text that the reward model scores highly but that is not actually useful — this is reward hacking. The KL penalty term keeps the RLHF model close to the SFT baseline, preventing degenerate outputs while still improving alignment.

# PPO objective with KL penalty (simplified)
# r_ppo = r_reward_model - beta * KL(policy || sft_policy)
# where beta is typically 0.02 to 0.1

def ppo_reward(reward_score, policy_logprobs, sft_logprobs, beta=0.05):
    kl_penalty = (policy_logprobs - sft_logprobs).sum()
    return reward_score - beta * kl_penalty

Results and Impact

A 1.3B InstructGPT model was preferred by labelers over the 175B GPT-3 in 85% of comparisons. On TruthfulQA, InstructGPT was significantly more truthful. The paper also showed reduced toxic outputs without a significant capability regression — what became known as the alignment tax being close to zero.

How Claude and ChatGPT Use This

Both ChatGPT and Claude use RLHF variants as core alignment techniques. Anthropic extends it with Constitutional AI (RLAIF) to scale human feedback with AI-generated critiques. The InstructGPT recipe is now the standard starting point for any production LLM deployment.

RLHF Explained: How InstructGPT Taught GPT-3 to Follow Instructions

Related Articles

LLM Safety and Alignment Explained for Developers

Reinforcement Learning for Software Developers: A Practical Guide

The Problem With Raw Language Model Pretraining

The Three-Stage RLHF Pipeline

Why SFT Alone Was Not Enough

Reward Hacking and KL Penalty

Results and Impact

How Claude and ChatGPT Use This

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Toolformer: Teaching LLMs to Use Tools Without Human Annotations

RLHF Explained: How InstructGPT Taught GPT-3 to Follow Instructions

Related Articles

LLM Safety and Alignment Explained for Developers

Reinforcement Learning for Software Developers: A Practical Guide

The Problem With Raw Language Model Pretraining

The Three-Stage RLHF Pipeline

Why SFT Alone Was Not Enough

Reward Hacking and KL Penalty

Results and Impact

How Claude and ChatGPT Use This

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Toolformer: Teaching LLMs to Use Tools Without Human Annotations

The workspace your team
actually needs