The RLHF Scaling Problem
Producing high-quality human preference data for RLHF is expensive and slow. Human annotators must evaluate millions of response pairs. Their judgments can be inconsistent and biased. Scaling RLHF to produce safer models requires scaling human annotation — which is not sustainable. Anthropic's Constitutional AI paper (arXiv:2212.08073) proposes replacing much of this human feedback with AI-generated feedback guided by a written constitution.
The Constitution
The constitution is a set of approximately 16 principles that define what makes a response helpful, harmless, and honest. These include principles derived from the UN Declaration of Human Rights, Anthropic's usage policies, and principles that encourage nuanced handling of edge cases. Example principles:
- "Choose the response that is least likely to contain harmful or unethical content"
- "Which of these responses provides a more accurate and truthful response, even if it's not what the user wants to hear?"
- "Prefer responses that avoid unnecessarily assuming or citing potential bad intent on the part of the person"
Two-Stage Constitutional AI
Stage 1: Supervised Learning from AI Revision
- Start with a helpful-only model (no harmlessness training)
- Sample responses to potentially harmful prompts
- Ask the model to critique its own response against a randomly sampled constitutional principle
- Ask the model to revise the response to address the critique
- Fine-tune on the revised responses
This creates a dataset of (prompt, revised-response) pairs that removes harmful content without explicit human annotation.
Stage 2: RLAIF (RL from AI Feedback)
- Generate pairs of responses for each prompt
- Ask the AI (guided by constitutional principles) which response is better and why
- Use these AI-generated preference labels to train a preference model
- Fine-tune the policy with PPO against the preference model
RLAIF replaces the human labeler in RLHF with an AI making principled judgments.
# Simplified Constitutional AI critique-revision loop
def constitutional_revision(model, prompt, response, principle):
critique_prompt = f"""
Human: {prompt}
Assistant: {response}
Please critique the above response using the following principle:
{principle}
Critique: """
critique = model.generate(critique_prompt)
revision_prompt = f"""
{critique_prompt}{critique}
Please revise the original response to address this critique:
Revised response: """
return model.generate(revision_prompt)
Results: Less Harmful, More Helpful
The key finding is that CAI-trained models show reduced harmful outputs with less capability regression than pure RLHF. The critique-revision loop teaches the model nuanced refusal: it can decline genuinely harmful requests while remaining helpful on ambiguous edge cases that RLHF-only models over-refuse.
RLAIF vs RLHF at Scale
RLAIF scales better than human-feedback RLHF because AI annotation is orders of magnitude cheaper and faster than human annotation. The AI annotator can be prompted with different constitutional principles for different domains (medical, legal, creative), providing targeted alignment without domain-specific human annotators.
The HH-RLHF Dataset
Anthropic released the Helpful and Harmless (HH-RLHF) dataset used in this research: 160k human preference data points and 42k red-teaming conversations. This dataset has become a standard benchmark for alignment research.