Alignment is the set of techniques used to train language models to be helpful, harmless, and honest. As a developer building on top of LLMs, you need to understand alignment not to build it yourself, but to know what it does and does not protect you against. Model safety features reduce risk but do not eliminate it. You still need application-level guardrails.
What Alignment Means
A raw language model trained only to predict the next token will reflect whatever patterns exist in its training data. It will reproduce harmful content if that content appears in training data. It will confidently state false information. It will provide detailed instructions for dangerous activities because such instructions exist on the internet.
Alignment techniques are applied after pretraining to shape model behavior toward being useful and avoiding harm. The goal is a model that helps users accomplish legitimate tasks, declines to help with genuinely harmful requests, and communicates uncertainty rather than fabricating information.
The three main alignment objectives in practice:
- Helpful: The model should actually assist users. A model that refuses everything is safe but useless. Calibrating helpfulness vs harm avoidance is difficult.
- Harmless: The model should not produce content that causes real-world harm — dangerous instructions, violent content, content that enables illegal activity.
- Honest: The model should express uncertainty, not fabricate facts, and not deceive users.
RLHF: Reinforcement Learning from Human Feedback
RLHF is the dominant alignment technique used by OpenAI, Anthropic, Google, and most frontier lab models. The process works in three stages.
Stage 1: Supervised Fine-Tuning. Human trainers write example conversations showing ideal model behavior — good responses to difficult questions, appropriate refusals, honest answers to ambiguous questions. The pretrained model is fine-tuned on these examples.
Stage 2: Reward Model Training. The model generates multiple responses to the same prompt. Human raters rank the responses from best to worst. A separate "reward model" is trained to predict these human preference rankings. This reward model learns to score responses the way human evaluators would.
Stage 3: RL Optimization. The language model is optimized using reinforcement learning to produce outputs that score highly according to the reward model. The model learns to generate responses that humans prefer.
RLHF is effective but imperfect. The reward model is a proxy for human preferences, not a direct measurement. The model learns to maximize reward model scores, which can diverge from actual human preference through a phenomenon called "reward hacking." The human raters have their own biases and inconsistencies that get baked into the reward model.
Constitutional AI: Anthropic's Approach
Anthropic developed Constitutional AI (CAI) as a complement to RLHF. The core idea is to give the model a set of principles (a "constitution") and train it to critique and revise its own outputs against those principles.
Critique stage: The model generates an initial response. It then evaluates that response against a principle from the constitution — for example, "Does this response help the user do something harmful?" If the response violates the principle, the model identifies how.
Revision stage: The model revises its response to better align with the principle it violated.
This process generates a large dataset of critiqued and revised responses. That dataset is then used for supervised fine-tuning. The result is a model that has internalized the principles through self-critique rather than only through human feedback.
CAI reduces the burden on human feedback for harmful content evaluation, since the model can handle a large portion of its own alignment training. Anthropic's Claude models use CAI in addition to RLHF.
Why Models Still Fail Despite Alignment
Alignment significantly reduces harmful outputs but does not eliminate them. Here is why.
Jailbreaks. Users have found that framing harmful requests in certain ways bypasses refusal training. Role-playing scenarios, hypothetical framings, and gradual escalation can cause models to produce content they would refuse if asked directly. Labs patch known jailbreaks, but new ones keep appearing.
Prompt injection. If your application incorporates user-provided text into prompts — for example, summarizing a webpage — a malicious webpage can include instructions that override your system prompt. This is called prompt injection. The model may follow the injected instructions, not your intended ones.
Adversarial inputs. Carefully crafted inputs can push models toward outputs that alignment training did not anticipate. This is an active research area. Adversarial robustness for LLMs is not solved.
Training data edge cases. Alignment training covers the content categories the trainers thought of. Novel requests that fall between categories may get inconsistent treatment. A request that is tangentially related to something harmful may trigger refusal (over-refusal) or may slip through (under-refusal) depending on how the training examples were distributed.
Hallucination. Alignment training directly targets honesty, but hallucination is not fully solved. Models still confidently assert false information. The "harmless" objective focuses more on dangerous content than on factual accuracy.
Safety Features in Production Models
Modern production models include several safety layers beyond the core alignment training.
System prompt protections. You can instruct the model to stay within a defined scope in your system prompt. Well-aligned models respect these constraints and will tell users they cannot help with out-of-scope requests.
Content filtering. API providers run outputs through classifiers before returning them, catching some harmful content that slips through alignment training.
Refusal training. Models are explicitly trained to refuse specific categories of requests — instructions for weapons, sexual content involving minors, certain types of personal attacks.
Moderation APIs. OpenAI's Moderation API and similar tools let you check inputs and outputs against a separate safety classifier. This is separate from the model itself.
What You Should Do as a Developer
Do not rely on model safety as your only safeguard. Here is what application-level defense looks like.
Validate outputs before using them. If the model is generating structured data (JSON, SQL, code), parse and validate it before execution. Never execute LLM output directly.
Run outputs through a moderation layer. Use a moderation API or a separate safety classifier on model outputs before displaying them to users.
Scope your system prompt tightly. Tell the model exactly what it is and is not supposed to do. A narrow scope reduces the surface area for unexpected behavior.
Log and monitor. Keep logs of inputs and outputs. Review them for patterns of unexpected behavior. Set up alerts for high refusal rates (may indicate jailbreak attempts) or high user complaints.
Build application-level authorization. Do not let the model be the only thing standing between a user and a sensitive action. If the model is supposed to only allow certain users to do certain things, enforce that in application code, not just in the prompt.
Test adversarially. Before deploying an LLM feature, try to get it to misbehave. Write prompts designed to bypass your intended behavior. What you find in testing is better than what users find in production.
Model alignment is a meaningful safety layer. It catches a large percentage of harmful requests and shapes model behavior toward helpfulness and honesty. But it is not a complete solution. Treat aligned models the way you treat other software dependencies — useful, tested by the vendor, but not a substitute for your own validation and security practices.
Keep Reading
- How Large Language Models Work — the technical foundation that makes alignment necessary
- LLM Privacy for Enterprise — handling sensitive data alongside safety considerations
- LLM for Business Decision Making — practical guidance on where LLMs are and are not reliable
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.