A/B testing LLM outputs in production is the final validation step before fully committing to a new model version or prompt change. It routes a percentage of real traffic to the variant, measures business impact rather than output quality alone, and requires statistical significance before you ship. Done correctly, it catches cases where offline eval results did not translate to real-world improvement, which happens more often than most teams expect.
Why Offline Evals Are Not Enough
Offline evaluation on a golden dataset is essential, but it answers a different question than a production A/B test. An offline eval asks: "On my labeled test cases, does version B perform better than version A?" A production A/B test asks: "When real users interact with version B, do they achieve their goals more often?"
The gap between these two questions is significant. A prompt change that improves your eval score by 8% might have zero effect on user task completion because the test cases you labeled do not represent the distribution of real user queries. A model switch that appears to improve output quality might actually increase user frustration because the new model's output style is different and users need to adapt.
Production A/B tests close this gap.
Setting Up the Test
A basic LLM A/B test has three components: traffic splitting, metric tracking, and statistical analysis.
Traffic splitting. Route a percentage of requests to the variant. Common splits are 10/90 (cautious, low risk), 50/50 (fastest to significance), or 20/80 (balanced). For low-traffic applications, use 50/50 to reach significance faster.
function getLLMVariant(userId: string): "control" | "variant" {
// Deterministic assignment based on user ID
const hash = cyrb53(userId) % 100;
return hash < 10 ? "variant" : "control";
}
Use a deterministic assignment function (hash of user ID, not random per request) so the same user always gets the same variant. Mixing variants mid-session creates confounding effects.
Metric tracking. Log both the variant assignment and the downstream outcome for every interaction. Downstream outcomes are the business metrics that matter:
- Task completion rate (did the user complete the intended action after the LLM interaction?)
- User satisfaction (thumbs up/down rating, if you surface one)
- Session continuation (did the user continue using the feature or abandon it?)
- Error rate (did the LLM output cause a downstream error or user correction?)
Track output quality metrics too (via LM-as-judge sampling), but make business metrics the decision criteria.
Statistical analysis. Run a two-proportion z-test to compare task completion rates between control and variant. Calculate whether the difference is statistically significant before declaring a winner.
Sample Size Calculation
One of the most common A/B testing mistakes is stopping too early. If you look at results after 100 users and the variant appears to be winning by 12%, that is probably noise, not signal.
The correct approach is to calculate your required sample size before running the test, using these inputs:
- Baseline rate: Your current task completion rate (e.g., 0.65 = 65%)
- Minimum detectable effect: The smallest improvement worth caring about (e.g., 0.05 = 5% absolute improvement)
- Statistical power: Probability of detecting the effect if it exists (0.80 is standard)
- Significance level: P-value threshold for declaring significance (0.05 is standard)
A rough formula: n ≈ 16 × sigma² / delta²
For a baseline rate of 65%, a minimum detectable effect of 5%, you need approximately 800-1,000 users per variant before you can trust the results. At 10/90 traffic split with 100 users per day, that variant group accumulates only 10 users per day — you need 80-100 days to reach significance. This is why 50/50 splits are often better for faster iteration.
What to Measure
Priority order for LLM A/B test metrics:
-
Primary business metric: Whichever downstream action indicates the LLM is doing its job. For a support bot: issue resolution rate. For a code assistant: accepted suggestion rate. For a document generator: approval rate without edits.
-
User feedback signals: Thumbs up/down if you surface them. These are noisy but fast to accumulate.
-
LM-as-judge quality sample: Run judge scoring on a random 5% sample from each variant. This validates that the business metric improvement is driven by quality rather than some confounding factor.
-
Cost and latency: The variant should not win on quality while losing on cost or latency. Track all three.
Common Mistakes That Invalidate Results
Stopping too early. Looking at results before you have reached statistical significance and calling a winner. This is the most common mistake.
Measuring the wrong metric. Measuring "user clicked thumbs up" instead of "user completed their task." Click metrics are easier to game with more engaging but less useful outputs.
Not accounting for novelty effect. When users encounter a new model style, their engagement often spikes briefly before returning to baseline. Wait at least two weeks after any novelty spike subsides before analyzing results.
Running concurrent experiments. If you are simultaneously testing a prompt change and a UI change, you cannot attribute result differences to either one. Run one experiment at a time on any given user cohort.
Not checking for segment effects. An average improvement might mask a degradation for an important subgroup. Analyze results by user segment (power users vs. new users, mobile vs. desktop) before declaring a winner.
When A/B Testing Is Overkill
Not every LLM change needs a full production A/B test. If you are fixing a bug in your prompt that was causing obvious failures, ship it. If you are changing a model because your current one is being deprecated, ship it.
A/B tests are worth the investment when:
- The change affects a feature with significant user volume
- You are switching to a more expensive model (need to verify the business case)
- Your offline eval showed mixed results (some metrics up, some down)
- The change affects a metric that directly impacts revenue
Keep Reading
- Evals for Production LLM Apps — The full system: offline evals, online monitoring, and the production feedback loop.
- LM-as-Judge: Using LLMs to Evaluate LLM Outputs — How to add automated quality sampling to your A/B test analysis.
- Vibes vs. Benchmarks: How to Really Test an LLM — Why A/B tests are the final tier of a complete evaluation process.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.