A/B testing an ML model is not the same as A/B testing a button color. The model's behavior changes as it receives more data. Effects can compound over time. The right metric to measure might not manifest for days or weeks. Running a standard two-week A/B test and calling it done will give you misleading results. This guide explains what makes ML A/B tests different and how to run them correctly.
How ML A/B Tests Differ From Standard Software Tests
Standard software A/B tests compare two deterministic variants. Variant A always shows a red button. Variant B always shows a green button. The response to each variant is immediate and stable.
ML model A/B tests have four complications that standard tests do not:
Temporal effects — ML model impact often compounds over time. A recommendation model that starts users on a better content path increases engagement not just on day 1 but on day 30 when those users have deeper engagement patterns. A short test will underestimate the model's long-term impact.
Model improvement during the test — if both model versions are being retrained during the test period, the new model may improve faster (because it has a better architecture or training procedure), making the comparison a moving target. Consider freezing both model versions during the test to ensure a fair comparison.
Interference between users — on platforms where users interact with each other (social networks, marketplaces), treatment and control users are not independent. If the new recommendation model shows user A more posts from user B, user B's engagement metrics change even if user B is in the control group. This violates the independence assumption of standard A/B tests.
Metric lag — the ground truth outcome you care about (subscription renewal, 30-day retention, purchase made) happens days or weeks after the model prediction. The feedback loop is much longer than in most software tests.
Setting Up the Test
Define your north star metric — one primary metric that the test will be decided on. Secondary metrics (guardrail metrics) should be monitored to catch regressions but should not determine the winner. Common ML north star metrics: revenue per user, day-30 retention, click-through rate on the primary recommended content, conversion rate.
Set your sample size before starting — use a power analysis to determine the minimum sample size needed to detect your expected effect size. If you expect a 2% relative improvement in your metric, how many users do you need to detect this with 80% statistical power and 5% significance level? Most online power calculators will give you this. Run the test until you reach this sample size — stopping early because results "look significant" inflates the false positive rate (p-hacking).
Run for full business cycles — a business cycle is one full week (or one full month for monthly metrics). User behavior varies by day of week — Monday users behave differently from Saturday users. A test that runs Tuesday-Friday will show different results than one that runs a full week. Run for at least 2 full business cycles.
Pre-register your hypothesis — write down before the test starts: what the primary metric is, what effect size you expect, your significance threshold, and your sample size. This prevents post-hoc rationalization of results.
The Novelty Effect
When users encounter something new, they behave differently than they will once the novelty wears off. A new recommendation algorithm might drive higher click-through rates in week 1 simply because the recommendations look different and users explore them. By week 3, the novelty has worn off and behavior reflects the algorithm's actual quality.
Wait 2-3 weeks after the test starts before reading results. For high-novelty features (significant UI changes, new content types, new interaction patterns), the novelty effect can persist for 4-6 weeks.
One diagnostic: plot your primary metric over time for both variants. If the new model's metric starts high, declines, and then stabilizes above the control, the stabilized level is the real effect. If it declines to the control level, the initial lift was entirely novelty.
Statistical Analysis
For continuous metrics (revenue per user, session duration), use a t-test or Z-test. For proportions (conversion rate, click-through rate), use a proportion Z-test or chi-squared test.
Correct for multiple comparisons if you are testing multiple metrics. If you test 10 metrics at the 5% significance level, you expect 0.5 false positives by chance. Use a Bonferroni correction (divide significance threshold by the number of tests) or the Benjamini-Hochberg procedure for controlling false discovery rate.
Check for heterogeneous treatment effects: does the new model help some user segments (power users, new users, mobile users) and hurt others? A model that looks neutral on average might be meaningfully hurting a valuable segment while helping a less important one. Segment your analysis by user type, device, tenure, and geography.
Multi-Armed Bandits as an Alternative
Traditional A/B testing treats the exploration (discovering which variant is better) and exploitation (showing users the better variant) phases as strictly separate. The test runs until statistical significance is achieved, then 100% of traffic shifts to the winner.
Multi-armed bandits blur this separation: the algorithm continuously updates the probability of showing each variant based on observed performance, automatically routing more traffic toward the winner while the test is still running.
The Thompson Sampling bandit:
import numpy as np
from scipy.stats import beta
class ThompsonSamplingBandit:
def __init__(self, n_arms):
self.successes = np.ones(n_arms) # Beta prior alpha
self.failures = np.ones(n_arms) # Beta prior beta
def choose_arm(self):
samples = [beta.rvs(self.successes[i], self.failures[i]) for i in range(len(self.successes))]
return np.argmax(samples)
def update(self, arm, reward):
if reward == 1:
self.successes[arm] += 1
else:
self.failures[arm] += 1
Bandits are better than standard A/B tests when the cost of showing a suboptimal variant to users is high and the metric feedback loop is short (you learn quickly which variant is better). They are worse when the feedback loop is long (you cannot adapt quickly) or when you need a clean statistical comparison for decision-making.
Shadow Mode Testing
Shadow mode runs the new model in parallel with the current model, generating predictions without serving them to users. The new model's predictions are logged and compared to the current model's predictions and to actual outcomes, but users only see the current model's output.
Shadow mode is ideal for:
- Validating that the new model's predictions are sensible before any users see them
- Estimating the impact of the new model on offline metrics before an A/B test
- Debugging the new model's behavior on real production inputs (which may differ from your validation set)
- Catching infrastructure issues (latency, error rate) before the model is live
Implementation: in your prediction API, after calling the current model, asynchronously call the new model and log its prediction to a separate table. The async call should not block the response to the user.
import asyncio
async def predict_with_shadow(request):
# Current model: synchronous, blocks response
current_prediction = current_model.predict(request.features)
# New model: async, does not block response
asyncio.create_task(shadow_predict_and_log(request, current_prediction))
return current_prediction
async def shadow_predict_and_log(request, current_prediction):
new_prediction = new_model.predict(request.features)
await log_shadow_result(request.id, current_prediction, new_prediction)
Run shadow mode for at least one week before the A/B test. Use the shadow results to set expectations for A/B test effect sizes.
Measuring Long-Term Model Impact
The most important model improvements are often those that change long-term user behavior — recommendations that lead to deeper platform engagement, not just higher immediate click-through rates. These effects require longer measurement windows.
For 30-day retention impact: run the A/B test, then track both cohorts for 30 days after the test ends. This requires more patience but captures the real business value of improved recommendation quality.
For irreversible behavioral changes (a model that teaches users good habits vs bad habits), the effect may not fully appear for months. Consider running a holdout group: a small fraction of users who permanently receive the control experience and can be compared against the full rollout group to measure cumulative impact over months.
Keep Reading
- ML Monitoring and Data Drift Detection — what to watch after you have deployed the winning model
- ML Deployment Patterns Guide — canary deployments as a complement to A/B testing
- We Replaced 6 SaaS Tools With One: What Happened — the business context of ML-driven product decisions
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.