A/B Testing ML Models in Production: What's Different and How to Do It Right

Why ML A/B tests differ from standard software tests, novelty effects, multi-armed bandits, shadow mode testing, and how to measure model impact rigorously.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

9 min read

// tags

#a/b-testing#ml-deployment#experimentation#multi-armed-bandit#shadow-mode

FIG. ART-26

9 min read

“

A/B Testing ML Models in Production: What's Different and How to Do It Right

// reading plan

sections

1,398

words

min read

// Machine Learning

Ensemble Methods: Why Combining Models Beats Any Individual Model

Bagging, boosting, and stacking -- ensemble methods consistently win Kaggle competitions and improve production accuracy. Here is how each works and when to use them.

9 min read

// Machine Learning

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

A/B testing an ML model is not the same as A/B testing a button color. The model's behavior changes as it receives more data. Effects can compound over time. The right metric to measure might not manifest for days or weeks. Running a standard two-week A/B test and calling it done will give you misleading results. This guide explains what makes ML A/B tests different and how to run them correctly.

How ML A/B Tests Differ From Standard Software Tests

Standard software A/B tests compare two deterministic variants. Variant A always shows a red button. Variant B always shows a green button. The response to each variant is immediate and stable.

ML model A/B tests have four complications that standard tests do not:

Temporal effects — ML model impact often compounds over time. A recommendation model that starts users on a better content path increases engagement not just on day 1 but on day 30 when those users have deeper engagement patterns. A short test will underestimate the model's long-term impact.

Model improvement during the test — if both model versions are being retrained during the test period, the new model may improve faster (because it has a better architecture or training procedure), making the comparison a moving target. Consider freezing both model versions during the test to ensure a fair comparison.

Interference between users — on platforms where users interact with each other (social networks, marketplaces), treatment and control users are not independent. If the new recommendation model shows user A more posts from user B, user B's engagement metrics change even if user B is in the control group. This violates the independence assumption of standard A/B tests.

Metric lag — the ground truth outcome you care about (subscription renewal, 30-day retention, purchase made) happens days or weeks after the model prediction. The feedback loop is much longer than in most software tests.

Setting Up the Test

Define your north star metric — one primary metric that the test will be decided on. Secondary metrics (guardrail metrics) should be monitored to catch regressions but should not determine the winner. Common ML north star metrics: revenue per user, day-30 retention, click-through rate on the primary recommended content, conversion rate.

Set your sample size before starting — use a power analysis to determine the minimum sample size needed to detect your expected effect size. If you expect a 2% relative improvement in your metric, how many users do you need to detect this with 80% statistical power and 5% significance level? Most online power calculators will give you this. Run the test until you reach this sample size — stopping early because results "look significant" inflates the false positive rate (p-hacking).

Run for full business cycles — a business cycle is one full week (or one full month for monthly metrics). User behavior varies by day of week — Monday users behave differently from Saturday users. A test that runs Tuesday-Friday will show different results than one that runs a full week. Run for at least 2 full business cycles.

Pre-register your hypothesis — write down before the test starts: what the primary metric is, what effect size you expect, your significance threshold, and your sample size. This prevents post-hoc rationalization of results.

The Novelty Effect

When users encounter something new, they behave differently than they will once the novelty wears off. A new recommendation algorithm might drive higher click-through rates in week 1 simply because the recommendations look different and users explore them. By week 3, the novelty has worn off and behavior reflects the algorithm's actual quality.

Wait 2-3 weeks after the test starts before reading results. For high-novelty features (significant UI changes, new content types, new interaction patterns), the novelty effect can persist for 4-6 weeks.

One diagnostic: plot your primary metric over time for both variants. If the new model's metric starts high, declines, and then stabilizes above the control, the stabilized level is the real effect. If it declines to the control level, the initial lift was entirely novelty.

Statistical Analysis

For continuous metrics (revenue per user, session duration), use a t-test or Z-test. For proportions (conversion rate, click-through rate), use a proportion Z-test or chi-squared test.

Correct for multiple comparisons if you are testing multiple metrics. If you test 10 metrics at the 5% significance level, you expect 0.5 false positives by chance. Use a Bonferroni correction (divide significance threshold by the number of tests) or the Benjamini-Hochberg procedure for controlling false discovery rate.

Check for heterogeneous treatment effects: does the new model help some user segments (power users, new users, mobile users) and hurt others? A model that looks neutral on average might be meaningfully hurting a valuable segment while helping a less important one. Segment your analysis by user type, device, tenure, and geography.

Multi-Armed Bandits as an Alternative

Traditional A/B testing treats the exploration (discovering which variant is better) and exploitation (showing users the better variant) phases as strictly separate. The test runs until statistical significance is achieved, then 100% of traffic shifts to the winner.

Multi-armed bandits blur this separation: the algorithm continuously updates the probability of showing each variant based on observed performance, automatically routing more traffic toward the winner while the test is still running.

The Thompson Sampling bandit:

import numpy as np
from scipy.stats import beta

class ThompsonSamplingBandit:
    def __init__(self, n_arms):
        self.successes = np.ones(n_arms)  # Beta prior alpha
        self.failures = np.ones(n_arms)   # Beta prior beta

    def choose_arm(self):
        samples = [beta.rvs(self.successes[i], self.failures[i]) for i in range(len(self.successes))]
        return np.argmax(samples)

    def update(self, arm, reward):
        if reward == 1:
            self.successes[arm] += 1
        else:
            self.failures[arm] += 1

Bandits are better than standard A/B tests when the cost of showing a suboptimal variant to users is high and the metric feedback loop is short (you learn quickly which variant is better). They are worse when the feedback loop is long (you cannot adapt quickly) or when you need a clean statistical comparison for decision-making.

Shadow Mode Testing

Shadow mode runs the new model in parallel with the current model, generating predictions without serving them to users. The new model's predictions are logged and compared to the current model's predictions and to actual outcomes, but users only see the current model's output.

Shadow mode is ideal for:

Validating that the new model's predictions are sensible before any users see them
Estimating the impact of the new model on offline metrics before an A/B test
Debugging the new model's behavior on real production inputs (which may differ from your validation set)
Catching infrastructure issues (latency, error rate) before the model is live

Implementation: in your prediction API, after calling the current model, asynchronously call the new model and log its prediction to a separate table. The async call should not block the response to the user.

import asyncio

async def predict_with_shadow(request):
    # Current model: synchronous, blocks response
    current_prediction = current_model.predict(request.features)

    # New model: async, does not block response
    asyncio.create_task(shadow_predict_and_log(request, current_prediction))

    return current_prediction

async def shadow_predict_and_log(request, current_prediction):
    new_prediction = new_model.predict(request.features)
    await log_shadow_result(request.id, current_prediction, new_prediction)

Run shadow mode for at least one week before the A/B test. Use the shadow results to set expectations for A/B test effect sizes.

Measuring Long-Term Model Impact

The most important model improvements are often those that change long-term user behavior — recommendations that lead to deeper platform engagement, not just higher immediate click-through rates. These effects require longer measurement windows.

For 30-day retention impact: run the A/B test, then track both cohorts for 30 days after the test ends. This requires more patience but captures the real business value of improved recommendation quality.

For irreversible behavioral changes (a model that teaches users good habits vs bad habits), the effect may not fully appear for months. Consider running a holdout group: a small fraction of users who permanently receive the control experience and can be compared against the full rollout group to measure cumulative impact over months.

Keep Reading

ML Monitoring and Data Drift Detection — what to watch after you have deployed the winning model
ML Deployment Patterns Guide — canary deployments as a complement to A/B testing
We Replaced 6 SaaS Tools With One: What Happened — the business context of ML-driven product decisions

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

A/B Testing ML Models in Production: What's Different and How to Do It Right

Related Articles

Ensemble Methods: Why Combining Models Beats Any Individual Model

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

How ML A/B Tests Differ From Standard Software Tests

Setting Up the Test

The Novelty Effect

Statistical Analysis

Multi-Armed Bandits as an Alternative

Shadow Mode Testing

Measuring Long-Term Model Impact

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

A/B Testing ML Models in Production: What's Different and How to Do It Right

Related Articles

Ensemble Methods: Why Combining Models Beats Any Individual Model

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

How ML A/B Tests Differ From Standard Software Tests

Setting Up the Test

The Novelty Effect

Statistical Analysis

Multi-Armed Bandits as an Alternative

Shadow Mode Testing

Measuring Long-Term Model Impact

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

The workspace your team
actually needs