ML Monitoring in Production: Detecting Data Drift Before It Breaks Your Model

Why models degrade, what to monitor, detecting drift with statistical tests, automated retraining triggers, and tools like Evidently AI and Arize.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

9 min read

// tags

#ml-monitoring#data-drift#model-degradation#evidently-ai#production-ml

FIG. ART-30

9 min read

“

ML Monitoring in Production: Detecting Data Drift Before It Breaks Your Model

// reading plan

sections

1,251

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Gradient Descent Explained: How Machine Learning Models Actually Learn

Statistical Tests for Drift Detection

Kolmogorov-Smirnov (KS) test - a non-parametric test that compares two continuous distributions. It measures the maximum difference between the empirical CDFs of two samples. A large KS statistic (and small p-value) indicates the distributions are different. Commonly used for continuous feature drift detection.

from scipy.stats import ks_2samp

# Reference distribution (from training data)
reference = training_data["feature_x"].values

# Current window (last 7 days of production data)
current = production_data_last_7_days["feature_x"].values

ks_stat, p_value = ks_2samp(reference, current)

if p_value < 0.05:
    print(f"Significant drift detected in feature_x (KS={ks_stat:.3f}, p={p_value:.4f})")

Population Stability Index (PSI) - measures how much a distribution has shifted by comparing binned frequencies between reference and production. PSI below 0.1: no significant shift. PSI 0.1-0.25: moderate shift, worth investigating. PSI above 0.25: significant shift, likely model degradation.

import numpy as np

def psi(reference, current, buckets=10):
    breakpoints = np.percentile(reference, np.linspace(0, 100, buckets + 1))
    ref_counts = np.histogram(reference, breakpoints)[0] / len(reference)
    cur_counts = np.histogram(current, breakpoints)[0] / len(current)

    # Avoid division by zero
    ref_counts = np.where(ref_counts == 0, 0.0001, ref_counts)
    cur_counts = np.where(cur_counts == 0, 0.0001, cur_counts)

    return np.sum((cur_counts - ref_counts) * np.log(cur_counts / ref_counts))

Chi-squared test - for categorical features, compare frequency distributions using chi-squared. Like KS test but for discrete distributions.

Monitoring Implementation

A minimal monitoring pipeline:

At prediction time, log the input features, prediction, and timestamp to a monitoring table.
A scheduled job (hourly or daily) computes distribution statistics on the last N days of production features and compares them to training-time reference statistics.
Alerts fire when drift statistics exceed thresholds.
When ground truth labels become available (after the appropriate lag), compute accuracy for those predictions and log it.

# At prediction time
async def predict_and_log(features: dict, model):
    prediction = model.predict(features)

    await db.monitoring_log.insert_one({
        "timestamp": datetime.utcnow(),
        "features": features,
        "prediction": prediction,
        "model_version": model.version,
    })

    return prediction

# Scheduled drift check (run every hour)
def check_drift():
    recent_features = fetch_features_last_24h()
    for feature_name in MONITORED_FEATURES:
        psi_score = psi(reference_distributions[feature_name], recent_features[feature_name])
        if psi_score > 0.25:
            alert(f"Drift detected in {feature_name}: PSI={psi_score:.3f}")

Automated Retraining Triggers vs Manual Review

Automated retraining (retrain when drift exceeds a threshold) is appealing but risky:

If the trigger fires on a temporary anomaly (a brief traffic spike, a data pipeline glitch), you retrain on corrupted data.
If concept drift is adversarial (attackers adapting to your fraud model), automated retraining teaches the model to recognize the new patterns but may also teach it to miss old patterns.
Automated retraining can create feedback loops in recommendation systems: the model shapes user behavior, user behavior becomes training data, the retrained model reinforces those behaviors.

The pragmatic approach: use monitoring alerts to trigger human review, not automatic retraining. A dashboard shows drift statistics and accuracy degradation. A data scientist reviews whether the drift is real and the retraining is warranted. Retraining is then triggered manually. This adds latency (hours to days instead of minutes) but prevents runaway feedback loops and corrupted retraining.

Exceptions: for high-frequency models with very short feedback loops (CTR prediction with immediate feedback), automated retraining every few hours is standard practice at scale.

Monitoring Tools

Evidently AI (open source) is the most accessible monitoring tool for teams building their first ML monitoring system. It computes drift statistics, generates HTML reports, and integrates with dashboarding tools. One-line drift detection:

from evidently.report import Report
from evidently.metrics import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=production_df)
report.save_html("drift_report.html")

WhyLabs is a managed monitoring platform. You log statistical profiles (not raw data) from your production pipeline, and WhyLabs handles drift detection, anomaly detection, and alerting. Good for teams that do not want to manage monitoring infrastructure.

Arize AI is an enterprise ML observability platform with strong debugging tools. When drift is detected, Arize helps identify which examples are causing the drift through slice analysis and feature importance for degradation.

Grafana + Prometheus is the right choice if you prefer to build monitoring on standard infrastructure. Compute your own drift statistics, push them to Prometheus as metrics, and visualize in Grafana.

Keep Reading

A/B Testing ML Models in Production - how to test new model versions before rolling out
ML Deployment Patterns Guide - canary deployments for safe model updates
ML Data Collection Guide - building the retraining datasets that monitoring triggers

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

ML Monitoring in Production: Detecting Data Drift Before It Breaks Your Model

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Gradient Descent Explained: How Machine Learning Models Actually Learn

Why Models Degrade

What to Monitor

Statistical Tests for Drift Detection

Monitoring Implementation

Automated Retraining Triggers vs Manual Review

Monitoring Tools

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

ML Monitoring in Production: Detecting Data Drift Before It Breaks Your Model

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Gradient Descent Explained: How Machine Learning Models Actually Learn

Why Models Degrade

What to Monitor

Statistical Tests for Drift Detection

Monitoring Implementation

Automated Retraining Triggers vs Manual Review

Monitoring Tools

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs