Deploying a model to production is not the end of the ML lifecycle. It is the beginning. Models that performed well at deployment degrade over time as the world changes. A fraud detection model trained in 2023 may fail on fraud patterns that emerged in 2025. A churn prediction model trained on pre-pandemic behavior may perform poorly post-pandemic. Monitoring is what tells you when a model needs to be retrained before the business notices the degradation.
Why Models Degrade
Models degrade for three main reasons:
Data drift (also called covariate shift) — the distribution of input features changes over time. A model trained on user behavior before a major product redesign may see completely different feature distributions after the redesign. Age distribution of your user base shifts. A new device type becomes popular. Upstream data pipelines change a feature's scale or encoding.
Concept drift — the relationship between inputs and outputs changes. A model predicting house prices trained in a low-interest-rate environment will be systematically wrong in a high-interest-rate environment, even if the input features (square footage, location, number of bedrooms) are unchanged. The underlying concept (what determines price) has shifted.
Upstream data changes — a field gets dropped from the data pipeline, renamed, or its encoding changes. This is the most operationally common cause of model failure: the model's features are computed from upstream data, and a schema change silently breaks feature computation. The model receives zeros or nulls instead of meaningful values and its performance collapses.
The first two are gradual and detectable. The third is sudden and immediately catastrophic — monitoring for missing or constant features catches it quickly.
What to Monitor
A complete ML monitoring setup tracks five things:
Prediction distribution — the distribution of the model's output values (predicted probabilities, predicted scores, class frequencies). Shift in prediction distribution is often the earliest signal of data or concept drift. If your churn model usually predicts 15% of users as high-risk and suddenly starts predicting 40% as high-risk, something has changed.
Feature distributions — the distribution of each input feature. Statistical tests (KS test for continuous features, chi-squared for categorical) compare the current distribution to a reference distribution from the training period. Large deviations indicate data drift.
Model confidence — mean prediction confidence over time. A model that is systematically less certain about its predictions than it was at launch is encountering inputs unlike its training data.
Business metrics — the downstream metric the model was built to improve: conversion rate, click-through rate, churn rate. Business metrics are the ground truth but have a lag (you need to wait for outcomes to compare to predictions). They are the lagging indicator; feature and prediction distributions are the leading indicators.
The ground truth lag problem — for many ML tasks, you do not know the true label at prediction time. You predicted churn probability 30 days ago; you know if that prediction was correct only now. This means your real-time accuracy metric is always 30 days stale. Build your monitoring to handle this lag: compute accuracy on the cohort of predictions where you now have ground truth labels, and plot it over time.
Statistical Tests for Drift Detection
Kolmogorov-Smirnov (KS) test — a non-parametric test that compares two continuous distributions. It measures the maximum difference between the empirical CDFs of two samples. A large KS statistic (and small p-value) indicates the distributions are different. Commonly used for continuous feature drift detection.
from scipy.stats import ks_2samp
# Reference distribution (from training data)
reference = training_data["feature_x"].values
# Current window (last 7 days of production data)
current = production_data_last_7_days["feature_x"].values
ks_stat, p_value = ks_2samp(reference, current)
if p_value < 0.05:
print(f"Significant drift detected in feature_x (KS={ks_stat:.3f}, p={p_value:.4f})")
Population Stability Index (PSI) — measures how much a distribution has shifted by comparing binned frequencies between reference and production. PSI below 0.1: no significant shift. PSI 0.1-0.25: moderate shift, worth investigating. PSI above 0.25: significant shift, likely model degradation.
import numpy as np
def psi(reference, current, buckets=10):
breakpoints = np.percentile(reference, np.linspace(0, 100, buckets + 1))
ref_counts = np.histogram(reference, breakpoints)[0] / len(reference)
cur_counts = np.histogram(current, breakpoints)[0] / len(current)
# Avoid division by zero
ref_counts = np.where(ref_counts == 0, 0.0001, ref_counts)
cur_counts = np.where(cur_counts == 0, 0.0001, cur_counts)
return np.sum((cur_counts - ref_counts) * np.log(cur_counts / ref_counts))
Chi-squared test — for categorical features, compare frequency distributions using chi-squared. Like KS test but for discrete distributions.
Monitoring Implementation
A minimal monitoring pipeline:
- At prediction time, log the input features, prediction, and timestamp to a monitoring table.
- A scheduled job (hourly or daily) computes distribution statistics on the last N days of production features and compares them to training-time reference statistics.
- Alerts fire when drift statistics exceed thresholds.
- When ground truth labels become available (after the appropriate lag), compute accuracy for those predictions and log it.
# At prediction time
async def predict_and_log(features: dict, model):
prediction = model.predict(features)
await db.monitoring_log.insert_one({
"timestamp": datetime.utcnow(),
"features": features,
"prediction": prediction,
"model_version": model.version,
})
return prediction
# Scheduled drift check (run every hour)
def check_drift():
recent_features = fetch_features_last_24h()
for feature_name in MONITORED_FEATURES:
psi_score = psi(reference_distributions[feature_name], recent_features[feature_name])
if psi_score > 0.25:
alert(f"Drift detected in {feature_name}: PSI={psi_score:.3f}")
Automated Retraining Triggers vs Manual Review
Automated retraining (retrain when drift exceeds a threshold) is appealing but risky:
- If the trigger fires on a temporary anomaly (a brief traffic spike, a data pipeline glitch), you retrain on corrupted data.
- If concept drift is adversarial (attackers adapting to your fraud model), automated retraining teaches the model to recognize the new patterns but may also teach it to miss old patterns.
- Automated retraining can create feedback loops in recommendation systems: the model shapes user behavior, user behavior becomes training data, the retrained model reinforces those behaviors.
The pragmatic approach: use monitoring alerts to trigger human review, not automatic retraining. A dashboard shows drift statistics and accuracy degradation. A data scientist reviews whether the drift is real and the retraining is warranted. Retraining is then triggered manually. This adds latency (hours to days instead of minutes) but prevents runaway feedback loops and corrupted retraining.
Exceptions: for high-frequency models with very short feedback loops (CTR prediction with immediate feedback), automated retraining every few hours is standard practice at scale.
Monitoring Tools
Evidently AI (open source) is the most accessible monitoring tool for teams building their first ML monitoring system. It computes drift statistics, generates HTML reports, and integrates with dashboarding tools. One-line drift detection:
from evidently.report import Report
from evidently.metrics import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=production_df)
report.save_html("drift_report.html")
WhyLabs is a managed monitoring platform. You log statistical profiles (not raw data) from your production pipeline, and WhyLabs handles drift detection, anomaly detection, and alerting. Good for teams that do not want to manage monitoring infrastructure.
Arize AI is an enterprise ML observability platform with strong debugging tools. When drift is detected, Arize helps identify which examples are causing the drift through slice analysis and feature importance for degradation.
Grafana + Prometheus is the right choice if you prefer to build monitoring on standard infrastructure. Compute your own drift statistics, push them to Prometheus as metrics, and visualize in Grafana.
Keep Reading
- A/B Testing ML Models in Production — how to test new model versions before rolling out
- ML Deployment Patterns Guide — canary deployments for safe model updates
- ML Data Collection Guide — building the retraining datasets that monitoring triggers
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.