Why ML Models Degrade in Production
A model trained in January on clean data may perform poorly by June because the real world changed. Three kinds of drift cause this:
Data drift: input feature distributions shift (e.g., your price feature starts ranging 0-1000 instead of 0-100).
Concept drift: the relationship between features and labels changes (e.g., user behavior that predicted churn no longer does).
Target drift: the label distribution shifts (e.g., fraud rate changes from 2% to 8%).
Evidently AI detects all three automatically.
Generating a Data Drift Report
import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
# Load reference (training) and current (production) data
reference = pd.read_parquet("reference_data.parquet")
current = pd.read_parquet("current_data.parquet")
report = Report(metrics=[
DataDriftPreset(),
DataQualityPreset(),
])
report.run(reference_data=reference, current_data=current)
report.save_html("data_drift_report.html")
The HTML report shows statistical tests (KS test for numerical, chi-square for categorical) for each feature, drift severity, and distribution visualizations.
Classification Model Performance Report
from evidently.metric_preset import ClassificationPreset
report = Report(metrics=[ClassificationPreset()])
report.run(
reference_data=reference_df, # must have target and prediction columns
current_data=current_df,
column_mapping=ColumnMapping(
target="label",
prediction="predicted_label",
prediction_probas=["prob_0", "prob_1"],
)
)
report.save_html("classification_report.html")
Test Suite for Automated Pass/Fail
Reports are for humans. Test Suites are for pipelines:
from evidently.test_suite import TestSuite
from evidently.tests import (
TestShareOfDriftedColumns,
TestColumnDrift,
TestNumberOfMissingValues,
)
tests = TestSuite(tests=[
TestShareOfDriftedColumns(lt=0.2), # fail if >20% columns drift
TestColumnDrift(column_name="user_age"), # fail if age column drifts
TestNumberOfMissingValues(lt=1000), # fail if >1000 missing values
])
tests.run(reference_data=reference, current_data=current)
if not tests.as_dict()["summary"]["all_passed"]:
raise ValueError("Data quality check failed — investigate before retraining")
Integrating with Airflow
from airflow import DAG
from airflow.operators.python import PythonOperator
def run_drift_check():
# load data, run tests, raise on failure
...
with DAG("daily_drift_check", schedule_interval="@daily") as dag:
drift_check = PythonOperator(
task_id="check_drift",
python_callable=run_drift_check,
)
retrain = PythonOperator(
task_id="retrain_model",
python_callable=trigger_retraining,
trigger_rule="all_failed", # retrain only if drift check failed
)
drift_check >> retrain
Evidently vs WhyLogs vs Fiddler
| | Evidently | WhyLogs | Fiddler | |---|---|---|---| | Open source | Yes | Yes | No (SaaS) | | Self-host | Yes | Yes | No | | Reports | Rich HTML | Basic | Rich (managed) | | Real-time | Evidently Cloud | Yes (WhyLabs) | Yes | | Price | Free / Cloud | Free / WhyLabs | $$$|
Resources: Evidently GitHub, docs, Evidently Cloud.