Data quality is the unglamorous work that determines the quality of everything built on top of it. Models trained on bad data learn bad patterns. Dashboards built on bad data make wrong decisions visible. Pipelines that propagate bad data turn one upstream mistake into a dozen downstream problems. Investing in data quality is one of the highest-leverage things a data team can do.
This guide covers the six dimensions of data quality, how to measure each one, the tools to enforce them in production, and the real-world failure modes that cause production incidents.
The Six Dimensions
Completeness. Are required values present? Missing data has costs: a machine learning model trained on rows where income is often missing learns a different distribution than the population. Aggregations over columns with missing values produce wrong results. Measure completeness with the percentage of non-null values per column.
Accuracy. Do values reflect reality? An age of 250, a negative revenue, a zip code that does not exist -- these are inaccurate values that pass technical validity checks. Accuracy is the hardest dimension to enforce automatically because it requires external ground truth. The best you can do programmatically is range checks, pattern checks, and cross-field validation.
Consistency. Do values have the same format and meaning across sources? "New York", "new york", "NY", "New York City", and "NYC" may all refer to the same entity but will be treated as five different categories by your code. Timestamps without timezone information are inconsistent across servers in different regions. Inconsistency is pervasive in any system that integrates multiple data sources.
Timeliness. Is data fresh enough for its intended use? A dashboard showing yesterday's data as today's numbers is misleading. A fraud detection model scoring on 6-hour-old features may miss recent fraudulent behavior. Timeliness requirements vary by use case: some need real-time data, some can tolerate daily updates.
Validity. Do values conform to their defined format and type? An email field containing "not_an_email", a date field containing "N/A", or an integer field containing "unknown" are validity failures. These are the easiest to detect automatically.
Uniqueness. Are entities represented exactly once where they should be? Duplicate customers in a CRM, duplicate transactions in an orders table, duplicate rows after a bad ETL join -- these cause double-counting and incorrect aggregations.
Measuring Data Quality with Pandas
Quick measurements during EDA or pipeline monitoring:
import pandas as pd
import numpy as np
def data_quality_report(df: pd.DataFrame) -> pd.DataFrame:
"""Generate a data quality summary for all columns."""
report = pd.DataFrame(index=df.columns)
# Completeness
report["missing_count"] = df.isnull().sum()
report["missing_pct"] = (df.isnull().sum() / len(df) * 100).round(2)
# Uniqueness
report["unique_count"] = df.nunique()
report["duplicate_pct"] = (1 - df.nunique() / df.count()) * 100
# Data type
report["dtype"] = df.dtypes
return report.sort_values("missing_pct", ascending=False)
For duplicate detection:
# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Duplicate rows: {duplicate_count} ({duplicate_count/len(df)*100:.2f}%)")
# Check for duplicates on key fields
duplicate_ids = df.duplicated(subset=["order_id"], keep=False)
print(f"Duplicate order_ids: {duplicate_ids.sum()}")
Great Expectations: Data Quality as Code
Great Expectations (GX) is the standard tool for defining data quality rules as code and running them as automated tests in your pipeline.
import great_expectations as gx
context = gx.get_context()
# Define a batch of data to validate
batch = context.get_validator(
datasource_name="postgres_db",
data_asset_name="orders",
)
# Define expectations
batch.expect_column_values_to_not_be_null("order_id")
batch.expect_column_values_to_not_be_null("customer_id")
batch.expect_column_values_to_be_between(
"amount_usd",
min_value=0.01,
max_value=50000.0,
mostly=0.99 # Allow 1% exceptions
)
batch.expect_column_values_to_match_regex(
"email",
r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$",
mostly=0.95
)
batch.expect_column_values_to_be_in_set(
"status",
["pending", "processing", "shipped", "delivered", "cancelled", "refunded"]
)
batch.expect_column_pair_values_a_to_be_greater_than_b(
"shipped_at", "created_at", or_equal=True
)
# Save expectations and build a validation checkpoint
batch.save_expectation_suite(discard_failed_expectations=False)
Expectations are saved as JSON and version-controlled. You run them as a checkpoint in your pipeline, and GX generates an HTML report showing which expectations passed and failed, with example failing values.
Data Contracts: Schema Enforcement Between Teams
A data contract is an agreement between the producer of a dataset and its consumers about the schema, semantics, and SLAs of that data. When upstream teams change a field name, add a null to a previously non-null column, or change a field's semantics, they break their data contract -- and every downstream consumer breaks silently.
Data contracts can be as simple as a documented schema checked via a CI test:
# contracts/orders_contract.py
ORDERS_CONTRACT = {
"order_id": {"type": "string", "nullable": False, "unique": True},
"customer_id": {"type": "string", "nullable": False},
"amount_usd": {"type": "float", "nullable": False, "min": 0.0},
"status": {"type": "string", "nullable": False,
"allowed_values": ["pending", "shipped", "delivered", "cancelled"]},
"created_at": {"type": "datetime", "nullable": False},
}
More mature implementations use tools like Soda Core, Pydantic-based schemas, or the emerging data contract specification format (datacontract.com).
Common Production Data Quality Failures
Upstream schema change. The most common failure. A backend team renames user_id to account_id in their database. Your pipeline joins on user_id, gets no matches, produces an empty output. If you do not have schema validation or row count checks, this failure is invisible until a stakeholder notices the dashboard is empty.
Silent introduction of nulls. A previously always-populated field starts arriving as null for new records (perhaps due to a code change in the source system). Your aggregate metrics drift downward because you are averaging fewer values. No error is thrown.
Timezone change. A source system switches from UTC to local time without notice. All timestamps shift by an offset. Time-window features (last 7 days of activity) now include wrong records.
Duplicated join keys. A new data feed introduces duplicate order_id values. Your join multiplies rows. Your revenue metric doubles overnight.
How to catch these automatically: row count monitoring (alert if today's count is < 80% or > 120% of yesterday's), schema drift detection (alert on new or removed columns, type changes), null rate monitoring (alert if null rate for a key field increases), and aggregate monitoring (alert if daily revenue deviates more than 2 standard deviations from the trailing 14-day mean).
Implementing a Monitoring Dashboard
# Minimal data quality monitoring
import pandas as pd
from datetime import datetime, timedelta
def check_pipeline_health(df: pd.DataFrame, table_name: str) -> dict:
"""Run basic health checks on a freshly loaded DataFrame."""
issues = []
# Row count
if len(df) == 0:
issues.append(f"CRITICAL: {table_name} has 0 rows")
# Key columns not null
for col in ["id", "created_at"]:
null_pct = df[col].isnull().mean()
if null_pct > 0.01:
issues.append(f"WARNING: {col} has {null_pct:.1%} null values")
# No future timestamps
if "created_at" in df.columns:
future_records = (df["created_at"] > datetime.utcnow()).sum()
if future_records > 0:
issues.append(f"WARNING: {future_records} records have future created_at")
return {"table": table_name, "row_count": len(df), "issues": issues}
Keep Reading
- Data Pipeline Guide -- where to integrate quality checks
- dbt Data Transformation Guide -- dbt's built-in testing for the transformation layer
- Feature Store Guide -- quality concerns in ML feature serving
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.