Data Quality: The Six Dimensions and How to Enforce Them in Production

Data quality determines model quality. Here is how to measure, test, and automatically enforce data quality across the six core dimensions.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

11 min read

// tags

#data-quality#great-expectations#data-contracts#data-engineering

FIG. ART-18

11 min read

“

Data Quality: The Six Dimensions and How to Enforce Them in Production

// reading plan

sections

1,093

words

min read

// Data Science

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Pandas is the dominant Python library for data manipulation. Here is what every developer needs to know to use it effectively.

9 min read

// Data Science

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

Great Expectations: Data Quality as Code

Great Expectations (GX) is the standard tool for defining data quality rules as code and running them as automated tests in your pipeline.

import great_expectations as gx

context = gx.get_context()

# Define a batch of data to validate
batch = context.get_validator(
    datasource_name="postgres_db",
    data_asset_name="orders",
)

# Define expectations
batch.expect_column_values_to_not_be_null("order_id")
batch.expect_column_values_to_not_be_null("customer_id")
batch.expect_column_values_to_be_between(
    "amount_usd",
    min_value=0.01,
    max_value=50000.0,
    mostly=0.99  # Allow 1% exceptions
)
batch.expect_column_values_to_match_regex(
    "email",
    r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$",
    mostly=0.95
)
batch.expect_column_values_to_be_in_set(
    "status",
    ["pending", "processing", "shipped", "delivered", "cancelled", "refunded"]
)
batch.expect_column_pair_values_a_to_be_greater_than_b(
    "shipped_at", "created_at", or_equal=True
)

# Save expectations and build a validation checkpoint
batch.save_expectation_suite(discard_failed_expectations=False)

Expectations are saved as JSON and version-controlled. You run them as a checkpoint in your pipeline, and GX generates an HTML report showing which expectations passed and failed, with example failing values.

Data Contracts: Schema Enforcement Between Teams

A data contract is an agreement between the producer of a dataset and its consumers about the schema, semantics, and SLAs of that data. When upstream teams change a field name, add a null to a previously non-null column, or change a field's semantics, they break their data contract -- and every downstream consumer breaks silently.

Data contracts can be as simple as a documented schema checked via a CI test:

# contracts/orders_contract.py
ORDERS_CONTRACT = {
    "order_id": {"type": "string", "nullable": False, "unique": True},
    "customer_id": {"type": "string", "nullable": False},
    "amount_usd": {"type": "float", "nullable": False, "min": 0.0},
    "status": {"type": "string", "nullable": False,
               "allowed_values": ["pending", "shipped", "delivered", "cancelled"]},
    "created_at": {"type": "datetime", "nullable": False},
}

More mature implementations use tools like Soda Core, Pydantic-based schemas, or the emerging data contract specification format (datacontract.com).

Common Production Data Quality Failures

Upstream schema change. The most common failure. A backend team renames user_id to account_id in their database. Your pipeline joins on user_id, gets no matches, produces an empty output. If you do not have schema validation or row count checks, this failure is invisible until a stakeholder notices the dashboard is empty.

Silent introduction of nulls. A previously always-populated field starts arriving as null for new records (perhaps due to a code change in the source system). Your aggregate metrics drift downward because you are averaging fewer values. No error is thrown.

Timezone change. A source system switches from UTC to local time without notice. All timestamps shift by an offset. Time-window features (last 7 days of activity) now include wrong records.

Duplicated join keys. A new data feed introduces duplicate order_id values. Your join multiplies rows. Your revenue metric doubles overnight.

How to catch these automatically: row count monitoring (alert if today's count is < 80% or > 120% of yesterday's), schema drift detection (alert on new or removed columns, type changes), null rate monitoring (alert if null rate for a key field increases), and aggregate monitoring (alert if daily revenue deviates more than 2 standard deviations from the trailing 14-day mean).

Implementing a Monitoring Dashboard

# Minimal data quality monitoring
import pandas as pd
from datetime import datetime, timedelta

def check_pipeline_health(df: pd.DataFrame, table_name: str) -> dict:
    """Run basic health checks on a freshly loaded DataFrame."""
    issues = []

    # Row count
    if len(df) == 0:
        issues.append(f"CRITICAL: {table_name} has 0 rows")

    # Key columns not null
    for col in ["id", "created_at"]:
        null_pct = df[col].isnull().mean()
        if null_pct > 0.01:
            issues.append(f"WARNING: {col} has {null_pct:.1%} null values")

    # No future timestamps
    if "created_at" in df.columns:
        future_records = (df["created_at"] > datetime.utcnow()).sum()
        if future_records > 0:
            issues.append(f"WARNING: {future_records} records have future created_at")

    return {"table": table_name, "row_count": len(df), "issues": issues}

Keep Reading

Data Pipeline Guide -- where to integrate quality checks
dbt Data Transformation Guide -- dbt's built-in testing for the transformation layer
Feature Store Guide -- quality concerns in ML feature serving

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

Data Quality: The Six Dimensions and How to Enforce Them in Production

Related Articles

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

The Six Dimensions

Measuring Data Quality with Pandas

Great Expectations: Data Quality as Code

Data Contracts: Schema Enforcement Between Teams

Common Production Data Quality Failures

Implementing a Monitoring Dashboard

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

Data Visualization in Python: When to Use Matplotlib, Seaborn, Plotly, and Altair

Data Quality: The Six Dimensions and How to Enforce Them in Production

Related Articles

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

The Six Dimensions

Measuring Data Quality with Pandas

Great Expectations: Data Quality as Code

Data Contracts: Schema Enforcement Between Teams

Common Production Data Quality Failures

Implementing a Monitoring Dashboard

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

Data Visualization in Python: When to Use Matplotlib, Seaborn, Plotly, and Altair

The workspace your team
actually needs