Great Expectations: Automated Data Quality Checks for ML Pipelines

Great Expectations lets you define what good data looks like, validate it automatically in your pipeline, and generate documentation - catching data issues before they corrupt your models.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 7, 2026

7 min read

// tags

#great-expectations#data-quality#validation#testing#mlops

FIG. ART-31

7 min read

“

Great Expectations: Automated Data Quality Checks for ML Pipelines

// reading plan

sections

340

words

min read

// Developer Tools

Testing HTTP APIs Effectively: Beyond the Happy Path

Unit vs integration tests, test database strategies, auth in tests, and making sure your 400, 401, 403, 404, and 500 responses are all verified.

10 min read

// Developer Tools

Storybook Guide: Building and Documenting Your Component Library

Defining Expectations

import great_expectations as gx
import pandas as pd

context = gx.get_context()

df = pd.read_csv("orders.csv")
validator = context.sources.pandas_default.read_dataframe(df)

# Basic column expectations
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_be_in_set("status", ["pending", "shipped", "delivered", "cancelled"])

# Statistical expectations
validator.expect_column_mean_to_be_between("total_usd", min_value=10.0, max_value=500.0)
validator.expect_column_min_to_be_between("total_usd", min_value=0.0, max_value=1.0)
validator.expect_column_values_to_be_between("total_usd", min_value=0, max_value=10000)

# Schema expectations
validator.expect_column_to_exist("created_at")
validator.expect_table_row_count_to_be_between(min_value=1000, max_value=10_000_000)

validator.save_expectation_suite(discard_failed_expectations=False)

Connecting to Real Databases

import great_expectations as gx

context = gx.get_context()

# PostgreSQL
pg_datasource = context.sources.add_postgres(
    name="production_db",
    connection_string="postgresql://user:pass@host:5432/dbname",
)
asset = pg_datasource.add_table_asset("orders", table_name="orders")

# BigQuery
bq_datasource = context.sources.add_bigquery(
    name="bigquery_prod",
    project="my-project",
)

Checkpoints for Pipeline Integration

checkpoint = context.add_or_update_checkpoint(
    name="daily_orders_checkpoint",
    validations=[
        {
            "batch_request": asset.build_batch_request(),
            "expectation_suite_name": "orders_suite",
        }
    ],
)

results = checkpoint.run()

if not results["success"]:
    raise ValueError("Data quality check failed  -  pipeline halted")

Embed this in your Airflow DAG or Prefect flow to halt pipelines on data quality failures.

Generating Data Docs

great_expectations docs build
great_expectations docs serve  # opens browser with validation history

Data Docs show every expectation, when it was last validated, and the pass/fail rate over time. Share this URL with data consumers to document your SLAs.

GX Cloud vs Self-Hosted

GX Core (open source): all the above features, self-managed. Free.

GX Cloud: managed UI, team collaboration, alerting. Free tier for small teams.

GX vs Soda Core

Soda Core is an alternative that uses YAML-defined checks instead of Python. Simpler for non-engineers. GX is more powerful for complex statistical expectations and better Python integration.

Resources: Great Expectations, GitHub, docs.

Great Expectations: Automated Data Quality Checks for ML Pipelines

Related Articles

Testing HTTP APIs Effectively: Beyond the Happy Path

Storybook Guide: Building and Documenting Your Component Library

Why Data Quality Matters for ML

Core Concepts

Defining Expectations

Connecting to Real Databases

Checkpoints for Pipeline Integration

Generating Data Docs

GX Cloud vs Self-Hosted

GX vs Soda Core

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Vitest Unit Testing Guide for Modern JavaScript and TypeScript

Great Expectations: Automated Data Quality Checks for ML Pipelines

Related Articles

Testing HTTP APIs Effectively: Beyond the Happy Path

Storybook Guide: Building and Documenting Your Component Library

Why Data Quality Matters for ML

Core Concepts

Defining Expectations

Connecting to Real Databases

Checkpoints for Pipeline Integration

Generating Data Docs

GX Cloud vs Self-Hosted

GX vs Soda Core

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Vitest Unit Testing Guide for Modern JavaScript and TypeScript

The workspace your team
actually needs