Why Data Quality Matters for ML
Garbage in, garbage out is not a metaphor — it is the most common cause of ML model failures in production. A model trained on clean data and served corrupted data degrades silently. Great Expectations (GX) gives your pipeline a data quality layer that fails loudly when data does not meet your expectations.
Core Concepts
Expectation: a testable assertion about your data. "Column X should not be null." "Column Y mean should be between 100 and 200."
Expectation Suite: a collection of expectations for a dataset.
Checkpoint: runs an Expectation Suite against data and produces a validation result.
Data Docs: auto-generated HTML documentation showing your expectations and validation history.
Defining Expectations
import great_expectations as gx
import pandas as pd
context = gx.get_context()
df = pd.read_csv("orders.csv")
validator = context.sources.pandas_default.read_dataframe(df)
# Basic column expectations
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_be_in_set("status", ["pending", "shipped", "delivered", "cancelled"])
# Statistical expectations
validator.expect_column_mean_to_be_between("total_usd", min_value=10.0, max_value=500.0)
validator.expect_column_min_to_be_between("total_usd", min_value=0.0, max_value=1.0)
validator.expect_column_values_to_be_between("total_usd", min_value=0, max_value=10000)
# Schema expectations
validator.expect_column_to_exist("created_at")
validator.expect_table_row_count_to_be_between(min_value=1000, max_value=10_000_000)
validator.save_expectation_suite(discard_failed_expectations=False)
Connecting to Real Databases
import great_expectations as gx
context = gx.get_context()
# PostgreSQL
pg_datasource = context.sources.add_postgres(
name="production_db",
connection_string="postgresql://user:pass@host:5432/dbname",
)
asset = pg_datasource.add_table_asset("orders", table_name="orders")
# BigQuery
bq_datasource = context.sources.add_bigquery(
name="bigquery_prod",
project="my-project",
)
Checkpoints for Pipeline Integration
checkpoint = context.add_or_update_checkpoint(
name="daily_orders_checkpoint",
validations=[
{
"batch_request": asset.build_batch_request(),
"expectation_suite_name": "orders_suite",
}
],
)
results = checkpoint.run()
if not results["success"]:
raise ValueError("Data quality check failed — pipeline halted")
Embed this in your Airflow DAG or Prefect flow to halt pipelines on data quality failures.
Generating Data Docs
great_expectations docs build
great_expectations docs serve # opens browser with validation history
Data Docs show every expectation, when it was last validated, and the pass/fail rate over time. Share this URL with data consumers to document your SLAs.
GX Cloud vs Self-Hosted
GX Core (open source): all the above features, self-managed. Free.
GX Cloud: managed UI, team collaboration, alerting. Free tier for small teams.
GX vs Soda Core
Soda Core is an alternative that uses YAML-defined checks instead of Python. Simpler for non-engineers. GX is more powerful for complex statistical expectations and better Python integration.
Resources: Great Expectations, GitHub, docs.