A data pipeline is a system that moves data from one or more sources to one or more destinations, typically transforming it along the way. Pipelines are the plumbing of data infrastructure. They are unglamorous, they are critical, and building them badly creates cascading failures that are difficult to debug.
This guide covers the three pipeline types, the tools that implement each, and the decision process for choosing what your situation actually requires.
Pipeline Types
Batch pipelines process data on a schedule. Run at midnight, pull yesterday's orders, compute daily summaries, write to the data warehouse. Batch is the right choice when: data consumers can tolerate latency (daily, hourly), source data is updated periodically (daily database exports, weekly API dumps), or the transformation is expensive and only needs to run occasionally. Batch is simpler to build, test, and debug than streaming.
Streaming pipelines process data as it arrives. An event is emitted, the pipeline picks it up within milliseconds and processes it. Streaming is the right choice when: decisions must be made on fresh data (fraud detection, real-time recommendations, alerting), the source produces a continuous event stream (user clicks, IoT sensors, log data), or downstream consumers need sub-minute latency. Streaming is harder to build, test, and debug. Do not use it unless you have a genuine latency requirement.
Micro-batch pipelines run very frequently (every 1-5 minutes). They are a practical middle ground: simpler than true streaming but with much lower latency than nightly batch. Most use cases that feel like they need streaming can be solved with micro-batch.
Batch Orchestration: Airflow vs Prefect
Apache Airflow is the industry standard for batch pipeline orchestration. You define workflows as Directed Acyclic Graphs (DAGs) in Python. Tasks in the DAG represent units of work; edges represent dependencies.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
"owner": "data-team",
"retries": 3,
"retry_delay": timedelta(minutes=5),
}
with DAG(
"daily_revenue_pipeline",
default_args=default_args,
schedule_interval="0 6 * * *", # 6am daily
start_date=datetime(2026, 1, 1),
catchup=False,
) as dag:
extract = PythonOperator(
task_id="extract_orders",
python_callable=extract_orders_from_db,
)
transform = PythonOperator(
task_id="compute_revenue_summary",
python_callable=compute_revenue_summary,
)
load = PythonOperator(
task_id="load_to_warehouse",
python_callable=load_to_data_warehouse,
)
extract >> transform >> load
Airflow is powerful but has a steep learning curve. The scheduler, worker, and web UI are separate components. It requires a database for state storage. For teams of 10+ engineers with dedicated data engineering time, Airflow is appropriate.
Prefect takes a more Pythonic approach. You decorate functions with @flow and @task and run them. The local development experience is dramatically simpler than Airflow.
from prefect import flow, task
from prefect.schedules import CronSchedule
@task(retries=3, retry_delay_seconds=300)
def extract_orders():
return fetch_orders_from_database()
@task
def compute_summary(orders):
return aggregate_to_daily_revenue(orders)
@task
def load_to_warehouse(summary):
write_to_snowflake(summary)
@flow(schedule=CronSchedule(cron="0 6 * * *"))
def daily_revenue_pipeline():
orders = extract_orders()
summary = compute_summary(orders)
load_to_warehouse(summary)
Prefect Cloud provides a managed UI for monitoring. For small teams, Prefect is significantly easier to operate than Airflow.
dbt: SQL Transformations for the Analytics Layer
dbt (data build tool) handles the transformation layer within your data warehouse. It does not extract or load data; it transforms data that is already in the warehouse using SQL.
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
cleaned AS (
SELECT
id AS order_id,
customer_id,
CAST(created_at AS TIMESTAMP) AS created_at,
UPPER(status) AS status,
amount_cents / 100.0 AS amount_usd
FROM source
WHERE id IS NOT NULL
)
SELECT * FROM cleaned
dbt automatically infers dependencies between models, builds them in the correct order, and provides testing and documentation as first-class features.
# schema.yml: tests are defined alongside models
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: amount_usd
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "> 0"
dbt runs dbt test to execute these tests after each build. This is data quality testing built into the transformation pipeline.
Data Quality Testing: Great Expectations
Great Expectations lets you define data quality assertions as code and run them as tests in your pipeline.
import great_expectations as gx
context = gx.get_context()
# Define expectations
validator = context.get_validator(
datasource_name="orders_db",
data_asset_name="orders",
)
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_between("amount_usd", min_value=0, max_value=100000)
validator.expect_column_values_to_be_in_set("status", ["pending", "shipped", "delivered", "cancelled"])
validator.expect_column_pair_values_a_to_be_greater_than_b("shipped_at", "created_at")
# Save expectations as a suite and run as a checkpoint in your pipeline
When an expectation fails, your pipeline knows the data is bad before downstream consumers are affected.
ETL vs ELT: Load First, Transform Later
Traditional ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. Modern ELT (Extract, Load, Transform) loads raw data first, then transforms it using warehouse compute.
ELT is now the dominant pattern for cloud data warehouses (Snowflake, BigQuery, Redshift) for good reasons: raw data is preserved for reprocessing, transformations can be changed without re-extracting, warehouse compute is cheap, and dbt handles the T in ELT elegantly.
Use ETL when: you have strict data residency requirements (PII must be masked before entering the warehouse), transformation is computationally expensive and better done at the source, or you are loading into a relational OLTP database that cannot handle raw data.
Error Handling in Pipelines
Production pipelines fail. The question is how they fail and whether you know about it.
Retry logic handles transient failures (network timeouts, database deadlocks). Airflow and Prefect both provide configurable retry with backoff. Set retries=3 and retry_delay=5 minutes for most tasks.
Dead letter queues catch records that fail processing in streaming systems. Instead of dropping bad records, write them to a DLQ for inspection and replay.
Alerting on failure. Every production pipeline needs an alert on failure. Airflow and Prefect support email and Slack notifications. A silent pipeline failure that processes no data for 24 hours before someone notices is a common disaster.
Idempotency. Design pipeline tasks to be idempotent: running the same task twice produces the same result as running it once. This makes retries safe. Use IF NOT EXISTS, MERGE statements, or partition-based overwrite strategies.
When You Need a Pipeline vs a Cron Job
A cron job calling a Python script is a pipeline. For simple, single-step processes that run infrequently and have low criticality, a cron job is perfectly appropriate. Do not introduce Airflow for a weekly script that takes 30 seconds.
You need a proper orchestration tool when: the pipeline has multiple dependent steps that need to run in order, you need visibility into pipeline history and failures, multiple people need to understand and modify the pipeline, or you need retry logic, alerting, and parallelism.
Keep Reading
- dbt Data Transformation Guide — deep dive into the transformation layer
- Data Quality Guide — ensuring pipeline output is trustworthy
- Feature Store Guide — pipelines for ML feature serving
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.