The Orchestration Problem
Data pipelines break. Upstream data arrives late. APIs rate-limit. Databases go down. A workflow orchestrator handles scheduling, retries, logging, alerts, and dependency management so your pipelines are observable and recoverable.
Three tools dominate in 2026: Apache Airflow, Prefect, and Dagster.
Apache Airflow
The original. Airflow is a 2014 Airbnb project that became the industry default. It is mature, battle-tested, and has 1,000+ operators for every service imaginable.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
def extract(): ...
def transform(): ...
def load(): ...
with DAG(
"etl_pipeline",
schedule_interval="@daily",
start_date=datetime(2025, 1, 1),
catchup=False,
default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
) as dag:
t1 = PythonOperator(task_id="extract", python_callable=extract)
t2 = PythonOperator(task_id="transform", python_callable=transform)
t3 = PythonOperator(task_id="load", python_callable=load)
t1 >> t2 >> t3
Airflow runs a scheduler process, a webserver, and workers (Celery or Kubernetes). The Kubernetes Executor scales workers horizontally. The UI is functional but dated.
When to choose Airflow: large team already using it, need a specific operator that exists nowhere else, working in a data platform that expects Airflow.
Prefect
Prefect is the Python-first alternative. You decorate existing Python functions — no DAG DSL to learn.
from prefect import flow, task
from prefect.tasks import task_input_hash
from datetime import timedelta
@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1), retries=3)
def extract(url: str) -> list:
return fetch_data(url)
@task
def transform(data: list) -> list:
return [clean(row) for row in data]
@task
def load(data: list) -> None:
db.insert_many(data)
@flow(name="ETL Pipeline")
def etl_pipeline(url: str = "https://api.example.com/data"):
raw = extract(url)
cleaned = transform(raw)
load(cleaned)
if __name__ == "__main__":
etl_pipeline()
Prefect Cloud has a generous free tier (unlimited flows, 3 workspaces). Self-hosted Prefect Server is also available.
When to choose Prefect: new project, Python-first team, want fast setup, need dynamic workflows (flows that create tasks at runtime based on data).
Dagster
Dagster's differentiator is the software-defined asset (SDA) model. Instead of defining tasks, you define data assets and their dependencies.
from dagster import asset, AssetExecutionContext
import pandas as pd
@asset
def raw_orders() -> pd.DataFrame:
return pd.read_csv("s3://bucket/orders.csv")
@asset
def cleaned_orders(raw_orders: pd.DataFrame) -> pd.DataFrame:
return raw_orders.dropna().rename(columns={"amt": "amount_usd"})
@asset
def revenue_by_region(cleaned_orders: pd.DataFrame) -> pd.DataFrame:
return cleaned_orders.groupby("region")["amount_usd"].sum().reset_index()
Dagster tracks which assets are materialized, stale, or failing. The asset graph UI shows the lineage of every piece of data in your system.
When to choose Dagster: data-intensive teams that want lineage + observability, projects where data assets are first-class concepts, when dbt integration matters (Dagster has the best dbt integration of the three).
Comparison Matrix
| | Airflow | Prefect | Dagster | |---|---|---|---| | Learning curve | High | Low | Medium | | Setup complexity | High | Low | Medium | | Task model | Operators | Python decorators | Software-defined assets | | Dynamic workflows | Limited | Excellent | Good | | Self-host cost | Low | Low | Low | | Managed cloud | MWAA ($$$) | Prefect Cloud (free tier) | Dagster Cloud (free tier) | | dbt integration | Plugin | Plugin | Native |