Building Data Pipelines: Batch, Streaming, and When You Need Each

Data pipelines move data from source to destination reliably. Here is the complete guide to pipeline types, tools, and how to decide what you actually need.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

11 min read

// tags

#data-pipeline#airflow#prefect#dbt#data-engineering

FIG. ART-19

11 min read

“

Building Data Pipelines: Batch, Streaming, and When You Need Each

// reading plan

sections

1,112

words

min read

// Data Science

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Pandas is the dominant Python library for data manipulation. Here is what every developer needs to know to use it effectively.

9 min read

// Data Science

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

dbt: SQL Transformations for the Analytics Layer

dbt (data build tool) handles the transformation layer within your data warehouse. It does not extract or load data; it transforms data that is already in the warehouse using SQL.

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

cleaned AS (
    SELECT
        id AS order_id,
        customer_id,
        CAST(created_at AS TIMESTAMP) AS created_at,
        UPPER(status) AS status,
        amount_cents / 100.0 AS amount_usd
    FROM source
    WHERE id IS NOT NULL
)

SELECT * FROM cleaned

dbt automatically infers dependencies between models, builds them in the correct order, and provides testing and documentation as first-class features.

# schema.yml: tests are defined alongside models
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount_usd
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "> 0"

dbt runs dbt test to execute these tests after each build. This is data quality testing built into the transformation pipeline.

Data Quality Testing: Great Expectations

Great Expectations lets you define data quality assertions as code and run them as tests in your pipeline.

import great_expectations as gx

context = gx.get_context()

# Define expectations
validator = context.get_validator(
    datasource_name="orders_db",
    data_asset_name="orders",
)

validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_between("amount_usd", min_value=0, max_value=100000)
validator.expect_column_values_to_be_in_set("status", ["pending", "shipped", "delivered", "cancelled"])
validator.expect_column_pair_values_a_to_be_greater_than_b("shipped_at", "created_at")

# Save expectations as a suite and run as a checkpoint in your pipeline

When an expectation fails, your pipeline knows the data is bad before downstream consumers are affected.

ETL vs ELT: Load First, Transform Later

Traditional ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. Modern ELT (Extract, Load, Transform) loads raw data first, then transforms it using warehouse compute.

ELT is now the dominant pattern for cloud data warehouses (Snowflake, BigQuery, Redshift) for good reasons: raw data is preserved for reprocessing, transformations can be changed without re-extracting, warehouse compute is cheap, and dbt handles the T in ELT elegantly.

Use ETL when: you have strict data residency requirements (PII must be masked before entering the warehouse), transformation is computationally expensive and better done at the source, or you are loading into a relational OLTP database that cannot handle raw data.

Error Handling in Pipelines

Production pipelines fail. The question is how they fail and whether you know about it.

Retry logic handles transient failures (network timeouts, database deadlocks). Airflow and Prefect both provide configurable retry with backoff. Set retries=3 and retry_delay=5 minutes for most tasks.

Dead letter queues catch records that fail processing in streaming systems. Instead of dropping bad records, write them to a DLQ for inspection and replay.

Alerting on failure. Every production pipeline needs an alert on failure. Airflow and Prefect support email and Slack notifications. A silent pipeline failure that processes no data for 24 hours before someone notices is a common disaster.

Idempotency. Design pipeline tasks to be idempotent: running the same task twice produces the same result as running it once. This makes retries safe. Use IF NOT EXISTS, MERGE statements, or partition-based overwrite strategies.

When You Need a Pipeline vs a Cron Job

A cron job calling a Python script is a pipeline. For simple, single-step processes that run infrequently and have low criticality, a cron job is perfectly appropriate. Do not introduce Airflow for a weekly script that takes 30 seconds.

You need a proper orchestration tool when: the pipeline has multiple dependent steps that need to run in order, you need visibility into pipeline history and failures, multiple people need to understand and modify the pipeline, or you need retry logic, alerting, and parallelism.

Keep Reading

dbt Data Transformation Guide - deep dive into the transformation layer
Data Quality Guide - ensuring pipeline output is trustworthy
Feature Store Guide - pipelines for ML feature serving

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

Building Data Pipelines: Batch, Streaming, and When You Need Each

Related Articles

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Pipeline Types

Batch Orchestration: Airflow vs Prefect

dbt: SQL Transformations for the Analytics Layer

Data Quality Testing: Great Expectations

ETL vs ELT: Load First, Transform Later

Error Handling in Pipelines

When You Need a Pipeline vs a Cron Job

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

Data Visualization in Python: When to Use Matplotlib, Seaborn, Plotly, and Altair

Building Data Pipelines: Batch, Streaming, and When You Need Each

Related Articles

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Pipeline Types

Batch Orchestration: Airflow vs Prefect

dbt: SQL Transformations for the Analytics Layer

Data Quality Testing: Great Expectations

ETL vs ELT: Load First, Transform Later

Error Handling in Pipelines

When You Need a Pipeline vs a Cron Job

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

Data Visualization in Python: When to Use Matplotlib, Seaborn, Plotly, and Altair

The workspace your team
actually needs