dbt Guide: SQL-Based Data Transformation for Analytics Engineers

dbt brings software engineering practices to SQL transformations. Here is how it works, how to structure a project, and when it is the right choice.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

11 min read

// tags

#dbt#sql#data-transformation#analytics-engineering#data-engineering

FIG. ART-29

11 min read

“

dbt Guide: SQL-Based Data Transformation for Analytics Engineers

// reading plan

sections

1,103

words

min read

// Developer Tools

Drizzle ORM Guide: TypeScript-First SQL That Stays Close to the Database

Drizzle defines your schema in TypeScript, builds SQL-like queries with full type safety, and ships a smaller bundle than Prisma. Here is how it works.

9 min read

// Developer Tools

PostgreSQL Guide for Developers: Why Postgres Wins in 2026

dbt (data build tool) is a transformation framework that enables analysts and engineers to transform raw data in their data warehouse using SQL, with software engineering practices built in. If you have ever maintained a spaghetti mess of SQL scripts that no one understands, runs in an undocumented order, and breaks whenever anyone changes anything, dbt is the answer.

What dbt Does (and Does Not Do)

dbt handles the T in ELT. It takes raw data that is already in your data warehouse and transforms it into clean, analysis-ready tables and views through modular, version-controlled SQL models.

What dbt does not do: extract data from source systems or load it into the warehouse. That is handled by tools like Fivetran, Airbyte, or custom extraction pipelines. dbt assumes the raw data is already there.

The core value proposition: your transformation logic lives in version-controlled SQL files, is modular and composable, is documented and tested, and is reproducible. One command (dbt run) rebuilds your entire analytics layer from scratch.

The dbt Project Structure

A dbt project is a directory of SQL files organized by layer:

models/
  staging/          -- Clean raw source data, one model per source table
    stg_orders.sql
    stg_customers.sql
    stg_products.sql
  intermediate/     -- Business logic, joins, complex transformations
    int_order_items_enriched.sql
    int_customer_lifetime_value.sql
  marts/            -- Final, aggregated, analysis-ready tables
    orders.sql
    customers.sql
    finance/
      revenue_daily.sql
      revenue_by_product.sql

Each layer has a specific purpose:

Staging models take one source table and clean it: rename columns to consistent naming conventions, cast data types, handle basic null logic, and add no business logic. A staging model for orders should look exactly like the raw orders table, just cleaner.

Intermediate models apply business logic. This is where you join staging models, compute derived fields, apply business rules ("an order is considered complete if status = 'delivered' and return_window_days has elapsed"), and build complex aggregations.

Mart models are the tables your analysts and BI tools query. They are optimized for a specific analytical domain (finance, marketing, product) and should be simple enough that analysts do not need to write complex SQL to answer common questions.

Writing dbt Models

A dbt model is a single .sql file containing a SELECT statement. dbt handles the CREATE TABLE or CREATE VIEW wrapping.

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw_postgres', 'orders') }}
),

renamed AS (
    SELECT
        id                                      AS order_id,
        customer_id,
        created_at                              AS order_created_at,
        CAST(total_cents AS FLOAT) / 100.0     AS order_total_usd,
        UPPER(TRIM(status))                    AS order_status,
        COALESCE(shipping_address_country, 'Unknown') AS shipping_country
    FROM source
    WHERE id IS NOT NULL
)

SELECT * FROM renamed

The {{ source('raw_postgres', 'orders') }} syntax is a dbt Jinja function that references a source defined in your sources.yml file. This lets dbt build a lineage graph and check for source freshness.

Referencing another model uses {{ ref() }}:

-- models/marts/customers.sql
WITH customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
),

orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

customer_orders AS (
    SELECT
        c.customer_id,
        c.email,
        c.created_at AS customer_since,
        COUNT(o.order_id) AS order_count,
        SUM(o.order_total_usd) AS lifetime_value_usd,
        MIN(o.order_created_at) AS first_order_at,
        MAX(o.order_created_at) AS most_recent_order_at
    FROM customers c
    LEFT JOIN orders o USING (customer_id)
    GROUP BY 1, 2, 3
)

SELECT * FROM customer_orders

dbt reads these ref() calls to build a directed acyclic graph (DAG) of model dependencies. When you run dbt run, it executes models in topological order: staging models first, then intermediate, then marts.

Testing in dbt

dbt has two types of tests: schema tests (defined in YAML) and singular tests (custom SQL).

# models/staging/schema.yml
version: 2

models:
  - name: stg_orders
    description: "Cleaned orders from the Postgres source"
    columns:
      - name: order_id
        description: "Primary key for orders"
        tests:
          - unique
          - not_null
      - name: order_status
        tests:
          - not_null
          - accepted_values:
              values: ["PENDING", "PROCESSING", "SHIPPED", "DELIVERED", "CANCELLED"]
      - name: order_total_usd
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "> 0"
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

Run dbt test to execute all tests. A failing test blocks downstream consumers from using bad data. The relationships test is particularly valuable: it checks referential integrity between models, catching join key problems before they silently produce null matches.

Documentation

dbt generates a documentation site from your schema YAML and model descriptions. Run dbt docs generate followed by dbt docs serve to get a searchable, browsable documentation site with lineage graphs showing how every model connects to every other model.

This is not optional for team environments. When a new analyst joins, the documentation site is the map of your analytics layer.

Materializations: Table, View, Incremental

By default, dbt models are materialized as views. You can override this per-model or per-directory:

# dbt_project.yml
models:
  your_project:
    staging:
      +materialized: view      # Staging: views (fast to build, always current)
    intermediate:
      +materialized: view
    marts:
      +materialized: table     # Marts: tables (fast to query)
      finance:
        +materialized: incremental  # Large finance tables: only process new data

Incremental materializations only process new or updated rows, dramatically speeding up builds for large tables:

-- models/marts/finance/events_daily.sql
{{
  config(
    materialized='incremental',
    unique_key='event_date',
    incremental_strategy='delete+insert'
  )
}}

SELECT
    DATE(created_at) AS event_date,
    COUNT(*) AS event_count,
    COUNT(DISTINCT user_id) AS unique_users
FROM {{ ref('stg_events') }}

{% if is_incremental() %}
    WHERE created_at >= (SELECT MAX(event_date) FROM {{ this }})
{% endif %}

GROUP BY 1

dbt Cloud vs dbt Core

dbt Core is the open-source CLI. It runs locally or in any CI/CD environment. Free.

dbt Cloud is dbt's managed platform. It provides a web IDE, job scheduling, a CI/CD integration that runs tests on pull requests, and a hosted documentation site. It is significantly easier to operate than managing dbt Core in your own infrastructure. Pricing is per-developer seat.

For small teams getting started, dbt Core with GitHub Actions for scheduling is a reasonable free setup. For teams that want managed operations, dbt Cloud is worth the cost.

When dbt Is the Right Choice

dbt is the right choice when: your data warehouse is SQL-native (Snowflake, BigQuery, Redshift, DuckDB, Databricks SQL), your transformation team thinks primarily in SQL, you want version control and testing for your analytics layer, or you are adopting the ELT pattern.

dbt is not the right choice when: your transformations require complex Python (dbt supports Python models on some warehouses, but SQL-first is still the core value), you need to extract or load data (use a dedicated EL tool), or your analytics layer is simple enough that one person can maintain it without tooling.

Keep Reading

Data Pipeline Guide -- how dbt fits into the broader pipeline architecture
Data Quality Guide -- dbt testing for data quality enforcement
SQL for Data Scientists Guide -- the SQL skills that underpin dbt modeling

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

dbt Guide: SQL-Based Data Transformation for Analytics Engineers

Related Articles

Drizzle ORM Guide: TypeScript-First SQL That Stays Close to the Database

PostgreSQL Guide for Developers: Why Postgres Wins in 2026

What dbt Does (and Does Not Do)

The dbt Project Structure

Writing dbt Models

Testing in dbt

Documentation

Materializations: Table, View, Incremental

dbt Cloud vs dbt Core

When dbt Is the Right Choice

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

dbt Guide: SQL-Based Data Transformation for Analytics Engineers

Related Articles

Drizzle ORM Guide: TypeScript-First SQL That Stays Close to the Database

PostgreSQL Guide for Developers: Why Postgres Wins in 2026

What dbt Does (and Does Not Do)

The dbt Project Structure

Writing dbt Models

Testing in dbt

Documentation

Materializations: Table, View, Incremental

dbt Cloud vs dbt Core

When dbt Is the Right Choice

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

The workspace your team
actually needs