dbt (data build tool) is a transformation framework that enables analysts and engineers to transform raw data in their data warehouse using SQL, with software engineering practices built in. If you have ever maintained a spaghetti mess of SQL scripts that no one understands, runs in an undocumented order, and breaks whenever anyone changes anything, dbt is the answer.
What dbt Does (and Does Not Do)
dbt handles the T in ELT. It takes raw data that is already in your data warehouse and transforms it into clean, analysis-ready tables and views through modular, version-controlled SQL models.
What dbt does not do: extract data from source systems or load it into the warehouse. That is handled by tools like Fivetran, Airbyte, or custom extraction pipelines. dbt assumes the raw data is already there.
The core value proposition: your transformation logic lives in version-controlled SQL files, is modular and composable, is documented and tested, and is reproducible. One command (dbt run) rebuilds your entire analytics layer from scratch.
The dbt Project Structure
A dbt project is a directory of SQL files organized by layer:
models/
staging/ -- Clean raw source data, one model per source table
stg_orders.sql
stg_customers.sql
stg_products.sql
intermediate/ -- Business logic, joins, complex transformations
int_order_items_enriched.sql
int_customer_lifetime_value.sql
marts/ -- Final, aggregated, analysis-ready tables
orders.sql
customers.sql
finance/
revenue_daily.sql
revenue_by_product.sql
Each layer has a specific purpose:
Staging models take one source table and clean it: rename columns to consistent naming conventions, cast data types, handle basic null logic, and add no business logic. A staging model for orders should look exactly like the raw orders table, just cleaner.
Intermediate models apply business logic. This is where you join staging models, compute derived fields, apply business rules ("an order is considered complete if status = 'delivered' and return_window_days has elapsed"), and build complex aggregations.
Mart models are the tables your analysts and BI tools query. They are optimized for a specific analytical domain (finance, marketing, product) and should be simple enough that analysts do not need to write complex SQL to answer common questions.
Writing dbt Models
A dbt model is a single .sql file containing a SELECT statement. dbt handles the CREATE TABLE or CREATE VIEW wrapping.
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('raw_postgres', 'orders') }}
),
renamed AS (
SELECT
id AS order_id,
customer_id,
created_at AS order_created_at,
CAST(total_cents AS FLOAT) / 100.0 AS order_total_usd,
UPPER(TRIM(status)) AS order_status,
COALESCE(shipping_address_country, 'Unknown') AS shipping_country
FROM source
WHERE id IS NOT NULL
)
SELECT * FROM renamed
The {{ source('raw_postgres', 'orders') }} syntax is a dbt Jinja function that references a source defined in your sources.yml file. This lets dbt build a lineage graph and check for source freshness.
Referencing another model uses {{ ref() }}:
-- models/marts/customers.sql
WITH customers AS (
SELECT * FROM {{ ref('stg_customers') }}
),
orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customer_orders AS (
SELECT
c.customer_id,
c.email,
c.created_at AS customer_since,
COUNT(o.order_id) AS order_count,
SUM(o.order_total_usd) AS lifetime_value_usd,
MIN(o.order_created_at) AS first_order_at,
MAX(o.order_created_at) AS most_recent_order_at
FROM customers c
LEFT JOIN orders o USING (customer_id)
GROUP BY 1, 2, 3
)
SELECT * FROM customer_orders
dbt reads these ref() calls to build a directed acyclic graph (DAG) of model dependencies. When you run dbt run, it executes models in topological order: staging models first, then intermediate, then marts.
Testing in dbt
dbt has two types of tests: schema tests (defined in YAML) and singular tests (custom SQL).
# models/staging/schema.yml
version: 2
models:
- name: stg_orders
description: "Cleaned orders from the Postgres source"
columns:
- name: order_id
description: "Primary key for orders"
tests:
- unique
- not_null
- name: order_status
tests:
- not_null
- accepted_values:
values: ["PENDING", "PROCESSING", "SHIPPED", "DELIVERED", "CANCELLED"]
- name: order_total_usd
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "> 0"
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
Run dbt test to execute all tests. A failing test blocks downstream consumers from using bad data. The relationships test is particularly valuable: it checks referential integrity between models, catching join key problems before they silently produce null matches.
Documentation
dbt generates a documentation site from your schema YAML and model descriptions. Run dbt docs generate followed by dbt docs serve to get a searchable, browsable documentation site with lineage graphs showing how every model connects to every other model.
This is not optional for team environments. When a new analyst joins, the documentation site is the map of your analytics layer.
Materializations: Table, View, Incremental
By default, dbt models are materialized as views. You can override this per-model or per-directory:
# dbt_project.yml
models:
your_project:
staging:
+materialized: view # Staging: views (fast to build, always current)
intermediate:
+materialized: view
marts:
+materialized: table # Marts: tables (fast to query)
finance:
+materialized: incremental # Large finance tables: only process new data
Incremental materializations only process new or updated rows, dramatically speeding up builds for large tables:
-- models/marts/finance/events_daily.sql
{{
config(
materialized='incremental',
unique_key='event_date',
incremental_strategy='delete+insert'
)
}}
SELECT
DATE(created_at) AS event_date,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
WHERE created_at >= (SELECT MAX(event_date) FROM {{ this }})
{% endif %}
GROUP BY 1
dbt Cloud vs dbt Core
dbt Core is the open-source CLI. It runs locally or in any CI/CD environment. Free.
dbt Cloud is dbt's managed platform. It provides a web IDE, job scheduling, a CI/CD integration that runs tests on pull requests, and a hosted documentation site. It is significantly easier to operate than managing dbt Core in your own infrastructure. Pricing is per-developer seat.
For small teams getting started, dbt Core with GitHub Actions for scheduling is a reasonable free setup. For teams that want managed operations, dbt Cloud is worth the cost.
When dbt Is the Right Choice
dbt is the right choice when: your data warehouse is SQL-native (Snowflake, BigQuery, Redshift, DuckDB, Databricks SQL), your transformation team thinks primarily in SQL, you want version control and testing for your analytics layer, or you are adopting the ELT pattern.
dbt is not the right choice when: your transformations require complex Python (dbt supports Python models on some warehouses, but SQL-first is still the core value), you need to extract or load data (use a dedicated EL tool), or your analytics layer is simple enough that one person can maintain it without tooling.
Keep Reading
- Data Pipeline Guide -- how dbt fits into the broader pipeline architecture
- Data Quality Guide -- dbt testing for data quality enforcement
- SQL for Data Scientists Guide -- the SQL skills that underpin dbt modeling
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.