SQL for Data Scientists: Window Functions, CTEs, and Query Optimization

SQL is the most important data science tool that data scientists often undervalue. Window functions alone replace hundreds of lines of pandas code.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

10 min read

// tags

#sql#window-functions#cte#data-science#query-optimization

FIG. ART-29

10 min read

“

SQL for Data Scientists: Window Functions, CTEs, and Query Optimization

// reading plan

sections

1,140

words

min read

// Developer Tools

Drizzle ORM Guide: TypeScript-First SQL That Stays Close to the Database

Drizzle defines your schema in TypeScript, builds SQL-like queries with full type safety, and ships a smaller bundle than Prisma. Here is how it works.

9 min read

// Developer Tools

PostgreSQL Guide for Developers: Why Postgres Wins in 2026

SQL is the most important data science tool you are probably underusing. Most data scientists learn pandas first and treat SQL as a secondary skill. This is backwards. The vast majority of data in real organizations lives in databases. SQL is the language those databases speak, and the compute power of a modern data warehouse is enormous. Learning to use SQL effectively means pushing expensive transformations down to the database instead of pulling gigabytes of raw data into Python.

This guide focuses on the SQL features that matter most for data science: window functions, CTEs, aggregations, and query optimization.

Window Functions: The Feature That Changes Everything

Window functions perform calculations across a set of rows that are related to the current row, without collapsing those rows into a single output row. They eliminate enormous amounts of pandas code.

ROW_NUMBER, RANK, DENSE_RANK

-- Rank employees by salary within each department
SELECT
    employee_id,
    department,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank
FROM employees;

Difference: if two employees have the same salary, RANK skips the next number (1, 2, 2, 4), DENSE_RANK does not (1, 2, 2, 3), and ROW_NUMBER assigns arbitrary unique numbers.

Use ROW_NUMBER to get the top-N rows per group (a classic pattern):

-- Top 3 earners per department
SELECT * FROM (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM employees
) ranked
WHERE rn <= 3;

LAG and LEAD: Comparing to Adjacent Rows

-- Month-over-month revenue change
SELECT
    month,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
    revenue - LAG(revenue, 1) OVER (ORDER BY month) AS revenue_change,
    ROUND(100.0 * (revenue - LAG(revenue, 1) OVER (ORDER BY month))
          / LAG(revenue, 1) OVER (ORDER BY month), 2) AS pct_change
FROM monthly_revenue;

LAG(column, n) returns the value from n rows before. LEAD(column, n) returns the value from n rows after. Essential for time series analysis.

Running Totals and Moving Averages

-- Running total of revenue
SELECT
    date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY date) AS cumulative_revenue,
    AVG(daily_revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM daily_sales;

The ROWS BETWEEN clause controls which rows are included in the window. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives you the last 7 days (current + 6 previous).

CTEs: Writing Readable Complex Queries

Common Table Expressions let you name intermediate results and reference them like tables. They transform unreadable nested subqueries into step-by-step logic.

-- Without CTE: hard to read
SELECT department, avg_salary
FROM (
    SELECT department, AVG(salary) AS avg_salary
    FROM (
        SELECT * FROM employees WHERE hire_date > '2020-01-01'
    ) recent_hires
    GROUP BY department
) dept_averages
WHERE avg_salary > 80000;

-- With CTEs: readable and maintainable
WITH recent_hires AS (
    SELECT * FROM employees
    WHERE hire_date > '2020-01-01'
),
dept_averages AS (
    SELECT department, AVG(salary) AS avg_salary
    FROM recent_hires
    GROUP BY department
)
SELECT department, avg_salary
FROM dept_averages
WHERE avg_salary > 80000;

CTEs can reference previous CTEs. Build complex logic incrementally. In most databases, CTEs are equivalent in performance to subqueries (the optimizer handles them the same way).

Recursive CTEs are a separate feature for traversing hierarchical data (org charts, graph structures). They are worth knowing but beyond everyday use.

Aggregations and GROUP BY

The basics most developers know, but with important nuances:

-- Standard aggregation
SELECT
    department,
    COUNT(*) AS headcount,
    AVG(salary) AS avg_salary,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM employees
GROUP BY department
HAVING COUNT(*) > 10  -- Filter after aggregation (WHERE filters before)
ORDER BY avg_salary DESC;

The critical distinction: WHERE filters individual rows before aggregation, HAVING filters groups after aggregation. You cannot use an alias defined in SELECT inside WHERE (it has not been computed yet), but you can use it in HAVING in most databases.

PERCENTILE_CONT gives you the true median. Average is skewed by outliers; median is more robust for salary analysis.

JOINs: When to Use Each Type

-- INNER JOIN: only rows with matches in both tables
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.id;

-- LEFT JOIN: all rows from left, NULL where no match on right
SELECT e.name, o.order_id
FROM employees e
LEFT JOIN orders o ON e.id = o.employee_id;
-- Employees with no orders will appear with NULL order_id

-- CROSS JOIN: every combination of rows (use with caution)
SELECT p.product, r.region
FROM products p
CROSS JOIN regions r;

LEFT JOIN is the most common after INNER JOIN. Use it when you want all records from one table regardless of whether there is a match. Right JOIN is equivalent to left JOIN with tables swapped (prefer LEFT for readability). FULL OUTER JOIN returns all rows from both tables.

Watch for accidental row multiplication in JOINs. If the join key has duplicates on the right side, each left row gets multiple matches, multiplying your row count. Always check COUNT(*) before and after joins.

Query Optimization: The Practical Rules

Use EXPLAIN to understand execution plans. In PostgreSQL: EXPLAIN ANALYZE SELECT .... This shows you whether the database is doing a sequential scan (slow for large tables) or an index scan (fast).

Add indexes on join keys and filter columns.

-- Assuming your app uses this query frequently
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_created_at ON orders(created_at);

-- Composite index for combined filters
CREATE INDEX idx_orders_customer_date ON orders(customer_id, created_at);

*Avoid SELECT . Always specify the columns you need. SELECT * pulls all columns across the network and prevents the database from using covering indexes.

Filter early. Apply WHERE conditions before joins when possible. Push predicates into CTEs or subqueries to reduce the data before the expensive join operation.

Use the database for aggregation, not pandas. If you need the average salary by department, run GROUP BY in SQL and return 10 rows, not 100,000 rows to aggregate in Python.

SQL vs Pandas: The Decision Rule

Use SQL when the data is in a database and the transformation can be expressed in SQL. The database has access to indexes, parallel query execution, and hardware-optimized aggregation that pandas cannot match.

Use pandas when you have already extracted data into Python, the transformation is complex Python logic that is awkward in SQL, or you are feeding the result directly to scikit-learn.

The pattern that works best in practice: SQL for extraction and initial aggregation, pandas for final manipulation and model input preparation.

Keep Reading

Pandas Guide for Developers — Python side of the data manipulation coin
Data Pipeline Guide — automate SQL and pandas workflows
Feature Engineering Practical Guide — SQL features for ML

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

SQL for Data Scientists: Window Functions, CTEs, and Query Optimization

Related Articles

Drizzle ORM Guide: TypeScript-First SQL That Stays Close to the Database

PostgreSQL Guide for Developers: Why Postgres Wins in 2026

Window Functions: The Feature That Changes Everything

ROW_NUMBER, RANK, DENSE_RANK

LAG and LEAD: Comparing to Adjacent Rows

Running Totals and Moving Averages

CTEs: Writing Readable Complex Queries

Aggregations and GROUP BY

JOINs: When to Use Each Type

Query Optimization: The Practical Rules

SQL vs Pandas: The Decision Rule

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How Product Teams Can Work Effectively With Machine Learning

SQL for Data Scientists: Window Functions, CTEs, and Query Optimization

Related Articles

Drizzle ORM Guide: TypeScript-First SQL That Stays Close to the Database

PostgreSQL Guide for Developers: Why Postgres Wins in 2026

Window Functions: The Feature That Changes Everything

ROW_NUMBER, RANK, DENSE_RANK

LAG and LEAD: Comparing to Adjacent Rows

Running Totals and Moving Averages

CTEs: Writing Readable Complex Queries

Aggregations and GROUP BY

JOINs: When to Use Each Type

Query Optimization: The Practical Rules

SQL vs Pandas: The Decision Rule

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How Product Teams Can Work Effectively With Machine Learning

The workspace your team
actually needs