SQL is the most important data science tool you are probably underusing. Most data scientists learn pandas first and treat SQL as a secondary skill. This is backwards. The vast majority of data in real organizations lives in databases. SQL is the language those databases speak, and the compute power of a modern data warehouse is enormous. Learning to use SQL effectively means pushing expensive transformations down to the database instead of pulling gigabytes of raw data into Python.
This guide focuses on the SQL features that matter most for data science: window functions, CTEs, aggregations, and query optimization.
Window Functions: The Feature That Changes Everything
Window functions perform calculations across a set of rows that are related to the current row, without collapsing those rows into a single output row. They eliminate enormous amounts of pandas code.
ROW_NUMBER, RANK, DENSE_RANK
-- Rank employees by salary within each department
SELECT
employee_id,
department,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank
FROM employees;
Difference: if two employees have the same salary, RANK skips the next number (1, 2, 2, 4), DENSE_RANK does not (1, 2, 2, 3), and ROW_NUMBER assigns arbitrary unique numbers.
Use ROW_NUMBER to get the top-N rows per group (a classic pattern):
-- Top 3 earners per department
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees
) ranked
WHERE rn <= 3;
LAG and LEAD: Comparing to Adjacent Rows
-- Month-over-month revenue change
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY month) AS revenue_change,
ROUND(100.0 * (revenue - LAG(revenue, 1) OVER (ORDER BY month))
/ LAG(revenue, 1) OVER (ORDER BY month), 2) AS pct_change
FROM monthly_revenue;
LAG(column, n) returns the value from n rows before. LEAD(column, n) returns the value from n rows after. Essential for time series analysis.
Running Totals and Moving Averages
-- Running total of revenue
SELECT
date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY date) AS cumulative_revenue,
AVG(daily_revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM daily_sales;
The ROWS BETWEEN clause controls which rows are included in the window. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives you the last 7 days (current + 6 previous).
CTEs: Writing Readable Complex Queries
Common Table Expressions let you name intermediate results and reference them like tables. They transform unreadable nested subqueries into step-by-step logic.
-- Without CTE: hard to read
SELECT department, avg_salary
FROM (
SELECT department, AVG(salary) AS avg_salary
FROM (
SELECT * FROM employees WHERE hire_date > '2020-01-01'
) recent_hires
GROUP BY department
) dept_averages
WHERE avg_salary > 80000;
-- With CTEs: readable and maintainable
WITH recent_hires AS (
SELECT * FROM employees
WHERE hire_date > '2020-01-01'
),
dept_averages AS (
SELECT department, AVG(salary) AS avg_salary
FROM recent_hires
GROUP BY department
)
SELECT department, avg_salary
FROM dept_averages
WHERE avg_salary > 80000;
CTEs can reference previous CTEs. Build complex logic incrementally. In most databases, CTEs are equivalent in performance to subqueries (the optimizer handles them the same way).
Recursive CTEs are a separate feature for traversing hierarchical data (org charts, graph structures). They are worth knowing but beyond everyday use.