SQL is the most important data science tool you are probably underusing. Most data scientists learn pandas first and treat SQL as a secondary skill. This is backwards. The vast majority of data in real organizations lives in databases. SQL is the language those databases speak, and the compute power of a modern data warehouse is enormous. Learning to use SQL effectively means pushing expensive transformations down to the database instead of pulling gigabytes of raw data into Python.
This guide focuses on the SQL features that matter most for data science: window functions, CTEs, aggregations, and query optimization.
Window Functions: The Feature That Changes Everything
Window functions perform calculations across a set of rows that are related to the current row, without collapsing those rows into a single output row. They eliminate enormous amounts of pandas code.
ROW_NUMBER, RANK, DENSE_RANK
-- Rank employees by salary within each department
SELECT
employee_id,
department,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank
FROM employees;
Difference: if two employees have the same salary, RANK skips the next number (1, 2, 2, 4), DENSE_RANK does not (1, 2, 2, 3), and ROW_NUMBER assigns arbitrary unique numbers.
Use ROW_NUMBER to get the top-N rows per group (a classic pattern):
-- Top 3 earners per department
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees
) ranked
WHERE rn <= 3;
LAG and LEAD: Comparing to Adjacent Rows
-- Month-over-month revenue change
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY month) AS revenue_change,
ROUND(100.0 * (revenue - LAG(revenue, 1) OVER (ORDER BY month))
/ LAG(revenue, 1) OVER (ORDER BY month), 2) AS pct_change
FROM monthly_revenue;
LAG(column, n) returns the value from n rows before. LEAD(column, n) returns the value from n rows after. Essential for time series analysis.
Running Totals and Moving Averages
-- Running total of revenue
SELECT
date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY date) AS cumulative_revenue,
AVG(daily_revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM daily_sales;
The ROWS BETWEEN clause controls which rows are included in the window. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives you the last 7 days (current + 6 previous).
CTEs: Writing Readable Complex Queries
Common Table Expressions let you name intermediate results and reference them like tables. They transform unreadable nested subqueries into step-by-step logic.
-- Without CTE: hard to read
SELECT department, avg_salary
FROM (
SELECT department, AVG(salary) AS avg_salary
FROM (
SELECT * FROM employees WHERE hire_date > '2020-01-01'
) recent_hires
GROUP BY department
) dept_averages
WHERE avg_salary > 80000;
-- With CTEs: readable and maintainable
WITH recent_hires AS (
SELECT * FROM employees
WHERE hire_date > '2020-01-01'
),
dept_averages AS (
SELECT department, AVG(salary) AS avg_salary
FROM recent_hires
GROUP BY department
)
SELECT department, avg_salary
FROM dept_averages
WHERE avg_salary > 80000;
CTEs can reference previous CTEs. Build complex logic incrementally. In most databases, CTEs are equivalent in performance to subqueries (the optimizer handles them the same way).
Recursive CTEs are a separate feature for traversing hierarchical data (org charts, graph structures). They are worth knowing but beyond everyday use.
Aggregations and GROUP BY
The basics most developers know, but with important nuances:
-- Standard aggregation
SELECT
department,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary
FROM employees
GROUP BY department
HAVING COUNT(*) > 10 -- Filter after aggregation (WHERE filters before)
ORDER BY avg_salary DESC;
The critical distinction: WHERE filters individual rows before aggregation, HAVING filters groups after aggregation. You cannot use an alias defined in SELECT inside WHERE (it has not been computed yet), but you can use it in HAVING in most databases.
PERCENTILE_CONT gives you the true median. Average is skewed by outliers; median is more robust for salary analysis.
JOINs: When to Use Each Type
-- INNER JOIN: only rows with matches in both tables
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.id;
-- LEFT JOIN: all rows from left, NULL where no match on right
SELECT e.name, o.order_id
FROM employees e
LEFT JOIN orders o ON e.id = o.employee_id;
-- Employees with no orders will appear with NULL order_id
-- CROSS JOIN: every combination of rows (use with caution)
SELECT p.product, r.region
FROM products p
CROSS JOIN regions r;
LEFT JOIN is the most common after INNER JOIN. Use it when you want all records from one table regardless of whether there is a match. Right JOIN is equivalent to left JOIN with tables swapped (prefer LEFT for readability). FULL OUTER JOIN returns all rows from both tables.
Watch for accidental row multiplication in JOINs. If the join key has duplicates on the right side, each left row gets multiple matches, multiplying your row count. Always check COUNT(*) before and after joins.
Query Optimization: The Practical Rules
Use EXPLAIN to understand execution plans. In PostgreSQL: EXPLAIN ANALYZE SELECT .... This shows you whether the database is doing a sequential scan (slow for large tables) or an index scan (fast).
Add indexes on join keys and filter columns.
-- Assuming your app uses this query frequently
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_created_at ON orders(created_at);
-- Composite index for combined filters
CREATE INDEX idx_orders_customer_date ON orders(customer_id, created_at);
*Avoid SELECT . Always specify the columns you need. SELECT * pulls all columns across the network and prevents the database from using covering indexes.
Filter early. Apply WHERE conditions before joins when possible. Push predicates into CTEs or subqueries to reduce the data before the expensive join operation.
Use the database for aggregation, not pandas. If you need the average salary by department, run GROUP BY in SQL and return 10 rows, not 100,000 rows to aggregate in Python.
SQL vs Pandas: The Decision Rule
Use SQL when the data is in a database and the transformation can be expressed in SQL. The database has access to indexes, parallel query execution, and hardware-optimized aggregation that pandas cannot match.
Use pandas when you have already extracted data into Python, the transformation is complex Python logic that is awkward in SQL, or you are feeding the result directly to scikit-learn.
The pattern that works best in practice: SQL for extraction and initial aggregation, pandas for final manipulation and model input preparation.
Keep Reading
- Pandas Guide for Developers — Python side of the data manipulation coin
- Data Pipeline Guide — automate SQL and pandas workflows
- Feature Engineering Practical Guide — SQL features for ML
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.