What Is DuckDB and Why Should You Care
DuckDB is an in-process OLAP database — it runs inside your Python, R, Java, or Node process without a server. No installation, no Docker, no cluster to manage. Just pip install duckdb and you have a full SQL analytics engine.
The core insight: most analytics workloads that teams throw at Spark or BigQuery fit comfortably on a single modern machine with 32-128GB RAM. DuckDB handles these cases with dramatically less complexity.
Query Parquet Files Directly with SQL
DuckDB's killer feature is reading files directly:
import duckdb
# Query a single Parquet file
result = duckdb.sql("SELECT region, SUM(revenue) FROM 'sales.parquet' GROUP BY region").df()
# Query all Parquet files in a directory
result = duckdb.sql("SELECT * FROM read_parquet('data/*.parquet') WHERE date >= '2025-01-01'").df()
# Join two files — no loading into memory required
result = duckdb.sql("""
SELECT o.order_id, u.email, o.total
FROM read_parquet('orders.parquet') o
JOIN read_csv('users.csv') u ON o.user_id = u.id
LIMIT 1000
""").df()
DuckDB pushes predicates and projections down to the file reader — it only reads the columns and row groups you need.
DuckDB vs Spark
| | DuckDB | Spark | |---|---|---| | Setup | Zero | Complex (JVM, cluster manager) | | Data size | Up to ~500GB single machine | Petabyte-scale distributed | | Query speed (<100GB) | Often faster | Overhead from shuffle | | Cost | Free | Compute cluster cost | | SQL support | Full SQL | SparkSQL (limitations) |
For data teams processing 1-100GB per job, DuckDB eliminates the operational overhead of Spark entirely.
Python API
import duckdb
con = duckdb.connect("analytics.duckdb") # persistent database
# Create table from Parquet
con.execute("CREATE TABLE events AS SELECT * FROM read_parquet('events/*.parquet')")
# Query with pandas interop
df = con.execute("SELECT date_trunc('day', ts) AS day, count(*) FROM events GROUP BY 1").df()
# Register a Pandas DataFrame as a virtual table
import pandas as pd
users_df = pd.read_csv("users.csv")
con.register("users", users_df)
result = con.execute("SELECT * FROM users WHERE country = 'US'").df()
MotherDuck: DuckDB in the Cloud
MotherDuck is managed DuckDB with a cloud UI, team sharing, and hybrid execution (local + cloud). It uses the same DuckDB SQL dialect. Useful for teams that want DuckDB's simplicity with cloud storage and collaboration.
DuckDB + Polars Combination
import duckdb
import polars as pl
# Read with DuckDB, convert to Polars for transformation
result = duckdb.sql("SELECT * FROM 'large.parquet' WHERE status = 'active'").pl()
# Register Polars LazyFrame in DuckDB
lf = pl.scan_parquet("data/*.parquet")
duckdb.register("my_data", lf.collect().to_arrow())
result = duckdb.sql("SELECT category, AVG(value) FROM my_data GROUP BY category").df()
When to Use DuckDB
DuckDB wins for: ad-hoc analytics on files, replacing SQLite for analytical queries, embedded analytics in Python apps, replacing pandas groupby/merge on large files. Use Spark when: data exceeds single-machine RAM, you need distributed fault tolerance, or you already have a data platform built on it.
Resources: DuckDB docs, GitHub.