DuckDB: In-Process Analytics That Replaces Spark for Single-Machine Workloads

DuckDB runs inside your Python or R process with zero setup, queries Parquet files directly with SQL, and outperforms Spark on datasets under 100GB on a single machine.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 10, 2026

7 min read

// tags

#duckdb#analytics#sql#columnar#olap

FIG. ART-22

7 min read

“

DuckDB: In-Process Analytics That Replaces Spark for Single-Machine Workloads

// reading plan

sections

460

words

min read

// Data Science

Python Data Science Tools in 2026: The Stack That Actually Gets Used

The Python data science ecosystem has stabilized. Here is what a working data scientist actually uses, from core libraries to the faster alternatives.

10 min read

// Data Science

Jupyter Notebooks Best Practices: How to Avoid the Common Pitfalls

DuckDB vs Spark

	DuckDB	Spark
Setup	Zero	Complex (JVM, cluster manager)
Data size	Up to ~500GB single machine	Petabyte-scale distributed
Query speed (<100GB)	Often faster	Overhead from shuffle
Cost	Free	Compute cluster cost
SQL support	Full SQL	SparkSQL (limitations)

For data teams processing 1-100GB per job, DuckDB eliminates the operational overhead of Spark entirely.

Python API

import duckdb

con = duckdb.connect("analytics.duckdb")  # persistent database

# Create table from Parquet
con.execute("CREATE TABLE events AS SELECT * FROM read_parquet('events/*.parquet')")

# Query with pandas interop
df = con.execute("SELECT date_trunc('day', ts) AS day, count(*) FROM events GROUP BY 1").df()

# Register a Pandas DataFrame as a virtual table
import pandas as pd
users_df = pd.read_csv("users.csv")
con.register("users", users_df)
result = con.execute("SELECT * FROM users WHERE country = 'US'").df()

MotherDuck: DuckDB in the Cloud

MotherDuck is managed DuckDB with a cloud UI, team sharing, and hybrid execution (local + cloud). It uses the same DuckDB SQL dialect. Useful for teams that want DuckDB's simplicity with cloud storage and collaboration.

DuckDB + Polars Combination

import duckdb
import polars as pl

# Read with DuckDB, convert to Polars for transformation
result = duckdb.sql("SELECT * FROM 'large.parquet' WHERE status = 'active'").pl()

# Register Polars LazyFrame in DuckDB
lf = pl.scan_parquet("data/*.parquet")
duckdb.register("my_data", lf.collect().to_arrow())
result = duckdb.sql("SELECT category, AVG(value) FROM my_data GROUP BY category").df()

When to Use DuckDB

DuckDB wins for: ad-hoc analytics on files, replacing SQLite for analytical queries, embedded analytics in Python apps, replacing pandas groupby/merge on large files. Use Spark when: data exceeds single-machine RAM, you need distributed fault tolerance, or you already have a data platform built on it.

Resources: DuckDB docs, GitHub.

DuckDB: In-Process Analytics That Replaces Spark for Single-Machine Workloads

Related Articles

Python Data Science Tools in 2026: The Stack That Actually Gets Used

What Is DuckDB and Why Should You Care

Query Parquet Files Directly with SQL

DuckDB vs Spark

Python API

MotherDuck: DuckDB in the Cloud

DuckDB + Polars Combination

When to Use DuckDB

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Jupyter Notebooks Best Practices: How to Avoid the Common Pitfalls

Software Developer to Data Scientist: The Realistic Transition Guide

DuckDB: In-Process Analytics That Replaces Spark for Single-Machine Workloads

Related Articles

Python Data Science Tools in 2026: The Stack That Actually Gets Used

What Is DuckDB and Why Should You Care

Query Parquet Files Directly with SQL

DuckDB vs Spark

Python API

MotherDuck: DuckDB in the Cloud

DuckDB + Polars Combination

When to Use DuckDB

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Jupyter Notebooks Best Practices: How to Avoid the Common Pitfalls

Software Developer to Data Scientist: The Realistic Transition Guide

The workspace your team
actually needs