Row-Based vs Columnar Storage
CSV and JSON store data row by row. To compute the average of one column across 100M rows, you read all 100M rows worth of data — including every column you do not need.
Apache Parquet stores data column by column. To compute that average, you read only the one column you need. On a table with 50 columns, this is a 50x reduction in I/O.
Why Columnar Compression Is Better
Column values are highly similar to each other — a "country" column might be 90% "US". Columnar storage enables:
- Dictionary encoding: store unique values once, reference by integer
- Run-length encoding: "US × 90,000 rows" as one entry
- Delta encoding: store differences instead of absolute values for timestamps
Result: a 10GB CSV often becomes a 500MB–2GB Parquet file. Snappy compression (default) balances speed and ratio. Zstd offers better compression at slightly higher CPU cost.
Reading Parquet with Python
# Pandas
import pandas as pd
df = pd.read_parquet("data.parquet")
df = pd.read_parquet("data.parquet", columns=["date", "user_id", "revenue"]) # column pushdown
# Polars (faster)
import polars as pl
df = pl.read_parquet("data.parquet")
df = pl.scan_parquet("data/*.parquet").filter(pl.col("date") >= "2025-01-01").collect()
# DuckDB (SQL on files)
import duckdb
result = duckdb.sql("SELECT date, SUM(revenue) FROM 'data.parquet' GROUP BY date").df()
Writing Parquet with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({"date": ["2025-01-01"] * 100000, "revenue": range(100000)})
table = pa.Table.from_pandas(df)
# Write with Zstd compression
pq.write_table(
table,
"output.parquet",
compression="zstd",
row_group_size=100000,
)
Predicate Pushdown
Parquet stores min/max statistics per row group (a block of rows). Readers use these statistics to skip row groups that cannot contain matching rows:
# Only reads row groups where date might be >= 2025-06-01
df = pd.read_parquet(
"data.parquet",
filters=[("date", ">=", "2025-06-01")]
)
For this to work, sort your data by the filter column before writing.
Partitioning for Query Pruning
import pyarrow.parquet as pq
pq.write_to_dataset(
table,
root_path="data/",
partition_cols=["year", "month"], # creates data/year=2025/month=06/part-0.parquet
)
# Reading only June 2025 — skips all other partitions
df = pd.read_parquet("data/", filters=[("year", "=", 2025), ("month", "=", 6)])
Partition columns become directories. Queries that filter on partition columns skip entire directories without opening files.
Parquet vs ORC vs Avro
- Parquet: best for analytics (column scans, BI tools, Spark, DuckDB, Polars) — the default choice
- ORC: similar to Parquet, optimized for Hive/HBase, slightly better compression in some cases
- Avro: row-based, better for Kafka streaming and write-heavy workloads, schema evolution
Delta Lake on Top of Parquet
Delta Lake adds ACID transactions, schema enforcement, and time travel on top of Parquet files. The underlying files are Parquet, but Delta adds a transaction log that enables updates, deletes, and rollback — none of which plain Parquet supports.
Resources: Apache Parquet, PyArrow Parquet docs.