What Changed in Pandas 2.x
Pandas 2.0 was the largest breaking change in the library's history. Two changes matter most: Copy-on-Write (CoW) semantics and optional PyArrow backend.
Copy-on-Write: No More SettingWithCopyWarning
The infamous SettingWithCopyWarning happened because Pandas was ambiguous about whether an operation created a copy or a view. Pandas 2.2 makes Copy-on-Write the default, eliminating the ambiguity entirely.
Old behavior (Pandas 1.x):
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
subset = df[df["A"] > 1]
subset["B"] = 99 # SettingWithCopyWarning — does this modify df?
New behavior (Pandas 2.2+ with CoW):
# Enable early in Pandas 2.0/2.1
pd.options.mode.copy_on_write = True
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
subset = df[df["A"] > 1]
subset["B"] = 99 # Always modifies a copy — df is unchanged
print(df["B"]) # [4, 5, 6] — unmodified
CoW means every subset/slice is a lazy copy — it shares memory until you modify it, then it copies only the modified column. This is both safer and often faster than the old eager-copy behavior.
Chain assignment no longer works:
# This silently does nothing with CoW
df[df["A"] > 1]["B"] = 99
# Do this instead
df.loc[df["A"] > 1, "B"] = 99
PyArrow Backend: 10x Less Memory on Strings
import pandas as pd
# Default NumPy backend
df_numpy = pd.read_csv("data.csv")
print(df_numpy.dtypes) # object for strings — very memory inefficient
# PyArrow backend
df_arrow = pd.read_csv("data.csv", dtype_backend="pyarrow")
print(df_arrow.dtypes) # string[pyarrow], int64[pyarrow], etc.
print(df_arrow.memory_usage(deep=True).sum()) # often 5-10x less
PyArrow strings use dictionary encoding and contiguous memory — a column of 1M repeated strings (like country codes) uses a tiny fraction of the memory compared to NumPy object arrays.
Nullable Integer Types
Pandas now has proper nullable integer types:
# Old: integers with NaN required float dtype
s = pd.Series([1, 2, None])
print(s.dtype) # float64 — NaN forced float
# New: nullable integer
s = pd.Series([1, 2, None], dtype="Int64") # capital I
print(s.dtype) # Int64
print(s.isna()) # [False, False, True]
Pandas 2 vs Polars Decision Tree
- Data < 1M rows, existing Pandas codebase → stay on Pandas 2.x with CoW
- Data > 10M rows, new pipeline → use Polars
- Need SQL-style analytics on files → use DuckDB
- Need both transformation and SQL → DuckDB + Polars
Migration Checklist
- Enable CoW early:
pd.options.mode.copy_on_write = True - Replace all chained assignment with
.loc[] - Test with
dtype_backend="pyarrow"and verify operations still work - Update
append()calls topd.concat()(append was removed in 2.0) - Update
DataFrame.swapaxes()callers (removed in 2.0)
Resources: Pandas 2.0 changelog, Copy-on-Write guide.