Pandas is a Python library built on top of NumPy that provides fast, flexible data structures for working with structured data. If you know Python but not data science, pandas is the first tool you need to learn. It is the dominant library for data manipulation and analysis, installed in virtually every data science environment.
What Pandas Is and Why It Dominates
Pandas gives you two primary data structures: the Series (a one-dimensional labeled array) and the DataFrame (a two-dimensional labeled table, like a spreadsheet or SQL table). The DataFrame is what you will use for almost everything.
The library dominates data manipulation for several reasons. It handles heterogeneous data naturally (one column can be integers, another strings, another timestamps). It has built-in support for missing data. Its indexing is flexible and powerful. And it integrates seamlessly with every other tool in the Python data science stack.
For developers coming from a backend or systems background, pandas is the closest thing Python has to working with SQL tables in memory. Once you internalize that mental model, the API becomes much more intuitive.
Reading Data: CSVs and JSON
The most common entry points:
import pandas as pd
# Read a CSV
df = pd.read_csv("data.csv")
# Read JSON
df = pd.read_json("data.json")
# Read from a URL directly
df = pd.read_csv("https://example.com/data.csv")
pd.read_csv is surprisingly powerful. It handles encoding detection, date parsing, custom delimiters, skipping rows, and reading only specific columns. Check its documentation before writing custom parsing code.
For large files, use chunksize to read in batches:
for chunk in pd.read_csv("large_file.csv", chunksize=10000):
process(chunk)
Filtering Rows: Boolean Indexing
This is how you filter in pandas. You create a boolean Series and pass it to the DataFrame:
# All rows where age > 30
adults = df[df["age"] > 30]
# Multiple conditions (use & and |, not 'and'/'or')
senior_employees = df[(df["age"] > 50) & (df["department"] == "Engineering")]
# String contains
python_devs = df[df["skills"].str.contains("Python")]
The key thing to understand: df["age"] > 30 returns a Series of True/False values. Passing that back into df[...] keeps only the True rows.
Selecting Columns
# Single column returns a Series
names = df["name"]
# Multiple columns returns a DataFrame
subset = df[["name", "age", "department"]]
# Using .loc for label-based selection (rows and columns)
subset = df.loc[0:5, ["name", "age"]]
# Using .iloc for integer-based selection
subset = df.iloc[0:5, 0:3]
Prefer .loc and .iloc over chained bracket indexing. The difference matters: .loc uses labels, .iloc uses integer positions.
GroupBy and Aggregation
GroupBy is one of the most powerful operations in pandas. It works like SQL's GROUP BY:
# Average salary by department
avg_salary = df.groupby("department")["salary"].mean()
# Multiple aggregations at once
stats = df.groupby("department")["salary"].agg(["mean", "median", "count", "std"])
# GroupBy on multiple columns
team_stats = df.groupby(["department", "level"])["salary"].mean()
# Custom aggregation
result = df.groupby("department").agg(
avg_salary=("salary", "mean"),
headcount=("employee_id", "count"),
max_tenure=("years", "max")
)
The .agg() method with named aggregations is the cleanest approach for production code.
Merging DataFrames: SQL-Style Joins
pd.merge() is the pandas equivalent of SQL JOIN:
# Inner join (only matching rows)
merged = pd.merge(employees, departments, on="department_id", how="inner")
# Left join (all rows from left, matching from right)
merged = pd.merge(employees, salaries, on="employee_id", how="left")
# Join on different column names
merged = pd.merge(df1, df2, left_on="emp_id", right_on="employee_id")
The how parameter matches SQL semantics: "inner", "left", "right", "outer".
Handling Missing Values
Pandas represents missing data as NaN (for floats) or pd.NA (for nullable integer/string types). Key operations:
# Check for missing values
df.isnull().sum() # count per column
df.isnull().any() # True/False per column
# Drop rows with any missing values
df_clean = df.dropna()
# Drop rows where specific columns are missing
df_clean = df.dropna(subset=["name", "email"])
# Fill missing values
df["salary"].fillna(df["salary"].median(), inplace=False)
# Forward-fill (good for time series)
df["price"].ffill()
Never use inplace=True in production code. Assign to a new variable instead. More on this below.
Writing Results Back
# Write to CSV
df.to_csv("output.csv", index=False) # index=False prevents writing the row index
# Write to JSON
df.to_json("output.json", orient="records")
# Write to a database (SQLAlchemy engine)
df.to_sql("table_name", engine, if_exists="replace", index=False)
The Gotchas Every Developer Must Know
Chained indexing warnings. This is the most common pandas mistake:
# BAD: May not work, triggers SettingWithCopyWarning
df[df["age"] > 30]["salary"] = 100000
# GOOD: Use .loc
df.loc[df["age"] > 30, "salary"] = 100000
When you chain two indexing operations, pandas may be operating on a copy, and your assignment will silently fail. Always use .loc for combined row/column assignment.
inplace vs assignment. The inplace=True parameter seems convenient but creates problems. It returns None, making method chaining impossible, and does not actually save memory in most cases. Always use assignment:
# BAD
df.dropna(inplace=True)
# GOOD
df = df.dropna()
Copy vs view. When you slice a DataFrame, you may get a view (points to the same memory) or a copy (independent data). Pandas 2.0 introduced Copy-on-Write semantics to make this more predictable. If you are on pandas 2.0+, enable it explicitly:
pd.options.mode.copy_on_write = True
With Copy-on-Write enabled, any mutation creates a copy automatically, eliminating the chained indexing problem entirely.
When to Use Pandas vs SQL vs Polars
Use SQL when your data lives in a database. Let the database engine do the work. Pulling millions of rows into pandas to then aggregate them is wasteful. Run your GROUP BY in the database, pull the result into pandas.
Use pandas when you have already extracted data and need to manipulate it in Python. Pandas excels at irregular transformations, applying custom functions, and feeding data into ML libraries like scikit-learn.
Use polars when performance matters and your data fits in memory. Polars is a newer library with a Rust backend that is typically 5-10x faster than pandas for large datasets. It has a nearly identical API and is worth learning if you deal with datasets over a few hundred MB regularly.
For most data science tasks on datasets under 1GB, pandas is the right choice. Its ecosystem integration is unmatched, and the performance is adequate. Switch to polars or DuckDB when you hit performance limits.
Keep Reading
- Machine Learning Complete Guide for Software Developers — next step after mastering data manipulation
- SQL for Data Scientists: Window Functions and CTEs — complement pandas with SQL
- Python Data Science Tools in 2026 — the full stack context
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.