Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

EDA is the process of understanding a dataset before modeling. Skip it and your models will fail in ways you cannot explain.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

10 min read

// tags

#eda#exploratory-data-analysis#pandas#data-science

FIG. ART-31

10 min read

“

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

// reading plan

sections

1,014

words

min read

// Machine Learning

How Product Teams Can Work Effectively With Machine Learning

What ML can and cannot do for your product, how to write an ML spec, how to evaluate model readiness, and what PMs consistently get wrong working with data scientists.

8 min read

// Data Science

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Exploratory data analysis (EDA) is the process of understanding a new dataset before building models. It is the most important step in any data science project and the one most commonly skipped by beginners. You cannot build a good model without understanding your data. This guide gives you a systematic checklist to follow every time.

Why EDA Matters

Machine learning models amplify the patterns in your data. If your data has a flaw, your model will learn and magnify that flaw. EDA catches problems before they become expensive mistakes:

A feature you thought was a predictor is actually a data leakage vector (it contains future information)
Your target variable has severe class imbalance that will make accuracy a misleading metric
Two features are nearly perfectly correlated, making one redundant
A numeric field has impossible values (ages of 999, negative prices) that will distort your model
Missing values are not random but systematic, carrying information themselves

EDA also generates hypotheses. Before you write a single line of model code, EDA tells you which features are likely to matter, which transformations might help, and what kind of model architecture makes sense.

Step 1: Shape and Data Types

Start with the basics:

import pandas as pd

df = pd.read_csv("data.csv")

# Shape: (rows, columns)
print(df.shape)

# Column names and data types
print(df.dtypes)

# First few rows
print(df.head())

# Statistical summary for numeric columns
print(df.describe())

# Include non-numeric columns
print(df.describe(include="all"))

Look for columns that are numeric but should be categorical (e.g., a status column encoded as 0/1/2). Look for columns that are stored as strings but should be numeric or datetime. These type mismatches cause silent errors downstream.

Step 2: Missing Value Counts

# Count missing values per column
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100

missing_report = pd.DataFrame({
    "missing_count": missing,
    "missing_pct": missing_pct
}).sort_values("missing_pct", ascending=False)

print(missing_report[missing_report["missing_count"] > 0])

The critical question is not just how many values are missing, but why. Missing completely at random (MCAR) can be imputed or dropped. Missing not at random (MNAR) carries information. For example, if income is missing more often for low-income respondents, the missingness itself is a signal.

Columns with more than 50% missing values are usually not worth keeping. Between 10-50%, consider whether the missingness pattern is informative. Under 10%, standard imputation is usually safe.

Step 3: Distribution of Numerical Features

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram for all numeric columns
df.hist(figsize=(20, 15), bins=50)
plt.tight_layout()
plt.show()

# Distribution of a single column
sns.histplot(df["age"], kde=True)
plt.show()

What to look for: skewness (long tail on one side), multimodality (two humps suggesting the column actually encodes two different populations), impossible values (ages below 0, percentages above 100), and heavy tails that will affect models sensitive to scale (linear regression, neural networks, k-NN).

Log-transform right-skewed features before feeding them to linear models. Standardize (subtract mean, divide by std) features when using distance-based algorithms.

Step 4: Categorical Value Counts

# Value counts for all object columns
for col in df.select_dtypes(include="object").columns:
    print(f"
{col}: {df[col].nunique()} unique values")
    print(df[col].value_counts().head(10))

Look for: high cardinality columns (thousands of unique values) that will be expensive to one-hot encode, rare categories that will have too few training examples to learn from, inconsistent formatting ("New York" vs "new york" vs "NY"), and categories that appear only in the test set (will produce NaN in one-hot encoding).

Step 5: Correlation Matrix

# Pearson correlation for numeric features
corr_matrix = df.select_dtypes(include="number").corr()

# Visualize as heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation Matrix")
plt.show()

Correlations above 0.9 (or below -0.9) between two features indicate multicollinearity. In linear models, this destabilizes coefficient estimates. In tree-based models, it is less of a problem but still reduces interpretability.

Also check correlation between each feature and the target variable. This gives you a quick feature importance signal before modeling.

Step 6: Target Variable Distribution

The target variable deserves extra attention:

# For classification: class balance
print(df["target"].value_counts(normalize=True))

# Visualize
sns.countplot(x="target", data=df)
plt.show()

# For regression: distribution of target
sns.histplot(df["target"], kde=True)
plt.show()

Class imbalance (e.g., 95% negative, 5% positive) is one of the most common problems in real-world classification. A model that always predicts the majority class will achieve 95% accuracy while being completely useless. You need to know this before choosing your evaluation metric.

For regression targets, check for skewness. A heavily skewed target (e.g., house prices, salary, user spend) often benefits from log transformation, changing the model from predicting the raw value to predicting the log value.

Step 7: Outlier Detection

# IQR method
Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df["salary"] < Q1 - 1.5 * IQR) | (df["salary"] > Q3 + 1.5 * IQR)]
print(f"Outlier count: {len(outliers)}")

# Box plot to visualize
sns.boxplot(x=df["salary"])
plt.show()

Outliers are not automatically bad. Sometimes they are the most important data points (fraud detection, anomaly detection). Before removing them, ask: are these real values or data errors? If they are real, keep them but consider whether your model can handle them.

Automated EDA: ydata-profiling

For a quick automated report:

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Dataset Report")
profile.to_file("report.html")

This generates an HTML report with distributions, correlations, missing value analysis, and interaction plots. It is a useful starting point but does not replace manual EDA. Automated reports miss domain-specific context.

What to Document from EDA

After completing EDA, write down:

Which features have high missing rates and what you plan to do about them
Which features are correlated with the target (ranked)
Which features are correlated with each other (multicollinearity risks)
Class balance of the target (for classification)
Any data quality issues found and how they were resolved
Columns to drop and why

This document becomes the specification for your feature engineering step.

Keep Reading

Data Visualization Guide for Python — visualize your EDA findings effectively
Feature Engineering Practical Guide — what to do after EDA
Machine Learning Complete Guide for Software Developers — from EDA to production models

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

Related Articles

How Product Teams Can Work Effectively With Machine Learning

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Why EDA Matters

Step 1: Shape and Data Types

Step 2: Missing Value Counts

Step 3: Distribution of Numerical Features

Step 4: Categorical Value Counts

Step 5: Correlation Matrix

Step 6: Target Variable Distribution

Step 7: Outlier Detection

Automated EDA: ydata-profiling

What to Document from EDA

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Software Developer to Data Scientist: The Realistic Transition Guide

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

Related Articles

How Product Teams Can Work Effectively With Machine Learning

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Why EDA Matters

Step 1: Shape and Data Types

Step 2: Missing Value Counts

Step 3: Distribution of Numerical Features

Step 4: Categorical Value Counts

Step 5: Correlation Matrix

Step 6: Target Variable Distribution

Step 7: Outlier Detection

Automated EDA: ydata-profiling

What to Document from EDA

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Software Developer to Data Scientist: The Realistic Transition Guide

The workspace your team
actually needs