Exploratory data analysis (EDA) is the process of understanding a new dataset before building models. It is the most important step in any data science project and the one most commonly skipped by beginners. You cannot build a good model without understanding your data. This guide gives you a systematic checklist to follow every time.
Why EDA Matters
Machine learning models amplify the patterns in your data. If your data has a flaw, your model will learn and magnify that flaw. EDA catches problems before they become expensive mistakes:
- A feature you thought was a predictor is actually a data leakage vector (it contains future information)
- Your target variable has severe class imbalance that will make accuracy a misleading metric
- Two features are nearly perfectly correlated, making one redundant
- A numeric field has impossible values (ages of 999, negative prices) that will distort your model
- Missing values are not random but systematic, carrying information themselves
EDA also generates hypotheses. Before you write a single line of model code, EDA tells you which features are likely to matter, which transformations might help, and what kind of model architecture makes sense.
Step 1: Shape and Data Types
Start with the basics:
import pandas as pd
df = pd.read_csv("data.csv")
# Shape: (rows, columns)
print(df.shape)
# Column names and data types
print(df.dtypes)
# First few rows
print(df.head())
# Statistical summary for numeric columns
print(df.describe())
# Include non-numeric columns
print(df.describe(include="all"))
Look for columns that are numeric but should be categorical (e.g., a status column encoded as 0/1/2). Look for columns that are stored as strings but should be numeric or datetime. These type mismatches cause silent errors downstream.
Step 2: Missing Value Counts
# Count missing values per column
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_report = pd.DataFrame({
"missing_count": missing,
"missing_pct": missing_pct
}).sort_values("missing_pct", ascending=False)
print(missing_report[missing_report["missing_count"] > 0])
The critical question is not just how many values are missing, but why. Missing completely at random (MCAR) can be imputed or dropped. Missing not at random (MNAR) carries information. For example, if income is missing more often for low-income respondents, the missingness itself is a signal.
Columns with more than 50% missing values are usually not worth keeping. Between 10-50%, consider whether the missingness pattern is informative. Under 10%, standard imputation is usually safe.
Step 3: Distribution of Numerical Features
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram for all numeric columns
df.hist(figsize=(20, 15), bins=50)
plt.tight_layout()
plt.show()
# Distribution of a single column
sns.histplot(df["age"], kde=True)
plt.show()
What to look for: skewness (long tail on one side), multimodality (two humps suggesting the column actually encodes two different populations), impossible values (ages below 0, percentages above 100), and heavy tails that will affect models sensitive to scale (linear regression, neural networks, k-NN).
Log-transform right-skewed features before feeding them to linear models. Standardize (subtract mean, divide by std) features when using distance-based algorithms.
Step 4: Categorical Value Counts
# Value counts for all object columns
for col in df.select_dtypes(include="object").columns:
print(f"
{col}: {df[col].nunique()} unique values")
print(df[col].value_counts().head(10))
Look for: high cardinality columns (thousands of unique values) that will be expensive to one-hot encode, rare categories that will have too few training examples to learn from, inconsistent formatting ("New York" vs "new york" vs "NY"), and categories that appear only in the test set (will produce NaN in one-hot encoding).
Step 5: Correlation Matrix
# Pearson correlation for numeric features
corr_matrix = df.select_dtypes(include="number").corr()
# Visualize as heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation Matrix")
plt.show()
Correlations above 0.9 (or below -0.9) between two features indicate multicollinearity. In linear models, this destabilizes coefficient estimates. In tree-based models, it is less of a problem but still reduces interpretability.
Also check correlation between each feature and the target variable. This gives you a quick feature importance signal before modeling.
Step 6: Target Variable Distribution
The target variable deserves extra attention:
# For classification: class balance
print(df["target"].value_counts(normalize=True))
# Visualize
sns.countplot(x="target", data=df)
plt.show()
# For regression: distribution of target
sns.histplot(df["target"], kde=True)
plt.show()
Class imbalance (e.g., 95% negative, 5% positive) is one of the most common problems in real-world classification. A model that always predicts the majority class will achieve 95% accuracy while being completely useless. You need to know this before choosing your evaluation metric.
For regression targets, check for skewness. A heavily skewed target (e.g., house prices, salary, user spend) often benefits from log transformation, changing the model from predicting the raw value to predicting the log value.
Step 7: Outlier Detection
# IQR method
Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df["salary"] < Q1 - 1.5 * IQR) | (df["salary"] > Q3 + 1.5 * IQR)]
print(f"Outlier count: {len(outliers)}")
# Box plot to visualize
sns.boxplot(x=df["salary"])
plt.show()
Outliers are not automatically bad. Sometimes they are the most important data points (fraud detection, anomaly detection). Before removing them, ask: are these real values or data errors? If they are real, keep them but consider whether your model can handle them.
Automated EDA: ydata-profiling
For a quick automated report:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Dataset Report")
profile.to_file("report.html")
This generates an HTML report with distributions, correlations, missing value analysis, and interaction plots. It is a useful starting point but does not replace manual EDA. Automated reports miss domain-specific context.
What to Document from EDA
After completing EDA, write down:
- Which features have high missing rates and what you plan to do about them
- Which features are correlated with the target (ranked)
- Which features are correlated with each other (multicollinearity risks)
- Class balance of the target (for classification)
- Any data quality issues found and how they were resolved
- Columns to drop and why
This document becomes the specification for your feature engineering step.
Keep Reading
- Data Visualization Guide for Python — visualize your EDA findings effectively
- Feature Engineering Practical Guide — what to do after EDA
- Machine Learning Complete Guide for Software Developers — from EDA to production models
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.