Statistics for Data Science: The Practical Guide to What You Actually Need

The statistical concepts every data scientist needs, explained without academic detours. Descriptive statistics, distributions, hypothesis testing, and when you actually need them.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

10 min read

// tags

#statistics#hypothesis-testing#p-values#data-science

FIG. ART-33

10 min read

“

Statistics for Data Science: The Practical Guide to What You Actually Need

// reading plan

sections

1,188

words

min read

// Machine Learning

How Product Teams Can Work Effectively With Machine Learning

What ML can and cannot do for your product, how to write an ML spec, how to evaluate model readiness, and what PMs consistently get wrong working with data scientists.

8 min read

// Data Science

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Statistics is the mathematical foundation of data science. But most statistics courses teach theory first, application second, leading to practitioners who can recite formulas but cannot decide whether a t-test is appropriate for their problem. This guide takes the opposite approach: here is what the concept means, here is when to use it, here is the common mistake to avoid.

Descriptive Statistics: Mean vs Median

The mean (average) and median (middle value when sorted) both describe the center of a distribution, but they behave differently in the presence of skew and outliers.

Mean is pulled toward outliers. If 10 employees earn $50,000 and one executive earns $1,000,000, the mean salary is about $136,000. The median is $50,000. Which number is more representative? It depends on the question. For total payroll calculation, mean is right. For "what does a typical employee earn," median is right.

A quick rule: if your data is symmetric and well-behaved, mean and median are similar and either works. If your data has a long tail (incomes, house prices, time-to-event data), prefer median for describing the typical case.

Standard deviation measures spread around the mean. It has the same units as the original variable, making it interpretable. Two-thirds of normally distributed data falls within 1 standard deviation of the mean; 95% falls within 2 standard deviations.

Percentiles are often more useful than standard deviation. The 25th, 50th, and 75th percentiles (Q1, median, Q3) divide your data into quarters. The interquartile range (IQR = Q3 - Q1) is a robust measure of spread that ignores outliers.

Probability Distributions: Where They Appear in Real Data

Normal distribution. The famous bell curve. Many natural measurements (heights, measurement errors, test scores) are approximately normal. More importantly, the Central Limit Theorem says the mean of many independent random variables converges to a normal distribution regardless of the underlying distribution. This is why so many statistical tests assume normality.

Binomial distribution. Models the number of successes in n independent trials with probability p of success each time. Conversion rate testing, A/B testing, quality control. "Out of 1,000 visitors, 47 converted. Is that significantly different from our baseline of 4%?" This is a binomial problem.

Poisson distribution. Models the number of events occurring in a fixed interval of time or space when events happen independently. Website visits per minute, support tickets per day, defects per 100 units. If your count data has the property that variance approximately equals mean, Poisson is a good model.

Power law / long-tail distributions. Many internet-scale phenomena: number of followers on social media, revenue per customer, frequency of words in text. A small fraction of items account for the vast majority of the total. These distributions do not follow normal assumptions. Be careful applying normal-distribution-based statistics to them.

Hypothesis Testing: p-Values Done Right

A hypothesis test starts with a null hypothesis (H0) -- the boring claim that nothing interesting is happening (no difference, no effect). The p-value is the probability of seeing results at least as extreme as yours if the null hypothesis were true.

A p-value of 0.03 means: if there truly were no effect, there is only a 3% chance of seeing a result this extreme by random chance. By convention (arbitrary but widely used), we reject H0 when p < 0.05.

Common mistakes with p-values:

The p-value does not tell you the probability that H0 is true. A p-value of 0.03 does not mean there is a 3% chance the null hypothesis is true. It means the data are unlikely under H0.

Statistical significance is not practical significance. With a large enough sample, you can detect arbitrarily small effects. A 0.01% increase in click rate may be statistically significant with 10 million users but have no meaningful business impact.

p = 0.049 is not meaningfully different from p = 0.051. The 0.05 threshold is a convention, not a law of nature.

Common Tests

t-test: Compares means between two groups. Use when your outcome variable is continuous, approximately normally distributed (or n > 30 by CLT), and you want to know if the means are different.

from scipy import stats

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(group_a_values, group_b_values)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")

Chi-squared test: Tests independence between two categorical variables. Use for A/B testing conversion rates, testing whether two groups have different category distributions.

# Chi-squared test for conversion rates
contingency_table = [[converted_a, not_converted_a],
                      [converted_b, not_converted_b]]
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

Mann-Whitney U test: Non-parametric alternative to t-test. Use when your data is not normally distributed or contains ordinal data.

Confidence Intervals

A 95% confidence interval for a mean means: if you repeated the sampling procedure 100 times and computed the interval each time, approximately 95 of those intervals would contain the true mean. It does not mean there is a 95% probability the true mean is in this specific interval.

Confidence intervals are more informative than p-values alone. A p-value tells you whether an effect exists; a confidence interval tells you the plausible magnitude of the effect.

import numpy as np
from scipy import stats

# 95% CI for the mean
data = df["response_time"].dropna()
mean = np.mean(data)
sem = stats.sem(data)
ci = stats.t.interval(0.95, df=len(data)-1, loc=mean, scale=sem)
print(f"Mean: {mean:.2f}, 95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")

Correlation vs Causation

Correlation measures the linear relationship between two variables (Pearson r from -1 to 1). It is easy to compute and easy to misinterpret.

Correlation does not imply causation. Ice cream sales and drowning rates are correlated (both increase in summer). Shoe size and reading ability in children are correlated (both increase with age). The correlation is real; the causal relationship is not.

Establishing causation requires: an experiment with random assignment (A/B test), or a natural experiment, or causal inference methods (instrumental variables, difference-in-differences, regression discontinuity). Observational data alone cannot establish causation.

The Multiple Testing Problem

If you run 100 statistical tests at the p < 0.05 threshold, you expect 5 false positives even when there is no real effect. This is the multiple comparisons problem.

In practice: if you test 50 features for correlation with your target, some will appear significant by chance. If you test every pairwise combination of marketing channels for differential effects, you will find spurious winners.

Solutions: Bonferroni correction (divide p-value threshold by number of tests), Benjamini-Hochberg procedure (controls false discovery rate), or pre-registering your hypotheses before looking at the data.

When You Actually Need Formal Statistics

You need formal hypothesis testing when you are making a decision based on a comparison (A/B test, product change evaluation, clinical trial). You need confidence intervals when communicating uncertainty to stakeholders.

For exploratory data analysis and model building, formal statistics are often not the right tool. You want to understand your data, not prove hypotheses. Use visualization and descriptive statistics instead.

Keep Reading

ML Model Evaluation Metrics Guide — applying statistics to model performance
Exploratory Data Analysis Guide — putting statistics to work
Machine Learning Complete Guide for Software Developers — statistics in the ML context

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Statistics for Data Science: The Practical Guide to What You Actually Need

Related Articles

How Product Teams Can Work Effectively With Machine Learning

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Descriptive Statistics: Mean vs Median

Probability Distributions: Where They Appear in Real Data

Hypothesis Testing: p-Values Done Right

Common Tests

Confidence Intervals

Correlation vs Causation

The Multiple Testing Problem

When You Actually Need Formal Statistics

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

Statistics for Data Science: The Practical Guide to What You Actually Need

Related Articles

How Product Teams Can Work Effectively With Machine Learning

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Descriptive Statistics: Mean vs Median

Probability Distributions: Where They Appear in Real Data

Hypothesis Testing: p-Values Done Right

Common Tests

Confidence Intervals

Correlation vs Causation

The Multiple Testing Problem

When You Actually Need Formal Statistics

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Exploratory Data Analysis: The Complete EDA Checklist for Data Scientists

The workspace your team
actually needs