Statistics is the mathematical foundation of data science. But most statistics courses teach theory first, application second, leading to practitioners who can recite formulas but cannot decide whether a t-test is appropriate for their problem. This guide takes the opposite approach: here is what the concept means, here is when to use it, here is the common mistake to avoid.
Descriptive Statistics: Mean vs Median
The mean (average) and median (middle value when sorted) both describe the center of a distribution, but they behave differently in the presence of skew and outliers.
Mean is pulled toward outliers. If 10 employees earn $50,000 and one executive earns $1,000,000, the mean salary is about $136,000. The median is $50,000. Which number is more representative? It depends on the question. For total payroll calculation, mean is right. For "what does a typical employee earn," median is right.
A quick rule: if your data is symmetric and well-behaved, mean and median are similar and either works. If your data has a long tail (incomes, house prices, time-to-event data), prefer median for describing the typical case.
Standard deviation measures spread around the mean. It has the same units as the original variable, making it interpretable. Two-thirds of normally distributed data falls within 1 standard deviation of the mean; 95% falls within 2 standard deviations.
Percentiles are often more useful than standard deviation. The 25th, 50th, and 75th percentiles (Q1, median, Q3) divide your data into quarters. The interquartile range (IQR = Q3 - Q1) is a robust measure of spread that ignores outliers.
Probability Distributions: Where They Appear in Real Data
Normal distribution. The famous bell curve. Many natural measurements (heights, measurement errors, test scores) are approximately normal. More importantly, the Central Limit Theorem says the mean of many independent random variables converges to a normal distribution regardless of the underlying distribution. This is why so many statistical tests assume normality.
Binomial distribution. Models the number of successes in n independent trials with probability p of success each time. Conversion rate testing, A/B testing, quality control. "Out of 1,000 visitors, 47 converted. Is that significantly different from our baseline of 4%?" This is a binomial problem.
Poisson distribution. Models the number of events occurring in a fixed interval of time or space when events happen independently. Website visits per minute, support tickets per day, defects per 100 units. If your count data has the property that variance approximately equals mean, Poisson is a good model.
Power law / long-tail distributions. Many internet-scale phenomena: number of followers on social media, revenue per customer, frequency of words in text. A small fraction of items account for the vast majority of the total. These distributions do not follow normal assumptions. Be careful applying normal-distribution-based statistics to them.
Hypothesis Testing: p-Values Done Right
A hypothesis test starts with a null hypothesis (H0) -- the boring claim that nothing interesting is happening (no difference, no effect). The p-value is the probability of seeing results at least as extreme as yours if the null hypothesis were true.
A p-value of 0.03 means: if there truly were no effect, there is only a 3% chance of seeing a result this extreme by random chance. By convention (arbitrary but widely used), we reject H0 when p < 0.05.
Common mistakes with p-values:
The p-value does not tell you the probability that H0 is true. A p-value of 0.03 does not mean there is a 3% chance the null hypothesis is true. It means the data are unlikely under H0.
Statistical significance is not practical significance. With a large enough sample, you can detect arbitrarily small effects. A 0.01% increase in click rate may be statistically significant with 10 million users but have no meaningful business impact.
p = 0.049 is not meaningfully different from p = 0.051. The 0.05 threshold is a convention, not a law of nature.
Common Tests
t-test: Compares means between two groups. Use when your outcome variable is continuous, approximately normally distributed (or n > 30 by CLT), and you want to know if the means are different.
from scipy import stats
# Two-sample t-test
t_stat, p_value = stats.ttest_ind(group_a_values, group_b_values)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
Chi-squared test: Tests independence between two categorical variables. Use for A/B testing conversion rates, testing whether two groups have different category distributions.
# Chi-squared test for conversion rates
contingency_table = [[converted_a, not_converted_a],
[converted_b, not_converted_b]]
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
Mann-Whitney U test: Non-parametric alternative to t-test. Use when your data is not normally distributed or contains ordinal data.
Confidence Intervals
A 95% confidence interval for a mean means: if you repeated the sampling procedure 100 times and computed the interval each time, approximately 95 of those intervals would contain the true mean. It does not mean there is a 95% probability the true mean is in this specific interval.
Confidence intervals are more informative than p-values alone. A p-value tells you whether an effect exists; a confidence interval tells you the plausible magnitude of the effect.
import numpy as np
from scipy import stats
# 95% CI for the mean
data = df["response_time"].dropna()
mean = np.mean(data)
sem = stats.sem(data)
ci = stats.t.interval(0.95, df=len(data)-1, loc=mean, scale=sem)
print(f"Mean: {mean:.2f}, 95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")
Correlation vs Causation
Correlation measures the linear relationship between two variables (Pearson r from -1 to 1). It is easy to compute and easy to misinterpret.
Correlation does not imply causation. Ice cream sales and drowning rates are correlated (both increase in summer). Shoe size and reading ability in children are correlated (both increase with age). The correlation is real; the causal relationship is not.
Establishing causation requires: an experiment with random assignment (A/B test), or a natural experiment, or causal inference methods (instrumental variables, difference-in-differences, regression discontinuity). Observational data alone cannot establish causation.
The Multiple Testing Problem
If you run 100 statistical tests at the p < 0.05 threshold, you expect 5 false positives even when there is no real effect. This is the multiple comparisons problem.
In practice: if you test 50 features for correlation with your target, some will appear significant by chance. If you test every pairwise combination of marketing channels for differential effects, you will find spurious winners.
Solutions: Bonferroni correction (divide p-value threshold by number of tests), Benjamini-Hochberg procedure (controls false discovery rate), or pre-registering your hypotheses before looking at the data.
When You Actually Need Formal Statistics
You need formal hypothesis testing when you are making a decision based on a comparison (A/B test, product change evaluation, clinical trial). You need confidence intervals when communicating uncertainty to stakeholders.
For exploratory data analysis and model building, formal statistics are often not the right tool. You want to understand your data, not prove hypotheses. Use visualization and descriptive statistics instead.
Keep Reading
- ML Model Evaluation Metrics Guide — applying statistics to model performance
- Exploratory Data Analysis Guide — putting statistics to work
- Machine Learning Complete Guide for Software Developers — statistics in the ML context
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.