Cross-validation is the standard technique for estimating how well a trained model will perform on new, unseen data. It addresses a fundamental problem: a single train/test split can give you a very misleading estimate of true model performance, depending on which examples happen to land in your test set. Cross-validation reduces this variance by averaging performance across multiple different splits.
Why a Single Train/Test Split Is Unreliable
Suppose you have 1,000 labeled examples. You split them 80/20: 800 training, 200 test. Your model achieves 87% accuracy on the test set.
Is 87% a reliable estimate? Not necessarily. If you had split the data differently -- say, the first 800 examples for training and the last 200 for test -- you might get 83% or 91%, depending on which examples ended up in the test set. With only 200 test examples, accuracy estimates have high variance.
The 95% confidence interval for a proportion estimate of 87% on 200 examples is roughly 82% to 92%. That is a 10-percentage-point range around a single number that feels precise. Cross-validation collapses this uncertainty by averaging across multiple test sets.
K-Fold Cross-Validation: The Standard Approach
K-fold CV divides your data into K equally-sized folds. The process:
- Split data into K folds (K = 5 or K = 10 are the most common choices)
- For each fold i from 1 to K:
- Train on all folds except fold i
- Evaluate on fold i
- Average the K evaluation scores
The result: every example has been used exactly once as a test example, and K times as a training example. The final performance estimate uses all your data for evaluation.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
# Accuracy: 0.871 +/- 0.023
The standard deviation across folds is as informative as the mean. A high standard deviation (+/- 0.10) suggests high variance across splits, which might indicate insufficient data or a model that is sensitive to which examples it sees.
Choosing K:
- K = 5: Lower computational cost, slightly higher variance in the estimate. Good default for large datasets or slow models.
- K = 10: Slightly lower bias, slightly higher variance, more computation. Common in academic evaluation.
- K = 5 is usually the right default. K = 10 adds computation without a meaningful improvement in most real cases.
Stratified K-Fold: Maintaining Class Balance
Standard K-fold splits data randomly without regard for class distribution. On an imbalanced dataset (say, 95% class A, 5% class B), you might get a fold where class B is severely underrepresented or absent entirely.
Stratified K-fold maintains the class distribution in each fold proportional to the overall distribution. This ensures each fold is a representative sample of the full dataset.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
# train and evaluate...
Always use stratified K-fold for classification tasks. For regression, standard K-fold is fine (there are no discrete classes to balance).
Leave-One-Out Cross-Validation
LOOCV is K-fold where K = N (the number of examples). You train N models, each leaving out exactly one example and testing on that example.
Advantages:
- Maximum use of training data (N-1 examples per fold)
- Minimal bias in the estimate (each model sees almost all the data)
- No randomness (deterministic, unlike K-fold with shuffling)
Disadvantages:
- Extremely expensive: you train N models instead of K models
- High variance across "folds": each test set is one example, so each evaluation is maximally noisy
- The N test scores are highly correlated (the N models are nearly identical)
LOOCV is most appropriate for very small datasets (fewer than 50-100 examples) where you cannot afford to hold out 20% of your data as a test set. For anything larger, K-fold is preferred.
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
print(f"LOOCV Accuracy: {scores.mean():.3f}")
Time Series Cross-Validation: The Most Misunderstood Case
For time series data, standard K-fold is wrong. It will test the model on data from before the training period, which simulates a scenario that cannot exist in production (predicting the past using the future).
The correct approach: always train on past data and evaluate on future data.
Expanding window CV: The training set grows with each fold. Fold 1: train on months 1-6, test on month 7. Fold 2: train on months 1-7, test on month 8. And so on. This is the most realistic simulation of how you would retrain over time.
Sliding window CV: Keep the training window fixed. Fold 1: train on months 1-6, test on month 7. Fold 2: train on months 2-7, test on month 8. This is appropriate when you expect data distribution to shift over time (older data becomes less relevant).
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
X_train, X_val = X[train_idx], X[val_idx]
# train and evaluate in temporal order...
Gap: In some forecasting scenarios, you should add a gap between training and validation to account for the fact that very recent data might not be available in production (due to data pipeline delays). TimeSeriesSplit supports a gap parameter.
Nested Cross-Validation: Unbiased Hyperparameter Tuning
This is the most important technique most practitioners do not use.
If you use cross-validation to select hyperparameters AND to report model performance, you get an optimistically biased performance estimate. The reason: you searched over many hyperparameter configurations and picked the best one -- so the reported performance reflects the best outcome of your search, which will not generalize to new data.
Nested CV uses two loops:
- Outer loop: K-fold splits for unbiased performance estimation
- Inner loop: K-fold splits within each training fold for hyperparameter tuning
from sklearn.model_selection import GridSearchCV, cross_val_score
inner_cv = StratifiedKFold(n_splits=3)
outer_cv = StratifiedKFold(n_splits=5)
param_grid = {"max_depth": [3, 5, 7], "n_estimators": [50, 100, 200]}
clf = GridSearchCV(RandomForestClassifier(), param_grid, cv=inner_cv)
nested_scores = cross_val_score(clf, X, y, cv=outer_cv)
print(f"Nested CV accuracy: {nested_scores.mean():.3f}")
Nested CV gives you a nearly unbiased estimate of what your model selection procedure will achieve on new data. It is expensive (K_outer x K_inner x number_of_configs models trained), but it is the statistically correct approach.
Common Mistakes
Leaking preprocessing into CV: Fit your scaler, imputer, or feature selector INSIDE each fold (on training data only), then apply to validation data. Never fit on the full dataset before splitting. Scikit-learn Pipelines handle this correctly when used with cross_val_score.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([("scaler", StandardScaler()), ("clf", RandomForestClassifier())])
# Scaler is fit inside each fold automatically
scores = cross_val_score(pipeline, X, y, cv=5)
Reporting test set performance after multiple rounds of tuning: Each time you look at the test set and adjust your model, you are using it for training. Hold out a final test set and evaluate on it exactly once.
Using CV for very large datasets unnecessarily: For 1M+ examples, a single train/validation/test split gives reliable estimates. Five-fold CV on 1M examples is 5x the compute with diminishing statistical returns. Use CV when N is small (under ~10,000).
Keep Reading
- Hyperparameter Tuning Guide -- nested CV is the right way to tune and evaluate simultaneously
- ML Model Evaluation Metrics Guide -- what to measure during each CV fold
- Ensemble Methods Guide -- stacking uses held-out fold predictions as training data for the meta-model
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.