Cross-Validation: Reliably Estimating Model Performance on Unseen Data

A single train/test split gives you a noisy estimate of real performance. Cross-validation gives you a reliable one. Here is every variant, when to use each, and the mistakes to avoid.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

9 min read

// tags

#cross-validation#model-evaluation#k-fold#time-series#overfitting

FIG. ART-22

9 min read

“

Cross-Validation: Reliably Estimating Model Performance on Unseen Data

// reading plan

sections

1,270

words

min read

// Machine Learning

Ensemble Methods: Why Combining Models Beats Any Individual Model

Bagging, boosting, and stacking -- ensemble methods consistently win Kaggle competitions and improve production accuracy. Here is how each works and when to use them.

9 min read

// Machine Learning

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Cross-validation is the standard technique for estimating how well a trained model will perform on new, unseen data. It addresses a fundamental problem: a single train/test split can give you a very misleading estimate of true model performance, depending on which examples happen to land in your test set. Cross-validation reduces this variance by averaging performance across multiple different splits.

Why a Single Train/Test Split Is Unreliable

Suppose you have 1,000 labeled examples. You split them 80/20: 800 training, 200 test. Your model achieves 87% accuracy on the test set.

Is 87% a reliable estimate? Not necessarily. If you had split the data differently -- say, the first 800 examples for training and the last 200 for test -- you might get 83% or 91%, depending on which examples ended up in the test set. With only 200 test examples, accuracy estimates have high variance.

The 95% confidence interval for a proportion estimate of 87% on 200 examples is roughly 82% to 92%. That is a 10-percentage-point range around a single number that feels precise. Cross-validation collapses this uncertainty by averaging across multiple test sets.

K-Fold Cross-Validation: The Standard Approach

K-fold CV divides your data into K equally-sized folds. The process:

Split data into K folds (K = 5 or K = 10 are the most common choices)
For each fold i from 1 to K:
- Train on all folds except fold i
- Evaluate on fold i
Average the K evaluation scores

The result: every example has been used exactly once as a test example, and K times as a training example. The final performance estimate uses all your data for evaluation.

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")

print(f"Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
# Accuracy: 0.871 +/- 0.023

The standard deviation across folds is as informative as the mean. A high standard deviation (+/- 0.10) suggests high variance across splits, which might indicate insufficient data or a model that is sensitive to which examples it sees.

Choosing K:

K = 5: Lower computational cost, slightly higher variance in the estimate. Good default for large datasets or slow models.
K = 10: Slightly lower bias, slightly higher variance, more computation. Common in academic evaluation.
K = 5 is usually the right default. K = 10 adds computation without a meaningful improvement in most real cases.

Stratified K-Fold: Maintaining Class Balance

Standard K-fold splits data randomly without regard for class distribution. On an imbalanced dataset (say, 95% class A, 5% class B), you might get a fold where class B is severely underrepresented or absent entirely.

Stratified K-fold maintains the class distribution in each fold proportional to the overall distribution. This ensures each fold is a representative sample of the full dataset.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    # train and evaluate...

Always use stratified K-fold for classification tasks. For regression, standard K-fold is fine (there are no discrete classes to balance).

Leave-One-Out Cross-Validation

LOOCV is K-fold where K = N (the number of examples). You train N models, each leaving out exactly one example and testing on that example.

Advantages:

Maximum use of training data (N-1 examples per fold)
Minimal bias in the estimate (each model sees almost all the data)
No randomness (deterministic, unlike K-fold with shuffling)

Disadvantages:

Extremely expensive: you train N models instead of K models
High variance across "folds": each test set is one example, so each evaluation is maximally noisy
The N test scores are highly correlated (the N models are nearly identical)

LOOCV is most appropriate for very small datasets (fewer than 50-100 examples) where you cannot afford to hold out 20% of your data as a test set. For anything larger, K-fold is preferred.

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
print(f"LOOCV Accuracy: {scores.mean():.3f}")

Time Series Cross-Validation: The Most Misunderstood Case

For time series data, standard K-fold is wrong. It will test the model on data from before the training period, which simulates a scenario that cannot exist in production (predicting the past using the future).

The correct approach: always train on past data and evaluate on future data.

Expanding window CV: The training set grows with each fold. Fold 1: train on months 1-6, test on month 7. Fold 2: train on months 1-7, test on month 8. And so on. This is the most realistic simulation of how you would retrain over time.

Sliding window CV: Keep the training window fixed. Fold 1: train on months 1-6, test on month 7. Fold 2: train on months 2-7, test on month 8. This is appropriate when you expect data distribution to shift over time (older data becomes less relevant).

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
    X_train, X_val = X[train_idx], X[val_idx]
    # train and evaluate in temporal order...

Gap: In some forecasting scenarios, you should add a gap between training and validation to account for the fact that very recent data might not be available in production (due to data pipeline delays). TimeSeriesSplit supports a gap parameter.

Nested Cross-Validation: Unbiased Hyperparameter Tuning

This is the most important technique most practitioners do not use.

If you use cross-validation to select hyperparameters AND to report model performance, you get an optimistically biased performance estimate. The reason: you searched over many hyperparameter configurations and picked the best one -- so the reported performance reflects the best outcome of your search, which will not generalize to new data.

Nested CV uses two loops:

Outer loop: K-fold splits for unbiased performance estimation
Inner loop: K-fold splits within each training fold for hyperparameter tuning

from sklearn.model_selection import GridSearchCV, cross_val_score

inner_cv = StratifiedKFold(n_splits=3)
outer_cv = StratifiedKFold(n_splits=5)

param_grid = {"max_depth": [3, 5, 7], "n_estimators": [50, 100, 200]}
clf = GridSearchCV(RandomForestClassifier(), param_grid, cv=inner_cv)

nested_scores = cross_val_score(clf, X, y, cv=outer_cv)
print(f"Nested CV accuracy: {nested_scores.mean():.3f}")

Nested CV gives you a nearly unbiased estimate of what your model selection procedure will achieve on new data. It is expensive (K_outer x K_inner x number_of_configs models trained), but it is the statistically correct approach.

Common Mistakes

Leaking preprocessing into CV: Fit your scaler, imputer, or feature selector INSIDE each fold (on training data only), then apply to validation data. Never fit on the full dataset before splitting. Scikit-learn Pipelines handle this correctly when used with cross_val_score.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([("scaler", StandardScaler()), ("clf", RandomForestClassifier())])
# Scaler is fit inside each fold automatically
scores = cross_val_score(pipeline, X, y, cv=5)

Reporting test set performance after multiple rounds of tuning: Each time you look at the test set and adjust your model, you are using it for training. Hold out a final test set and evaluate on it exactly once.

Using CV for very large datasets unnecessarily: For 1M+ examples, a single train/validation/test split gives reliable estimates. Five-fold CV on 1M examples is 5x the compute with diminishing statistical returns. Use CV when N is small (under ~10,000).

Keep Reading

Hyperparameter Tuning Guide -- nested CV is the right way to tune and evaluate simultaneously
ML Model Evaluation Metrics Guide -- what to measure during each CV fold
Ensemble Methods Guide -- stacking uses held-out fold predictions as training data for the meta-model

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.

Cross-Validation: Reliably Estimating Model Performance on Unseen Data

Related Articles

Ensemble Methods: Why Combining Models Beats Any Individual Model

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Why a Single Train/Test Split Is Unreliable

K-Fold Cross-Validation: The Standard Approach

Stratified K-Fold: Maintaining Class Balance

Leave-One-Out Cross-Validation

Time Series Cross-Validation: The Most Misunderstood Case

Nested Cross-Validation: Unbiased Hyperparameter Tuning

Common Mistakes

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

Cross-Validation: Reliably Estimating Model Performance on Unseen Data

Related Articles

Ensemble Methods: Why Combining Models Beats Any Individual Model

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Why a Single Train/Test Split Is Unreliable

K-Fold Cross-Validation: The Standard Approach

Stratified K-Fold: Maintaining Class Balance

Leave-One-Out Cross-Validation

Time Series Cross-Validation: The Most Misunderstood Case

Nested Cross-Validation: Unbiased Hyperparameter Tuning

Common Mistakes

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

The workspace your team
actually needs