Ensemble Methods: Why Combining Models Beats Any Individual Model

Bagging, boosting, and stacking -- ensemble methods consistently win Kaggle competitions and improve production accuracy. Here is how each works and when to use them.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

9 min read

// tags

#ensemble-methods#random-forest#xgboost#stacking#boosting

FIG. ART-22

9 min read

“

Ensemble Methods: Why Combining Models Beats Any Individual Model

// reading plan

sections

1,207

words

min read

// Machine Learning

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

PyTorch, MLflow, DVC, vLLM, Airflow -- the ML tooling landscape is vast. Here is a curated map of the tools that matter, what each does, and how to choose for startup vs enterprise.

11 min read

// Machine Learning

ML Research Papers Every Practitioner Should Know in 2026

Ensemble methods combine predictions from multiple models to produce a final prediction that outperforms any individual model. This is not a theoretical curiosity -- ensembles dominate Kaggle competitions, are the backbone of production recommendation systems, and are responsible for some of the largest accuracy jumps in applied ML. Understanding why they work and how to apply each variant will make you a more effective ML practitioner.

Why Ensembles Work: The Bias-Variance-Covariance Decomposition

The expected squared error of an ensemble of M models averaging predictions can be decomposed as:

Ensemble error = average individual model error - (M-1)/M * average pairwise covariance

This equation contains a key insight: the ensemble error is lower than the average individual model error BY AN AMOUNT that depends on the covariance between models. If models make uncorrelated errors -- if model A fails on example 1 but model B is correct, and model B fails on example 2 but model A is correct -- the errors cancel out in the average.

In practice, errors are never completely uncorrelated. But if you train models on different data subsets (bagging), with different algorithms (voting ensembles), or sequentially to correct each other's errors (boosting), you can substantially reduce correlation and improve ensemble performance.

Bagging: Bootstrap Aggregating

Bagging trains multiple models independently on random bootstrap samples of the training data. A bootstrap sample is a sample drawn with replacement: some examples appear multiple times, some not at all. Approximately 63.2% of original examples appear in each bootstrap sample.

The variance reduction happens because each model is trained on a slightly different dataset, causing them to make different errors. Averaging their predictions cancels out much of the noise.

Random Forest is the canonical bagging ensemble. It combines bagging with random feature selection at each split: when building each decision tree, only a random subset of features is considered for each split. This further decorrelates the trees.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=500,    # more trees = better, up to diminishing returns
    max_features="sqrt", # random feature subset at each split
    max_depth=None,      # fully grown trees (high variance, reduced by averaging)
    random_state=42
)
rf.fit(X_train, y_train)

Random Forest is robust, fast to train, requires little hyperparameter tuning, and handles missing data and mixed feature types reasonably well. It is often the best first model to try for tabular data.

Bagging primarily reduces variance. It works best with high-variance (overfit) base learners like deep decision trees. Using shallow trees with bagging provides less benefit.

Boosting: Sequential Error Correction

Boosting trains models sequentially, where each model focuses on correcting the mistakes of all previous models. The final prediction combines all models' outputs with learned weights.

AdaBoost (Freund and Schapire, 1997): The original boosting algorithm. After each round, examples that were misclassified are upweighted, so the next model focuses on the difficult examples. Combines weak learners (slightly better than random) into a strong learner.

Gradient Boosting: A generalization of boosting that fits each new model to the residuals (errors) of the current ensemble, treating boosting as gradient descent in function space. XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017) are the dominant implementations.

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,      # small learning rate = slower, usually better
    max_depth=6,
    subsample=0.8,           # row sampling (like bagging)
    colsample_bytree=0.8,    # column sampling
    early_stopping_rounds=50,
    eval_metric="logloss"
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

Boosting primarily reduces bias. It works best with low-variance (underfit) base learners like shallow trees (max_depth 3-6). Deep trees with boosting often overfit.

The key practical difference: bagging can be parallelized (trees are independent), boosting is sequential. LightGBM addresses this by using histogram-based splitting and leaf-wise growth, making it substantially faster than XGBoost on most datasets while matching or exceeding its accuracy.

Stacking: Learning How to Combine Models

Stacking (Wolpert, 1992) trains a meta-model that learns how to best combine the predictions of base models. Instead of simply averaging predictions, you train a second-level model on out-of-fold base model predictions.

The process:

Split training data into K folds
For each fold i, train all base models on the K-1 other folds and predict on fold i
Collect these out-of-fold predictions into a new feature matrix (one column per base model)
Train a meta-model on this feature matrix with the true labels

from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Base models
base_models = [
    RandomForestClassifier(n_estimators=100),
    GradientBoostingClassifier(n_estimators=100),
]

# Get out-of-fold predictions for each base model
meta_features = np.column_stack([
    cross_val_predict(model, X_train, y_train, cv=5, method="predict_proba")[:, 1]
    for model in base_models
])

# Train base models on full training data
for model in base_models:
    model.fit(X_train, y_train)

# Train meta-model
meta_model = LogisticRegression()
meta_model.fit(meta_features, y_train)

The meta-model learns, for example, that the Random Forest is more reliable for examples where features A and B are large, while the Gradient Boosting model is more reliable for examples where feature C is large. This is more sophisticated than simple averaging.

Stacking typically provides a 1-3% lift over the best individual model. It is standard practice in Kaggle competitions but adds complexity (multiple models to maintain, out-of-fold generation process).

Voting Ensembles: The Simplest Approach

Majority vote (classification) or average (regression) over independently trained models. No second-level model required.

from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(
    estimators=[("rf", RandomForestClassifier()), ("xgb", xgb.XGBClassifier()), ("lr", LogisticRegression())],
    voting="soft"  # average predicted probabilities
)
ensemble.fit(X_train, y_train)

Soft voting (averaging probabilities) almost always outperforms hard voting (majority vote of class labels) when your models are well-calibrated. Use soft voting as the default.

The power of voting ensembles is in model diversity. A Random Forest, XGBoost, and logistic regression make different types of errors. An ensemble of five Random Forests with different seeds is less powerful because the models are too similar.

When Ensembles Are Worth the Complexity

Kaggle competitions: Almost always. The difference between first place and tenth place is often 0.1% AUC, and ensembling is one of the most reliable ways to get there.

High-stakes production systems: Credit scoring, fraud detection, medical diagnosis. When the cost of false positives or false negatives is high, the 2-3% accuracy gain from ensembling may be worth the added complexity.

When you have diverse base models: If you have independently built models using different algorithms and feature engineering approaches, stacking them often provides meaningful gain.

When Ensembles Are Not Worth It

Real-time inference with strict latency requirements: Running 5 models instead of 1 multiplies inference time. For sub-10ms latency requirements, a single well-tuned model is usually the right choice.

Early in a project: Do not reach for ensembles before your individual model is well-tuned. The 2% gain from ensembling a weak model is smaller than the gain from better feature engineering or a better base model.

When interpretability is required: Ensembles are black boxes. If you need to explain predictions to regulators or customers, a single decision tree or logistic regression may be required regardless of accuracy.

Keep Reading

Hyperparameter Tuning Guide -- tune your base models before ensembling them
Cross-Validation Guide -- stacking requires careful use of out-of-fold predictions to avoid leakage
ML Model Evaluation Metrics Guide -- measure the right thing when comparing ensemble vs individual model

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.

Ensemble Methods: Why Combining Models Beats Any Individual Model

Related Articles

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Why Ensembles Work: The Bias-Variance-Covariance Decomposition

Bagging: Bootstrap Aggregating

Boosting: Sequential Error Correction

Stacking: Learning How to Combine Models

Voting Ensembles: The Simplest Approach

When Ensembles Are Worth the Complexity

When Ensembles Are Not Worth It

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

Reducing ML Model Serving Latency for Production

Ensemble Methods: Why Combining Models Beats Any Individual Model

Related Articles

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Why Ensembles Work: The Bias-Variance-Covariance Decomposition

Bagging: Bootstrap Aggregating

Boosting: Sequential Error Correction

Stacking: Learning How to Combine Models

Voting Ensembles: The Simplest Approach

When Ensembles Are Worth the Complexity

When Ensembles Are Not Worth It

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

Reducing ML Model Serving Latency for Production

The workspace your team
actually needs