Ensemble methods combine predictions from multiple models to produce a final prediction that outperforms any individual model. This is not a theoretical curiosity -- ensembles dominate Kaggle competitions, are the backbone of production recommendation systems, and are responsible for some of the largest accuracy jumps in applied ML. Understanding why they work and how to apply each variant will make you a more effective ML practitioner.
Why Ensembles Work: The Bias-Variance-Covariance Decomposition
The expected squared error of an ensemble of M models averaging predictions can be decomposed as:
Ensemble error = average individual model error - (M-1)/M * average pairwise covariance
This equation contains a key insight: the ensemble error is lower than the average individual model error BY AN AMOUNT that depends on the covariance between models. If models make uncorrelated errors -- if model A fails on example 1 but model B is correct, and model B fails on example 2 but model A is correct -- the errors cancel out in the average.
In practice, errors are never completely uncorrelated. But if you train models on different data subsets (bagging), with different algorithms (voting ensembles), or sequentially to correct each other's errors (boosting), you can substantially reduce correlation and improve ensemble performance.
Bagging: Bootstrap Aggregating
Bagging trains multiple models independently on random bootstrap samples of the training data. A bootstrap sample is a sample drawn with replacement: some examples appear multiple times, some not at all. Approximately 63.2% of original examples appear in each bootstrap sample.
The variance reduction happens because each model is trained on a slightly different dataset, causing them to make different errors. Averaging their predictions cancels out much of the noise.
Random Forest is the canonical bagging ensemble. It combines bagging with random feature selection at each split: when building each decision tree, only a random subset of features is considered for each split. This further decorrelates the trees.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=500, # more trees = better, up to diminishing returns
max_features="sqrt", # random feature subset at each split
max_depth=None, # fully grown trees (high variance, reduced by averaging)
random_state=42
)
rf.fit(X_train, y_train)
Random Forest is robust, fast to train, requires little hyperparameter tuning, and handles missing data and mixed feature types reasonably well. It is often the best first model to try for tabular data.
Bagging primarily reduces variance. It works best with high-variance (overfit) base learners like deep decision trees. Using shallow trees with bagging provides less benefit.
Boosting: Sequential Error Correction
Boosting trains models sequentially, where each model focuses on correcting the mistakes of all previous models. The final prediction combines all models' outputs with learned weights.
AdaBoost (Freund and Schapire, 1997): The original boosting algorithm. After each round, examples that were misclassified are upweighted, so the next model focuses on the difficult examples. Combines weak learners (slightly better than random) into a strong learner.
Gradient Boosting: A generalization of boosting that fits each new model to the residuals (errors) of the current ensemble, treating boosting as gradient descent in function space. XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017) are the dominant implementations.
import xgboost as xgb
model = xgb.XGBClassifier(
n_estimators=1000,
learning_rate=0.05, # small learning rate = slower, usually better
max_depth=6,
subsample=0.8, # row sampling (like bagging)
colsample_bytree=0.8, # column sampling
early_stopping_rounds=50,
eval_metric="logloss"
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
Boosting primarily reduces bias. It works best with low-variance (underfit) base learners like shallow trees (max_depth 3-6). Deep trees with boosting often overfit.
The key practical difference: bagging can be parallelized (trees are independent), boosting is sequential. LightGBM addresses this by using histogram-based splitting and leaf-wise growth, making it substantially faster than XGBoost on most datasets while matching or exceeding its accuracy.
Stacking: Learning How to Combine Models
Stacking (Wolpert, 1992) trains a meta-model that learns how to best combine the predictions of base models. Instead of simply averaging predictions, you train a second-level model on out-of-fold base model predictions.
The process:
- Split training data into K folds
- For each fold i, train all base models on the K-1 other folds and predict on fold i
- Collect these out-of-fold predictions into a new feature matrix (one column per base model)
- Train a meta-model on this feature matrix with the true labels
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# Base models
base_models = [
RandomForestClassifier(n_estimators=100),
GradientBoostingClassifier(n_estimators=100),
]
# Get out-of-fold predictions for each base model
meta_features = np.column_stack([
cross_val_predict(model, X_train, y_train, cv=5, method="predict_proba")[:, 1]
for model in base_models
])
# Train base models on full training data
for model in base_models:
model.fit(X_train, y_train)
# Train meta-model
meta_model = LogisticRegression()
meta_model.fit(meta_features, y_train)
The meta-model learns, for example, that the Random Forest is more reliable for examples where features A and B are large, while the Gradient Boosting model is more reliable for examples where feature C is large. This is more sophisticated than simple averaging.
Stacking typically provides a 1-3% lift over the best individual model. It is standard practice in Kaggle competitions but adds complexity (multiple models to maintain, out-of-fold generation process).
Voting Ensembles: The Simplest Approach
Majority vote (classification) or average (regression) over independently trained models. No second-level model required.
from sklearn.ensemble import VotingClassifier
ensemble = VotingClassifier(
estimators=[("rf", RandomForestClassifier()), ("xgb", xgb.XGBClassifier()), ("lr", LogisticRegression())],
voting="soft" # average predicted probabilities
)
ensemble.fit(X_train, y_train)
Soft voting (averaging probabilities) almost always outperforms hard voting (majority vote of class labels) when your models are well-calibrated. Use soft voting as the default.
The power of voting ensembles is in model diversity. A Random Forest, XGBoost, and logistic regression make different types of errors. An ensemble of five Random Forests with different seeds is less powerful because the models are too similar.
When Ensembles Are Worth the Complexity
Kaggle competitions: Almost always. The difference between first place and tenth place is often 0.1% AUC, and ensembling is one of the most reliable ways to get there.
High-stakes production systems: Credit scoring, fraud detection, medical diagnosis. When the cost of false positives or false negatives is high, the 2-3% accuracy gain from ensembling may be worth the added complexity.
When you have diverse base models: If you have independently built models using different algorithms and feature engineering approaches, stacking them often provides meaningful gain.
When Ensembles Are Not Worth It
Real-time inference with strict latency requirements: Running 5 models instead of 1 multiplies inference time. For sub-10ms latency requirements, a single well-tuned model is usually the right choice.
Early in a project: Do not reach for ensembles before your individual model is well-tuned. The 2% gain from ensembling a weak model is smaller than the gain from better feature engineering or a better base model.
When interpretability is required: Ensembles are black boxes. If you need to explain predictions to regulators or customers, a single decision tree or logistic regression may be required regardless of accuracy.
Keep Reading
- Hyperparameter Tuning Guide -- tune your base models before ensembling them
- Cross-Validation Guide -- stacking requires careful use of out-of-fold predictions to avoid leakage
- ML Model Evaluation Metrics Guide -- measure the right thing when comparing ensemble vs individual model
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.