Why Gradient Boosting Dominates Tabular ML
Despite all the excitement around deep learning, gradient boosting models win most Kaggle tabular competitions and outperform neural networks on structured business data. XGBoost, LightGBM, and CatBoost are the three dominant libraries — each with distinct strengths.
XGBoost: The Industry Standard
XGBoost invented the modern gradient boosting formula. It is the most widely deployed, has the largest community, and integrates with everything.
import xgboost as xgb
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
model = xgb.XGBClassifier(
n_estimators=1000,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
use_label_encoder=False,
eval_metric="logloss",
early_stopping_rounds=50,
device="cuda", # GPU training
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=100,
)
XGBoost uses level-wise (breadth-first) tree growth, which is more regularized but slower than LightGBM.
LightGBM: Fastest Training
LightGBM is Microsoft's answer to XGBoost. It uses leaf-wise (depth-first) tree growth and Gradient-based One-Side Sampling (GOSS), making it 3-10x faster than XGBoost on large datasets.
import lightgbm as lgb
model = lgb.LGBMClassifier(
n_estimators=1000,
learning_rate=0.05,
num_leaves=63, # key hyperparameter — controls model complexity
min_child_samples=20, # regularization
subsample=0.8,
colsample_bytree=0.8,
device="gpu",
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)],
)
LightGBM natively handles categorical features with categorical_feature parameter — no one-hot encoding needed.
CatBoost: Best for Categorical Features
CatBoost is Yandex's contribution. Its killer feature is ordered target encoding — it handles categorical features without preprocessing, without leakage, and without manual encoding.
from catboost import CatBoostClassifier, Pool
cat_features = ["country", "device_type", "plan_name"]
train_pool = Pool(X_train, y_train, cat_features=cat_features)
val_pool = Pool(X_val, y_val, cat_features=cat_features)
model = CatBoostClassifier(
iterations=1000,
learning_rate=0.05,
depth=6,
early_stopping_rounds=50,
task_type="GPU",
verbose=100,
)
model.fit(train_pool, eval_set=val_pool)
CatBoost also requires no hyperparameter tuning as a baseline — default parameters often give competitive results.
Hyperparameter Tuning with Optuna
import optuna
import lightgbm as lgb
def objective(trial):
params = {
"n_estimators": trial.suggest_int("n_estimators", 100, 2000),
"num_leaves": trial.suggest_int("num_leaves", 20, 300),
"learning_rate": trial.suggest_float("learning_rate", 1e-4, 0.3, log=True),
"min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
}
model = lgb.LGBMClassifier(**params)
return cross_val_score(model, X_train, y_train, cv=3, scoring="roc_auc").mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
When to Choose Each
- XGBoost: existing codebase, widest ecosystem compatibility, sklearn API
- LightGBM: large datasets (>1M rows), fastest training, many categorical features
- CatBoost: high-cardinality categoricals, want minimal preprocessing, no-code baseline
On most competition datasets, running all three and ensembling beats any single library.