Hyperparameter tuning is the process of finding the model configuration settings that maximize performance on your validation data. Unlike model weights (which are learned from data during training), hyperparameters are settings you choose before training begins. Getting them right can be the difference between a mediocre model and a production-ready one.
The Hyperparameters That Actually Matter
Not all hyperparameters have equal impact. Before running any search, know which levers matter most for your model type.
Learning rate is the single most impactful hyperparameter for almost any neural network. Too high and training diverges -- loss explodes or oscillates. Too low and training converges too slowly or gets stuck. The relationship between learning rate and model performance is non-monotonic: there is a sweet spot that is often narrow. A factor of 10 in learning rate can be the difference between a converging and a diverging model.
Batch size affects both optimization dynamics and training speed. Smaller batches give noisier gradient estimates (which can help escape local minima) but are slower due to GPU underutilization. Larger batches are computationally efficient but may require a proportionally larger learning rate and can converge to sharper, less-generalizable minima. A common starting range: 32 to 512 for most tasks.
Regularization parameters: Dropout rate (0.1 to 0.5 for most architectures), weight decay (L2 regularization, typically 1e-4 to 1e-2). Too little regularization leads to overfitting. Too much prevents learning.
Network architecture: Depth (number of layers) and width (hidden dimension size) for neural networks. These interact with each other and with learning rate. Deeper networks often need smaller learning rates.
For tree-based models (XGBoost, LightGBM): max depth, n_estimators, min_child_weight, subsample, colsample_bytree. These have strong interactions.
Strategies: From Worst to Best
Manual Tuning
The baseline. You pick values based on intuition or published recommendations, train, evaluate, adjust. This is how researchers tuned models before automated search was practical.
It is still useful for an initial sanity check: train once with reasonable defaults, look at training curves, then decide whether to search. If your model is not even converging, search is premature.
Grid Search
Try every combination of specified values. If you search over 5 values of learning rate, 4 values of batch size, and 3 values of dropout rate, you run 5x4x3 = 60 experiments.
The problem: it scales exponentially with the number of hyperparameters. Most of the budget is wasted on hyperparameters that do not matter for the current run. Bergstra and Bengio (2012) showed analytically that random search outperforms grid search when only a few hyperparameters matter most, which is almost always the case.
Grid search is only worth using when you have 1-2 hyperparameters to tune and you want exhaustive coverage.
Random Search
Sample hyperparameter values randomly from specified distributions rather than a grid. Run N experiments.
This works better than grid search for a counterintuitive reason: when only a few hyperparameters matter (which is typical), random search automatically explores more values of the important ones. A 10x10 grid over two hyperparameters (100 experiments) explores only 10 values of each. 100 random samples explore 100 distinct values of each hyperparameter.
Practical recommendation: for most cases, 50-200 random samples with a good search space gets you 80-90% of the way to optimal.
Bayesian Optimization
Bayesian optimization builds a probabilistic model of the objective function (validation performance as a function of hyperparameters) and uses it to intelligently choose the next configuration to try. After each experiment, it updates its model and selects the next point that maximizes expected improvement over the current best result.
This is significantly more sample-efficient than random search when evaluating each configuration is expensive (as it is for large neural networks). In practice, Bayesian optimization often finds better configurations in fewer experiments than random search.
The downside: Bayesian optimization is inherently sequential (the next point depends on all previous results), making it harder to parallelize across many machines. For high-parallelism environments, asynchronous Bayesian optimization or multi-fidelity methods are more appropriate.
Optuna is the best open source implementation:
import optuna
def objective(trial):
lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
dropout = trial.suggest_float("dropout", 0.1, 0.5)
model = build_model(dropout=dropout)
val_loss = train_and_evaluate(model, lr=lr, batch_size=batch_size)
return val_loss
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
print(study.best_params)
Population-Based Training (PBT)
PBT (Jaderberg et al., 2017, DeepMind) combines random search with online adaptation. You train a population of models in parallel. Periodically, low-performing models are replaced by copies of high-performing models with slightly perturbed hyperparameters.
This is uniquely powerful because hyperparameters can change during training. Learning rate schedules discovered by PBT (start high, decrease in complex patterns) often outperform hand-designed schedules. Ray Tune implements PBT.
Tools
Optuna: Flexible, supports Bayesian optimization, CMA-ES, and other samplers. Works with any Python training code. Excellent visualization dashboard.
Ray Tune: Distributed hyperparameter tuning. Integrates with PyTorch, TensorFlow, Hugging Face Trainer. Good for large-scale searches across many machines. Supports PBT.
Weights and Biases Sweeps: Managed hyperparameter search that integrates with W&B experiment tracking. YAML configuration, automatic logging. Good for teams already using W&B.
The Early Stopping Trick
Do not run full training for each configuration. Use early stopping: monitor validation loss and terminate training if it stops improving for N epochs. This can reduce the cost of each experiment by 50-80%.
For neural networks, you can also use successive halving: run all configurations for a small number of epochs, eliminate the bottom half, run the remaining configurations longer, repeat. This concentrates compute on promising configurations early.
Hyperband (Li et al., 2017) formalizes successive halving and is integrated into both Optuna and Ray Tune as the ASHA (Asynchronous Successive Halving) scheduler.
Search Space Design
The search space you define is as important as the search algorithm. Bad defaults:
- Linear scale for learning rate:
[0.001, 0.01, 0.1]-- use log scale instead - Too narrow ranges: you might exclude the optimal value
- Too wide ranges: most of the budget is wasted in hopeless regions
Good defaults:
- Learning rate: log-uniform between 1e-5 and 1e-1
- Batch size: categorical over powers of 2 (32, 64, 128, 256, 512)
- Dropout: uniform between 0.0 and 0.5
- Weight decay: log-uniform between 1e-6 and 1e-2
Reporting Results Correctly
Always report the validation performance of the best hyperparameters evaluated on a held-out test set -- NOT the validation set used for selection. If you use the validation set to both select hyperparameters and report final performance, you are reporting an overly optimistic estimate. This is a form of data leakage.
The correct procedure: train/validation/test split. Tune on validation. Report on test. Never touch test during tuning.
Keep Reading
- Cross-Validation Guide -- reliable validation is the prerequisite for reliable hyperparameter tuning
- ML Model Evaluation Metrics Guide -- know what metric to optimize during tuning
- Ensemble Methods Guide -- after tuning individual models, ensembles extract further gains
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.