Hyperparameter Tuning: Finding the Model Settings That Actually Matter

Learning rate, batch size, regularization -- the right hyperparameters can mean 10+ percentage points of accuracy. Here is how to find them efficiently without exhaustive search.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

9 min read

// tags

#hyperparameter-tuning#optuna#bayesian-optimization#ray-tune#model-training

FIG. ART-30

9 min read

“

Hyperparameter Tuning: Finding the Model Settings That Actually Matter

// reading plan

sections

1,161

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Supervised Learning Explained: How Models Learn from Labeled Examples

Hyperparameter tuning is the process of finding the model configuration settings that maximize performance on your validation data. Unlike model weights (which are learned from data during training), hyperparameters are settings you choose before training begins. Getting them right can be the difference between a mediocre model and a production-ready one.

The Hyperparameters That Actually Matter

Not all hyperparameters have equal impact. Before running any search, know which levers matter most for your model type.

Learning rate is the single most impactful hyperparameter for almost any neural network. Too high and training diverges -- loss explodes or oscillates. Too low and training converges too slowly or gets stuck. The relationship between learning rate and model performance is non-monotonic: there is a sweet spot that is often narrow. A factor of 10 in learning rate can be the difference between a converging and a diverging model.

Batch size affects both optimization dynamics and training speed. Smaller batches give noisier gradient estimates (which can help escape local minima) but are slower due to GPU underutilization. Larger batches are computationally efficient but may require a proportionally larger learning rate and can converge to sharper, less-generalizable minima. A common starting range: 32 to 512 for most tasks.

Regularization parameters: Dropout rate (0.1 to 0.5 for most architectures), weight decay (L2 regularization, typically 1e-4 to 1e-2). Too little regularization leads to overfitting. Too much prevents learning.

Network architecture: Depth (number of layers) and width (hidden dimension size) for neural networks. These interact with each other and with learning rate. Deeper networks often need smaller learning rates.

For tree-based models (XGBoost, LightGBM): max depth, n_estimators, min_child_weight, subsample, colsample_bytree. These have strong interactions.

Strategies: From Worst to Best

Manual Tuning

The baseline. You pick values based on intuition or published recommendations, train, evaluate, adjust. This is how researchers tuned models before automated search was practical.

It is still useful for an initial sanity check: train once with reasonable defaults, look at training curves, then decide whether to search. If your model is not even converging, search is premature.

Grid Search

Try every combination of specified values. If you search over 5 values of learning rate, 4 values of batch size, and 3 values of dropout rate, you run 5x4x3 = 60 experiments.

The problem: it scales exponentially with the number of hyperparameters. Most of the budget is wasted on hyperparameters that do not matter for the current run. Bergstra and Bengio (2012) showed analytically that random search outperforms grid search when only a few hyperparameters matter most, which is almost always the case.

Grid search is only worth using when you have 1-2 hyperparameters to tune and you want exhaustive coverage.

Random Search

Sample hyperparameter values randomly from specified distributions rather than a grid. Run N experiments.

This works better than grid search for a counterintuitive reason: when only a few hyperparameters matter (which is typical), random search automatically explores more values of the important ones. A 10x10 grid over two hyperparameters (100 experiments) explores only 10 values of each. 100 random samples explore 100 distinct values of each hyperparameter.

Practical recommendation: for most cases, 50-200 random samples with a good search space gets you 80-90% of the way to optimal.

Bayesian Optimization

Bayesian optimization builds a probabilistic model of the objective function (validation performance as a function of hyperparameters) and uses it to intelligently choose the next configuration to try. After each experiment, it updates its model and selects the next point that maximizes expected improvement over the current best result.

This is significantly more sample-efficient than random search when evaluating each configuration is expensive (as it is for large neural networks). In practice, Bayesian optimization often finds better configurations in fewer experiments than random search.

The downside: Bayesian optimization is inherently sequential (the next point depends on all previous results), making it harder to parallelize across many machines. For high-parallelism environments, asynchronous Bayesian optimization or multi-fidelity methods are more appropriate.

Optuna is the best open source implementation:

import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
    dropout = trial.suggest_float("dropout", 0.1, 0.5)

    model = build_model(dropout=dropout)
    val_loss = train_and_evaluate(model, lr=lr, batch_size=batch_size)
    return val_loss

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
print(study.best_params)

Population-Based Training (PBT)

PBT (Jaderberg et al., 2017, DeepMind) combines random search with online adaptation. You train a population of models in parallel. Periodically, low-performing models are replaced by copies of high-performing models with slightly perturbed hyperparameters.

This is uniquely powerful because hyperparameters can change during training. Learning rate schedules discovered by PBT (start high, decrease in complex patterns) often outperform hand-designed schedules. Ray Tune implements PBT.

Tools

Optuna: Flexible, supports Bayesian optimization, CMA-ES, and other samplers. Works with any Python training code. Excellent visualization dashboard.

Ray Tune: Distributed hyperparameter tuning. Integrates with PyTorch, TensorFlow, Hugging Face Trainer. Good for large-scale searches across many machines. Supports PBT.

Weights and Biases Sweeps: Managed hyperparameter search that integrates with W&B experiment tracking. YAML configuration, automatic logging. Good for teams already using W&B.

The Early Stopping Trick

Do not run full training for each configuration. Use early stopping: monitor validation loss and terminate training if it stops improving for N epochs. This can reduce the cost of each experiment by 50-80%.

For neural networks, you can also use successive halving: run all configurations for a small number of epochs, eliminate the bottom half, run the remaining configurations longer, repeat. This concentrates compute on promising configurations early.

Hyperband (Li et al., 2017) formalizes successive halving and is integrated into both Optuna and Ray Tune as the ASHA (Asynchronous Successive Halving) scheduler.

Search Space Design

The search space you define is as important as the search algorithm. Bad defaults:

Linear scale for learning rate: [0.001, 0.01, 0.1] -- use log scale instead
Too narrow ranges: you might exclude the optimal value
Too wide ranges: most of the budget is wasted in hopeless regions

Good defaults:

Learning rate: log-uniform between 1e-5 and 1e-1
Batch size: categorical over powers of 2 (32, 64, 128, 256, 512)
Dropout: uniform between 0.0 and 0.5
Weight decay: log-uniform between 1e-6 and 1e-2

Reporting Results Correctly

Always report the validation performance of the best hyperparameters evaluated on a held-out test set -- NOT the validation set used for selection. If you use the validation set to both select hyperparameters and report final performance, you are reporting an overly optimistic estimate. This is a form of data leakage.

The correct procedure: train/validation/test split. Tune on validation. Report on test. Never touch test during tuning.

Keep Reading

Cross-Validation Guide -- reliable validation is the prerequisite for reliable hyperparameter tuning
ML Model Evaluation Metrics Guide -- know what metric to optimize during tuning
Ensemble Methods Guide -- after tuning individual models, ensembles extract further gains

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.

Hyperparameter Tuning: Finding the Model Settings That Actually Matter

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

The Hyperparameters That Actually Matter

Strategies: From Worst to Best

Manual Tuning

Grid Search

Random Search

Bayesian Optimization

Population-Based Training (PBT)

Tools

The Early Stopping Trick

Search Space Design

Reporting Results Correctly

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

Hyperparameter Tuning: Finding the Model Settings That Actually Matter

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

The Hyperparameters That Actually Matter

Strategies: From Worst to Best

Manual Tuning

Grid Search

Random Search

Bayesian Optimization

Population-Based Training (PBT)

Tools

The Early Stopping Trick

Search Space Design

Reporting Results Correctly

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

The workspace your team
actually needs