Gradient descent is the algorithm that makes machine learning models learn. Every neural network, logistic regression model, and linear classifier you have ever used was trained using some version of it. Understanding gradient descent is not optional if you want to understand machine learning at any real depth.
The Core Idea: Walking Downhill
Imagine you are standing somewhere on a hilly landscape in thick fog. Your goal is to reach the lowest point in the valley. You cannot see far, but you can feel the slope under your feet. The sensible strategy: take a step in whichever direction slopes downward, then repeat.
That is gradient descent. The "landscape" is your model's loss function, which measures how wrong your model's predictions are. The "position" is the current set of model weights. The gradient is the slope -- it tells you which direction increases the loss. You move in the opposite direction (negative gradient) to reduce it.
Mathematically: new_weight = old_weight - learning_rate * gradient
You do this repeatedly until the loss stops improving meaningfully. That is training.
What the Loss Function Actually Measures
Before gradient descent can work, you need something to minimize. The loss function converts "how wrong is my model?" into a single number.
For regression (predicting continuous values), mean squared error is common: average the squared differences between predictions and true values. Squaring penalizes large errors more than small ones.
For classification (predicting categories), cross-entropy loss is standard. It penalizes confident wrong predictions very harshly. If your model says a cat image is 99% likely to be a dog, cross-entropy loss is enormous.
The loss function shapes the landscape that gradient descent navigates. Choose the wrong one and you are optimizing for the wrong thing.
Batch vs. SGD vs. Mini-Batch
The critical practical question: how many training examples do you use to compute each gradient update?
Batch gradient descent uses the entire dataset to compute one gradient update. This gives you an accurate gradient, but if your dataset has 10 million examples, you have to process all 10 million before taking a single step. It is slow, memory-intensive, and impractical for large datasets.
Stochastic gradient descent (SGD) uses one example at a time. Each example gives you a noisy estimate of the true gradient, but you get an update after every single example. This is fast but chaotic -- the loss bounces around rather than smoothly decreasing. That noise is sometimes useful because it helps escape shallow local minima, but it makes convergence erratic.
Mini-batch gradient descent is the standard approach in practice. You use a small batch of examples -- typically 32 to 256 -- to compute each gradient update. This gives you a reasonable gradient estimate, fits in GPU memory efficiently, and updates weights thousands of times per epoch. Almost all modern deep learning uses mini-batch gradient descent, and when people say "SGD" in the context of neural networks, they usually mean mini-batch SGD.
The batch size is a hyperparameter you tune. Smaller batches introduce more noise (can help generalization, can destabilize training). Larger batches are more stable but can lead to sharper minima that generalize worse.
Learning Rate: The Most Important Hyperparameter
The learning rate controls how big a step you take in the downhill direction after each gradient computation.
Too large: You overshoot the minimum repeatedly. The loss oscillates or diverges entirely. Your model never converges.
Too small: You take tiny steps and training takes forever. You might also get stuck in a sub-optimal local minimum because you lack momentum to escape it.
Just right: The loss decreases smoothly and the model reaches a good solution in a reasonable number of steps.
Finding the right learning rate is part science, part experience. A common starting point for Adam is 0.001. For raw SGD, 0.01 or 0.1 is typical. But these are starting points, not rules.
Learning rate schedules adjust the learning rate during training. Common approaches: decay the learning rate by a fixed factor every N epochs, use cosine annealing (smoothly decrease from high to low following a cosine curve), or use warmup (start very small, ramp up, then decay). These schedules often improve final model quality significantly.
The Adam Optimizer: Why Everyone Uses It
Raw SGD updates every weight by the same learning rate. Adam (Adaptive Moment Estimation) adapts the learning rate for each weight individually based on the history of gradients for that weight.
The intuition: if a parameter's gradient has been consistently large, reduce its learning rate. If a parameter's gradient has been small and variable, increase its learning rate. This per-parameter adaptation makes Adam far more robust than plain SGD across a wide range of problems.
Adam maintains two running averages: the mean of recent gradients (momentum) and the mean of squared recent gradients (variance). It uses both to compute an adaptive learning rate for each parameter.
In practice, Adam converges faster and is less sensitive to the initial learning rate than SGD. This is why almost all deep learning practitioners use Adam or one of its variants (AdamW, which adds weight decay correctly, is currently the default for most modern training runs).
Raw SGD with momentum is still used in some settings -- particularly for training large vision models -- because it sometimes finds flatter minima that generalize better. But for most tasks, Adam is the right starting point.
Reading Loss Curves
A loss curve plots the loss value over training steps or epochs. Reading loss curves is a core diagnostic skill.
Healthy training curve: Both training loss and validation loss decrease together, with validation loss slightly higher than training loss. They eventually plateau. The model is learning and generalizing.
Overfitting: Training loss keeps decreasing while validation loss plateaus or increases. Your model is memorizing training data rather than learning general patterns. Solutions: more data, stronger regularization (dropout, weight decay), smaller model, early stopping.
Underfitting: Both losses plateau at a high value. Your model lacks the capacity to capture the patterns in your data. Solutions: larger model, more training epochs, better features.
Unstable training: Loss oscillates wildly or diverges. Learning rate is too high, or there is a bug in your data pipeline or model architecture.
Loss spikes: Sudden jumps up followed by recovery. Often caused by a bad batch with an unusual example or a bug in data augmentation. Usually not catastrophic unless frequent.
Local Minima and Saddle Points
A common concern: will gradient descent get stuck in a local minimum (a point that is lower than its immediate neighbors but not the global lowest point)?
In high-dimensional spaces -- modern neural networks have millions or billions of parameters -- local minima are surprisingly rare. Most critical points (where the gradient is zero) are saddle points: lower in some dimensions, higher in others. Gradient descent with mini-batch noise and momentum tends to escape saddle points reasonably well.
The more practical concern is not local minima but saddle points and flat regions that slow training. This is another reason adaptive optimizers like Adam are popular: they maintain momentum to push through flat regions.
Gradient Descent in Practice
Start with Adam at a learning rate of 0.001. Plot your loss curves after a short training run to check for instability. If loss diverges, reduce learning rate by 10x. If training is too slow, try increasing it slightly.
Monitor both training and validation loss to catch overfitting early. Use early stopping: stop training when validation loss has not improved for N epochs and restore the weights from the best validation checkpoint.
If your model is not learning at all (loss stays flat), check for: learning rate too small, vanishing gradients (use better initialization or batch normalization), data pipeline bugs, or a bug in your loss function.
Gradient descent is not magic. It is iterative numerical optimization applied to a specific objective. Understanding it lets you diagnose training problems, choose the right optimizer, and tune hyperparameters with intention rather than guesswork.
Keep Reading
- Machine Learning Complete Guide for Software Developers — broader ML foundations to pair with this deep dive
- Neural Networks Explained Visually — what gradient descent is actually optimizing inside a neural network
- Overfitting and Underfitting: How to Fix Them — the training curves you need to recognize and the remedies for each
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.