Overfitting and underfitting are the two fundamental failure modes of machine learning models. Overfitting happens when your model learns the training data too well, including its noise and quirks, and fails to generalize to new examples. Underfitting happens when your model is too simple to capture the patterns in the data at all. The training versus validation loss curve is how you tell which problem you have, and the fixes are almost entirely different.
Understanding these two failure modes is foundational. Every technique in ML — regularization, dropout, early stopping, data augmentation, architecture selection — is ultimately a tool for managing this tradeoff.
Overfitting: The Model Memorizes Instead of Learning
Overfitting is when a model fits the training data so closely that it captures noise rather than signal. The classic example: a model trained to classify images of cats and dogs that perfectly classifies every training image, but fails on new images it has never seen. It did not learn what makes a cat a cat; it memorized which pixel patterns corresponded to which labels in the training set.
Symptoms of overfitting:
- Training loss continues to decrease as training progresses
- Validation loss decreases initially, then starts increasing (or stops decreasing)
- A large gap between training accuracy and validation accuracy (e.g., 98% training, 72% validation)
- The model gives very confident predictions on training examples and uncertain or wrong predictions on new examples
What causes it: the model has more capacity (parameters) than the data complexity warrants. A 100-million parameter model trained on 1,000 examples will overfit severely. The model effectively memorizes each training example.
A concrete example: fitting a polynomial to six data points. A polynomial with six parameters can pass through all six points exactly (zero training error). But the curve between the points oscillates wildly, making terrible predictions for new data. A simpler polynomial with three parameters misses some training points but captures the underlying trend much better.
Underfitting: The Model Is Too Simple
Underfitting happens when the model is not powerful enough to capture the patterns in the data. Both training and validation loss are high, and performance is poor on both.
Symptoms of underfitting:
- High training loss that is not decreasing much with more training
- Training and validation loss are similarly high (small gap, but both poor)
- The model's predictions seem random or default to the most common class
What causes it: using a model that is too simple for the task (e.g., linear regression for a non-linear relationship), too few training epochs, too high a learning rate causing unstable training, or bad features that do not capture the relevant signal.
Diagnosing Which You Have: The Loss Curve
Plot training loss and validation loss on the same chart, with training steps or epochs on the x-axis and loss on the y-axis.
Overfitting pattern: training loss decreases consistently, validation loss decreases initially then flattens or increases. The two curves diverge. The divergence point is where overfitting begins.
Underfitting pattern: both curves are high and decrease slowly or plateau at a high value. The curves are close together but both at an unacceptable level.
Good fit pattern: both curves decrease and converge to low values. Validation loss is slightly higher than training loss (always expected), but the gap is small and stable.
Most ML libraries plot this automatically. In PyTorch:
import matplotlib.pyplot as plt
# Assume train_losses and val_losses are lists collected during training
plt.figure(figsize=(10, 6))
plt.plot(train_losses, label="Training Loss")
plt.plot(val_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.title("Training vs Validation Loss")
plt.show()
Fixes for Overfitting
More Data
The most reliable fix. More training examples give the model more patterns to learn and make it harder to memorize noise. If doubling your dataset is feasible, try this first before any regularization technique.
Dropout
Dropout randomly sets a fraction of neuron outputs to zero during each training step. A common setting is 0.3 to 0.5 (30 to 50 percent of neurons disabled each step). This prevents any single neuron from becoming critical to the model's predictions, forcing the network to distribute knowledge across many neurons.
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.dropout = nn.Dropout(0.3) # 30% dropout
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.dropout(x) # applied during training only
x = self.fc2(x)
return x
Dropout is disabled during inference automatically.
L1 and L2 Regularization (Weight Decay)
Regularization adds a penalty to the loss function that discourages large weights. Large weights mean the model is very sensitive to specific input patterns, which is a sign of memorization.
L2 regularization (weight decay) adds the sum of squared weights to the loss, pushing all weights toward smaller values without eliminating them. L1 regularization adds the sum of absolute weights, which tends to drive some weights to exactly zero, creating a sparse model.
In most modern deep learning frameworks, L2 regularization is implemented as weight decay in the optimizer:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
Early Stopping
Monitor validation loss during training. When it stops improving (for a patience period of, say, 10 epochs), stop training. Save the model checkpoint from the epoch with the best validation loss.
This prevents the model from continuing to train after it starts overfitting, even if training loss would keep decreasing.
Simpler Model Architecture
If regularization is insufficient, reduce model capacity. Fewer layers, fewer neurons per layer, or fewer parameters overall. The model literally has less capacity to memorize noise.
Fixes for Underfitting
More Model Capacity
Add layers or neurons. An underfitting model is constrained by its architecture. Increasing the number of parameters gives it more capacity to represent complex patterns.
More Training Time
Train for more epochs. Underfitting is often simply a matter of not running training long enough for the model to converge. Check the loss curve: if both curves are still decreasing, more training will help.
Better Features
If the features (inputs) do not capture the relevant signal, no amount of model capacity will help. Domain knowledge about which features matter can be more valuable than architectural changes.
Lower Learning Rate
If the learning rate is too high, training oscillates and fails to converge. Reducing the learning rate by a factor of 10 is often the first thing to try when training is unstable.
Better Architecture
For some data types, certain architectures are significantly better suited than others. For images, convolutional neural networks (CNNs) outperform fully connected networks. For sequential data, transformers or recurrent networks outperform fully connected ones. Using the wrong architecture type can cause persistent underfitting regardless of training time or capacity.
The Bias-Variance Tradeoff
The bias-variance tradeoff is the theoretical framework underlying overfitting and underfitting.
Bias is the error from incorrect assumptions in the learning algorithm. High bias means the model consistently misses the target (underfitting). Variance is the error from sensitivity to fluctuations in the training data. High variance means the model's predictions change dramatically with different training samples (overfitting).
Every design decision in ML involves this tradeoff. More complex models have lower bias but higher variance. Simpler models have higher bias but lower variance. Regularization reduces variance at the cost of slightly higher bias. The goal is finding the model complexity that minimizes total error on new data, which is the sum of bias and variance.
The practical approach: start simple. Check train vs. validation loss. If you are underfitting, increase complexity. If you are overfitting, apply regularization or get more data. Iterate.
Keep Reading
- Neural Networks Explained: A Visual Guide for Software Developers — The foundational mechanics before tackling training failures
- When Not to Use Machine Learning: Simpler Solutions That Actually Work — How to avoid the overfitting problem entirely for simple tasks
- Building a RAG System From Scratch: A Complete Implementation Guide — A practical ML system that sidesteps training entirely
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.