Training a neural network is a pipeline, not a single step. Every stage from raw data to deployed model has decisions that compound — a bad choice in data splitting will invalidate every metric downstream, and a misconfigured learning rate can waste days of compute. This guide walks through the full process with enough detail to make real decisions.
Step 1: Data Preparation
Raw data is never model-ready. The preparation pipeline for most tasks includes:
Cleaning — remove duplicates, handle missing values (drop rows, impute with median/mode, or use a sentinel value), remove obvious corruptions.
Normalization — scale numerical features so they are on comparable ranges. Standard scaling (subtract mean, divide by std) works for most cases. Min-max scaling (rescale to [0, 1]) is better when the feature distribution is bounded. If you skip normalization, gradient-based optimizers behave poorly because different features have wildly different gradient magnitudes.
Encoding — convert categorical features to numbers. One-hot encoding for low-cardinality categoricals (fewer than ~20 values). Embedding layers for high-cardinality (user IDs, product IDs). Label encoding only for ordinal features (small, medium, large has a natural order).
Text tokenization — if your input is text, tokenize it with the same tokenizer used to pretrain your base model. Using a different tokenizer than the one the model was trained with is a silent but severe error.
Step 2: Train/Validation/Test Split
Split your data before any preprocessing that learns parameters (fitting a scaler, fitting a tokenizer vocabulary). If you normalize on the full dataset before splitting, your validation and test sets have information about the training distribution — this is data leakage.
Standard splits: 80% train, 10% validation, 10% test. For small datasets: 70/15/15. For very large datasets: 98/1/1 (you do not need 10,000 validation examples when you have a million training examples).
The test set is held out until the very end — you evaluate on it exactly once after all training is complete. If you peek at test set metrics during development and make decisions based on them, the test set is no longer an unbiased estimate of real-world performance.
For time-series data: split by time, never randomly. If your task is predicting tomorrow's sales, your test set should be the last N days, not a random sample of days across all time. Random splits on time-series create future leakage.
Step 3: Model Architecture Selection
Start with the simplest architecture that could plausibly work:
Tabular data — start with gradient boosted trees (XGBoost, LightGBM) before neural networks. They train faster, require less hyperparameter tuning, and often outperform neural networks on structured tabular data. If you need differentiability or integration with other neural components, use a multi-layer perceptron (MLP).
Images — use a pretrained convolutional network (ResNet, EfficientNet) and fine-tune it. Training from scratch requires far more data and compute than most projects have.
Text — use a pretrained transformer (BERT, RoBERTa, or a generative model depending on the task). Fine-tuning a pretrained model beats training from scratch unless your domain is highly specialized.
Sequences — transformers have replaced RNNs for most sequence tasks. For low-latency applications where transformer attention is too expensive, an LSTM is still reasonable.
Step 4: Loss Function Choice
The loss function defines what "better" means during training. Getting it wrong means optimizing for the wrong objective.
Binary classification — binary cross-entropy. If classes are severely imbalanced, either weight the positive class or use focal loss (which down-weights easy negatives).
Multi-class classification — categorical cross-entropy (also called softmax loss).
Regression — mean squared error (MSE) is standard. Mean absolute error (MAE) is more robust to outliers. Huber loss combines both: MSE for small errors, MAE for large ones.
Ranking — listwise or pairwise losses (e.g., BPR loss, LambdaRank). Standard classification losses are a poor proxy for ranking quality.
Step 5: Optimizer Configuration
Adam is the default optimizer for almost everything. It adapts learning rates per parameter and handles sparse gradients well. Use it unless you have a specific reason not to.
Learning rate — the most important hyperparameter. Start with 1e-3 for Adam. If loss is unstable (jumping around), lower it. If loss barely moves, try 1e-2.
Weight decay (L2 regularization) — AdamW adds weight decay correctly, whereas Adam's original formulation does not. Use AdamW with weight decay of 1e-4 to 1e-2 for regularization.
Batch size — larger batches train faster per epoch but may generalize slightly worse. 32-256 is a reasonable range for most GPU-trained models. If your GPU memory allows it, 256-512 is efficient. Gradient accumulation lets you simulate large batches with limited memory.
Step 6: Learning Rate Schedule
A fixed learning rate is rarely optimal. Common schedules:
Cosine annealing — decay the learning rate following a cosine curve from initial lr to near zero. Widely used, works well in most settings.
Warmup + decay — increase lr linearly for the first N steps (warmup), then decay. Standard for fine-tuning pretrained transformers. Warmup prevents large gradient updates in early training when the model weights are far from optimal.
ReduceLROnPlateau — reduce lr when validation loss stops improving. Simple, adaptive, no schedule hyperparameters to tune.
Step 7: Epochs, Early Stopping, and Checkpointing
Train until validation loss stops improving, not for a fixed number of epochs. Early stopping monitors validation loss and stops training when it has not improved for N consecutive epochs (patience). Typical patience: 5-10 epochs.
Checkpointing saves the model weights at each validation improvement. After early stopping triggers, load the checkpoint from the epoch with the best validation loss — not the weights from the final epoch (which may be slightly overfit).
Combine both: checkpoint every epoch that improves validation loss, stop when patience is exhausted, load the best checkpoint.
Step 8: Evaluation on Test Set
After training is complete and you have selected your final model using the validation set, evaluate it exactly once on the test set. Report precision, recall, F1, and AUC for classification tasks. RMSE and MAE for regression.
If the test set metrics are significantly worse than validation metrics, you have overfit to the validation set through repeated evaluation — the test set is no longer clean. This is a process failure, not a model failure.
Common Training Failures and How to Fix Them
NaN loss — the learning rate is too high, causing gradient explosion. Reduce lr by 10x. If NaN appears immediately at step 1, check for NaN or infinite values in your input data.
Loss is not decreasing — either the learning rate is too low (increase it 10x), the model has poor initialization (try a different seed or use He/Xavier initialization), or the loss function is mismatched with the output activation (e.g., sigmoid output with MSE loss for a classification task).
Validation loss rises while training loss falls — classic overfitting. Add dropout (0.1-0.5 depending on severity), reduce model capacity (fewer layers/neurons), add weight decay, or get more training data. If you have very little data, consider a simpler model architecture.
Oscillating validation loss — learning rate is still slightly too high for the late stages of training. A schedule that decays lr over time typically fixes this.
Training is very slow — check GPU utilization. If it is below 80%, your data loading pipeline is the bottleneck. Use prefetching, multiple DataLoader workers, and pin memory to CPU-GPU transfer.
Training a neural network is iterative. Plan for 5-10 experiment cycles from first run to production-ready model. Log every experiment with its hyperparameters and results using MLflow or Weights and Biases so you can compare runs without relying on memory.
Keep Reading
- Overfitting and Underfitting: How to Fix Them — detailed techniques for each failure mode
- Gradient Descent Explained — the optimization algorithm behind all of this
- ML Data Collection Guide — building the dataset that training depends on
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.