A neural network is a sequence of mathematical functions, organized in layers, that transforms an input into a prediction. Each layer applies a transformation to the output of the previous layer. The network learns which transformations produce accurate predictions by adjusting its internal parameters on millions of examples. That is the complete description. Everything else in ML, including deep learning, LLMs, and computer vision, is a variation on this core idea.
This guide explains the mechanics of neural networks through analogies and examples rather than equations. If you can read code and understand how a function maps inputs to outputs, you have everything you need to follow this.
The Building Block: A Single Neuron
A neuron is a function. It takes multiple numerical inputs, multiplies each input by a weight, adds a bias term, sums everything, and passes the result through an activation function to produce a single output.
An analogy: imagine you are deciding whether to take an umbrella. Your inputs are the weather forecast (0.0 to 1.0 probability of rain), whether the walk is long (binary), and whether you have a meeting outside (binary). Each input has a weight that reflects how much it matters for your decision. You multiply each input by its weight, sum the products, and apply a threshold: above 0.5 means take the umbrella.
That process is a single neuron. The weights determine how much each input influences the output. In a real neural network, these weights are learned from data, not set by hand.
The formal description:
output = activation(w1*x1 + w2*x2 + w3*x3 + bias)
Where w1, w2, w3 are weights, x1, x2, x3 are inputs, bias is an adjustable offset, and activation is a function that introduces non-linearity (more on this below).
Layers: Organizing Neurons
A neural network organizes neurons into layers. The first layer receives the raw input. The last layer produces the prediction. The layers in between are called hidden layers and perform intermediate transformations.
Input layer: The raw data. For an image classifier, this might be the pixel values. For a text classifier, this might be numerical representations of words. The input layer is not really a layer of neurons; it is just the data entry point.
Hidden layers: One or more layers of neurons. Each neuron in a hidden layer takes outputs from the previous layer as its inputs, applies a transformation, and passes the result to the next layer. These layers learn to extract features from the data: edges in images, syntax patterns in text, correlations in tabular data.
Output layer: The final layer produces the prediction. For a binary classifier (spam or not spam), this is a single neuron producing a probability between 0 and 1. For a ten-class image classifier (digits 0 through 9), this is ten neurons, one per class, each producing a score.
Why "Deep" Learning: The Case for Multiple Layers
A single-layer network (one input layer, one output layer) can only learn linear patterns. It can separate two classes if they are linearly separable in the input space, but fails on anything more complex.
Adding hidden layers allows the network to learn non-linear patterns. A two-hidden-layer network can learn to detect edges and then combine edges into shapes. A deeper network learns a hierarchy: edges to shapes to objects to scenes.
The term "deep learning" refers to networks with many hidden layers. Modern image recognition networks have dozens to hundreds of layers. Transformer-based LLMs like GPT-4o have 96 or more transformer layers. More layers mean more representational capacity, which means more complex patterns can be learned, at the cost of more data and more computation required for training.
A concrete example: recognizing a cat in a photo. A single-layer network sees raw pixel values and tries to find a direct mapping from pixels to "cat." A deep network:
- Layer 1 learns to detect edges (lines in specific orientations)
- Layer 2 combines edges into corners and curves
- Layer 3 combines curves into shapes (eyes, ears, paws)
- Layer 4 combines shapes into cat-like structures
- Output layer: "cat with 94% confidence"
Each layer builds on the one before it. This compositional structure is why depth matters.
Activation Functions: Why Non-Linearity Matters
Without activation functions, a stack of layers is just a series of linear transformations, which collapses into a single linear transformation. You could have 100 layers, but the whole network would behave like one layer. Activation functions break this by introducing non-linearity.
The most common activation function today is ReLU (Rectified Linear Unit): for negative inputs, output 0. For positive inputs, output the input unchanged.
ReLU(x) = max(0, x)
This simple function is non-linear. Applied between layers, it allows networks to learn arbitrary curves and decision boundaries, not just straight lines. The mathematics behind why this works involves approximation theory, but the intuition is: any curved function can be approximated by many small linear pieces, and ReLU creates those pieces.
Other activation functions: sigmoid (squashes output to 0-1, useful for output layer probabilities), tanh (squashes to -1 to 1), GELU (used in transformer architectures including GPT and BERT). For most hidden layers, ReLU or its variants are the standard choice.
The Training Process: How Networks Learn
Training is how the network learns the right weights. The process has four steps, repeated millions of times:
Step 1 - Forward pass: Feed an input through the network, layer by layer, to get a prediction.
Step 2 - Measure error: Compare the prediction to the correct answer using a loss function. For classification, this is typically cross-entropy loss. For regression, it is mean squared error. The loss is a single number: how wrong was this prediction?
Step 3 - Backpropagation: Calculate, for each weight in the network, how much does a small change in that weight change the loss? This is a calculus operation (computing gradients), but you do not need to understand the calculus to understand what it does: it assigns "blame" to each weight proportional to its contribution to the error.
Step 4 - Update weights: Adjust each weight slightly in the direction that reduces the loss. The size of the adjustment is controlled by the learning rate. Repeat from Step 1.
Over millions of iterations, the weights converge to values that produce accurate predictions on the training data. A neural network that took days to train on a supercomputer has simply done this update step billions of times.
Why This Matters for Developers Who Use LLMs
Understanding the training process explains several practical things about LLMs:
Why they hallucinate. The training process optimizes for producing the most likely next token, not the most accurate one. The network has no internal truth check. It generates what patterns in training data suggest should come next.
Why they have a knowledge cutoff. Training uses a dataset collected up to a specific date. The weights encode patterns from that data. Events after the cutoff are not in the weights.
Why prompting works. The same network, with the same weights, produces different outputs for different inputs. Prompting changes the input. The network's output reflects both its learned patterns and the specific prompt.
Why larger models are better. More parameters means more capacity to store patterns. A model with 7 billion parameters can represent more complex functions than a model with 1 billion parameters. This is why the largest models score highest on benchmarks.
Keep Reading
- How Large Language Models Work: A Complete Guide Without the Math Overload - How transformers extend the neural network basics explained here
- Overfitting and Underfitting in ML: How to Diagnose and Fix Both - The core training failure modes every developer should understand
- When Not to Use Machine Learning: Simpler Solutions That Actually Work - When neural networks are the wrong tool
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.