Convolutional Neural Networks (CNNs) transformed computer vision. Before CNNs, image classification required massive hand-crafted feature engineering -- detecting edges, textures, and shapes using domain-expert pipelines. After CNNs, the features are learned automatically from data. AlexNet winning ImageNet 2012 by a 10-percentage-point margin over the second-best hand-crafted approach marked the beginning of the modern deep learning era.
Understanding CNNs requires understanding three core ideas: convolution as local pattern detection, pooling as downsampling, and how modern architectures stack these ideas to build deep networks that recognize complex visual patterns.
Convolution: Detecting Local Patterns
A convolution is a sliding-window operation. A small matrix of weights called a filter or kernel slides across the image. At each position, the filter is multiplied element-wise with the overlapping image pixels and summed. This produces a single output value for each position. The result is a "feature map" -- a matrix of values indicating how strongly each spatial location matches the filter's pattern.
A 3x3 filter that has large positive weights on the left column and large negative weights on the right column will fire strongly on vertical edges (where pixel values transition from bright on the left to dark on the right). Different filters detect different patterns: horizontal edges, diagonal edges, color gradients, small textures.
The key innovation: the same filter is applied across the entire image. If a pattern (an edge, a corner, a texture) appears anywhere in the image, the filter detects it. This "translation equivariance" means the network does not need to learn separate detectors for an edge in the top-left versus an edge in the bottom-right.
In a convolutional layer, you have multiple filters -- typically 32, 64, 128, or 256. Each filter produces its own feature map. The layer output has shape (height, width, num_filters). Deeper layers combine the feature maps of shallower layers, building from simple edge detectors to complex pattern detectors.
import torch.nn as nn
# A simple convolutional block
conv_block = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
)
Batch normalization after convolution (before activation) stabilizes training by normalizing the distribution of activations within each mini-batch. It is nearly universal in modern CNNs.
Pooling: Downsampling and Spatial Invariance
Pooling layers reduce the spatial dimensions of feature maps. Max pooling, the most common type, divides the feature map into non-overlapping windows (typically 2x2) and takes the maximum value in each window. A 32x32 feature map becomes 16x16 after one 2x2 max pooling operation.
Pooling serves two purposes:
- Reduces computation. Smaller feature maps require fewer multiply-accumulate operations in subsequent layers.
- Provides spatial invariance. If a feature appears anywhere within a pooling window, the pooled output is the same. This makes the network somewhat robust to small translations of features.
Modern architectures often reduce spatial dimensions through strided convolutions (convolution with stride > 1) rather than explicit pooling layers. Both approaches are valid; the trend has moved toward strided convolutions.
The Deep Architecture Pattern
A standard CNN follows a pattern: alternating convolution and pooling blocks that progressively reduce spatial dimensions while increasing the number of channels, followed by a global pooling operation, followed by fully-connected layers for classification.
As you go deeper: spatial dimensions decrease (from 224x224 to 112x112 to 56x56...) while the number of filters increases (64 to 128 to 256 to 512). Early layers detect low-level patterns (edges, colors). Later layers detect high-level patterns (eyes, wheels, text).
The final fully-connected layers combine these spatial pattern detections into a global image classification. The last layer has one neuron per class, with a softmax activation producing class probabilities.
ResNet: Solving the Vanishing Gradient Problem
As networks get deeper, training becomes harder. Gradients propagate backward through each layer during backpropagation, and in deep networks they tend to either vanish (become too small to update early layers) or explode (become too large, destabilizing training).
ResNet (Residual Networks, He et al. 2015) introduced residual connections (also called skip connections) that directly add the input of a block to its output: output = F(input) + input. This creates a "shortcut" path that gradients can flow through without passing through the nonlinear transformations of the block.
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(channels)
self.relu = nn.ReLU()
def forward(self, x):
residual = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += residual # The residual connection
out = self.relu(out)
return out
Residual connections allowed training networks 100, 152, even 1,000 layers deep without gradient problems. ResNet-50 (50 layers) and ResNet-101 (101 layers) remain widely used pretrained backbones for image tasks.
The intuition: a residual block learns F(x) -- the residual transformation needed to improve upon passing x through unchanged. If the optimal transformation is close to the identity (no change), F(x) approaches zero and the block is effectively bypassed. This makes it easy for the network to learn that some blocks should not transform their input at all.
EfficientNet: Scaling Efficiently
EfficientNet (Tan and Le, 2019) addressed a practical question: if you have more compute budget, should you make the network deeper, wider (more filters), or use higher input resolution?
EfficientNet's answer: scale all three dimensions together using a fixed ratio found by neural architecture search. This "compound scaling" produces significantly better accuracy-efficiency trade-offs than scaling a single dimension.
EfficientNet-B0 (the smallest variant) achieves competitive accuracy with far fewer parameters than ResNet-50. EfficientNet-B7 achieves state-of-the-art accuracy with far fewer parameters than previous SOTA models. The B0-B7 variants provide a continuum for different compute budgets.
import timm # Hugging Face's image model hub
model = timm.create_model('efficientnet_b4', pretrained=True, num_classes=10)
Pretrained vs. From Scratch vs. Vision API
Pretrained model (recommended for almost all practical tasks): Use a model pretrained on ImageNet (ResNet, EfficientNet, ViT) and fine-tune it for your task. This is the right choice when your images are natural photos or visually similar to ImageNet content (most tasks: product photos, medical images, satellite imagery, defect detection). With as few as a few hundred images per class, fine-tuning a pretrained model produces excellent results.
Training from scratch: Only justified when your images are radically different from ImageNet (e.g., radar images, microscopy, 3D volumetric scans) where pretrained features do not transfer, AND when you have sufficient data (tens of thousands to millions of labeled examples). Even then, consider domain-specific pretrained models (medical imaging, remote sensing) before training from scratch.
Vision API (Google Vision, AWS Rekognition, Azure Computer Vision): The right choice for generic tasks: object detection, face detection, scene classification, OCR, label detection. These APIs provide state-of-the-art performance with no model training. Use them unless your task is too specific for generic models, your volume makes API costs prohibitive, or data privacy prevents sending images to external services.
Decision framework:
- Generic task (object detection, OCR, general classification): Vision API
- Specific task, natural images, limited data (under 10K examples): Pretrained + fine-tune
- Specific task, domain-specific images, large dataset: Pretrained from related domain + fine-tune, or train from scratch
- Production at scale with proven need for custom model: Consider training from scratch with full pipeline
CNNs remain highly relevant despite the rise of Vision Transformers (ViT). For most practical applications -- especially with limited data -- pretrained CNNs (ResNet, EfficientNet) via fine-tuning are the most practical and well-supported starting point.
Keep Reading
- Transfer Learning Explained -- the fine-tuning strategies that make pretrained CNNs practical
- Neural Networks Explained Visually -- the underlying neural network concepts that CNNs build on
- Machine Learning Complete Guide for Software Developers -- where computer vision fits in the broader ML landscape
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.