The Paper That Changed Everything
In June 2017, a team at Google Brain published a paper with an audacious title: "Attention Is All You Need." It was not hyperbole. The Transformer architecture introduced in arXiv:1706.03762 replaced the dominant recurrent neural networks (RNNs and LSTMs) with a mechanism called self-attention, and every model you interact with today — GPT-4, Claude, Gemini, Llama — is a direct descendant.
What Self-Attention Actually Computes
The core operation is scaled dot-product attention. Given a sequence of tokens, you project each into three vectors: a Query (Q), a Key (K), and a Value (V). The attention output is:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
The division by sqrt(d_k) prevents the dot products from growing too large in high-dimensional spaces, which would push the softmax into saturation. Each token's output is a weighted sum of all Value vectors, where the weights are determined by how well its Query matches each Key. In plain terms: every token attends to every other token simultaneously.
Multi-Head Attention
Rather than running one attention function, the paper runs h parallel attention heads, each with different learned projections. This lets the model simultaneously attend to information from different representation subspaces — one head might track syntactic dependencies, another semantic similarity, another coreference. The outputs are concatenated and projected back to the model dimension.
Why Positional Encoding
Self-attention is permutation-invariant — it has no built-in sense of word order. The Transformer adds positional encodings to the input embeddings using sine and cosine functions at different frequencies. This was a deliberate design choice that later led to rotary positional embeddings (RoPE) and ALiBi in modern LLMs.
Encoder-Decoder Architecture
The original Transformer was designed for machine translation. The encoder processes the source sequence and produces contextual representations. The decoder generates the target sequence autoregressively, using masked self-attention (can only see past tokens) and cross-attention over the encoder output.
Modern LLMs like GPT are decoder-only — they removed the encoder and cross-attention, keeping just the autoregressive decoder. BERT is encoder-only. T5 kept the full encoder-decoder.
Why It Replaced RNNs
RNNs process tokens sequentially, which means you cannot parallelize training — step t depends on step t-1. Self-attention computes all positions simultaneously, making full use of GPU tensor cores. RNNs also suffer from vanishing gradients over long sequences. Attention has a direct path between any two positions regardless of distance, solving long-range dependency learning.
The Lasting Impact
Every benchmark record today is held by a Transformer variant. The architecture has scaled from the original 65M parameter model to systems with hundreds of billions of parameters with the same fundamental mechanism. Understanding the original paper is not academic — it is the prerequisite to understanding every optimization, every efficiency trick, and every research paper published since.
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V)