GPT (Generative Pretrained Transformer) is the architecture behind the most capable text generation models in production today. Understanding how it works -- not just what it does -- is increasingly important for developers building on top of these systems. This post goes beyond the surface and explains the design decisions that make GPT work.
The Autoregressive Approach
GPT generates text one token at a time. Given a sequence of tokens, the model predicts the probability distribution over the next token. You sample from that distribution, append the sampled token to the sequence, and repeat.
Formally, the model learns:
P(token_n | token_1, token_2, ..., token_n-1)
This is called autoregressive generation. The model is conditioned on all previous tokens to predict the next one. This makes GPT fundamentally a next-token predictor, and essentially everything that emerges from it -- code generation, reasoning, summarization -- comes from that one training objective applied at scale.
Decoder-Only Architecture: Why It Works for Generation
The original "Attention Is All You Need" transformer (Vaswani et al., 2017) had both an encoder and a decoder. The encoder reads the input sequence. The decoder generates the output sequence, attending to both the encoder output and its own previously generated tokens.
GPT strips out the encoder and uses only the decoder stack. This simplification works for generation because:
- A decoder with causal (left-to-right) self-attention can model text generation directly -- each token attends to all tokens before it, and the model generates the next token.
- The encoder is only needed when you have a separate input sequence to "read" (as in translation, where you read a source sentence and write a target sentence). For open-ended generation, there is no separate input -- the prompt is simply prepended to the generation.
The causal attention mask is the key mechanism: token at position i can attend to positions 0 through i, but not i+1 through n. This prevents the model from "cheating" during training by looking ahead.
Pretraining Data: What GPT Was Trained On
The quality and composition of pretraining data has a larger impact on model behavior than almost any other factor. Here is what the major GPT models were trained on:
GPT-2 (2019): WebText -- a dataset of web pages linked from Reddit posts with at least 3 karma points. About 40GB of text, 8 million documents. The "karma filter" was an attempt to get higher-quality text than a random web crawl.
GPT-3 (2020): A combination of: CommonCrawl (filtered, 410B tokens), WebText2 (19B tokens), Books1 and Books2 (67B tokens), English Wikipedia (3B tokens). Total: approximately 300B tokens after deduplication and quality filtering.
GPT-4: OpenAI has not published a technical report with pretraining data details. The model is estimated to have been trained on a mix of internet text, books, and code, with RLHF applied post-pretraining.
The shift from GPT-2 to GPT-3 in terms of data scale (40GB vs several TB) is a major part of why GPT-3 exhibited capabilities that GPT-2 did not.
Emergent Capabilities: Abilities That Appear With Scale
One of the most surprising findings in LLM research is that certain capabilities do not improve gradually with scale -- they appear suddenly once the model reaches a certain size threshold.
In-context learning is the canonical example. GPT-2 (1.5B parameters) cannot reliably follow examples in the prompt to perform a new task. GPT-3 (175B parameters) can look at a few examples of a new task in the prompt and perform it without any weight updates. This ability was not explicitly trained -- it emerged from the combination of scale and the diversity of the pretraining data.
Other emergent capabilities observed at scale: multi-step arithmetic, chain-of-thought reasoning (especially with prompting), code generation from natural language descriptions, and translation between languages not heavily represented in training data.
The exact mechanism behind emergence is still debated. One hypothesis: larger models develop more efficient internal representations that can be repurposed for novel tasks. Another: emergent capabilities are artifacts of how we measure them (some metrics have sharp thresholds).
Parameter Counts Across GPT Versions
Understanding scale puts model comparisons in context:
- GPT-2 (2019): 1.5B parameters (largest variant)
- GPT-3 (2020): 175B parameters
- GPT-3.5 / ChatGPT (2022): architecture similar to GPT-3, fine-tuned with RLHF
- GPT-4 (2023): Not officially disclosed. Estimated by researchers at approximately 1.7T parameters in a Mixture of Experts (MoE) architecture with 8 expert networks, each ~220B parameters, 2 active per token.
- GPT-4o (2024): Native multimodal model; parameter count undisclosed.
The MoE architecture (if the GPT-4 estimates are accurate) means the model has 1.7T total parameters but only activates a subset for each token -- keeping inference cost manageable despite the large total parameter count.
Inference: How Text Is Actually Generated
When you call the OpenAI API (or run an open source model), here is what happens token by token:
- Your prompt is tokenized into token IDs using the model's tokenizer (GPT models use Byte-Pair Encoding).
- The token IDs are passed through the embedding layer to get dense vectors.
- These vectors pass through N transformer decoder layers (GPT-3 has 96 layers).
- The final layer produces a logit vector over the vocabulary (50,257 tokens for GPT-3).
- Logits are divided by the temperature parameter and converted to probabilities via softmax.
- A token is sampled from this distribution.
- The sampled token is appended to the sequence and the process repeats.
Sampling strategies:
- Greedy decoding: Always pick the highest-probability token. Fast, but produces repetitive text.
- Temperature sampling: Higher temperature = flatter distribution = more random. Lower temperature = sharper distribution = more deterministic.
- Top-k sampling: Sample only from the k most likely tokens at each step.
- Top-p (nucleus) sampling: Sample from the smallest set of tokens whose cumulative probability exceeds p. Adapts to the shape of the distribution at each step.
Most production systems use top-p sampling with temperature 1.0 as the default.
What GPT-4o Adds Over GPT-4
GPT-4o ("o" for omni) processes images, audio, and text natively in a single model rather than routing inputs through separate modality-specific models. In GPT-4V, image understanding was handled by a separate vision encoder whose outputs were fed into the language model. GPT-4o integrates all modalities into a single end-to-end trained system.
The practical benefits: lower latency for multimodal tasks, better integration between modalities (the model can reason about audio and images in the same context window), and speech-to-speech conversation without the round-trip through a separate ASR system.
Keep Reading
- How Large Language Models Work: A Complete Guide -- the full picture of LLM mechanics, training, and inference
- BERT Explained for Developers -- the encoder-only alternative to GPT, and when to use it
- ML Research Papers Every Practitioner Should Know -- primary sources: GPT-3, InstructGPT, and more
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.