GPT Architecture Explained: Beyond the Surface Level

GPT's autoregressive, decoder-only design enables text generation at scale. Here is how it actually works -- from pretraining data to emergent capabilities to GPT-4o.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

9 min read

// tags

#gpt#transformers#llm#architecture#generative-ai

FIG. ART-26

9 min read

“

GPT Architecture Explained: Beyond the Surface Level

// reading plan

sections

1,155

words

min read

// LLM & Language Models

What Is GPT-5.6 Sol Ultra Will Be in Codex? A Practical Overview

GPT-5.6 Sol Ultra is a rumored model optimized for code generation, integrated into Codex. We analyze the claims, potential capabilities, and what developers should expect.

5 min read

// AI Agents

Building reliable agentic AI systems: A Practical Overview

Pretraining Data: What GPT Was Trained On

The quality and composition of pretraining data has a larger impact on model behavior than almost any other factor. Here is what the major GPT models were trained on:

GPT-2 (2019): WebText -- a dataset of web pages linked from Reddit posts with at least 3 karma points. About 40GB of text, 8 million documents. The "karma filter" was an attempt to get higher-quality text than a random web crawl.

GPT-3 (2020): A combination of: CommonCrawl (filtered, 410B tokens), WebText2 (19B tokens), Books1 and Books2 (67B tokens), English Wikipedia (3B tokens). Total: approximately 300B tokens after deduplication and quality filtering.

GPT-4: OpenAI has not published a technical report with pretraining data details. The model is estimated to have been trained on a mix of internet text, books, and code, with RLHF applied post-pretraining.

The shift from GPT-2 to GPT-3 in terms of data scale (40GB vs several TB) is a major part of why GPT-3 exhibited capabilities that GPT-2 did not.

Emergent Capabilities: Abilities That Appear With Scale

One of the most surprising findings in LLM research is that certain capabilities do not improve gradually with scale -- they appear suddenly once the model reaches a certain size threshold.

In-context learning is the canonical example. GPT-2 (1.5B parameters) cannot reliably follow examples in the prompt to perform a new task. GPT-3 (175B parameters) can look at a few examples of a new task in the prompt and perform it without any weight updates. This ability was not explicitly trained -- it emerged from the combination of scale and the diversity of the pretraining data.

Other emergent capabilities observed at scale: multi-step arithmetic, chain-of-thought reasoning (especially with prompting), code generation from natural language descriptions, and translation between languages not heavily represented in training data.

The exact mechanism behind emergence is still debated. One hypothesis: larger models develop more efficient internal representations that can be repurposed for novel tasks. Another: emergent capabilities are artifacts of how we measure them (some metrics have sharp thresholds).

Parameter Counts Across GPT Versions

Understanding scale puts model comparisons in context:

GPT-2 (2019): 1.5B parameters (largest variant)
GPT-3 (2020): 175B parameters
GPT-3.5 / ChatGPT (2022): architecture similar to GPT-3, fine-tuned with RLHF
GPT-4 (2023): Not officially disclosed. Estimated by researchers at approximately 1.7T parameters in a Mixture of Experts (MoE) architecture with 8 expert networks, each ~220B parameters, 2 active per token.
GPT-4o (2024): Native multimodal model; parameter count undisclosed.

The MoE architecture (if the GPT-4 estimates are accurate) means the model has 1.7T total parameters but only activates a subset for each token -- keeping inference cost manageable despite the large total parameter count.

Inference: How Text Is Actually Generated

When you call the OpenAI API (or run an open source model), here is what happens token by token:

Your prompt is tokenized into token IDs using the model's tokenizer (GPT models use Byte-Pair Encoding).
The token IDs are passed through the embedding layer to get dense vectors.
These vectors pass through N transformer decoder layers (GPT-3 has 96 layers).
The final layer produces a logit vector over the vocabulary (50,257 tokens for GPT-3).
Logits are divided by the temperature parameter and converted to probabilities via softmax.
A token is sampled from this distribution.
The sampled token is appended to the sequence and the process repeats.

Sampling strategies:

Greedy decoding: Always pick the highest-probability token. Fast, but produces repetitive text.
Temperature sampling: Higher temperature = flatter distribution = more random. Lower temperature = sharper distribution = more deterministic.
Top-k sampling: Sample only from the k most likely tokens at each step.
Top-p (nucleus) sampling: Sample from the smallest set of tokens whose cumulative probability exceeds p. Adapts to the shape of the distribution at each step.

Most production systems use top-p sampling with temperature 1.0 as the default.

What GPT-4o Adds Over GPT-4

GPT-4o ("o" for omni) processes images, audio, and text natively in a single model rather than routing inputs through separate modality-specific models. In GPT-4V, image understanding was handled by a separate vision encoder whose outputs were fed into the language model. GPT-4o integrates all modalities into a single end-to-end trained system.

The practical benefits: lower latency for multimodal tasks, better integration between modalities (the model can reason about audio and images in the same context window), and speech-to-speech conversation without the round-trip through a separate ASR system.

Keep Reading

How Large Language Models Work: A Complete Guide -- the full picture of LLM mechanics, training, and inference
BERT Explained for Developers -- the encoder-only alternative to GPT, and when to use it
ML Research Papers Every Practitioner Should Know -- primary sources: GPT-3, InstructGPT, and more

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.

GPT Architecture Explained: Beyond the Surface Level

Related Articles

What Is GPT-5.6 Sol Ultra Will Be in Codex? A Practical Overview

The Autoregressive Approach

Decoder-Only Architecture: Why It Works for Generation

Pretraining Data: What GPT Was Trained On

Emergent Capabilities: Abilities That Appear With Scale

Parameter Counts Across GPT Versions

Inference: How Text Is Actually Generated

What GPT-4o Adds Over GPT-4

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Building reliable agentic AI systems: A Practical Overview

ONNX: Export Any ML Model and Run It Anywhere

GPT Architecture Explained: Beyond the Surface Level

Related Articles

What Is GPT-5.6 Sol Ultra Will Be in Codex? A Practical Overview

The Autoregressive Approach

Decoder-Only Architecture: Why It Works for Generation

Pretraining Data: What GPT Was Trained On

Emergent Capabilities: Abilities That Appear With Scale

Parameter Counts Across GPT Versions

Inference: How Text Is Actually Generated

What GPT-4o Adds Over GPT-4

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Building reliable agentic AI systems: A Practical Overview

ONNX: Export Any ML Model and Run It Anywhere

The workspace your team
actually needs