How Large Language Models Work: A Complete Guide Without the Math Overload

A plain-English guide to how LLMs actually work: tokens, attention, training vs inference, why they hallucinate, and what context windows mean for your workflow.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 16, 2026

14 min read

// tags

#llm#transformers#ai-basics#gpt#neural-networks

FIG. ART-45

14 min read

“

How Large Language Models Work: A Complete Guide Without the Math Overload

// reading plan

sections

1,884

words

min read

// LLM & Language Models

What Is GPT-5.6 Sol Ultra Will Be in Codex? A Practical Overview

GPT-5.6 Sol Ultra is a rumored model optimized for code generation, integrated into Codex. We analyze the claims, potential capabilities, and what developers should expect.

5 min read

// AI Agents

Building reliable agentic AI systems: A Practical Overview

Training vs. Fine-Tuning vs. Prompting

These three terms are often confused. They represent fundamentally different kinds of work at different cost scales.

Training from scratch means taking a transformer architecture and feeding it hundreds of billions of tokens of text over weeks or months on thousands of GPUs. GPT-4 is estimated to have cost over $100 million to train (Sam Altman, interview with StrictlyVC, 2023). This is what OpenAI, Anthropic, Google, and a handful of other organizations do. Almost no one else should.

Fine-tuning means taking a pretrained model and continuing to train it on a smaller, specific dataset to shift its behavior in a particular direction. A company might fine-tune a base model on their internal documentation to make it better at answering company-specific questions, or fine-tune on examples of their preferred writing style. Fine-tuning costs hundreds to thousands of dollars depending on dataset size and model size. It changes the model's weights permanently for the duration of that fine-tuned version.

Prompting means writing instructions in the input that tell the model how to behave. No weight changes, no training costs. This is what you do every time you write a system prompt or a few-shot example. It is the lowest-cost, highest-agility approach, and for most use cases it is sufficient. The limits of prompting: it cannot teach the model facts it was not trained on, it cannot overcome fundamental weaknesses in the base model, and the instructions live in the context window and can be overridden or forgotten across long conversations.

The practical decision framework: try prompting first. If prompting cannot solve the problem after real effort, consider fine-tuning. Training from scratch is for foundation model labs, not application developers.

Why LLMs Hallucinate

Hallucination is the single most important limitation to understand about language models. It is not a bug that will be patched. It is a structural consequence of how these models work.

An LLM generates text by predicting, at each step, which token is most likely to come next given everything that came before. It does not look anything up. It does not check facts. It does not know that it does not know something. When asked a question whose correct answer it never saw in training, it does not say "I do not have this information." It generates a plausible-sounding answer, because generating plausible-sounding text is what it was trained to do.

The three main hallucination patterns are:

Confabulation: The model invents facts. It cites papers that do not exist, quotes people saying things they never said, or states incorrect statistics confidently. The confidence is not a sign of accuracy; it is a byproduct of the generation process.

Sycophancy: The model agrees with you even when you are wrong, especially if you state something confidently in your prompt. If you say "I read that the Battle of Hastings was in 1067" and ask for confirmation, many models will find a way to partially validate your incorrect date rather than correct it directly.

Outdated knowledge: Models have a training cutoff. They do not know about events after that date. When asked about recent events, they either admit the limitation (better behavior) or hallucinate plausible-sounding but incorrect information based on pre-cutoff patterns (worse behavior).

The practical response to hallucination is not to stop using LLMs. It is to design workflows that do not depend on the model's factual reliability. Use LLMs for tasks where the human is reviewing the output. Use retrieval-augmented generation (RAG) to give the model access to ground-truth documents. Set temperature to 0 for tasks where consistency matters more than creativity. And verify any factual claim that has real consequences.

Context Windows: What They Are and Why They Matter

The context window is the total amount of text an LLM can see at once. Everything you send to the model, plus everything the model generates in response, consumes context window. Once the limit is hit, the model either stops or begins forgetting the earliest parts of the conversation.

Current limits as of May 2026:

GPT-4o: 128,000 tokens (roughly 90,000 to 100,000 words)
Claude 3.5 Sonnet: 200,000 tokens (roughly 140,000 to 150,000 words)
Gemini 1.5 Pro: 1,000,000 tokens (roughly 700,000 words)
Deepseek V3: 64,000 tokens (roughly 45,000 words)

What does 128,000 tokens actually hold? A full-length novel is approximately 80,000 to 100,000 words. All of the Harry Potter series combined is roughly 1 million words. A 128k context window can process an entire novel in a single request. Gemini's 1M window can process the entire Harry Potter series.

Why does this matter in practice? Long documents, long codebases, long conversation histories. If you are asking an LLM to analyze a 50-page contract, summarize a 200-page technical spec, or maintain a coherent conversation over a multi-hour work session, context window size is the limiting constraint.

When context limits hurt you: the model starts "forgetting" instructions given early in a long conversation. The system prompt you wrote at the top gets deprioritized as the context fills. Responses late in a long conversation often lose coherence with the original framing.

Strategies for working within context limits: summarize and compress at regular intervals, use retrieval-augmented generation to pull in only the relevant sections of large documents rather than feeding the whole thing, and reset conversations and re-inject the system prompt for long sessions.

How GPT-4o, Claude 3.5, and Gemini Differ

All three are transformer-based language models. The architectural differences between them are not fully public, but some meaningful distinctions are visible from their behavior.

GPT-4o is a multimodal model that processes text, images, and audio natively. Its strong points include code generation, structured output reliability, and instruction following. It has a larger training data corpus than earlier GPT versions and a faster inference speed than GPT-4 Turbo. Its context window (128k) is smaller than Claude's or Gemini's.

Claude 3.5 Sonnet (Anthropic) was trained with a specific focus on safety and honest uncertainty expression. It tends to be better than GPT-4o at admitting when it does not know something, which reduces some categories of hallucination. Its 200k context window is larger. Anthropic's constitutional AI training approach gives it a notably different "feel" in conversation, particularly around sensitive topics.

Gemini 1.5 Pro (Google) has the largest context window of the mainstream models. Its multimodal capabilities include video and audio understanding alongside text and images. Its 1M context window makes it the best choice for tasks requiring very long document analysis. Performance on standard benchmarks is comparable to the others at its tier.

Deepseek V3 (Deepseek AI) is notable for strong performance at a fraction of the cost of the Western frontier models. It performs competitively on coding and reasoning tasks while being accessible at much lower price points. For cost-sensitive applications, it is worth serious evaluation.

What to Read Next

If this overview raised more questions than it answered, that is the right reaction. The follow-on posts go deeper into the specific mechanics and practical applications.

Keep Reading

GPT-4o vs Claude 3.5 Sonnet vs Gemini Pro vs Deepseek V3: Honest Comparison 2026 - Benchmark scores, pricing tables, and honest assessments of what each model is best at
Prompt Engineering Complete Guide 2026 - Every technique that actually moves the needle on LLM output quality
Why LLMs Hallucinate and How to Reduce It: A Practical Guide - A deeper look at the three hallucination types and concrete mitigation techniques

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

How Large Language Models Work: A Complete Guide Without the Math Overload

Related Articles

What Is GPT-5.6 Sol Ultra Will Be in Codex? A Practical Overview

What a Token Is and Why It Matters

The Transformer Architecture (Without the Math)

Training vs. Fine-Tuning vs. Prompting

Why LLMs Hallucinate

Context Windows: What They Are and Why They Matter

How GPT-4o, Claude 3.5, and Gemini Differ

What to Read Next

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Building reliable agentic AI systems: A Practical Overview

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

How Large Language Models Work: A Complete Guide Without the Math Overload

Related Articles

What Is GPT-5.6 Sol Ultra Will Be in Codex? A Practical Overview

What a Token Is and Why It Matters

The Transformer Architecture (Without the Math)

Training vs. Fine-Tuning vs. Prompting

Why LLMs Hallucinate

Context Windows: What They Are and Why They Matter

How GPT-4o, Claude 3.5, and Gemini Differ

What to Read Next

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Building reliable agentic AI systems: A Practical Overview

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

The workspace your team
actually needs