A large language model is a neural network trained on enormous amounts of text to predict what comes next in a sequence. That is it. The rest - the coherent answers, the code generation, the apparent reasoning - emerges from doing that one thing at an enormous scale with a specific architecture called a transformer. You do not need calculus to understand how these systems work, and understanding them makes you significantly better at using them.
This guide covers everything a working professional needs to know: what tokens are, how the transformer architecture creates the illusion of understanding, why hallucination is a feature not a bug of the training process, what context windows actually limit, and when to use training versus fine-tuning versus prompting.
What a Token Is and Why It Matters
Before anything else, you need to understand tokens. A token is not a word. It is a chunk of text, usually between one and four characters, that the model treats as a single unit.
"ChatGPT" is one token. "artificial intelligence" is three tokens ("art", "ificial", " intelligence" - roughly). "Hello, world!" is four tokens. Spaces are sometimes attached to the following word, sometimes separate.
Why does this matter? Three reasons.
First, pricing. Every major LLM API charges per token, not per character or word. GPT-4o costs $5 per million input tokens and $15 per million output tokens (OpenAI pricing, May 2026). If you are building a product on top of an LLM, your costs scale with token count, and understanding what inflates token count (verbose prompts, unnecessary examples, repetitive instructions) gives you direct control over your costs.
Second, context limits. Every model has a maximum number of tokens it can process in a single request. Understanding that a 128,000-token context window holds roughly 90,000 to 100,000 words helps you plan what you can and cannot fit.
Third, quality. Models process tokens, not meaning. A token boundary can fall in the middle of a word, and the model sees the fragments. This is why LLMs sometimes handle unusual compound words or rare technical terms poorly: the tokenizer has split them in ways that obscure the meaning the model would otherwise have learned.
The Transformer Architecture (Without the Math)
The transformer, introduced by Vaswani et al. in "Attention Is All You Need" (2017), solved a problem that older architectures struggled with: how do you understand the relationship between words that are far apart in a sentence?
Older recurrent neural networks processed text sequentially, word by word. By the time they reached the end of a long sentence, the beginning had largely faded from their internal representation. Transformers process the entire input simultaneously, using a mechanism called self-attention that asks, for every token: which other tokens in this input are relevant to understanding this one, and how relevant are they?
A practical analogy: imagine reading a contract. When you encounter the word "termination" in clause 12, you do not just look at the words around it. You look back at clause 3 (which defined the term), and forward to clause 15 (which specifies the notice period). That long-range contextual lookup is what attention does, computationally, for every token simultaneously.
The model uses multiple attention heads in parallel, each learning to attend to different kinds of relationships. One head might learn to track subject-verb agreement. Another might track pronoun references. Another might track topic coherence. No one programs these patterns: they emerge from training.
The output of the attention layers feeds into feed-forward networks that transform the attended representations. Add positional encodings (to tell the model the order of tokens, since attention itself is order-agnostic), stack these layers 96 times for a large model, and you have the basic architecture of GPT-4.
Training vs. Fine-Tuning vs. Prompting
These three terms are often confused. They represent fundamentally different kinds of work at different cost scales.
Training from scratch means taking a transformer architecture and feeding it hundreds of billions of tokens of text over weeks or months on thousands of GPUs. GPT-4 is estimated to have cost over $100 million to train (Sam Altman, interview with StrictlyVC, 2023). This is what OpenAI, Anthropic, Google, and a handful of other organizations do. Almost no one else should.
Fine-tuning means taking a pretrained model and continuing to train it on a smaller, specific dataset to shift its behavior in a particular direction. A company might fine-tune a base model on their internal documentation to make it better at answering company-specific questions, or fine-tune on examples of their preferred writing style. Fine-tuning costs hundreds to thousands of dollars depending on dataset size and model size. It changes the model's weights permanently for the duration of that fine-tuned version.
Prompting means writing instructions in the input that tell the model how to behave. No weight changes, no training costs. This is what you do every time you write a system prompt or a few-shot example. It is the lowest-cost, highest-agility approach, and for most use cases it is sufficient. The limits of prompting: it cannot teach the model facts it was not trained on, it cannot overcome fundamental weaknesses in the base model, and the instructions live in the context window and can be overridden or forgotten across long conversations.
The practical decision framework: try prompting first. If prompting cannot solve the problem after real effort, consider fine-tuning. Training from scratch is for foundation model labs, not application developers.
Why LLMs Hallucinate
Hallucination is the single most important limitation to understand about language models. It is not a bug that will be patched. It is a structural consequence of how these models work.
An LLM generates text by predicting, at each step, which token is most likely to come next given everything that came before. It does not look anything up. It does not check facts. It does not know that it does not know something. When asked a question whose correct answer it never saw in training, it does not say "I do not have this information." It generates a plausible-sounding answer, because generating plausible-sounding text is what it was trained to do.
The three main hallucination patterns are:
Confabulation: The model invents facts. It cites papers that do not exist, quotes people saying things they never said, or states incorrect statistics confidently. The confidence is not a sign of accuracy; it is a byproduct of the generation process.
Sycophancy: The model agrees with you even when you are wrong, especially if you state something confidently in your prompt. If you say "I read that the Battle of Hastings was in 1067" and ask for confirmation, many models will find a way to partially validate your incorrect date rather than correct it directly.
Outdated knowledge: Models have a training cutoff. They do not know about events after that date. When asked about recent events, they either admit the limitation (better behavior) or hallucinate plausible-sounding but incorrect information based on pre-cutoff patterns (worse behavior).
The practical response to hallucination is not to stop using LLMs. It is to design workflows that do not depend on the model's factual reliability. Use LLMs for tasks where the human is reviewing the output. Use retrieval-augmented generation (RAG) to give the model access to ground-truth documents. Set temperature to 0 for tasks where consistency matters more than creativity. And verify any factual claim that has real consequences.
Context Windows: What They Are and Why They Matter
The context window is the total amount of text an LLM can see at once. Everything you send to the model, plus everything the model generates in response, consumes context window. Once the limit is hit, the model either stops or begins forgetting the earliest parts of the conversation.
Current limits as of May 2026:
- GPT-4o: 128,000 tokens (roughly 90,000 to 100,000 words)
- Claude 3.5 Sonnet: 200,000 tokens (roughly 140,000 to 150,000 words)
- Gemini 1.5 Pro: 1,000,000 tokens (roughly 700,000 words)
- Deepseek V3: 64,000 tokens (roughly 45,000 words)
What does 128,000 tokens actually hold? A full-length novel is approximately 80,000 to 100,000 words. All of the Harry Potter series combined is roughly 1 million words. A 128k context window can process an entire novel in a single request. Gemini's 1M window can process the entire Harry Potter series.
Why does this matter in practice? Long documents, long codebases, long conversation histories. If you are asking an LLM to analyze a 50-page contract, summarize a 200-page technical spec, or maintain a coherent conversation over a multi-hour work session, context window size is the limiting constraint.
When context limits hurt you: the model starts "forgetting" instructions given early in a long conversation. The system prompt you wrote at the top gets deprioritized as the context fills. Responses late in a long conversation often lose coherence with the original framing.
Strategies for working within context limits: summarize and compress at regular intervals, use retrieval-augmented generation to pull in only the relevant sections of large documents rather than feeding the whole thing, and reset conversations and re-inject the system prompt for long sessions.
How GPT-4o, Claude 3.5, and Gemini Differ
All three are transformer-based language models. The architectural differences between them are not fully public, but some meaningful distinctions are visible from their behavior.
GPT-4o is a multimodal model that processes text, images, and audio natively. Its strong points include code generation, structured output reliability, and instruction following. It has a larger training data corpus than earlier GPT versions and a faster inference speed than GPT-4 Turbo. Its context window (128k) is smaller than Claude's or Gemini's.
Claude 3.5 Sonnet (Anthropic) was trained with a specific focus on safety and honest uncertainty expression. It tends to be better than GPT-4o at admitting when it does not know something, which reduces some categories of hallucination. Its 200k context window is larger. Anthropic's constitutional AI training approach gives it a notably different "feel" in conversation, particularly around sensitive topics.
Gemini 1.5 Pro (Google) has the largest context window of the mainstream models. Its multimodal capabilities include video and audio understanding alongside text and images. Its 1M context window makes it the best choice for tasks requiring very long document analysis. Performance on standard benchmarks is comparable to the others at its tier.
Deepseek V3 (Deepseek AI) is notable for strong performance at a fraction of the cost of the Western frontier models. It performs competitively on coding and reasoning tasks while being accessible at much lower price points. For cost-sensitive applications, it is worth serious evaluation.
What to Read Next
If this overview raised more questions than it answered, that is the right reaction. The follow-on posts go deeper into the specific mechanics and practical applications.
Keep Reading
- GPT-4o vs Claude 3.5 Sonnet vs Gemini Pro vs Deepseek V3: Honest Comparison 2026 - Benchmark scores, pricing tables, and honest assessments of what each model is best at
- Prompt Engineering Complete Guide 2026 - Every technique that actually moves the needle on LLM output quality
- Why LLMs Hallucinate and How to Reduce It: A Practical Guide - A deeper look at the three hallucination types and concrete mitigation techniques
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.