A context window is the maximum amount of text an LLM can process in a single interaction. Everything the model sees, including your system prompt, the conversation history, any documents you attach, and the model's own prior responses, counts against this limit. When the limit is reached, the model can no longer access the earlier parts of the conversation. Understanding context windows tells you what you can build with LLMs and where you will hit walls.
Current limits as of May 2026: GPT-4o handles 128,000 tokens, Claude 3.5 Sonnet handles 200,000, Gemini 1.5 Pro handles 1,000,000, and Deepseek V3 handles 64,000.
What Fits in a Context Window
Token counts are abstract until you map them to real content. Here are reference points for the most common use cases.
128,000 tokens (GPT-4o) holds approximately:
- 90,000 to 100,000 words of English text
- An entire novel (the average novel is 80,000 to 100,000 words)
- 50 to 70 pages of dense PDF documentation
- A mid-sized codebase (5,000 to 8,000 lines of code with comments)
- Roughly 200 to 300 email-length messages in a conversation history
200,000 tokens (Claude 3.5 Sonnet) holds approximately:
- 140,000 to 150,000 words
- The entire text of the Bible (about 783,000 words in the King James Version — no, this does not fit; a single book or 10 to 15 chapters would)
- A 150-page technical specification
- A substantial codebase or collection of related files
1,000,000 tokens (Gemini 1.5 Pro) holds approximately:
- 700,000 to 750,000 words
- The entire Harry Potter series is approximately 1,000,000 words (close but over the limit)
- A large enterprise codebase
- Years of customer support tickets
64,000 tokens (Deepseek V3) holds approximately:
- 45,000 to 48,000 words
- A short novel or a long report
- A medium codebase
Why Context Windows Degrade at the Edges
A larger context window does not mean uniformly good attention across the entire window. Research on the "lost in the middle" problem (Liu et al., 2023, "Lost in the Middle: How Language Models Use Long Contexts") found that LLMs are reliably better at processing information placed at the beginning and end of the context window than information placed in the middle.
The practical consequence: if you are giving an LLM a long document and asking it to follow specific instructions, put the most important instructions at the very beginning or the very end. Information buried in the middle of a long context is more likely to be under-weighted.
This also means that as conversations get longer, the system prompt you wrote at the start is being increasingly deprioritized as it gets pushed toward the middle of the context. For long conversations, re-injecting critical instructions periodically is better than relying on the model to remember them from the beginning.
When Context Limits Actually Hurt You
The situations where context limits become a real constraint:
Large codebase analysis. If you want an LLM to review an entire codebase for security vulnerabilities or architectural issues, a 100,000-line codebase at roughly 50 characters per line is approximately 5,000,000 characters, or about 1.25 million tokens. Even Gemini's 1M window is not sufficient. You need chunking strategies.
Long research sessions. If you are using an LLM for a multi-hour research session and the conversation grows to 50,000+ tokens, you will start seeing the model forget context established early in the conversation, particularly nuanced instructions or earlier conclusions.
Long document analysis with follow-up questions. A 300-page technical specification is roughly 150,000 tokens. It fits in Claude's 200k window, but adding your system prompt, the specification, and a back-and-forth conversation about it can push past 200k quickly.
Multi-document synthesis. Asking an LLM to synthesize insights across five long documents simultaneously may exceed the context window of all but Gemini 1.5 Pro.
Strategies for Working Around Context Limits
Chunking. Divide long documents into segments and process each segment separately. Collect the outputs and synthesize them in a final pass. This requires more API calls and more careful prompt design, but it works for most document analysis tasks.
Summarization. After processing a section of content, ask the model to summarize the key points, and use that summary instead of the full content in subsequent requests. This compresses context at the cost of losing some detail.
Retrieval-augmented generation (RAG). Instead of feeding the entire document into the context window, build an index of the document (using embeddings) and retrieve only the most relevant sections for each query. RAG is the standard architecture for applications dealing with large knowledge bases, and it effectively removes context window constraints for most use cases.
Conversation reset. For long work sessions, periodically extract the key conclusions from the current conversation, start a new conversation with those conclusions plus a fresh system prompt, and continue from there. This is less elegant than a continuous conversation but avoids the degradation that comes from overfull context windows.
Model selection based on context need. If your task requires a very long context, use the right model. For tasks that genuinely need 150k+ tokens, Claude 3.5 Sonnet or Gemini 1.5 Pro are the appropriate choices. Do not try to compress a naturally long-context task into a smaller window; the quality degradation is real.
What Context Window Size Does Not Change
Context window size does not affect the model's maximum output length. Even with a 1M token context window, Gemini still has a maximum output of several thousand tokens per request. Large context windows help you send more input, not receive more output.
Context window size also does not affect base model quality. A model with a larger context window is not necessarily better at reasoning, coding, or writing. The window is a constraint on input size, not a measure of intelligence.
Keep Reading
- What Is a Token in an LLM? A Plain-English Explanation — How tokens are counted, why they matter for cost, and how to check token counts in your code
- How Large Language Models Work: A Complete Guide Without the Math Overload — The full guide to LLM mechanics, including why context windows work the way they do
- GPT-4o vs Claude 3.5 Sonnet vs Gemini Pro vs Deepseek V3: Honest Comparison 2026 — Detailed comparison of context windows, pricing, and performance across the major models
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.