The Counterintuitive Finding
Longer context windows should make LLMs more capable. You can provide more documents, more background, more evidence. But the paper "Lost in the Middle" (arXiv:2307.03172) by Liu et al. revealed something that surprised many practitioners: when relevant information is buried in the middle of a long context, LLM performance degrades dramatically — even when the model is technically capable of processing that context length.
The Experimental Setup
The researchers tested a controlled multi-document QA task: given a question and k documents (only one of which contained the answer), answer the question. They varied where the answer-containing document appeared — at positions 0%, 25%, 50%, 75%, and 100% through the context.
They tested GPT-3.5-turbo-16k, GPT-4, Claude 1.3 (100k), and Llama 2 variants. All models showed the same pattern.
The U-Shaped Performance Curve
Performance was highest when the relevant document appeared at the very beginning (primacy effect) or very end (recency effect) of the context. Performance was lowest when the relevant document was in the middle — dropping 15-25 percentage points on accuracy compared to the same model with the document at position 0%.
With 20 documents, models that scored 70%+ accuracy with the answer first dropped to under 50% accuracy with the answer in the middle — a meaningful drop in real-world usefulness despite the context fitting within the model's window.
Why This Happens
The leading hypothesis is that Transformer attention patterns during pretraining correlate with position. Documents at the beginning are always attended to (they are in the attention window of every subsequent token). Documents at the end receive strong attention from each other. Documents in the middle receive proportionally less attention on average.
Additionally, the KV cache for middle-position tokens is accessed less frequently in autoregressive generation, reducing their effective influence on the output.
Implications for RAG Systems
This finding has direct implications for how to build RAG pipelines:
- Put the most relevant chunks first or last — not in the middle of your retrieved context
- Prefer fewer, higher-quality passages over many mediocre ones — adding irrelevant middle context actively hurts
- Re-rank retrieved passages and place top-ranked results at context boundaries
def reorder_for_lost_in_middle(passages: list[str], scores: list[float]) -> list[str]:
"""
Reorder passages to avoid the 'lost in the middle' effect.
Highest scored passage first, second highest last, rest in between.
"""
paired = sorted(zip(scores, passages), reverse=True)
sorted_passages = [p for _, p in paired]
if len(sorted_passages) <= 2:
return sorted_passages
# Best passage first, second best last, rest in middle
result = [sorted_passages[0]]
result.extend(sorted_passages[2:]) # middle passages
result.append(sorted_passages[1]) # second best at end
return result
How to Structure Prompts for Long Contexts
Beyond RAG, this affects any long-context task:
- Put the most critical instructions at the beginning and end of the system prompt
- Put examples and background in the middle (they are less likely to be precisely recalled)
- For summarization tasks, consider chunked approaches that process the document in windows rather than all at once
Model Comparisons
Claude 1.3 (100k context) showed a less severe U-shape than GPT-3.5-turbo, suggesting Anthropic's training paid attention to uniform recall. GPT-4 showed more robustness than GPT-3.5. Llama 2 models showed the most severe degradation. The pattern was consistent but varied in magnitude.