What the Test Measures
When a model claims to support 100k tokens of context, does it actually use all of that context equally? Or are there positions in the context window where recall degrades? The needle-in-a-haystack test, popularized by Gregory Kamradt, answers this question with a simple controlled experiment.
The Test Design
-
The haystack: A large corpus of unrelated text (typically Paul Graham essays, repeated as needed) filling the target context length (e.g., 100k tokens)
-
The needle: A single specific sentence inserted at a specific depth: "The best thing to do in San Francisco is eat a Mission burrito and go to Dolores Park on a sunny day." (or similar fact-rich sentence with a specific detail to recall)
-
The question: "What is the best thing to do in San Francisco?"
-
The sweep: Test all combinations of context lengths (1k, 5k, 10k, 20k, 50k, 100k tokens) and needle depths (5%, 25%, 50%, 75%, 95% through the document). This produces a 2D heatmap.
-
Scoring: Each cell is graded 1 (exact recall of the needle) or 0 (miss), using a second LLM as the grader.
Results Across Models
Claude 2.1 (200k context) achieved near-perfect recall across the full context length at all depths — over 98% accuracy at 100k tokens. This made it the benchmark for long-context recall quality.
GPT-4 Turbo (128k context) showed specific failure zones: a notable degradation zone at 70-100k tokens at certain depths, where recall dropped to 60-70%. The model had inconsistent recall in its extended context window.
Llama 2 (4k context native) failed predictably beyond its training context, demonstrating that context extension techniques (RoPE scaling, ALiBi) do not fully preserve recall.
import os
from openai import OpenAI
from anthropic import Anthropic
def needle_haystack_test(
model_client,
haystack_text: str,
needle: str,
question: str,
context_length: int,
depth_percent: float,
) -> bool:
# Truncate haystack to target context length
words = haystack_text.split()
# Insert needle at specified depth
insert_pos = int(len(words) * depth_percent)
needle_words = needle.split()
words = words[:insert_pos] + needle_words + words[insert_pos:]
# Truncate to context_length approximate
text = " ".join(words[:context_length // 4]) # rough word-to-token ratio
prompt = f"{text}
Question: {question}
Answer:"
response = model_client.generate(prompt)
# Grade: check if the specific needle detail appears in response
grader_prompt = f"""
Needle: {needle}
Model response: {response}
Does the response accurately recall the specific details from the needle? Answer Yes or No.
"""
grade = model_client.generate(grader_prompt)
return "yes" in grade.lower()
def run_sweep(model_client, haystack, needle, question):
context_lengths = [1000, 5000, 10000, 25000, 50000, 100000]
depth_percents = [0.05, 0.25, 0.50, 0.75, 0.95]
results = {}
for length in context_lengths:
for depth in depth_percents:
result = needle_haystack_test(
model_client, haystack, needle, question, length, depth
)
results[(length, depth)] = result
return results
Implications for Production RAG
The test reveals that "supports 200k context" does not mean "reliably recalls information from anywhere in that context." For production RAG:
- Use shorter, more focused context windows (under 50k tokens) unless the model is validated for long-context recall
- Apply the "lost in the middle" reordering strategy (put most relevant chunks first and last)
- Consider chunked processing with a final synthesis step rather than one massive context
Running It Yourself
The original implementation at github.com/gkamradt/LLMTest_NeedleInAHaystack generates the heatmap visualization and supports multiple model APIs. You can run the full sweep for a new model in a few hours and a few dollars of API costs.