The Needle-in-a-Haystack Test: Benchmarking LLM Long-Context Recall

The needle-in-a-haystack test measures whether an LLM can recall a single specific fact embedded at varying depths in a long document, revealing which models have uniform long-context recall and which have blind spots.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 1, 2026

9 min read

// tags

#needle-in-haystack#long-context#evaluation#benchmark#recall

FIG. ART-33

9 min read

“

The Needle-in-a-Haystack Test: Benchmarking LLM Long-Context Recall

// reading plan

sections

563

words

min read

// AI Evaluation

SWE-Bench: The Gold Standard for Evaluating LLM Software Engineering

SWE-Bench tests LLMs on 2,294 real GitHub issues from popular Python repositories, evaluating whether the model can write code that passes the existing test suite - a far harder and more realistic evaluation than HumanEval.

9 min read

// Prompt Engineering

Prompt Testing Methodology: A Systematic Approach for Teams

Results Across Models

Claude 2.1 (200k context) achieved near-perfect recall across the full context length at all depths - over 98% accuracy at 100k tokens. This made it the benchmark for long-context recall quality.

GPT-4 Turbo (128k context) showed specific failure zones: a notable degradation zone at 70-100k tokens at certain depths, where recall dropped to 60-70%. The model had inconsistent recall in its extended context window.

Llama 2 (4k context native) failed predictably beyond its training context, demonstrating that context extension techniques (RoPE scaling, ALiBi) do not fully preserve recall.

import os
from openai import OpenAI
from anthropic import Anthropic

def needle_haystack_test(
    model_client,
    haystack_text: str,
    needle: str,
    question: str,
    context_length: int,
    depth_percent: float,
) -> bool:
    # Truncate haystack to target context length
    words = haystack_text.split()
    # Insert needle at specified depth
    insert_pos = int(len(words) * depth_percent)
    needle_words = needle.split()
    words = words[:insert_pos] + needle_words + words[insert_pos:]
    # Truncate to context_length approximate
    text = " ".join(words[:context_length // 4])  # rough word-to-token ratio

    prompt = f"{text}

Question: {question}
Answer:"
    response = model_client.generate(prompt)

    # Grade: check if the specific needle detail appears in response
    grader_prompt = f"""
    Needle: {needle}
    Model response: {response}
    Does the response accurately recall the specific details from the needle? Answer Yes or No.
    """
    grade = model_client.generate(grader_prompt)
    return "yes" in grade.lower()


def run_sweep(model_client, haystack, needle, question):
    context_lengths = [1000, 5000, 10000, 25000, 50000, 100000]
    depth_percents = [0.05, 0.25, 0.50, 0.75, 0.95]
    results = {}

    for length in context_lengths:
        for depth in depth_percents:
            result = needle_haystack_test(
                model_client, haystack, needle, question, length, depth
            )
            results[(length, depth)] = result

    return results

Implications for Production RAG

The test reveals that "supports 200k context" does not mean "reliably recalls information from anywhere in that context." For production RAG:

Use shorter, more focused context windows (under 50k tokens) unless the model is validated for long-context recall
Apply the "lost in the middle" reordering strategy (put most relevant chunks first and last)
Consider chunked processing with a final synthesis step rather than one massive context

Running It Yourself

The original implementation at github.com/gkamradt/LLMTest_NeedleInAHaystack generates the heatmap visualization and supports multiple model APIs. You can run the full sweep for a new model in a few hours and a few dollars of API costs.

The Needle-in-a-Haystack Test: Benchmarking LLM Long-Context Recall

Related Articles

SWE-Bench: The Gold Standard for Evaluating LLM Software Engineering

What the Test Measures

The Test Design

Results Across Models

Implications for Production RAG

Running It Yourself

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Prompt Testing Methodology: A Systematic Approach for Teams

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

The Needle-in-a-Haystack Test: Benchmarking LLM Long-Context Recall

Related Articles

SWE-Bench: The Gold Standard for Evaluating LLM Software Engineering

What the Test Measures

The Test Design

Results Across Models

Implications for Production RAG

Running It Yourself

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Prompt Testing Methodology: A Systematic Approach for Teams

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

The workspace your team
actually needs