MMLU Explained: What the 57-Subject LLM Benchmark Actually Tests

MMLU covers 57 academic subjects with 15,908 multiple-choice questions from elementary to professional level, and remains one of the most widely cited LLM benchmarks despite significant criticisms.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 10, 2026

9 min read

// tags

#mmlu#benchmark#evaluation#llm#academic

FIG. ART-24

9 min read

“

MMLU Explained: What the 57-Subject LLM Benchmark Actually Tests

// reading plan

sections

421

words

min read

// AI Agents

Building reliable agentic AI systems: A Practical Overview

A practical guide to building reliable agentic AI systems covering structured outputs, observability, fallbacks, and cost controls with real code examples.

4 min read

// AI Evaluation

SWE-Bench: The Gold Standard for Evaluating LLM Software Engineering

Scoring Methodology

Each question has four choices (A, B, C, D). The random baseline is 25%. MMLU is evaluated in two modes:

Few-shot (5-shot): Five example questions with answers are prepended to each test question. This tests whether the model can use the context format correctly and apply in-context learning.

Zero-shot: No examples. Tests pure parametric knowledge.

Score is reported as accuracy (fraction of questions answered correctly). Most papers use 5-shot.

Score Distribution Across Models

Model	MMLU 5-shot
Random baseline	25.0%
GPT-3 (2020)	43.9%
GPT-3.5	70.0%
GPT-4	86.4%
Claude 3 Opus	86.8%
Llama 3 70B	82.0%
Human expert estimate	~89.8%

Running MMLU Evaluation

# Using EleutherAI's lm-evaluation-harness
pip install lm-eval

lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B \
    --tasks mmlu \
    --num_fewshot 5 \
    --device cuda:0 \
    --batch_size 8 \
    --output_path ./results/llama3-8b-mmlu

The Criticisms

Data contamination: Many MMLU questions appear verbatim in Common Crawl, meaning models may have memorized answers rather than demonstrating reasoning. Studies estimate contamination rates of 10-30% for some subjects.

Multiple-choice is not real-world: Production tasks require open-ended generation, not selecting from four options. A model can score 80% on MMLU while failing completely on open-ended versions of the same questions.

Inconsistent across papers: Different implementations (zero-shot vs few-shot, log-prob scoring vs generation, chain-of-thought or not) produce incomparable numbers. Papers cherry-pick the evaluation protocol that makes their model look best.

MMLU-Pro: A Harder Variant

MMLU-Pro (arXiv:2406.01574) addresses some criticisms by:

Expanding to 10 choices instead of 4 (reducing random baseline to 10%)
Filtering for harder questions with less ambiguity
Requiring reasoning rather than pattern-matching

GPT-4 scores approximately 72% on MMLU-Pro versus 87% on MMLU - a more discriminating benchmark.

MMLU Explained: What the 57-Subject LLM Benchmark Actually Tests

Related Articles

Building reliable agentic AI systems: A Practical Overview

SWE-Bench: The Gold Standard for Evaluating LLM Software Engineering

What MMLU Is

The 57 Subjects

Scoring Methodology

Score Distribution Across Models

Running MMLU Evaluation

The Criticisms

MMLU-Pro: A Harder Variant

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Prompting for Classification: Getting Consistent Labels Every Time

MMLU Explained: What the 57-Subject LLM Benchmark Actually Tests

Related Articles

Building reliable agentic AI systems: A Practical Overview

SWE-Bench: The Gold Standard for Evaluating LLM Software Engineering

What MMLU Is

The 57 Subjects

Scoring Methodology

Score Distribution Across Models

Running MMLU Evaluation

The Criticisms

MMLU-Pro: A Harder Variant

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Prompting for Classification: Getting Consistent Labels Every Time

The workspace your team
actually needs