What MMLU Is
The Massive Multitask Language Understanding benchmark (arXiv:2009.03300) by Hendrycks et al. tests LLMs on 57 subject areas spanning STEM, humanities, social sciences, and professional domains. With 15,908 multiple-choice questions at difficulty levels from elementary to expert, it was designed to measure the breadth of knowledge a model has absorbed during pretraining.
The 57 Subjects
The subjects span a wide range:
- STEM: Abstract algebra, anatomy, astronomy, chemistry, computer science, electrical engineering, physics, mathematics
- Professional: Medical genetics, clinical knowledge, jurisprudence, accountancy, management, marketing
- Humanities: History, philosophy, prehistory, moral scenarios
- Social Sciences: Economics, psychology, sociology, geography, politics
- Miscellaneous: Miscellaneous knowledge spanning professional medicine, law, and business
Scoring Methodology
Each question has four choices (A, B, C, D). The random baseline is 25%. MMLU is evaluated in two modes:
Few-shot (5-shot): Five example questions with answers are prepended to each test question. This tests whether the model can use the context format correctly and apply in-context learning.
Zero-shot: No examples. Tests pure parametric knowledge.
Score is reported as accuracy (fraction of questions answered correctly). Most papers use 5-shot.
Score Distribution Across Models
| Model | MMLU 5-shot | |-------|-------------| | Random baseline | 25.0% | | GPT-3 (2020) | 43.9% | | GPT-3.5 | 70.0% | | GPT-4 | 86.4% | | Claude 3 Opus | 86.8% | | Llama 3 70B | 82.0% | | Human expert estimate | ~89.8% |
Running MMLU Evaluation
# Using EleutherAI's lm-evaluation-harness
pip install lm-eval
lm_eval --model hf \
--model_args pretrained=meta-llama/Meta-Llama-3-8B \
--tasks mmlu \
--num_fewshot 5 \
--device cuda:0 \
--batch_size 8 \
--output_path ./results/llama3-8b-mmlu
The Criticisms
Data contamination: Many MMLU questions appear verbatim in Common Crawl, meaning models may have memorized answers rather than demonstrating reasoning. Studies estimate contamination rates of 10-30% for some subjects.
Multiple-choice is not real-world: Production tasks require open-ended generation, not selecting from four options. A model can score 80% on MMLU while failing completely on open-ended versions of the same questions.
Inconsistent across papers: Different implementations (zero-shot vs few-shot, log-prob scoring vs generation, chain-of-thought or not) produce incomparable numbers. Papers cherry-pick the evaluation protocol that makes their model look best.
MMLU-Pro: A Harder Variant
MMLU-Pro (arXiv:2406.01574) addresses some criticisms by:
- Expanding to 10 choices instead of 4 (reducing random baseline to 10%)
- Filtering for harder questions with less ambiguity
- Requiring reasoning rather than pattern-matching
GPT-4 scores approximately 72% on MMLU-Pro versus 87% on MMLU — a more discriminating benchmark.