Why HumanEval Is Not Enough
HumanEval and MBPP test LLMs on self-contained coding problems with simple function signatures. Real software engineering involves understanding a large existing codebase, reading issue descriptions and bug reports, making targeted changes to multiple files, and passing an existing test suite you did not write. SWE-Bench (arXiv:2310.06770) by Jimenez et al. tests exactly this.
The Benchmark Construction
SWE-Bench collected 2,294 issues from 12 popular Python repositories:
- pytest, sympy, matplotlib, scikit-learn, requests, astropy, flask, pylint, django, seaborn, sphinx-gallery, xarray
For each issue, the benchmark captures:
- The GitHub issue description (bug report or feature request)
- The repository state at the time of the issue (the "broken" state)
- The gold-patch fix that was merged to resolve the issue
- The failing tests before the fix and passing tests after
The task: given the issue description and repository, produce a patch that makes the failing tests pass.
Evaluation: Pass@1 With Existing Tests
Evaluation is binary: the model's patch either causes the test suite to pass (success) or not (failure). There is no partial credit and no LLM-as-judge — real pytest determines correctness. This is a significantly harder and more objective evaluation than asking a human or LLM to judge code quality.
Score Progression Across Models
| Model / System | SWE-Bench Verified (%) | |---------------|------------------------| | GPT-4 (2023 baseline) | 1.7% | | Claude 3.5 Sonnet (2024) | 49.0% | | Agentless (open-source) | 50.8% | | Claude 3.7 Sonnet (2025) | ~70% | | Top open-source agents | ~55% |
The jump from GPT-4's 1.7% to modern agents' 50%+ illustrates how much better AI coding agents have become in just two years.
# Running SWE-Bench evaluation (simplified)
git clone https://github.com/princeton-nlp/SWE-bench
cd SWE-bench
pip install -e .
# Run inference with your model/agent
python run_inference.py \
--model_name "claude-3-5-sonnet" \
--dataset_path "princeton-nlp/SWE-bench_Verified" \
--output_dir ./predictions/
# Evaluate predictions
python evaluation/harness.py \
--predictions_path ./predictions/ \
--log_dir ./logs/ \
--testbed /tmp/testbed \
--skip_existing
SWE-Bench Verified: A Cleaner Subset
The original 2,294 instances include some ambiguous or underspecified issues where even humans disagree on the correct fix. SWE-Bench Verified is a 500-instance subset that was validated by human contractors from Upwork — each issue was confirmed to be unambiguous, solvable, and correctly specified. This subset is now the primary leaderboard for fair comparison.
What Good Agents Do Differently
Low-performing approaches generate patches without understanding the codebase. High-performing systems (Claude 3.7, Agentless):
- Read and understand the full repository structure first
- Identify the specific files and functions related to the issue
- Write localized targeted changes rather than large rewrites
- Run the test suite locally to verify the fix before submitting
- Handle file navigation, import resolution, and test interpretation
Why SWE-Bench Matters for AI Coding Tools
SWE-Bench scores correlate strongly with real-world usefulness for AI coding assistants. A model that solves 50% of SWE-Bench instances can meaningfully help with production bugs and feature requests. The benchmark has become the de facto standard for evaluating AI software engineering capability.