SWE-Bench is the most rigorous benchmark for measuring software engineering capability in language models. It takes real GitHub issues from real open source repositories and asks models to produce code patches that resolve them. The patches are validated by running the actual project's test suite. As of May 2026, Claude Sonnet 3.7 achieves around 70% on the verified subset, while GPT-4o sits closer to 38%. That gap has real-world implications for any team building coding tools.
Why SWE-Bench Is Different From HumanEval
HumanEval, released by OpenAI in 2021, was the standard coding benchmark for years. It consists of 164 Python programming problems, similar to LeetCode-style exercises. Each problem has a docstring, starter code, and test cases. Models complete the function body.
The problem with HumanEval is that it measures the ability to write standalone functions for well-specified toy problems. Real software engineering is nothing like this. Real work involves understanding an existing codebase, reading a bug report with incomplete information, tracing through dependencies to find the root cause, and making a minimal change that does not break existing tests.
SWE-Bench (Jimenez et al. 2024, Princeton) was built to measure this. It collects 2,294 real GitHub issues from 12 popular Python repositories including Django, Flask, scikit-learn, pandas, matplotlib, pytest, and astropy. For each issue, the benchmark has the original failing test case(s) and the correct patch that was eventually merged by the project's maintainers.
The evaluation is pass/fail: a model's generated patch either makes the failing test pass without breaking any existing tests, or it does not.
SWE-Bench Verified vs. SWE-Bench Full
The original SWE-Bench dataset has 2,294 instances. SWE-Bench Verified is a curated subset of 500 instances that OpenAI verified by having human software engineers review the problem statements and confirm they are unambiguous and solvable. It is the preferred benchmark for reporting results because the original dataset has some noisy instances where even the reference patches are debatable.
When you see headlines like "Claude achieves 70% on SWE-Bench," they almost always mean SWE-Bench Verified.
Current Leaderboard (May 2026)
Model performance on SWE-Bench Verified changes frequently as new model versions are released. As of May 2026:
- Claude Sonnet 3.7: ~70%
- OpenAI o3: ~72% (with full computer use scaffolding)
- GPT-4o: ~38%
- Claude 3.5 Sonnet: ~49%
- Gemini 1.5 Pro: ~26%
These numbers come from the official SWE-Bench leaderboard (swebench.com). The top performers use "agentic" scaffolding where the model can run commands, read files, and iteratively edit code rather than generating a single patch in one shot.
What the Gap Means in Practice
The difference between a 38% and 70% model on SWE-Bench is meaningful for engineering teams. On SWE-Bench, those numbers mean the difference between "resolves about one in three real bugs correctly without human intervention" and "resolves about two in three."
For a team using an AI coding assistant to handle first-pass bug fixes, moving from a 38% model to a 70% model roughly doubles the percentage of bugs that get resolved before a human needs to look at them. That is a concrete productivity difference.
However, there is an important caveat: SWE-Bench tests Python on a specific set of open source libraries. Performance on your codebase (different language, different architecture, different domain) will differ from SWE-Bench scores. Use SWE-Bench as a directional signal, not an exact predictor of task performance.
The Scaffolding Gap
One of the most important findings from SWE-Bench research is how much scaffolding matters. The same model in different scaffolding setups can have wildly different SWE-Bench scores.
"Scaffolding" refers to the agentic loop around the model: can it execute terminal commands, run tests, read multiple files, search the codebase, make multiple edit-test-revise cycles? A model with good scaffolding can approach a bug like a human engineer would. A model that produces a single patch in one shot without any ability to test is much more limited.
Cognition's Devin, SWE-agent (Princeton), and similar systems are essentially scaffolding systems designed to maximize SWE-Bench performance. They wrap models in agent loops that allow iterative debugging.
Running SWE-Bench Yourself
The SWE-Bench repository is open source (github.com/princeton-nlp/SWE-bench). Running it requires Docker because each evaluation instance needs the actual repository environment with its dependencies installed.
git clone https://github.com/princeton-nlp/SWE-bench
cd SWE-bench
pip install -e .
# Run evaluation on a specific subset
python -m swebench.harness.run_evaluation --dataset_name princeton-nlp/SWE-bench_Verified --predictions_path ./predictions.json --max_workers 4 --run_id my_eval_run
Running the full benchmark is compute-intensive. Budget several hours and significant disk space for the Docker images.
SWE-Bench Limitations
Like all benchmarks, SWE-Bench has limitations:
- Python only. Real codebases are polyglot. There is no equivalent benchmark for TypeScript, Go, or Rust at this quality level.
- Historical data. The issues were collected up to a certain cutoff date. Models trained after that cutoff may have seen the issues during training.
- Open source only. Enterprise codebases have different characteristics: internal libraries, complex build systems, undocumented conventions.
- Single-file patches. The benchmark is biased toward bugs that can be fixed with a small, localized change. Architectural problems are underrepresented.
Despite these limitations, SWE-Bench remains the best available signal for comparing coding model capability. Use it to compare models but validate on your own codebase before committing to a choice.
Keep Reading
- MMLU and HumanEval Benchmarks Explained — How the simpler coding and knowledge benchmarks work.
- Vibes vs. Benchmarks: How to Really Test an LLM — Why benchmarks and informal testing serve different purposes.
- How to Evaluate LLMs: The Complete Guide — The full evaluation framework from benchmarks to production monitoring.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.