For code generation, Claude 3.5 Sonnet leads on SWE-Bench Verified (~49%), the most realistic coding benchmark available. GPT-4o (~38%) and Gemini 1.5 Pro are close behind. For everyday coding tasks — function implementation, bug fixes, boilerplate generation — all three are capable. The gaps show up on complex, multi-file, real-world problems.
The Benchmarks That Actually Matter
Not all coding benchmarks are equal. Understanding what each one tests helps you interpret model claims correctly.
HumanEval
HumanEval is OpenAI's dataset of 164 Python coding problems. Each problem provides a function signature, a docstring, and a set of unit tests. The model must complete the function. Pass@1 measures how often the model solves it correctly on the first attempt.
HumanEval scores for top models (approximate, May 2026): Claude 3.5 Sonnet 92%, GPT-4o 90.2%, Gemini 1.5 Pro ~87%. The problems are well-defined and relatively contained. HumanEval is useful for measuring basic coding ability but does not reflect the complexity of real software engineering.
SWE-Bench Verified
SWE-Bench uses real GitHub issues from popular open-source Python repositories. The task is to generate a code patch that resolves the issue and passes the repository's test suite. SWE-Bench Verified is a curated subset with issues confirmed to be reproducible and well-specified.
This is a much harder benchmark than HumanEval. Success requires understanding an existing codebase, finding the source of a bug, writing a correct fix, and passing tests. Claude 3.5 Sonnet's ~49% on SWE-Bench Verified is the strongest published score for a production API model as of May 2026. GPT-4o is around 38%.
MBPP
Mostly Basic Programming Problems (MBPP) tests Python function generation across a wider range of problem types than HumanEval. It includes 374 crowd-sourced problems at varying difficulty levels. Top models score 85-90% on MBPP.
LiveCodeBench
LiveCodeBench tests recent competitive programming problems from LeetCode, AtCoder, and CodeForces. Because problems are pulled from after model training cutoffs, this benchmark is more resistant to data contamination (models memorizing test answers during training). It is a stronger signal of true generalization ability.
Why Code Generation Is Different from Text Generation
Generating code is fundamentally different from generating prose in several important ways.
Correctness is verifiable. Code either runs or it does not. Functions either pass tests or they fail. This is different from text, where quality is subjective. The verifiability of code means we can measure coding benchmarks more objectively than writing benchmarks.
Syntax errors are objectively wrong. A mismatched bracket or a missing import is unambiguously incorrect. Models rarely produce basic syntax errors at this point, but they do produce code that is syntactically valid and semantically wrong.
Security vulnerabilities are invisible to casual review. A model might write code that looks correct and passes tests but contains a SQL injection vulnerability, an insecure deserialization path, or a race condition. These are not detectable by running the code in a test environment.
Library knowledge decays. APIs change. A model trained on code from 2023 may generate function calls that no longer exist in the current version of a library. This gets worse as time passes from the training cutoff.