For code generation, Claude 3.5 Sonnet leads on SWE-Bench Verified (~49%), the most realistic coding benchmark available. GPT-4o (~38%) and Gemini 1.5 Pro are close behind. For everyday coding tasks — function implementation, bug fixes, boilerplate generation — all three are capable. The gaps show up on complex, multi-file, real-world problems.
The Benchmarks That Actually Matter
Not all coding benchmarks are equal. Understanding what each one tests helps you interpret model claims correctly.
HumanEval
HumanEval is OpenAI's dataset of 164 Python coding problems. Each problem provides a function signature, a docstring, and a set of unit tests. The model must complete the function. Pass@1 measures how often the model solves it correctly on the first attempt.
HumanEval scores for top models (approximate, May 2026): Claude 3.5 Sonnet 92%, GPT-4o 90.2%, Gemini 1.5 Pro ~87%. The problems are well-defined and relatively contained. HumanEval is useful for measuring basic coding ability but does not reflect the complexity of real software engineering.
SWE-Bench Verified
SWE-Bench uses real GitHub issues from popular open-source Python repositories. The task is to generate a code patch that resolves the issue and passes the repository's test suite. SWE-Bench Verified is a curated subset with issues confirmed to be reproducible and well-specified.
This is a much harder benchmark than HumanEval. Success requires understanding an existing codebase, finding the source of a bug, writing a correct fix, and passing tests. Claude 3.5 Sonnet's ~49% on SWE-Bench Verified is the strongest published score for a production API model as of May 2026. GPT-4o is around 38%.
MBPP
Mostly Basic Programming Problems (MBPP) tests Python function generation across a wider range of problem types than HumanEval. It includes 374 crowd-sourced problems at varying difficulty levels. Top models score 85-90% on MBPP.
LiveCodeBench
LiveCodeBench tests recent competitive programming problems from LeetCode, AtCoder, and CodeForces. Because problems are pulled from after model training cutoffs, this benchmark is more resistant to data contamination (models memorizing test answers during training). It is a stronger signal of true generalization ability.
Why Code Generation Is Different from Text Generation
Generating code is fundamentally different from generating prose in several important ways.
Correctness is verifiable. Code either runs or it does not. Functions either pass tests or they fail. This is different from text, where quality is subjective. The verifiability of code means we can measure coding benchmarks more objectively than writing benchmarks.
Syntax errors are objectively wrong. A mismatched bracket or a missing import is unambiguously incorrect. Models rarely produce basic syntax errors at this point, but they do produce code that is syntactically valid and semantically wrong.
Security vulnerabilities are invisible to casual review. A model might write code that looks correct and passes tests but contains a SQL injection vulnerability, an insecure deserialization path, or a race condition. These are not detectable by running the code in a test environment.
Library knowledge decays. APIs change. A model trained on code from 2023 may generate function calls that no longer exist in the current version of a library. This gets worse as time passes from the training cutoff.
How to Get Better Code from LLMs
The quality of code generation output is highly sensitive to prompt quality. These practices make a meaningful difference.
Provide language and version context. Tell the model what language and version you are using. "Write this in TypeScript 5 using the ES2022 module format" produces better results than "write this in TypeScript."
Include existing patterns. Paste relevant examples from your codebase. If you have a standard way of handling errors, or a particular pattern for database queries, show the model a real example before asking it to write something new. It will match the pattern.
Specify constraints explicitly. "This function should not make any network calls," "the output must be idempotent," "avoid mutable state" — constraints that seem obvious to you are not obvious to the model. State them.
Ask for tests alongside implementation. "Implement this function and write unit tests for it" consistently produces more correct implementations than asking for implementation alone. The process of writing tests forces the model to think through edge cases.
Iterate rather than generate everything at once. For complex code, generate in stages. Get the structure right first, then fill in implementations, then add error handling. Each stage is simpler than the whole.
Review critically before using. Never commit model-generated code without reading it. Understand what it does. Look for the bugs listed in the section below.
The Limits of LLM Code Generation
Even the best models have systematic failure patterns in code generation.
Subtle logical bugs. Models generate plausible-looking code that is logically wrong in non-obvious ways. Off-by-one errors in loop bounds, incorrect handling of empty arrays, wrong assumptions about concurrency, and subtle state management bugs all appear in generated code. Unit tests help but do not catch all of these.
Security vulnerabilities. LLMs are not security engineers. Code that processes user input, interacts with databases, or manages authentication requires security review regardless of who wrote it. Common vulnerabilities in generated code include: missing input validation, SQL injection via string concatenation, insecure defaults, and missing authorization checks. Run generated code through a security scanner and review any code that touches sensitive operations.
Performance issues. A model may generate code that is functionally correct but algorithmically inefficient. An O(n^2) solution where O(n log n) is possible, N+1 database query patterns, or missing index hints in SQL are common performance problems in generated code that tests do not catch.
Hallucinated APIs. Models will sometimes call functions, methods, or parameters that do not exist. The code looks plausible because the invented API follows the naming conventions of the library. Always check that every API call in generated code actually exists in the version you are using.
Context loss in long files. When the model is generating a long file or a complex implementation, it can lose track of earlier decisions. A variable name changes, a helper function is defined twice with slightly different behavior, or a constraint mentioned early in the conversation is forgotten. For long implementations, review the full output for consistency.
Choosing the Right Model for Your Coding Use Case
For everyday coding tasks (writing utility functions, generating boilerplate, explaining code), GPT-4o and Claude 3.5 Sonnet are interchangeable in practice. Choose based on your existing setup.
For complex multi-step coding tasks (implementing a feature that touches multiple files, fixing bugs in existing code, refactoring), Claude 3.5 Sonnet's SWE-Bench advantage becomes meaningful. The gap is not absolute but it is real.
For code completion in an IDE, speed matters as much as quality. Smaller, faster models like Claude Haiku or GPT-4o-mini provide lower latency that makes completion feel natural. Most IDE integrations use these smaller models.
For security-sensitive code, use the best available model and then do a dedicated security review regardless of model quality. No current model reliably catches its own security mistakes.
Keep Reading
- Best LLM for Coding 2026 — full ranking with use-case breakdowns
- Claude 3.5 Sonnet Review — deep comparison of the top two coding models
- Local LLM for Privacy — running models locally for code you cannot send to external APIs
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.