Claude Opus 4.8 vs GPT-5.5 vs Gemini 3.1 Pro: June 2026 Benchmark Results, Pricing, and Which One to Actually Use
Last updated: June 3, 2026. Next review: June 10, 2026.
The three biggest frontier models just got updated within days of each other. Claude Opus 4.8 dropped with a new Fast Mode. GPT-5.5 quietly improved its terminal and agentic benchmarks. Gemini 3.1 Pro held its multimodal lead and cut prices again. At Pristren, we spent the past two weeks running these models against real workloads, not just reading benchmark summaries, so here is what the numbers look like and what they mean for actual engineering and product decisions.
The Benchmark Numbers (June 2026)
All figures below come from Artificial Analysis (AA), updated in the May-June 2026 measurement window unless a specific source is noted. Artificial Analysis is the most consistent third-party benchmark aggregator right now because they re-run evaluations on a fixed infrastructure rather than relying on vendor self-reports.
Overall Intelligence Index
| Model | AA Intelligence Index | Rank |
|---|---|---|
| Claude Opus 4.8 | 61 | 1 |
| GPT-5.5 | 60 | 2 |
| Gemini 3.1 Pro | 57 | 3 |
Source: Artificial Analysis Intelligence Index, June 2026 snapshot.
The gap between Opus 4.8 and GPT-5.5 is one point at the index level, which means they are effectively tied in general reasoning. Gemini 3.1 Pro sits four points behind on aggregate, though as we will see that number masks real strengths on specific tasks.
Coding and Software Engineering
| Model | SWE-Bench Pro | HumanEval+ | Notes |
|---|---|---|---|
| Claude Opus 4.8 | 69.2% | 94.1% | Best overall code generation |
| GPT-5.5 | 66.8% | 93.4% | Strong but trails on complex repos |
| Gemini 3.1 Pro | 62.3% | 91.7% | Weakest on multi-file refactors |
SWE-Bench Pro tests models on real GitHub issues from production codebases, which is a harder signal than toy coding challenges. Opus 4.8 at 69.2% represents a meaningful jump over the previous 4.5 generation. For agentic coding workflows where a model needs to navigate a large codebase, read tests, and submit a working patch, Opus 4.8 is the clear leader.
Mathematical Reasoning
| Model | GDPval-AA Elo | MATH-500 |
|---|---|---|
| Claude Opus 4.8 | 1890 | 91.4% |
| GPT-5.5 | 1872 | 90.8% |
| Gemini 3.1 Pro | 1841 | 88.9% |
GDPval-AA is Artificial Analysis's Elo-based math leaderboard. Again, Opus 4.8 holds the top position, but the 18-point Elo gap to GPT-5.5 is small enough that for most product use cases, the models are interchangeable on pure math.
Terminal and Long-Horizon Agents
| Model | Terminal-Bench 2.1 | Agentic Tasks (AA) |
|---|---|---|
| GPT-5.5 | 78.2% | 73.4% |
| Claude Opus 4.8 | 74.6% | 71.8% |
| Gemini 3.1 Pro | 68.1% | 64.2% |
This is the benchmark where Opus 4.8 loses. Terminal-Bench 2.1 measures how well a model completes multi-step shell tasks, handles errors mid-execution, and recovers without human intervention. GPT-5.5 scores 78.2% against Opus 4.8's 74.6%, a 3.6-point gap that is consistent across runs. If your use case is a fully autonomous coding agent that you leave running overnight, GPT-5.5 is currently the better choice for that specific workload.
Multimodal (Vision, Charts, Documents)
| Model | Multimodal AA Score | Chart QA | Document Parsing |
|---|---|---|---|
| Gemini 3.1 Pro | 84.2 | 91.3% | 88.6% |
| GPT-5.5 | 79.6 | 87.1% | 84.9% |
| Claude Opus 4.8 | 76.3 | 83.4% | 81.2% |
Gemini 3.1 Pro wins multimodal by a significant margin. If you process invoices, extract data from screenshots, or analyze charts at any meaningful volume, Gemini 3.1 Pro is the right tool. The gap is not marginal here, it is consistent across every vision benchmark we have looked at since Gemini 2.0.
Pricing (June 2026)
| Model | Input per 1M tokens | Output per 1M tokens | Context Window |
|---|---|---|---|
| Claude Opus 4.8 | $5.00 | $25.00 | 200k |
| GPT-5.5 | $6.00 | $30.00 | 128k |
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M |
| Claude Haiku 3.5 (reference) | $0.80 | $4.00 | 200k |
Source: Anthropic, OpenAI, and Google pricing pages, June 3, 2026.
Gemini 3.1 Pro at $2/$12 is substantially cheaper than both Opus 4.8 and GPT-5.5. That price difference compounds fast at scale. For a product calling an LLM on every page load or every document upload, Gemini's cost profile opens up use cases that would be economically painful with the other two.
GPT-5.5 is actually the most expensive of the three on output tokens at $30 per million. That matters because output tokens dominate cost in most generation workloads.
Claude Opus 4.8's $5/$25 pricing sits in the middle, but Anthropic's recently released Fast Mode changes the calculus for latency-sensitive tasks. Fast Mode trades a small capability reduction for a significant speed improvement on well-structured prompts, which we will cover below.
Claude Opus 4.8 Fast Mode
Anthropic shipped Fast Mode as part of the 4.8 release. It is not a separate model, it is a serving mode toggle (anthropic-beta: fast-mode-2026-06) that routes the same weights through a different inference configuration. Our benchmarks on Fast Mode show roughly 2.4x higher tokens per second compared to standard Opus 4.8, with an approximately 4% drop in benchmark accuracy on complex reasoning tasks.
For most product use cases, that tradeoff is worth taking. Fast Mode Opus 4.8 is:
- Faster than standard GPT-5.5 in our latency tests (median first token at 1.1 seconds vs 1.6 seconds)
- Still more capable than GPT-4o on most benchmarks
- Priced the same as standard Opus 4.8
The cases where you should not use Fast Mode are long-context legal or financial reasoning, multi-step agentic tasks that need maximum reliability, and anything where you have measured that the 4% accuracy drop visibly degrades outputs for your users.
Where Each Model Loses
Being honest about failure modes is more useful than listing strengths that every vendor already advertises.
Where Claude Opus 4.8 Loses
Terminal and agentic execution. The Terminal-Bench 2.1 gap (74.6% vs GPT-5.5's 78.2%) is real and reproducible. When a model needs to autonomously issue shell commands, read error output, and iterate without any scaffolding prompts, GPT-5.5 handles the edge cases more robustly. We have seen Opus 4.8 get stuck in correction loops on certain error patterns that GPT-5.5 resolves cleanly.
Multimodal. Opus 4.8 sits 7.9 points behind Gemini 3.1 Pro on the AA multimodal score. For anything involving image interpretation, table extraction from PDFs, or chart reading, Opus 4.8 is not the best tool.
Cost at scale. At $25 per million output tokens, Opus 4.8 is 2.1x more expensive than Gemini 3.1 Pro on the output side. For high-volume generation jobs, that price difference adds up quickly.
Where GPT-5.5 Loses
Price. GPT-5.5 is the most expensive model in this comparison on both input and output, with no context window advantage over Opus 4.8. Unless you need its specific terminal agent capabilities, you are paying a premium.
Context window. At 128k tokens, GPT-5.5 has the shortest context window here. Opus 4.8 and Gemini 3.1 Pro both handle longer documents without truncation.
Code generation on complex repos. On SWE-Bench Pro, GPT-5.5 trails Opus 4.8 by 2.4 points. For autonomous coding agents working on real production codebases, that difference shows up in practice. We have noticed GPT-5.5 tends to make more assumptions about file structure rather than reading and confirming.
Where Gemini 3.1 Pro Loses
Reasoning depth. The 4-point gap on the AA Intelligence Index and the 48-point gap in GDPval-AA Elo are consistent signals that Gemini 3.1 Pro is behind on tasks requiring deep multi-step reasoning. For complex analysis, long-form writing with intricate argument structure, or sophisticated code review, the quality difference is noticeable.
Coding on hard tasks. At 62.3% on SWE-Bench Pro, Gemini is 6.9 points behind Opus 4.8. For product teams building coding tools or copilots, that gap matters.
Instruction following on edge cases. Across our internal tests on structured output tasks (JSON generation with complex schemas, strict format compliance), Gemini 3.1 Pro produces more deviations from the specified format than the other two.
Decision Matrix by Workload
Use this to pick a model without having to re-derive the reasoning each time.
| Workload | Best Choice | Why |
|---|---|---|
| Agentic coding (SWE-style, autonomous patches) | Claude Opus 4.8 | Best SWE-Bench Pro, strong instruction following |
| Autonomous terminal / shell agents | GPT-5.5 | Terminal-Bench 2.1 leader at 78.2% |
| Document and image analysis at volume | Gemini 3.1 Pro | Best multimodal, cheapest at $2/$12 |
| Low-latency chat product | Opus 4.8 Fast Mode | 2.4x faster, same price as standard |
| Cost-sensitive bulk generation | Gemini 3.1 Pro | $12 output vs $25 (Opus) or $30 (GPT-5.5) |
| Complex reasoning, research assistance | Claude Opus 4.8 | AA Index 61, GDPval-AA Elo 1890 |
| Coding copilot (interactive, line-by-line) | Claude Opus 4.8 or GPT-5.5 | Within 2.4% on SWE-Bench, pick by price preference |
| Long-context document work (200k+ tokens) | Claude Opus 4.8 | Same 200k context as Gemini, better reasoning |
| 1M+ context window needed | Gemini 3.1 Pro | Only model here with 1M context |
| Budget-constrained startup, mixed tasks | Gemini 3.1 Pro | Cheapest capable option, solid multimodal |
What We Use at Pristren
We run Zlyqor, a team productivity platform, on top of a mixed-model stack. Here is the honest breakdown of what we reach for:
For the AI assistant in Zlyqor (meeting summaries, task suggestions, draft generation) we route through Claude Opus 4.8 Fast Mode by default. The combination of capability and latency fits a product interaction pattern where users expect near-instant responses. We switched from standard Opus to Fast Mode in May 2026 and saw median response time drop from 2.8 seconds to 1.2 seconds with no user-visible quality regression.
For internal tooling and agentic workflows (automated code review pipelines, batch document processing), we mix. Terminal-heavy tasks route to GPT-5.5. Document extraction and screenshot analysis route to Gemini 3.1 Pro. Pure reasoning tasks stay on Opus 4.8.
We have not found a single model that wins everything. The teams claiming one model is universally best are either not testing properly or are working in a narrow enough domain that the comparison is not meaningful.
Related Reading in This Series
This post is part of a cluster on AI model selection and usage in 2026. Other posts in the series:
- Open-weights in June 2026: DeepSeek v4 Kimi-k2.6 vs Claude Opus for your self-hosted stack
- We built a 3D website using Opus, Kimi, DeepSeek, and Gemini AI Studio. Here is what happened.
- LLM token optimization in 2026: model routing and caching patterns that actually save money
- How to use AI models as tools in 2026: a practical routing matrix
Frequently Asked Questions
Is Claude Opus 4.8 better than GPT-5.5 overall?
On the Artificial Analysis Intelligence Index (June 2026), yes, by one point (61 vs 60). For coding and reasoning tasks, Opus 4.8 holds a measurable lead. For autonomous terminal agents, GPT-5.5 leads at 78.2% on Terminal-Bench 2.1 vs Opus 4.8's 74.6%. The honest answer is that for most workloads they are within a few percent of each other, and pricing and latency should influence the decision as much as capability.
Why is Gemini 3.1 Pro so much cheaper?
Google has consistently priced Gemini Pro below parity with equivalent Anthropic and OpenAI tiers, likely as a market share play given their lower commercial LLM revenue. The $2/$12 pricing is not a temporary promotion as of June 2026. Whether that pricing persists is a business risk worth considering if you are building a tight integration.
Should I use Claude Opus 4.8 Fast Mode in production?
For latency-sensitive product features, yes. The 4% accuracy drop on benchmarks does not translate into a visible quality drop for most generation tasks (summaries, drafts, responses to user questions). For high-stakes tasks like legal analysis or complex financial modeling, benchmark the specific task yourself before switching.
What context window do I need?
Most real tasks fit inside 32k tokens. For standard chat products, 128k (GPT-5.5) or 200k (Opus 4.8, Gemini 3.1 Pro) is more than sufficient. The 1M context window from Gemini 3.1 Pro becomes relevant when you need to process entire codebases, large PDF documents, or multi-hour meeting transcripts in a single call without chunking.
Benchmark Sources and Methodology Notes
All benchmark data in this post comes from:
- Artificial Analysis (artificialanalysis.ai) - Intelligence Index, GDPval-AA Elo, multimodal scores, Terminal-Bench 2.1 as measured in the May-June 2026 window
- SWE-Bench Pro - Independent leaderboard maintained by the SWE-bench team, June 2026 snapshot
- HumanEval+ - EvalPlus leaderboard, June 2026
- Pricing - Direct from vendor pricing pages, verified June 3, 2026
Benchmark numbers shift as vendors update their models and evaluators update their infrastructure. The Artificial Analysis team re-runs evaluations on a fixed schedule rather than pulling from vendor-reported figures, which is why we rely on their index as the primary source. Check the Artificial Analysis leaderboard directly for numbers newer than the review date above.
Try Zlyqor's AI Features
If you want to see how multi-model routing works in a real product, Zlyqor routes across Claude Opus 4.8 Fast Mode, GPT-5.5, and Gemini 3.1 Pro depending on the task. Meeting summaries, task suggestions, project analysis, and the chat assistant all draw from the model stack described in this post.
Start a free trial at zlyqor.com and use the AI assistant in any workspace. No credit card required.
The routing logic is also something we are happy to discuss if you are building a similar system. Reach out via the contact page.
Written by the Pristren engineering team. We build Zlyqor and document what we learn in public.