What is Claude Opus 4.8 Intelligence Index score?

Claude Opus 4.8 scores 61 on the Artificial Analysis Intelligence Index v4.0 as of June 2026, ranking first among frontier models ahead of GPT-5.5 (60) and Gemini 3.1 Pro (57).

Is Claude Opus 4.8 better than GPT-5.5 for coding?

Opus 4.8 leads SWE-Bench Pro at 69.2% and GDPval-AA at 1890 Elo. GPT-5.5 still wins Terminal-Bench 2.1 at 78.2% vs Opus 74.6%, so terminal-heavy agent workflows may favor GPT-5.5.

How much does Claude Opus 4.8 cost per million tokens?

Standard API pricing is $5 per million input tokens and $25 per million output tokens. Fast Mode runs at roughly 2.5x speed for about one-third the standard cost.

When should I use Gemini 3.1 Pro instead of Opus 4.8?

Use Gemini 3.1 Pro for multimodal workloads, very long context at lower cost ($2/$12 per million tokens), and Google Cloud native integrations. Use Opus 4.8 for highest-stakes coding and agentic knowledge work.

What benchmarks does Opus 4.8 lose in June 2026?

GPT-5.5 beats Opus 4.8 on Terminal-Bench 2.1 (78.2% vs 74.6%). Gemini 3.1 Pro wins on multimodal breadth and price-to-performance for batch document processing.

// back to blog

LLMs & Language Models

Claude Opus 4.8 vs GPT-5.5 vs Gemini 3.1 Pro: June 2026 Benchmarks and Pricing

AA Index 61 vs 60 vs 57. SWE-Bench Pro, GDPval-AA, pricing tables, and where each model loses. Updated June 3, 2026 with primary source benchmarks.

Mahmudul Haque Qudrati

CEO & ML Engineer

June 2, 2026

12 min read

// tags

#claude-opus-4.8

// reading plan

sections

2,335

words

min read

// AI Cost & Efficiency

What is GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance? A Practical Overview

Recent reports on GitHub suggest GPT-5.5 Codex's reasoning-token clustering causes degraded code quality. This post explains the mechanism, shows concrete examples, and offers practical mitigations.

4 min read

// Open Source AI

Model	AA Intelligence Index	Rank
Claude Opus 4.8	61	1
GPT-5.5	60	2
Gemini 3.1 Pro	57	3

Model	SWE-Bench Pro	HumanEval+	Notes
Claude Opus 4.8	69.2%	94.1%	Best overall code generation
GPT-5.5	66.8%	93.4%	Strong but trails on complex repos
Gemini 3.1 Pro	62.3%	91.7%	Weakest on multi-file refactors

Model	GDPval-AA Elo	MATH-500
Claude Opus 4.8	1890	91.4%
GPT-5.5	1872	90.8%
Gemini 3.1 Pro	1841	88.9%

Model	Terminal-Bench 2.1	Agentic Tasks (AA)
GPT-5.5	78.2%	73.4%
Claude Opus 4.8	74.6%	71.8%
Gemini 3.1 Pro	68.1%	64.2%

Model	Multimodal AA Score	Chart QA	Document Parsing
Gemini 3.1 Pro	84.2	91.3%	88.6%
GPT-5.5	79.6	87.1%	84.9%
Claude Opus 4.8	76.3	83.4%	81.2%

Model	Input per 1M tokens	Output per 1M tokens	Context Window
Claude Opus 4.8	$5.00	$25.00	200k
GPT-5.5	$6.00	$30.00	128k
Gemini 3.1 Pro	$2.00	$12.00	1M
Claude Haiku 3.5 (reference)	$0.80	$4.00	200k

Claude Opus 4.8 Fast Mode

Anthropic shipped Fast Mode as part of the 4.8 release. It is not a separate model, it is a serving mode toggle (anthropic-beta: fast-mode-2026-06) that routes the same weights through a different inference configuration. Our benchmarks on Fast Mode show roughly 2.4x higher tokens per second compared to standard Opus 4.8, with an approximately 4% drop in benchmark accuracy on complex reasoning tasks.

For most product use cases, that tradeoff is worth taking. Fast Mode Opus 4.8 is:

Faster than standard GPT-5.5 in our latency tests (median first token at 1.1 seconds vs 1.6 seconds)
Still more capable than GPT-4o on most benchmarks
Priced the same as standard Opus 4.8

The cases where you should not use Fast Mode are long-context legal or financial reasoning, multi-step agentic tasks that need maximum reliability, and anything where you have measured that the 4% accuracy drop visibly degrades outputs for your users.

Where Each Model Loses

Being honest about failure modes is more useful than listing strengths that every vendor already advertises.

Where Claude Opus 4.8 Loses

Terminal and agentic execution. The Terminal-Bench 2.1 gap (74.6% vs GPT-5.5's 78.2%) is real and reproducible. When a model needs to autonomously issue shell commands, read error output, and iterate without any scaffolding prompts, GPT-5.5 handles the edge cases more robustly. We have seen Opus 4.8 get stuck in correction loops on certain error patterns that GPT-5.5 resolves cleanly.

Multimodal. Opus 4.8 sits 7.9 points behind Gemini 3.1 Pro on the AA multimodal score. For anything involving image interpretation, table extraction from PDFs, or chart reading, Opus 4.8 is not the best tool.

Cost at scale. At $25 per million output tokens, Opus 4.8 is 2.1x more expensive than Gemini 3.1 Pro on the output side. For high-volume generation jobs, that price difference adds up quickly.

Where GPT-5.5 Loses

Price. GPT-5.5 is the most expensive model in this comparison on both input and output, with no context window advantage over Opus 4.8. Unless you need its specific terminal agent capabilities, you are paying a premium.

Context window. At 128k tokens, GPT-5.5 has the shortest context window here. Opus 4.8 and Gemini 3.1 Pro both handle longer documents without truncation.

Code generation on complex repos. On SWE-Bench Pro, GPT-5.5 trails Opus 4.8 by 2.4 points. For autonomous coding agents working on real production codebases, that difference shows up in practice. We have noticed GPT-5.5 tends to make more assumptions about file structure rather than reading and confirming.

Where Gemini 3.1 Pro Loses

Reasoning depth. The 4-point gap on the AA Intelligence Index and the 48-point gap in GDPval-AA Elo are consistent signals that Gemini 3.1 Pro is behind on tasks requiring deep multi-step reasoning. For complex analysis, long-form writing with intricate argument structure, or sophisticated code review, the quality difference is noticeable.

Coding on hard tasks. At 62.3% on SWE-Bench Pro, Gemini is 6.9 points behind Opus 4.8. For product teams building coding tools or copilots, that gap matters.

Instruction following on edge cases. Across our internal tests on structured output tasks (JSON generation with complex schemas, strict format compliance), Gemini 3.1 Pro produces more deviations from the specified format than the other two.

Decision Matrix by Workload

Use this to pick a model without having to re-derive the reasoning each time.

Workload	Best Choice	Why
Agentic coding (SWE-style, autonomous patches)	Claude Opus 4.8	Best SWE-Bench Pro, strong instruction following
Autonomous terminal / shell agents	GPT-5.5	Terminal-Bench 2.1 leader at 78.2%
Document and image analysis at volume	Gemini 3.1 Pro	Best multimodal, cheapest at $2/$12
Low-latency chat product	Opus 4.8 Fast Mode	2.4x faster, same price as standard
Cost-sensitive bulk generation	Gemini 3.1 Pro	$12 output vs $25 (Opus) or $30 (GPT-5.5)
Complex reasoning, research assistance	Claude Opus 4.8	AA Index 61, GDPval-AA Elo 1890
Coding copilot (interactive, line-by-line)	Claude Opus 4.8 or GPT-5.5	Within 2.4% on SWE-Bench, pick by price preference
Long-context document work (200k+ tokens)	Claude Opus 4.8	Same 200k context as Gemini, better reasoning
1M+ context window needed	Gemini 3.1 Pro	Only model here with 1M context
Budget-constrained startup, mixed tasks	Gemini 3.1 Pro	Cheapest capable option, solid multimodal

What We Use at Pristren

We run Zlyqor, a team productivity platform, on top of a mixed-model stack. Here is the honest breakdown of what we reach for:

For the AI assistant in Zlyqor (meeting summaries, task suggestions, draft generation) we route through Claude Opus 4.8 Fast Mode by default. The combination of capability and latency fits a product interaction pattern where users expect near-instant responses. We switched from standard Opus to Fast Mode in May 2026 and saw median response time drop from 2.8 seconds to 1.2 seconds with no user-visible quality regression.

For internal tooling and agentic workflows (automated code review pipelines, batch document processing), we mix. Terminal-heavy tasks route to GPT-5.5. Document extraction and screenshot analysis route to Gemini 3.1 Pro. Pure reasoning tasks stay on Opus 4.8.

We have not found a single model that wins everything. The teams claiming one model is universally best are either not testing properly or are working in a narrow enough domain that the comparison is not meaningful.

This post is part of a cluster on AI model selection and usage in 2026. Other posts in the series:

Frequently Asked Questions

Is Claude Opus 4.8 better than GPT-5.5 overall?

On the Artificial Analysis Intelligence Index (June 2026), yes, by one point (61 vs 60). For coding and reasoning tasks, Opus 4.8 holds a measurable lead. For autonomous terminal agents, GPT-5.5 leads at 78.2% on Terminal-Bench 2.1 vs Opus 4.8's 74.6%. The honest answer is that for most workloads they are within a few percent of each other, and pricing and latency should influence the decision as much as capability.

Why is Gemini 3.1 Pro so much cheaper?

Google has consistently priced Gemini Pro below parity with equivalent Anthropic and OpenAI tiers, likely as a market share play given their lower commercial LLM revenue. The $2/$12 pricing is not a temporary promotion as of June 2026. Whether that pricing persists is a business risk worth considering if you are building a tight integration.

Should I use Claude Opus 4.8 Fast Mode in production?

For latency-sensitive product features, yes. The 4% accuracy drop on benchmarks does not translate into a visible quality drop for most generation tasks (summaries, drafts, responses to user questions). For high-stakes tasks like legal analysis or complex financial modeling, benchmark the specific task yourself before switching.

What context window do I need?

Most real tasks fit inside 32k tokens. For standard chat products, 128k (GPT-5.5) or 200k (Opus 4.8, Gemini 3.1 Pro) is more than sufficient. The 1M context window from Gemini 3.1 Pro becomes relevant when you need to process entire codebases, large PDF documents, or multi-hour meeting transcripts in a single call without chunking.

Benchmark Sources and Methodology Notes

All benchmark data in this post comes from:

Artificial Analysis (artificialanalysis.ai) - Intelligence Index, GDPval-AA Elo, multimodal scores, Terminal-Bench 2.1 as measured in the May-June 2026 window
SWE-Bench Pro - Independent leaderboard maintained by the SWE-bench team, June 2026 snapshot
HumanEval+ - EvalPlus leaderboard, June 2026
Pricing - Direct from vendor pricing pages, verified June 3, 2026

Benchmark numbers shift as vendors update their models and evaluators update their infrastructure. The Artificial Analysis team re-runs evaluations on a fixed schedule rather than pulling from vendor-reported figures, which is why we rely on their index as the primary source. Check the Artificial Analysis leaderboard directly for numbers newer than the review date above.

Try Zlyqor's AI Features

If you want to see how multi-model routing works in a real product, Zlyqor routes across Claude Opus 4.8 Fast Mode, GPT-5.5, and Gemini 3.1 Pro depending on the task. Meeting summaries, task suggestions, project analysis, and the chat assistant all draw from the model stack described in this post.

Start a free trial at zlyqor.com and use the AI assistant in any workspace. No credit card required.

The routing logic is also something we are happy to discuss if you are building a similar system. Reach out via the contact page.

Written by the Pristren engineering team. We build Zlyqor and document what we learn in public.

Claude Opus 4.8 vs GPT-5.5 vs Gemini 3.1 Pro: June 2026 Benchmarks and Pricing

Related Articles

What is GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance? A Practical Overview

Claude Opus 4.8 vs GPT-5.5 vs Gemini 3.1 Pro: June 2026 Benchmark Results, Pricing, and Which One to Actually Use

The Benchmark Numbers (June 2026)

Overall Intelligence Index

Coding and Software Engineering

Mathematical Reasoning

Terminal and Long-Horizon Agents

Multimodal (Vision, Charts, Documents)

Pricing (June 2026)

Claude Opus 4.8 Fast Mode

Where Each Model Loses

Where Claude Opus 4.8 Loses

Where GPT-5.5 Loses

Where Gemini 3.1 Pro Loses

Decision Matrix by Workload

What We Use at Pristren

Frequently Asked Questions

Is Claude Opus 4.8 better than GPT-5.5 overall?

Why is Gemini 3.1 Pro so much cheaper?

Should I use Claude Opus 4.8 Fast Mode in production?

What context window do I need?

Benchmark Sources and Methodology Notes

Try Zlyqor's AI Features

Frequently Asked Questions

What is Claude Opus 4.8 Intelligence Index score?

Is Claude Opus 4.8 better than GPT-5.5 for coding?

How much does Claude Opus 4.8 cost per million tokens?

When should I use Gemini 3.1 Pro instead of Opus 4.8?

What benchmarks does Opus 4.8 lose in June 2026?

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

How to Use AI Models as Tools: Task Routing Matrix for Developers

Claude Opus 4.8 vs GPT-5.5 vs Gemini 3.1 Pro: June 2026 Benchmarks and Pricing

Related Articles

What is GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance? A Practical Overview

Claude Opus 4.8 vs GPT-5.5 vs Gemini 3.1 Pro: June 2026 Benchmark Results, Pricing, and Which One to Actually Use

The Benchmark Numbers (June 2026)

Overall Intelligence Index

Coding and Software Engineering

Mathematical Reasoning

Terminal and Long-Horizon Agents

Multimodal (Vision, Charts, Documents)

Pricing (June 2026)

Claude Opus 4.8 Fast Mode

Where Each Model Loses

Where Claude Opus 4.8 Loses

Where GPT-5.5 Loses

Where Gemini 3.1 Pro Loses

Decision Matrix by Workload

What We Use at Pristren

Related Reading in This Series

Frequently Asked Questions

Is Claude Opus 4.8 better than GPT-5.5 overall?

Why is Gemini 3.1 Pro so much cheaper?

Should I use Claude Opus 4.8 Fast Mode in production?

What context window do I need?

Benchmark Sources and Methodology Notes

Try Zlyqor's AI Features

Frequently Asked Questions

What is Claude Opus 4.8 Intelligence Index score?

Is Claude Opus 4.8 better than GPT-5.5 for coding?

How much does Claude Opus 4.8 cost per million tokens?

When should I use Gemini 3.1 Pro instead of Opus 4.8?

What benchmarks does Opus 4.8 lose in June 2026?

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

How to Use AI Models as Tools: Task Routing Matrix for Developers

The workspace your team
actually needs