ARC-AGI: The Benchmark That Stumped AI
The ARC-AGI benchmark tests abstract reasoning using visual pattern puzzles that humans solve easily but AI has historically failed at. Prior to o3, the best models scored around 32% (o1) and 34% (Claude 3.5).
o3 achieved 87.5% on the semi-private ARC-AGI evaluation — a jump so dramatic it made headlines and prompted the ARC Prize team to call it "a genuine breakthrough."
Benchmark Scores
| Benchmark | o3 | o1 | GPT-4o | |-----------|----|----|--------| | ARC-AGI | 87.5% | 32.0% | 5.0% | | AIME 2024 | 96.7% | 74.4% | 13.4% | | SWE-Bench | 71.7% | 48.9% | 38.0% | | GPQA Diamond | 87.7% | 78.3% | 53.6% |
The SWE-Bench score of 71.7% is particularly significant — it means o3 can resolve nearly 3 in 4 real GitHub issues autonomously.
Compute-Optimal Inference Scaling
Unlike standard models where inference cost is fixed, o3 uses test-time compute scaling: you can tell it how much "thinking budget" to spend. More thinking = higher accuracy = higher cost.
o3-mini offers three compute tiers:
- Low: Fast, cheap, similar to o1-mini
- Medium: Balanced quality/cost (recommended default)
- High: Maximum accuracy, higher latency and cost
from openai import OpenAI
client = OpenAI()
# o3-mini with medium reasoning effort
response = client.chat.completions.create(
model="o3-mini",
reasoning_effort="medium",
messages=[
{"role": "user", "content": "Solve: if 3x + 7 = 22, what is x^2 + 2x?"}
]
)
print(response.choices[0].message.content)
When to Use o3 vs o1
Use o3/o3-mini for:
- Mathematical proofs and competition-level problems
- Complex multi-step coding tasks (algorithm design, debugging)
- Scientific reasoning and STEM research assistance
- Tasks where accuracy matters more than cost
Use o1 for:
- Tasks where o3's performance gains don't justify the cost premium
- Existing pipelines already tuned for o1 behavior
- When you need faster time-to-first-token
API Access
o3 and o3-mini are available through the OpenAI API. Access for o3 (full) is currently prioritized for Tier 4-5 API users. o3-mini has broader availability.
# Full o3 — highest capability, highest cost
response = client.chat.completions.create(
model="o3",
messages=[{"role": "user", "content": "Design a Byzantine fault-tolerant consensus algorithm."}],
max_completion_tokens=8000,
)
Cost Considerations
o3 is significantly more expensive than GPT-4o. For most production applications, o3-mini at medium reasoning effort hits the best accuracy/cost balance. Run a cost analysis before deploying o3 at scale — the quality gains are real, but so is the bill.
Summary
o3 represents a step-change in AI reasoning capability, particularly for math, science, and complex coding. o3-mini with configurable compute tiers makes this capability accessible at reasonable cost. Track benchmark results and access details at the ARC Prize blog.