Last updated: June 3, 2026. Benchmark figures reflect the Artificial Analysis Intelligence Index as of the same date. License terms verified against official model cards on Hugging Face. Self-host cost estimates use current spot pricing on Lambda Labs and RunPod.
This is post 2 of the Pristren AI Sprint series. You might also want to read Post 1: The 2026 Model Landscape Map before diving in, or jump ahead to Post 3: Fine-Tuning Open Weights Models Without Destroying Alignment if licensing is already settled for your use case.
The Moment the Gap Became Uncomfortable
Eighteen months ago, the safe answer to "which model should we use?" was always some variant of "Claude or GPT-4, probably Claude." The reasoning was simple: closed frontier labs had a commanding quality lead, open weights options lagged on complex reasoning, and the compliance overhead of self-hosting rarely paid off for teams under 50 engineers.
That calculus has shifted. Not shattered, but shifted enough that ignoring it is now a business decision, not a technical default.
DeepSeek V4-Pro dropped in April 2026 under a clean MIT license. A month later, Moonshot AI published Kimi K2.6 under a Modified MIT that most legal teams have approved for commercial use. Both models cleared the 50-point threshold on the Artificial Analysis Intelligence Index (AA Index), a composite benchmark covering reasoning, coding, instruction following, and multilingual comprehension.
Claude Opus 4 sits at 61 on the same index. DeepSeek V4-Pro scores 52. Kimi K2.6 scores 54. The gap is 7 to 9 points, not 20. And you can run both open models yourself, at a cost structure that no API pricing page can match at scale.
What the AA Index Actually Measures
Before treating those numbers as gospel, it is worth understanding what the Artificial Analysis Intelligence Index aggregates. The index combines performance across five capability dimensions:
- Reasoning and math -- scored via MATH-500 and GPQA Diamond variants
- Coding ability -- HumanEval+ and LiveCodeBench
- Instruction following -- IFEval and MT-Bench adapted prompts
- Long-context retrieval -- RULER at 128k context
- Multilingual comprehension -- an aggregated score across 15 languages
Each dimension is normalized to a 0-100 scale and averaged with equal weights. This means a model that dominates coding but struggles at multilingual tasks can still post a respectable composite score.
DeepSeek V4-Pro's 52 reflects excellent coding (it benchmarks comparably to Opus on LiveCodeBench) but below-average multilingual scores outside Mandarin and English. Kimi K2.6's 54 reflects stronger multilingual performance at the cost of slightly weaker reasoning on multi-step math problems. Claude Opus 4 at 61 holds consistent leads across all five dimensions, with particular advantages in instruction following and nuanced creative tasks.
If your use case is English-only coding assistance or structured data extraction, the real-world gap between 52 and 61 may be invisible to your users. If your use case requires nuanced long-form reasoning, legal document analysis, or tone-sensitive customer-facing content, you will feel those 9 points.
License Comparison: What You Can Actually Do
The legal picture is clearer than it was even six months ago, but it still requires a careful read.
| Model | License | Commercial Use | Fine-Tuning | Redistribution | Model Output Restrictions |
|---|---|---|---|---|---|
| DeepSeek V4-Pro | MIT | Yes, unrestricted | Yes | Yes | None |
| Kimi K2.6 | Modified MIT | Yes, with attribution | Yes | Yes, with attribution | Cannot claim outputs are human-generated |
| Qwen3.7-Max | Proprietary (closed API) | Via API only | No | No | Standard API ToS |
| Claude Opus 4 | Anthropic ToS | Via API only | No (fine-tuning beta, restricted) | No | Standard API ToS |
| GPT-4.1 | OpenAI ToS | Via API only | Yes (enterprise tier) | No | Standard API ToS |
Two things stand out here.
First, DeepSeek V4-Pro is genuinely MIT. That means you can embed it in a product, ship it as part of a self-contained appliance, deploy it behind a firewall without phoning home, and build a competing product on top of it. MIT offers no restrictions that would concern a commercial software team.
Second, Kimi K2.6's Modified MIT adds two clauses: attribution (you must name the model in documentation or about pages) and a prohibition on claiming model outputs are human-generated without disclosure. The attribution requirement is operationally trivial. The output disclosure clause is consistent with EU AI Act requirements that will apply to most European deployments regardless of license anyway.
The Qwen3.7-Max Detour
Alibaba's Qwen line has an interesting recent history. Qwen2.5 and Qwen3.0 were released with permissive licenses that attracted a large open-source community. Qwen3.7-Max, the highest-capability model in the current generation, shipped as a closed API product with no public weights and no announced timeline for open release.
This matters because Qwen3.7-Max benchmarks at roughly 57 on the AA Index, and many teams had planned migration paths from earlier open Qwen models. The proprietary pivot stranded those plans and served as a concrete example of why "open weights today" does not guarantee "open weights tomorrow." If license stability over a 3-year product roadmap matters to you, Alibaba's recent behavior with Qwen3.7-Max should be in your risk model.
DeepSeek has maintained open releases consistently since V2. Moonshot AI has released all K2-series checkpoints under permissive terms. Neither company has announced any intent to close future releases, but neither has made a binding legal commitment either. This is a reputational track record, not a contractual guarantee.
Self-Hosting Cost Estimates: H100 vs API
This is where the economics get interesting, and where most "open source is free" conversations fall apart without actual numbers.
Hardware Requirements
DeepSeek V4-Pro is a 671-billion-parameter Mixture-of-Experts model. Running it at full precision requires approximately 8x H100 80GB GPUs in inference mode, or 4x H100 80GB with 4-bit GPTQ quantization at acceptable quality loss for most production workloads. Kimi K2.6 is a 235-billion-parameter dense model that fits in 4x H100 80GB at BF16 or 2x H100 80GB at INT8.
Cloud GPU Spot Pricing (June 2026)
| Provider | H100 80GB SXM (hourly, spot) | 8x H100 cluster (hourly) |
|---|---|---|
| Lambda Labs | $2.49 | $19.92 |
| RunPod | $2.71 | $21.68 |
| CoreWeave | $2.85 | $22.80 |
| AWS p4d.24xlarge (on-demand) | $32.77 | -- (8x A100, not H100) |
Reserved 1-year pricing on Lambda Labs brings the 8x H100 cluster to roughly $14.50 per hour.
API Pricing Comparison
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude Opus 4 | $15.00 | $75.00 |
| Kimi K2.6 (API) | $0.60 | $2.50 |
| DeepSeek V4-Pro (API) | $0.27 | $1.10 |
| GPT-4.1 (API) | $2.00 | $8.00 |
At Claude Opus 4 output pricing of $75 per million tokens, a team generating 50 million output tokens per month pays $3,750 per month on outputs alone. An 8x H100 spot cluster running continuously at 85 percent utilization costs roughly $12,150 per month. At that scale, the API is cheaper.
The break-even point, running DeepSeek V4-Pro self-hosted versus API, assuming 85 percent cluster utilization and 2,000 tokens per second throughput on the cluster, works out to roughly 180 million output tokens per month. Below that, use the API. Above that, consider self-hosting.
For Kimi K2.6, the smaller model fits on a 2x H100 cluster at $5.40 per hour spot, or roughly $3,890 per month at continuous operation. The break-even against Kimi's API pricing falls around 60 million output tokens per month. This is a much more reachable threshold for mid-sized product teams.
Hidden Costs of Self-Hosting
The hardware cost is the most visible line item, but not the largest. A realistic self-hosted inference stack adds:
- Engineering time: Minimum 0.5 FTE to maintain vLLM or TGI serving stack, monitor GPU health, handle CUDA version conflicts, and manage rolling updates. Fully loaded engineering cost: $6,000 to $10,000 per month.
- Inference optimization: Batching strategies, KV cache management, speculative decoding configuration. Expect 2-4 weeks of setup work before reaching production throughput targets.
- Monitoring and observability: Prometheus, Grafana, custom alerting for GPU memory fragmentation. Not glamorous, but silently expensive when ignored.
- Compliance overhead: If you are self-hosting to satisfy data residency requirements, you also need to budget for SOC 2 or ISO 27001 audit coverage of your inference infrastructure.
A conservative total cost of ownership for a self-hosted DeepSeek V4-Pro deployment at 200 million tokens per month lands between $18,000 and $24,000 per month including engineering overhead. At that output volume, the equivalent Claude Opus 4 API spend is $15,000 per month on outputs alone, not counting input costs. The self-hosted economics win only if you are also substituting away from Opus-tier pricing for workloads where 52 AA Index is sufficient.
When NOT to Self-Host
Self-hosting open weights models is the right call for some teams and the wrong call for many others. Here is where the economics and operational tradeoffs argue against it.
Team size below 20 engineers. The 0.5 FTE maintenance burden represents a meaningful percentage of a small team's capacity. Early-stage startups should stay on managed APIs until the unit economics clearly favor infrastructure investment.
Use cases that genuinely need frontier quality. The 9-point gap between Kimi K2.6 and Claude Opus 4 is real. If you are processing legal documents, writing medical content, or running a product where output quality is the primary differentiator, do not shave costs on the model.
Regulated environments with complex compliance requirements. Self-hosting sounds like a compliance win (data stays on your servers), but it adds infrastructure audit scope. For HIPAA, PCI-DSS, or FedRAMP environments, a managed API with a BAA or equivalent may have a lower total compliance cost than building your own secure inference cluster.
Teams without GPU infrastructure expertise. Running inference at production scale on multi-GPU clusters is operationally distinct from running training workloads. If your ML team's experience is primarily in training and fine-tuning, factor in the learning curve and the tail risk of a production incident during a peak traffic period.
Latency-sensitive applications. Managed API providers have spent years optimizing inference latency. Self-hosted vLLM on a cold cluster may not match API p99 latency without significant tuning investment. Benchmark your specific workload before committing.
Practical Decision Framework
Use this checklist before committing to self-hosted open weights deployment:
- Monthly output tokens exceed 50M for smaller models (Kimi K2.6 tier) or 180M for larger models (DeepSeek V4-Pro tier)?
- Engineering team can dedicate 0.5 FTE ongoing to inference infrastructure?
- Data residency or IP requirements make third-party API use problematic or legally constrained?
- AA Index score of 52-54 is sufficient for your specific use case (tested with real prompts, not assumed)?
- GPU cloud budget approved and spot instance interruption risk is acceptable or mitigated with reserved capacity?
If you answer yes to all five, self-hosting is worth prototyping. If you answer no to any of the first three, stay on managed APIs and revisit in 12 months.
Looking Ahead: The Compression of the Quality Gap
The 9-point AA Index gap between Kimi K2.6 and Claude Opus 4 is the smallest it has ever been. Based on DeepSeek's published research trajectory and Moonshot AI's k2-series roadmap, both are targeting new releases in Q3 and Q4 2026. Independent projections from the Artificial Analysis team suggest open weights models could reach 58-62 on the index by end of year, which would effectively eliminate the justification for Opus-tier pricing for most structured tasks.
Anthropic's response to this trend is visible in Claude's pricing evolution: the gap between Haiku and Opus pricing has widened, not narrowed, suggesting Anthropic is positioning Haiku-tier for cost-sensitive applications and defending Opus pricing on quality-differentiated enterprise use cases. This is a reasonable strategy, but it also means that for teams doing high-volume structured work (code generation, data extraction, document parsing), the open weights alternative is becoming harder to dismiss.
What This Means for How You Build in 2026
The most important shift is not "which model to use" but "build your architecture to be model-agnostic." Whether you run DeepSeek V4-Pro today or switch to Kimi K3 in six months, a hard dependency on any single provider's API format or capability profile is a liability.
OpenAI-compatible endpoints (which DeepSeek V4-Pro supports) make this easier. LiteLLM and similar routing layers reduce lock-in further. The teams that will have the most flexibility over the next 18 months are the ones building model routing into their infrastructure now, before the switching cost of migrating grows with product complexity.
For more on how to build an efficient routing layer that balances quality and cost across models, see Post 4: LLM Token Optimization and Model Routing in 2026.
Summary
The open weights vs closed weights debate in 2026 is not a binary. DeepSeek V4-Pro at AA Index 52 and Kimi K2.6 at 54 are not replacements for Claude Opus 4 at 61 in every context. But they are credible alternatives for a large class of structured, English-first workloads, and they carry licenses that give engineering teams genuine deployment flexibility.
The economics favor self-hosting only at meaningful scale (50M+ tokens per month for Kimi-tier, 180M+ for DeepSeek-tier) with engineering overhead factored in. Below those thresholds, open model APIs (not self-hosted) often offer the best cost-quality tradeoff.
Qwen3.7-Max's proprietary pivot is a reminder that "open today" does not mean "open forever." Build model-agnostic infrastructure.
Part of the Pristren AI Sprint series. Continue reading: