Open-source models match or beat Claude Opus 4.6 on benchmarks

Benchmark Results
A detailed comparison of open-source models against Claude Opus 4.6 shows competitive or superior performance across multiple categories.
General Reasoning: DeepSeek V3.2
DeepSeek V3.2 holds its own against proprietary models, with its high-compute variant (V3.2-Speciale) surpassing GPT-5.
- SWE-bench Verified: Claude Opus 4.6: 80.8%, DeepSeek V3.2: 73.0%
- LiveCodeBench: Claude Opus 4.6: 76, DeepSeek V3.2: 74.1
- MMLU-Pro: DeepSeek V3.2: 85.0%, Claude Opus 4.6: 82.0%
DeepSeek V3.2 has strong multilingual support (CJK, Arabic, European languages), 128K context with sparse attention, but falls short on creative writing and some structured output edge cases. Inference: ~60 tok/s output, 1.18s TTFT, 128K context. Production-ready for 90%+ of general use cases. 5x cheaper than GPT-5, 20x cheaper than Opus 4.6.
Reasoning: DeepSeek R1
DeepSeek R1 beats expensive reasoning models on several benchmarks.
- Humanity's Last Exam: DeepSeek R1: 50.2%, Claude Opus 4.6: 40.0%
- MMLU-Pro: DeepSeek R1: 88.9%, Claude Opus 4.6: 82.0%
Inference: ~30 tok/s output, ~2s TTFT. Slower than non-reasoning models due to chain-of-thought processing. Best open-source reasoning model. Matches GPT-5.2 Pro on HLE. 30x cheaper than o1.
Agentic: Kimi K2.5
1 trillion parameters (32B active per token via MoE). 256K context. Open-source under modified MIT.
- Tool use improvement: Kimi K2.5: +20.1 pts, Claude Opus 4.6: +12.4 pts, GPT-5.2: +11.0 pts
- SWE-bench Verified: Claude Opus 4.6: 80.8%, Kimi K2.5: 76.8%
- Humanity's Last Exam: Kimi K2.5: 50.2%, Claude Opus 4.6: 40.0%
Can autonomously spawn up to 100 sub-agents in parallel and handle 1,500+ tool calls without human intervention. Inference: 334 tok/s output, 0.31s TTFT. Best model for autonomous agent workloads. Fastest TTFT, best tool use, competitive on every benchmark.
Code: MiniMax M2.5
MiniMax M2.5 became one of the best coding models.
- SWE-bench Verified: Claude Opus 4.6: 80.8%, MiniMax M2.5: 80.2%, GLM-5: 77.8%
MiniMax released M2.7 on March 18 — a "self-evolving" model at $0.30/$1.20 per M tokens. 96th percentile on coding accuracy, perfect score on general knowledge. One of the cheapest frontier models available. Open-source coding models effectively match the best proprietary model.
Speed Comparison
For production, latency matters as much as quality.
Output speed (tokens/second):
- Kimi K2.5 Turbo: 334
- Llama 3.1 8B: ~200
- GLM 4.7 Flash: ~150
- DeepSeek V3.2: ~60
- Claude Opus 4.6: 46
- DeepSeek R1: ~30
Time to first token (TTFT):
- Llama 3.1 8B: 0.2s
- Kimi K2.5 Turbo: 0.31s
- GLM 4.7 Flash: 0.51s
- DeepSeek V3.2: 1.18s
Kimi K2.5 at 334 tok/s is 7x faster than Opus at 46 tok/s.
Vision
Open-source vision has caught up for document processing and standard image analysis. Llama 4 Scout, Qwen VL, and others handle document extraction (invoices, receipts, forms), diagram understanding, and multi-image reasoning well. Still falls short on fine-grained spatial reasoning and non-Latin handwriting.
Overall Comparison
Best open-source model in each category compared to Claude Opus 4.6 (Opus = 100% on each axis):
- Code (SWE-bench): Open-source 80.2% vs Opus 80.8% — Opus wins by 0.6 pts. Basically tied.
- Knowledge (MMLU-Pro): Open-source 88.9% vs Opus 82.0% — Open-source wins by 6.9 pts.
- Speed (tok/s): Open-source 334 vs Opus 46 — Open-source is 7.3x faster.
- Tool Use (improvement): Open-source +20.1 pts vs Opus +12.4 pts — Open-source wins by 7.7 pts.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenAI Working on AI Smartphone with MediaTek/Qualcomm Chips; Mass Production Target 2028
According to supply chain analyst Ming-Chi Kuo, OpenAI is developing an AI smartphone with chip partners MediaTek and Qualcomm, exclusive manufacturer Luxshare Precision, and mass production planned for 2028. The device is positioned as a context-aware AI agent platform.

RTX 5080 16GB: Qwen3.6 35B MoE at 128k Context — 56 tok/s, and Why MTP Doesn't Help
New benchmarks show Qwen3.6 35B MoE on RTX 5080 16GB hits 56 tok/s generation at 128k context. MTP (Multi-Token Prediction) makes it 23% slower due to VRAM pressure pushing expert layers to CPU.

Meta OpenEnv AI Hackathon in India Offers Direct Interviews and $30K Prize Pool
Meta is hosting India's first OpenEnv AI Hackathon in collaboration with Hugging Face and PyTorch, where developers build reinforcement learning environments for AI agents. Top teams get direct interviews with Meta and Hugging Face AI teams, plus a $30,000 prize pool.

Claude Loses Ability to Retrieve Product Pricing Across Retailers
As of April 27, Claude no longer returns pricing for Amazon, Best Buy, Newegg, or B&H Photo. Walmart is the only retailer still showing prices.