Xiaomi Open-Sources MiMo-V2.5-Pro: Nears Claude Opus 4.6 on Coding Benchmarks

Xiaomi released the MiMo-V2.5 family of open-source models, with the Pro variant delivering competitive coding benchmarks against Claude Opus 4.6 and GPT-5.4.
Real-World Tests
V2.5-Pro completed a Peking University compiler project (SysY compiler in Rust) in 4.3 hours with a perfect score of 233/233 — higher than most students who spend weeks. Given a vague prompt like "build a video editor," it autonomously produced an 8,192-line desktop application with multi-track timeline, clip trimming, crossfades, audio mixing, and export pipeline after 11.5 hours and 1,868 tool calls. In a graduate-level analog circuit design task (Flipped-Voltage-Follower LDO in TSMC 180nm), it iterated via ngspice simulation and improved line regulation 22× and load regulation 17× over its own initial attempt.
Benchmarks vs. Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, DeepSeek V4 Pro
- SWE-Bench Pro: 57.2 (vs. 57.3 Claude, 57.7 GPT, 54.2 Gemini, 55.4 DeepSeek)
- SWE-Bench Verified: 78.9 (vs. 80.8 Claude, n/a GPT, 76.2 Gemini, 80.6 DeepSeek)
- Terminal-Bench 2.0: 68.4 (vs. 65.4 Claude, 75.1 GPT, 68.5 Gemini, 67.9 DeepSeek) — leads Claude and Gemini
- Claw-Eval Pass@3: 63.8 (vs. 70.4 Claude, 60.3 GPT, 57.8 Gemini, 59.8 DeepSeek) — beats GPT and Gemini
- HLE with tools: 48.0 (vs. 53.0 Claude, 58.7 GPT, 51.4 Gemini, 48.2 DeepSeek) — lags on general reasoning
- GDPVal-AA: 1581 (vs. 1606 Claude, 1674 GPT, 1317 Gemini, 1554 DeepSeek) — lags GPT and Claude
On Claw-Eval, Xiaomi's token efficiency chart also claims V2.5-Pro (63.8) beats Claude Sonnet 4.6. V2.5-Pro supports sustained task execution over 1,000+ tool calls with self-correction; a regressing refactoring pass at turn 512 was caught and fixed autonomously.
Weights are now open-source for download and self-hosting.
📖 Read the full source: HN AI Agents
👀 See Also

Friendly AI Chatbots: 30% Less Accurate, 40% More Likely to Endorse Conspiracy Theories
Oxford researchers find that tuning chatbots for warmth reduces accuracy by 10-30% and increases support for false beliefs by 40%. Tested on GPT-4o and Llama.

Claude Plans to Add Monthly Programmatic Credit for API Usage
Anthropic's Claude plans will include a dedicated monthly credit for programmatic (API) usage, per a tweet from ClaudeDevs on X.

Stanford Study: Law Professors Prefer AI Answers Over Peers 75% of the Time
In a blind evaluation of 3,000 comparisons, law professors rated AI-generated answers significantly higher than peer-written ones. AI responses were flagged as harmful only 3.5% of the time vs 12% for humans.

Oracle considers 20k-30k job cuts and Cerner sale to fund AI data-center expansion
Oracle is considering cutting 20,000 to 30,000 jobs and selling its Cerner healthcare software unit to free up $8-10 billion in cash flow for AI data-center expansion, as US banks retreat from financing the company's $156 billion infrastructure buildout.