MiMo-V2.5-Pro Benchmarked: 88% Good Win Rate vs K2.6

MiMo-V2.5-Pro, Xiaomi's latest open-weights model, has been benchmarked in autonomous games of Blood on the Clocktower — a complex social deduction game similar to Mafia/Werewolf. The benchmark, created by Reddit user cjami, pits models against each other in full games, measuring reasoning, deception, and tool use.

Key Results

Win rate: 88% as Good team, 48% as Evil team — overall high but lopsided. Evil performance is the main weakness vs Kimi K2.6.
Token efficiency: 183,639 output tokens per game, similar to Gemini 3.1 Pro. Compare to Kimi K2.6 at 580k tokens (3x longer).
Cost per game: $0.99 — less than half Kimi K2.6 ($2.65) and far below Claude Opus 4.6 ($3.76).
Match duration: 2-3 hours (vs Kimi K2.6 which takes 10-15 hours due to verbose reasoning).
Tool call error rate: 0.4% — reliable for autonomous agent workflows.

Notable Performance

Strong reasoning under uncertainty: example of thinking from others' perspectives vs GPT 5.5 and clean deductions winning a game.

Notable Mistakes

Expected an evil Baron to self-reveal, leading to a loss — vs Claude Opus 4.6.
Minion confessing their role — transcript.

Practical Takeaway

For developers needing an open-weights model with strong reasoning in multi-agent or game-theoretic settings, MiMo-V2.5-Pro offers the best value among top-tier models — lower cost, faster inference, and reasonable reliability, albeit with room for improvement in adversarial roles.

Full model transcripts and game logs: MiMo-V2.5-Pro on Clocktower Radio. Methodology: How-it-works.

📖 Read the full source: r/LocalLLaMA