MiMo-V2.5-Pro Benchmarked: Strong Social Deduction Reasoning, Good Value vs K2.6

MiMo-V2.5-Pro, Xiaomi's latest open-weights model, has been benchmarked in autonomous games of Blood on the Clocktower — a complex social deduction game similar to Mafia/Werewolf. The benchmark, created by Reddit user cjami, pits models against each other in full games, measuring reasoning, deception, and tool use.
Key Results
- Win rate: 88% as Good team, 48% as Evil team — overall high but lopsided. Evil performance is the main weakness vs Kimi K2.6.
- Token efficiency: 183,639 output tokens per game, similar to Gemini 3.1 Pro. Compare to Kimi K2.6 at 580k tokens (3x longer).
- Cost per game: $0.99 — less than half Kimi K2.6 ($2.65) and far below Claude Opus 4.6 ($3.76).
- Match duration: 2-3 hours (vs Kimi K2.6 which takes 10-15 hours due to verbose reasoning).
- Tool call error rate: 0.4% — reliable for autonomous agent workflows.
Notable Performance
Strong reasoning under uncertainty: example of thinking from others' perspectives vs GPT 5.5 and clean deductions winning a game.
Notable Mistakes
- Expected an evil Baron to self-reveal, leading to a loss — vs Claude Opus 4.6.
- Minion confessing their role — transcript.
Practical Takeaway
For developers needing an open-weights model with strong reasoning in multi-agent or game-theoretic settings, MiMo-V2.5-Pro offers the best value among top-tier models — lower cost, faster inference, and reasonable reliability, albeit with room for improvement in adversarial roles.
Full model transcripts and game logs: MiMo-V2.5-Pro on Clocktower Radio. Methodology: How-it-works.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw contributor criticizes project's focus on pixel-perfect parity over modern features
A Reddit post from r/openclaw details how a contributor's PR addressing resolution scaling and high-refresh-rate support was rejected for deviating from the original engine's visual constraints, sparking debate about the project's direction.

Going Full AI Engineer: Not Touching Code Anymore
Max Heyer describes a workflow where agents write all code, he only reads diffs, writes specs, and reviews. The skill that matters is taste — evaluating code is harder than producing it.

Meta Pauses Internal AI Training Program After Employee Keystroke Data Leak
Meta pauses MCI program tracking employee keystrokes after SEV 2 leak exposed private conversations, performance data, and transcriptions company-wide.

Claude restricts third-party harness usage including OpenClaw starting April 4
Anthropic will no longer allow Claude subscription limits to be used with third-party harnesses like OpenClaw starting April 4, requiring separate pay-as-you-go billing for such usage. Users will receive a one-time credit equal to their monthly subscription price and can pre-purchase usage bundles with up to 30% discount.