Qwen 3 8B outperforms larger models in blind peer evaluations on hard tasks

Evaluation Results
A blind peer evaluation system called The Multivac tested 10 small language models on 13 hard frontier-level questions. The same difficulty level was used for GPT-5.4 and Claude Opus 4.6. Models didn't know which response came from which model, and rankings were computed from peer consensus.
Key Findings
Qwen 3 8B (8B parameters) achieved:
- 6 first-place wins out of 13 evaluations
- Top-3 finishes in 12 of 13 tasks
- Average score of 9.40
- Worst finish: 5th place
This performance exceeded models with significantly larger parameter counts, including:
- Gemma 3 27B (27B parameters): 3 wins, 11 top-3 finishes, average 9.33
- Kimi K2.5 (32B/1T MoE): 3 wins, 5 top-3 finishes, average 8.78
- Qwen 3 32B (32B parameters): 2 wins, 5 top-3 finishes, average 8.40
Task-Specific Performance
On code tasks, Qwen 3 8B placed:
- 1st on Go concurrency debugging (9.65)
- 1st on distributed lock analysis (9.33)
- Tied 1st on SQL optimization (9.66)
On reasoning tasks, it placed:
- 1st on Simpson's Paradox (9.51)
- 1st on investment decision theory (9.63)
- 2nd on Bayesian diagnosis (9.53)
Notable Observations
Qwen 3 32B showed a significant performance drop on the distributed lock debugging task (EVAL-20260315-043330), scoring only 1.00 out of 10 while every other model scored above 5.5. The 8B model scored 9.33 on the identical task. The cause is unclear but could be related to OpenRouter routing, quantization artifacts, or a genuine failure mode.
Kimi K2.5, technically a 32B active/1T MoE model, won 3 evaluations including the 502 debugging task (9.57), Arrow's voting theorem (9.18), and survivorship bias (9.63).
Llama 3.1 8B finished last or second-to-last in 10 of 13 evaluations with an average score of 7.51, showing a massive gap compared to Qwen 3 8B (9.40) despite having the same parameter count.
Methodology Notes
The evaluation used a blind peer system where 10 models respond to the same question, then each model judges all 10 responses (100 total judgments per evaluation, minus self-judgments). The author notes genuine limitations: AI judging AI has a circularity problem, and scores measure peer consensus rather than ground truth. A human baseline study is being developed to measure correlation.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw 2026.3.22-beta.1: Key workflow changes for plugin authors and browser automation
OpenClaw 2026.3.22-beta.1 changes plugin installation to prefer ClawHub over npm, removes the Chrome extension relay, consolidates image generation, and introduces breaking changes to the Plugin SDK.

China Blocks Meta's Acquisition of AI Startup Manus
China's government blocked Meta's proposed acquisition of AI startup Manus, citing national security concerns. The deal was reportedly valued at over $1 billion.

Google Signs Classified Pentagon Deal for ‘Any Lawful’ Use of AI
Google reportedly signed a classified deal allowing the US Department of Defense to use its AI models for any lawful government purpose, with restrictions on mass surveillance and autonomous weapons only as a non-binding agreement.

Project Health Check: Bus Factor and Commit Activity Across Claw/Assistant Repos
A Reddit user scraped commit data from major claw/assistant projects and found many with a bus factor of 1—meaning a single author accounts for over 50% of commits. Some projects show drastic drops in April activity.