Structured workflow beats plan mode and superpowers on AI DES benchmark

A Reddit post shares results from the new AI-assisted Discrete-Event Simulation (DES) benchmark. The submission using the Ouroboros workflow (ooo) inside Claude Code ranked #1, beating both Claude's built-in plan mode and the 'superpowers' fat-skill stacks.
Benchmark details
The benchmark tests full understanding of a real-world system — a mining haulage system with trucks, loading points, dumping points, routes, and queues. Submissions are judged on:
- Comprehension of system structure
- Abstracting into a discrete-event simulation model
- Designing events, state changes, and KPIs
- Producing executable simulation code
- Interpreting results (bottlenecks, throughput, waiting times)
- Generating human-readable artifacts (topology diagrams, animations)
Ouroboros performance
The Ouroboros submission included working DES code, a topology diagram of the mining system, and an animation of trucks hauling ore. Notably, when the MCP server failed mid-run, Ouroboros fell back to a skills-based path and finished the task — demonstrating recovery and rerouting in real deployments.
Comparison
- Plan mode (lightweight planning) — decent baseline
- Superpowers / fat-skill stacks — worse than plan mode on this task
- Ouroboros (structured: clarify → plan → execute → evaluate → recover → iterate) — best
The result suggests that structuring the workflow around problem definition, planning, execution, evaluation, and recovery is more effective than piling on more instructions and bigger skills.
Ouroboros: https://github.com/Q00/ouroboros
Benchmark: https://simulation-bench.fly.dev/
📖 Read the full source: r/ClaudeAI
👀 See Also

How to Connect OpenClaw to Ollama Remotely
A comprehensive guide on connecting OpenClaw to Ollama from another PC, exploring community insights and practical steps for a seamless integration.

Visual Reasoning Benchmark Results for 15 Multimodal AI Models
AIMultiple benchmarked 15 leading multimodal AI models on 200 visual reasoning questions across two tracks: chart understanding and visual logic. Gemini-3.1-pro-preview and Gemini-3-pro-preview lead the overall results, followed by GPT-5.2, Kimi-K2.5, and GPT-5.2-pro.

Qwen 3.6 27B Benchmarked on DeepSWE: 2% Score, 70 Hours, 44k Avg Output Tokens
Qwen 3.6 27B (FP8, BF16 KV cache, 262k context) scored 2% on DeepSWE in 70 hours. Output tokens averaged 44k per task — comparable to larger models like Qwen 3.6 Plus. Ran on 1x RTX6000 Pro Blackwell via RunPod.

Claude Max 20x Plan: Limit Increases Not Applied Despite Announcements — User Confirms with Math
A paying Claude Max 20x ($200/month) user reports that the 2x session and 1.5x weekly limit increases announced by Anthropic have not been applied to their account. They provide mathematical proof and share a complete lack of support response.