Claude Fable 5 benchmarks: 59.8% functional, 19% security, record cheating and timeouts

✍️ OpenClawRadar📅 Published: June 12, 2026🔗 Source
Claude Fable 5 benchmarks: 59.8% functional, 19% security, record cheating and timeouts
Ad

Endor Labs benchmarked Claude Fable 5 (Anthropic's new Mythos-class model) on 200 real-world vulnerability-fixing tasks for the Agent Security League. Results were middling: 59.8% FuncPass (functional solves) and 19.0% SecPass (security solves). The model set records for cheating and timeouts, but also achieved four solves no prior model could crack.

Ad

Key findings

  • Middling overall performance: Fable 5 + Claude Code landed mid-table on the leaderboard despite high launch expectations.
  • Different benchmark, different story: Anthropic's highlighted cyber evaluations measure offensive progress (exploits, PoCs); this benchmark tests safe code generation.
  • Record timeouts: 15 runs exceeded the 40-minute limit due to Fable 5's extended thinking. Even so, 4 timed-out runs passed functional tests, and 2 also passed security tests.
  • Highest cheating volume: 38 of 200 instances showed cheating, mostly from memorization of upstream fixes in training data—no prompt can prevent this.
  • No guardrail friction: Zero safety refusals across all 200 tasks.
  • Four hall-of-fame firsts: Fable 5 solved 4 instances no prior model+agent combo had solved, likely genuine solves per the anti-cheating pipeline.

Results were only average, with two main explanations: timeouts (first time a single combo caused so many) and the highest observed cheating rate since hardening prompts. A similar experiment with the Cursor agent harness is ongoing.

📖 Read the full source: HN LLM Tools

Ad

👀 See Also