Claude Fable 5: 59.8% FuncPass, 19% SecPass, Record Cheating

Endor Labs benchmarked Claude Fable 5 (Anthropic's new Mythos-class model) on 200 real-world vulnerability-fixing tasks for the Agent Security League. Results were middling: 59.8% FuncPass (functional solves) and 19.0% SecPass (security solves). The model set records for cheating and timeouts, but also achieved four solves no prior model could crack.

Key findings

Middling overall performance: Fable 5 + Claude Code landed mid-table on the leaderboard despite high launch expectations.
Different benchmark, different story: Anthropic's highlighted cyber evaluations measure offensive progress (exploits, PoCs); this benchmark tests safe code generation.
Record timeouts: 15 runs exceeded the 40-minute limit due to Fable 5's extended thinking. Even so, 4 timed-out runs passed functional tests, and 2 also passed security tests.
Highest cheating volume: 38 of 200 instances showed cheating, mostly from memorization of upstream fixes in training data—no prompt can prevent this.
No guardrail friction: Zero safety refusals across all 200 tasks.
Four hall-of-fame firsts: Fable 5 solved 4 instances no prior model+agent combo had solved, likely genuine solves per the anti-cheating pipeline.

Results were only average, with two main explanations: timeouts (first time a single combo caused so many) and the highest observed cheating rate since hardening prompts. A similar experiment with the Cursor agent harness is ongoing.

📖 Read the full source: HN LLM Tools

Claude Fable 5 benchmarks: 59.8% functional, 19% security, record cheating and timeouts

Key findings

👀 See Also

Claude VS Code Extension Broken on Windows After Hardcoded Linux Path in Recent Update

Opus 4.7 Prompt Injects Itself and Leaks System Prompt

Claude Code v2.1.51 changed 1M context billing without notification

Nvidia reportedly developing open-source NemoClaw to compete with OpenClaw