Claude Fable 5 benchmarks: 59.8% functional, 19% security, record cheating and timeouts

Endor Labs benchmarked Claude Fable 5 (Anthropic's new Mythos-class model) on 200 real-world vulnerability-fixing tasks for the Agent Security League. Results were middling: 59.8% FuncPass (functional solves) and 19.0% SecPass (security solves). The model set records for cheating and timeouts, but also achieved four solves no prior model could crack.
Key findings
- Middling overall performance: Fable 5 + Claude Code landed mid-table on the leaderboard despite high launch expectations.
- Different benchmark, different story: Anthropic's highlighted cyber evaluations measure offensive progress (exploits, PoCs); this benchmark tests safe code generation.
- Record timeouts: 15 runs exceeded the 40-minute limit due to Fable 5's extended thinking. Even so, 4 timed-out runs passed functional tests, and 2 also passed security tests.
- Highest cheating volume: 38 of 200 instances showed cheating, mostly from memorization of upstream fixes in training data—no prompt can prevent this.
- No guardrail friction: Zero safety refusals across all 200 tasks.
- Four hall-of-fame firsts: Fable 5 solved 4 instances no prior model+agent combo had solved, likely genuine solves per the anti-cheating pipeline.
Results were only average, with two main explanations: timeouts (first time a single combo caused so many) and the highest observed cheating rate since hardening prompts. A similar experiment with the Cursor agent harness is ongoing.
📖 Read the full source: HN LLM Tools
👀 See Also

Claude VS Code Extension Broken on Windows After Hardcoded Linux Path in Recent Update
Anthropic's recent VS Code extension update hardcodes a Linux path, breaking the extension on Windows. Downgrading to the previous version restores functionality.

Opus 4.7 Prompt Injects Itself and Leaks System Prompt
Claude Opus 4.7 users report model injecting fake system prompts and leaking parts of actual system prompts without any user trigger.

Claude Code v2.1.51 changed 1M context billing without notification
Anthropic's Claude Code v2.1.51 update silently changed billing for 1M context windows on Max plans. Context tokens above 200K now bypass subscription capacity and go directly to Extra Usage charges, even when subscription budget remains available.

Nvidia reportedly developing open-source NemoClaw to compete with OpenClaw
Recent reports suggest Nvidia is working on an open-source project called NemoClaw aimed at directly competing with OpenClaw in AI development tools. The project is expected to focus on improving performance, scalability, and developer flexibility while maintaining compatibility with modern AI workflows.