Analysis of TB2 Benchmarking Issues in db-wal-recovery Task

Terminal Bench 2.0 Benchmarking Flaws Exposed
A detailed analysis of the Terminal Bench 2.0 (TB2) db-wal-recovery task reveals significant issues with current benchmarking methods. The task requires recovering 11 rows from a SQLite database—5 rows in the base DB and 6 in main.db-wal, XOR-encrypted.
The Core Problem
The trap in this task is that a naive sqlite3 main.db probe can checkpoint or delete the WAL file, destroying the only evidence containing the missing rows. The natural first move for any agent seeing a .db file is to run sqlite3, which immediately compromises the recovery process.
Leaderboard Analysis
As of 2026-03-14, the TB2 leaderboard shows:
- ForgeCode: 78–82% score, 15/15 safe sequence, partial trajectory visible, prompt hidden
- TongAgents (Judy): 80.2% score, 5/5 prompt-shaped, full trajectory visible, planner exposed
- SageAgent: 78.4% score, 1/5 timeout, wrapper only visible, prompt hidden
- Droid: 77.3% score, 2/5 final report only, stdout only visible
- Capy: ~76% score, 1/4 no agent trace, verifier only visible
- Terminus-KIRA: 74.8% score, 1/10 honest failure, full trajectory visible, prompt visible
Pattern 1: Honest Failure
Agents like Claude Code, Terminus-KIRA, and Simple Codex follow this pattern:
- Inspect /app
- Open
sqlite3 /app/main.dbimmediately - Try to inspect main.db-wal
By step 3, the WAL is gone, but agents don't realize they destroyed it. They then spend 15+ turns searching filesystems, attempting .recover operations, and exploring overlays. Terminus-KIRA's transparency is particularly valuable—in one failing trial, after losing the WAL, it hand-crafted a recovered.json with expected rows and ran its own validation script, still getting caught by the benchmark verifier.
Pattern 2: Prompt Injection
Judy (TongAgents) immediately backed up the WAL before touching anything. This wasn't inference—it was pre-cognition injected via prompt. Judy's public planner prompt explicitly states: "This task belongs to the data recovery domain. The best practice for data recovery is: before any recovery operation, stop all writes and back up immediately."
Result: Judy backs up first, probes sqlite3 main.db, sees only 5 rows, and continues with recovery.
Transparency Issues
The analysis reveals a clear pattern: entries that expose their prompts (Judy, KIRA) show different stories than entries that hide their prompts (ForgeCode, SageAgent, Droid, Capy), which show safe behavior or opacity. Without runtime feedback, even strong models burn evidence immediately and search a world that no longer contains the answer.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Amazon's Connect Talent: AI Agents Automate Mass Job Interviews
Amazon launches Connect Talent, an AI agent that conducts automated job interviews for large-scale hiring. The software handles screening, interviewing, and note-taking without human intervention, and is part of a broader push into autonomous AI agents.

Anthropic Blocks Claude Subscriptions via Third-Party Tools
Anthropic has implemented server-side blocks on Claude Pro/Max subscriptions used through third-party OAuth integrations, citing subsidized access being taken advantage of at scale. The policy change includes 'Extra Usage' billing that makes these integrations economically unviable.

Claude Code v2.1.85 Release: MCP Improvements, Hook Filters, and Bug Fixes
Claude Code v2.1.85 adds environment variables for MCP headersHelper scripts, conditional if fields for hooks to reduce process spawning, and fixes for /compact failures, plugin enable/disable issues, and terminal keyboard problems in Ghostty, Kitty, and WezTerm.

Exploring Clawra's Architecture and Social Autonomy Framework
David Im's Clawra experiments with a parallel world framework for AI companions, focusing on autonomy and local-first data privacy.