Analysis of TB2 Benchmarking Issues in db-wal-recovery Task

✍️ OpenClawRadar📅 Published: March 17, 2026🔗 Source
Analysis of TB2 Benchmarking Issues in db-wal-recovery Task
Ad

Terminal Bench 2.0 Benchmarking Flaws Exposed

A detailed analysis of the Terminal Bench 2.0 (TB2) db-wal-recovery task reveals significant issues with current benchmarking methods. The task requires recovering 11 rows from a SQLite database—5 rows in the base DB and 6 in main.db-wal, XOR-encrypted.

The Core Problem

The trap in this task is that a naive sqlite3 main.db probe can checkpoint or delete the WAL file, destroying the only evidence containing the missing rows. The natural first move for any agent seeing a .db file is to run sqlite3, which immediately compromises the recovery process.

Leaderboard Analysis

As of 2026-03-14, the TB2 leaderboard shows:

  • ForgeCode: 78–82% score, 15/15 safe sequence, partial trajectory visible, prompt hidden
  • TongAgents (Judy): 80.2% score, 5/5 prompt-shaped, full trajectory visible, planner exposed
  • SageAgent: 78.4% score, 1/5 timeout, wrapper only visible, prompt hidden
  • Droid: 77.3% score, 2/5 final report only, stdout only visible
  • Capy: ~76% score, 1/4 no agent trace, verifier only visible
  • Terminus-KIRA: 74.8% score, 1/10 honest failure, full trajectory visible, prompt visible
Ad

Pattern 1: Honest Failure

Agents like Claude Code, Terminus-KIRA, and Simple Codex follow this pattern:

  1. Inspect /app
  2. Open sqlite3 /app/main.db immediately
  3. Try to inspect main.db-wal

By step 3, the WAL is gone, but agents don't realize they destroyed it. They then spend 15+ turns searching filesystems, attempting .recover operations, and exploring overlays. Terminus-KIRA's transparency is particularly valuable—in one failing trial, after losing the WAL, it hand-crafted a recovered.json with expected rows and ran its own validation script, still getting caught by the benchmark verifier.

Pattern 2: Prompt Injection

Judy (TongAgents) immediately backed up the WAL before touching anything. This wasn't inference—it was pre-cognition injected via prompt. Judy's public planner prompt explicitly states: "This task belongs to the data recovery domain. The best practice for data recovery is: before any recovery operation, stop all writes and back up immediately."

Result: Judy backs up first, probes sqlite3 main.db, sees only 5 rows, and continues with recovery.

Transparency Issues

The analysis reveals a clear pattern: entries that expose their prompts (Judy, KIRA) show different stories than entries that hide their prompts (ForgeCode, SageAgent, Droid, Capy), which show safe behavior or opacity. Without runtime feedback, even strong models burn evidence immediately and search a world that no longer contains the answer.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also