Claude Code benchmark reveals AI judge blind spot: pipeline bugs misattributed to model capability

Benchmark setup and initial results
A developer ran a controlled benchmark across three coding-agent stacks using Claude Code (Opus 4.6) as an autonomous evaluator. The benchmark tested: OpenCode + MiniMax-M2.7, Gemini CLI + Gemini 3.1 Pro, and Codex CLI + GPT-5.4. Each retest was a fresh session with no cross-session memory, using the prompt: "execute the benchmark plan, collect artifacts, write a report."
In the first two runs, OpenCode + MiniMax scored 15/60 and 16/60 respectively. The auto-generated reports stated: "Consistent with previous results: fast execution but no meaningful code output" and "Consistent: MiniMax cannot implement the task. The model may lack the capability to read external files and produce code changes in this Rust codebase."
The bug discovery
After two sessions producing identical verdicts blaming the model, the developer sent one instruction to a fresh session: "go deeper, check the daemon logs before retrying." The new session traced the issue to a spill file at ~/.orchestratord/logs/<task_id>.txt. The plan step was producing 50KB of useful context, but OpenCode's sandbox only allowed reads inside the workspace directory by default. Since the spill file was outside the workspace, the implement step received an empty string instead of the plan.
The session filed a one-line config fix (moving the spill path inside the workspace) and re-ran the benchmark. After the fix, MiniMax produced 219 lines of code including a RetryConfig struct and a connect_with_retry helper, scoring 18/60. The remaining issues were real model weaknesses: four type-mismatch compile errors in unit tests.
Implications for AI evaluation
The incident reveals a critical blind spot in autonomous AI judges: they don't ask "is my pipeline broken?" even when their own analysis identifies symptoms like "may lack the capability to read external files." The first two sessions ran the full benchmark end-to-end and produced comprehensive reports but never checked daemon logs on their own. Only when explicitly told to investigate did the third session discover the configuration bug.
This failure mode is particularly relevant as LLM-as-judge has become the default eval methodology for many agent benchmarks, including arena-style auto-scoring, internal A/B harnesses, and reward modeling. The developer notes: "I came within one human keystroke of publishing a benchmark that confidently mis-attributed a sandbox bug to a model."
Other benchmark results
Codex + GPT-5.4 took the top spot at 50/60, though it had a step_finished success rate of only 25% (three of four orchestrator steps reported failure). The developer notes this oddity without further explanation in the provided source text.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Anam Cara-3: Advancements in Interactive AI Avatars
Anam Cara-3 introduces advanced interactive avatars with a two-stage pipeline for audio-to-video conversion, achieving impressive speed and responsiveness.

Google AI Overview Falsely Labels Canadian Fiddler Sex Offender, Lawsuit Filed
Ashley MacIsaac sues Google for $1.5M after AI Overview generated false statements he was a convicted sex offender, leading to a concert cancellation.

Claude Code 2.1.72 System Prompt Updates: New Execution Modes and Verification Improvements
Claude Code version 2.1.72 introduces new system prompts for Auto mode (continuous task execution) and Brief mode (Codex-like execution), plus significant expansions to the Verification specialist agent with documented failure patterns and structured output requirements.

An Open Standard for Agent Run Records: The Case for a Shared Log Schema
Every agent runtime has its own log format, causing fragmentation in debugging, auditing, and tool portability. The fields already converge on a core schema — it's time to standardize.