Claude Code Benchmark Bug: AI Judge Blames Pipeline Errors on Model

Benchmark setup and initial results

A developer ran a controlled benchmark across three coding-agent stacks using Claude Code (Opus 4.6) as an autonomous evaluator. The benchmark tested: OpenCode + MiniMax-M2.7, Gemini CLI + Gemini 3.1 Pro, and Codex CLI + GPT-5.4. Each retest was a fresh session with no cross-session memory, using the prompt: "execute the benchmark plan, collect artifacts, write a report."

In the first two runs, OpenCode + MiniMax scored 15/60 and 16/60 respectively. The auto-generated reports stated: "Consistent with previous results: fast execution but no meaningful code output" and "Consistent: MiniMax cannot implement the task. The model may lack the capability to read external files and produce code changes in this Rust codebase."

The bug discovery

After two sessions producing identical verdicts blaming the model, the developer sent one instruction to a fresh session: "go deeper, check the daemon logs before retrying." The new session traced the issue to a spill file at ~/.orchestratord/logs/<task_id>.txt. The plan step was producing 50KB of useful context, but OpenCode's sandbox only allowed reads inside the workspace directory by default. Since the spill file was outside the workspace, the implement step received an empty string instead of the plan.

The session filed a one-line config fix (moving the spill path inside the workspace) and re-ran the benchmark. After the fix, MiniMax produced 219 lines of code including a RetryConfig struct and a connect_with_retry helper, scoring 18/60. The remaining issues were real model weaknesses: four type-mismatch compile errors in unit tests.

Implications for AI evaluation

The incident reveals a critical blind spot in autonomous AI judges: they don't ask "is my pipeline broken?" even when their own analysis identifies symptoms like "may lack the capability to read external files." The first two sessions ran the full benchmark end-to-end and produced comprehensive reports but never checked daemon logs on their own. Only when explicitly told to investigate did the third session discover the configuration bug.

This failure mode is particularly relevant as LLM-as-judge has become the default eval methodology for many agent benchmarks, including arena-style auto-scoring, internal A/B harnesses, and reward modeling. The developer notes: "I came within one human keystroke of publishing a benchmark that confidently mis-attributed a sandbox bug to a model."

Other benchmark results

Codex + GPT-5.4 took the top spot at 50/60, though it had a step_finished success rate of only 25% (three of four orchestrator steps reported failure). The developer notes this oddity without further explanation in the provided source text.

📖 Read the full source: r/LocalLLaMA