Opus 4.7 Reasoning Effort Benchmark: Medium Beats High and Max on Real Tasks

✍️ OpenClawRadar📅 Published: May 13, 2026🔗 Source
Ad

Reddit user ktane tested Claude Opus 4.7 in Claude Code across five reasoning effort settings (low, medium, high, xhigh, max) on 29 real tasks from the open-source GraphQL-go-tools repository. The result: medium reasoning effort consistently outperformed higher settings on test pass rate, semantic equivalence with human-authored patches, code-review pass rate, and aggregate craft/discipline scores.

Ad

Key Results

  • All-task pass rate: Medium 28/29, Max 27/29, High 26/29, Xhigh 25/29, Low 23/29
  • Equivalent patches: Medium 14/29, Max 13/29, High 12/29, Xhigh 11/29, Low 10/29
  • Code-review pass rate: Medium 10/29, High 7/29, Max 8/29, Xhigh 4/29, Low 5/29
  • Code-review rubric mean: Medium 2.716, High 2.509, Xhigh 2.482, Max 2.431, Low 2.426
  • Footprint risk (lower is better): Low 0.155, Medium 0.189, High 0.206, Max 0.227, Xhigh 0.238
  • Cost per task: Low $2.50, Medium $3.15, High $5.01, Xhigh $6.51, Max $8.84
  • Duration per task: Low 383.8s, Medium 450.7s, High 716.4s, Xhigh 803.8s, Max 996.9s
  • Equivalent passes per dollar: Low 4.0, Medium 4.4, High 2.4, Xhigh 1.7, Max 1.5

The author notes that Opus 4.7 uses adaptive thinking — it already allocates reasoning budget per task. The effort knob thus biases an already-adaptive policy rather than adding raw intelligence. Notably, in one PR (#1260), high and xhigh settings wasted extra reasoning on digging up commit hashes from prior PRs and concluded 'no work needed', while medium and max correctly read the control flow and produced a fix.

This contrasts with GPT-5.5 in Codex, which showed the intuitive monotonic curve where more reasoning improved quality. The full interactive report with per-task drilldowns is available at stet.sh.

📖 Read the full source: r/ClaudeAI

Ad

👀 See Also