Opus 4.6 excels at research, Gemini 3.1 Pro has better judgment in forecasting benchmark

A Reddit user posted results from a benchmark comparing four frontier models — Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20 — on 1,417 binary forecasting questions from October–December 2025. The key innovation is decomposing performance into two evaluation conditions: agentic (each model performs its own web research using tools) and fixed-evidence (all models receive the same ~12,000-character research dossier compiled via the Bosse et al. 2026 standardization methodology).
Key findings
- Opus 4.6 performs dramatically better in the agentic condition: it is better at figuring out what to search for, deciding which pages to read, and extracting relevant details. However, when research is removed, its advantage disappears.
- Gemini 3.1 Pro delivers sharper judgment on fixed evidence — it weights information more accurately on forecasting tasks. Its calibration actually improves when given the standardized dossier, while Opus's calibration drops sharply.
- GPT-5.4 and Grok 4.20 barely changed between conditions, suggesting their performance is less dependent on search strategy.
- The rank order swapped between Opus and Gemini across conditions, which the poster argues indicates the evaluation is not broken or biased (a biased eval would likely move all models in the same direction).
Interpretation
The asymmetry in calibration — Opus's calibration drops when search is removed, while Gemini's improves — suggests Opus may be using its search trace as scaffolding for probability assignment. In other words, the act of conducting the search loop itself does some of the epistemic work, separate from the information it surfaces. This is a novel finding that could have implications for how we evaluate and design AI research agents.
Limitations and resources
The fixed-evidence dossiers are themselves LM-produced, so the test may measure how well each model interprets a particular standardized version of the evidence rather than abstract judgment. The poster notes this as a limitation but argues that the divergent behavior across models reduces the concern.
Full calibration scores, refinement scores, and per-condition analysis are available at: futuresearch.ai/opus-research-gemini-judgment. The benchmark and leaderboard are at: evals.futuresearch.ai.
To the poster's knowledge, this is the first direct evaluation of frontier models that decomposes performance into research vs. judgment stages. They invite replication in other domains.
📖 Read the full source: r/ClaudeAI
👀 See Also

Graduates Boo AI Pep Talks at Commencements: A Sign of Developer Sentiment
College graduates booed speakers pushing AI enthusiasm at commencement ceremonies, reflecting broader unease about AI's impact on jobs and society.

Claude Skills vs. MCP: A Developer's Practical Boundary Question
A developer questions where MCP's value becomes decisive versus Claude Skills after the Skills release made tool integration reasoning harder, noting that well-structured instructions can often suffice without protocol boundaries.

Bonsai 1.7B Ternary Model Hits 442 T/s on M4 Max with Autonomously Tuned Metal Kernels
Autonomous agent ata optimized Metal kernels for Bonsai 1.7B Q2_0, achieving 442 t/s decode (+42%) and 4622 t/s prefill (+9%) on M4 Max vs unmodified llama.cpp.

OpenClaw: Dive Into the First AMA on r/clawdbot
In an exciting AMA session, the OpenClaw team discussed the future of AI coding agents on Reddit's r/clawdbot. Discover key insights and takeaways from this interactive event.