Opus 4.6 vs Gemini 3.1 Pro: Forecasting Benchmark Results

A Reddit user posted results from a benchmark comparing four frontier models — Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20 — on 1,417 binary forecasting questions from October–December 2025. The key innovation is decomposing performance into two evaluation conditions: agentic (each model performs its own web research using tools) and fixed-evidence (all models receive the same ~12,000-character research dossier compiled via the Bosse et al. 2026 standardization methodology).

Key findings

Opus 4.6 performs dramatically better in the agentic condition: it is better at figuring out what to search for, deciding which pages to read, and extracting relevant details. However, when research is removed, its advantage disappears.
Gemini 3.1 Pro delivers sharper judgment on fixed evidence — it weights information more accurately on forecasting tasks. Its calibration actually improves when given the standardized dossier, while Opus's calibration drops sharply.
GPT-5.4 and Grok 4.20 barely changed between conditions, suggesting their performance is less dependent on search strategy.
The rank order swapped between Opus and Gemini across conditions, which the poster argues indicates the evaluation is not broken or biased (a biased eval would likely move all models in the same direction).

Interpretation

The asymmetry in calibration — Opus's calibration drops when search is removed, while Gemini's improves — suggests Opus may be using its search trace as scaffolding for probability assignment. In other words, the act of conducting the search loop itself does some of the epistemic work, separate from the information it surfaces. This is a novel finding that could have implications for how we evaluate and design AI research agents.

Limitations and resources

The fixed-evidence dossiers are themselves LM-produced, so the test may measure how well each model interprets a particular standardized version of the evidence rather than abstract judgment. The poster notes this as a limitation but argues that the divergent behavior across models reduces the concern.

Full calibration scores, refinement scores, and per-condition analysis are available at: futuresearch.ai/opus-research-gemini-judgment. The benchmark and leaderboard are at: evals.futuresearch.ai.

To the poster's knowledge, this is the first direct evaluation of frontier models that decomposes performance into research vs. judgment stages. They invite replication in other domains.

📖 Read the full source: r/ClaudeAI

Opus 4.6 excels at research, Gemini 3.1 Pro has better judgment in forecasting benchmark

Key findings

Interpretation

Limitations and resources

👀 See Also

Graduates Boo AI Pep Talks at Commencements: A Sign of Developer Sentiment

Claude Skills vs. MCP: A Developer's Practical Boundary Question

Bonsai 1.7B Ternary Model Hits 442 T/s on M4 Max with Autonomously Tuned Metal Kernels

OpenClaw: Dive Into the First AMA on r/clawdbot