Claude Opus 4.1 scores 17.75% on SWE-Bench Pro's private dataset, highlighting memorization vs. reasoning gap

Benchmark results show significant performance gap
Claude Opus 4.1 achieved 80%+ on SWE-Bench Verified, but scored only 17.75% on SWE-Bench Pro's private dataset. This dataset contains 276 tasks from 18 proprietary startup codebases that have never been on GitHub, specifically designed to eliminate data contamination through GPL-licensed public repositories.
Other model results on the same private dataset: GPT-5.2 scored 23.81% (topping the leaderboard) and Gemini 3 Pro scored 17.95%.
Trajectory analysis reveals memorization behavior
Scale AI's analysis found that during testing, models could identify correct file paths to modify before fully reading problem descriptions on familiar repositories. This indicates they were navigating by memory rather than reasoning through the problems.
The 80% score on SWE-Bench Verified was real, but measured a different capability than most people assumed - primarily memory of training data rather than reasoning about novel code.
Practical implications for AI coding tool deployment
For developers deciding where to deploy AI coding tools in their workflow, the distinction between memory and reasoning matters more than headline benchmark numbers. Models that perform well on contaminated benchmarks may struggle with truly novel codebases they haven't seen during training.
SWE-Bench Pro was created specifically to address this contamination issue by using code that has never been publicly available on GitHub or in training datasets.
📖 Read the full source: r/ClaudeAI
👀 See Also

Claude Sonnet 4.5 Experiencing Elevated Errors — Status Update
Claude Sonnet 4.5 is currently experiencing elevated errors as of 2026-04-28T13:29:56.000Z. Check the status page and Reddit megathread for updates.

ICML 2026 Desk-Rejects 2% of Papers for LLM Review Policy Violations
ICML 2026 rejected 497 papers (~2% of submissions) after detecting 795 reviews (~1% of all reviews) where reviewers violated explicit agreements not to use LLMs. The detection method involved watermarking PDFs with hidden LLM instructions.

ChatGPT Workspace Agents Free Preview Ends Today — How It Compares to OpenClaw and Hermes
OpenAI's ChatGPT Workspace Agents free preview ends May 6, switching to credit-based pricing. The Reddit post compares it to OpenClaw, Hermes, and managed platforms like BetterClaw for team vs. personal use.

Revolutionize API Monitoring Across Providers with onWatch
Discover how onWatch, a powerful new tool, streamlines tracking your AI API quota usage across multiple providers, ensuring you stay within limits and optimize resource allocation.