SWE-rebench Leaderboard Update: February 2026 Results Show Tight Competition

✍️ OpenClawRadar📅 Published: March 23, 2026🔗 Source

SWE-rebench February 2026 Results

The SWE-rebench leaderboard has been updated with February 2026 runs on 57 fresh GitHub PR tasks. The setup follows standard SWE-bench methodology: models read real PR issues, edit code, run tests, and must make the full test suite pass. Tasks are restricted to PRs created in the previous month.

Key Results

Claude Opus 4.6 remains at the top with 65.3% resolved rate, continuing to set the pace with strong pass@5 (~70%)
The top tier is extremely tight: gpt-5.2-medium (64.4%), GLM-5 (62.8%), and gpt-5.4-medium (62.8%) are all within a few points of the leader
Gemini 3.1 Pro Preview (62.3%) and DeepSeek-V3.2 (60.9%) complete a tightly packed top-6
Open-weight/hybrid models keep improving: Qwen3.5-397B (59.9%), Step-3.5-Flash (59.6%), and Qwen3-Coder-Next (54.4%) are closing the gap, driven by improved long-context use and scaling
MiniMax M2.5 (54.6%) continues to stand out as a cost-efficient option with competitive performance

Overall, February shows a highly competitive frontier with multiple models within a few points of the lead.

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

Qwen 3 8B outperforms larger models in blind peer evaluations on hard tasks

In a blind peer evaluation of 10 small language models on 13 hard frontier-level tasks, Qwen 3 8B won 6 evaluations and placed in the top 3 in 12 of 13 tasks, outperforming models with up to 4x its parameter count. The evaluation covered distributed lock debugging, Go concurrency bugs, SQL optimization, Bayesian medical diagnosis, Simpson's Paradox, Arrow's voting theorem, and survivorship bias analysis.

Mar 17, 2026, 08:45 PM UTC

OpenClawRadar

News

The AI Dependency Trap: Why Over-Reliance on LLMs May Erode Core Skills

A contrarian take arguing that heavy reliance on AI chatbots will lead to atrophy of critical thinking, writing, research, and learning abilities.

Apr 29, 2026, 08:15 PM UTC

OpenClawRadar

News

Google: 75% of New Code Is AI-Generated, Code Migration 6x Faster with Agents

Google reports 75% of new code is AI-generated, up from 25% in 2024. A complex code migration completed 6x faster using Gemini agents. Engineers in some orgs have AI usage goals tied to performance reviews.

Apr 24, 2026, 08:15 AM UTC

OpenClawRadar

News

User Reports Sonnet 4.6 Outperforms Opus 4.6 for Practical Coding Tasks

A developer testing Claude AI models found that Opus 4.6 produced over-engineered solutions with performance gaps, while Sonnet 4.6 delivered more careful, efficient fixes with lower token usage.

Mar 12, 2026, 11:45 AM UTC

OpenClawRadar