Qwen 3.6 27B Benchmarked on DeepSWE: 2% Score, 70 Hours, 44k Avg Output Tokens

✍️ OpenClawRadar📅 Published: June 22, 2026🔗 Source

A Reddit user benchmarked Qwen 3.6 27B on the DeepSWE benchmark, scoring 2% (1.79% rounded up) — placing 18th out of 20, above Haiku 4.5 and Minimax M2.7. The full run took 70 hours, with an average task time of 32 minutes and average output tokens per task of 44k — surprisingly on par with the larger Qwen 3.6 Plus, despite the 27B model's reputation for verbosity.

Methodology

Model: Qwen 3.6 27B FP8 with BF16 KV cache, reasoning enabled, 262k context window, served via VLLM
Hardware: 1x RTX6000 Pro Blackwell on RunPod
Agent harness: mini-swe on Modal sandboxes
1 rollout per task (instead of the official 4) to save time; no score range
Costs calculated from RunPod hourly rate for completed tasks
Orchestration: Codex 5.5xhigh monitored and managed the full run

Key Observations

The author notes the score is suspiciously close to Qwen 3.6 Plus, raising questions about architectural differences. They argue that local models are falling further behind frontier closed-source offerings: K2.6 is the best open-source model, but most can't even run it locally. Qwen 3.6 27B is positioned as a "poor man's SOTA" local option. The trend suggests frontier performance requires large scale, which often leads to closed sourcing, making local inference a losing game in terms of competitiveness.

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

Medicare's ACCESS Program: Payment Model Built for AI Agents, Details Inside

CMS's ACCESS program pays for AI-driven chronic care, not just time with clinicians. Pair Team's voice AI Flora reduced ER visits by 50%. Cohort goes live July 5.

May 14, 2026, 02:17 AM UTC

OpenClawRadar

News

Anthropic's Natural Language Autoencoders Turn Claude's Activations into Readable English — Here's How

Anthropic releases Natural Language Autoencoders (NLAs) that convert Claude's internal activations into plain-text explanations, revealing model reasoning about rhymes, safety test awareness, and cheating detection.

May 7, 2026, 10:15 PM UTC

OpenClawRadar

News

Claude Code CC 2.1.124 and 2.1.126: File Modification Budget Exceeded Reminder, Harness Instructions Update, REPL Awaits Clarification, and Malware Analysis Reminder Removed

CC 2.1.124 adds a system reminder for file changes omitted due to budget limits, updates harness instructions with explicit insertion points, and clarifies REPL auto-await behavior. CC 2.1.126 removes the malware analysis post-read reminder.

May 5, 2026, 02:15 AM UTC

OpenClawRadar

News

Bram Cohen critiques 'vibe coding' and AI-assisted development practices

Bram Cohen argues that 'vibe coding'—where developers avoid looking at code while using AI assistants—leads to poor software quality, using Claude's source code leak as an example of the problems with excessive dogfooding.

Apr 16, 2026, 09:53 PM UTC

OpenClawRadar