Qwen 3.6 27B Benchmarked on DeepSWE: 2% Score, 70 Hours, 44k Avg Output Tokens

A Reddit user benchmarked Qwen 3.6 27B on the DeepSWE benchmark, scoring 2% (1.79% rounded up) — placing 18th out of 20, above Haiku 4.5 and Minimax M2.7. The full run took 70 hours, with an average task time of 32 minutes and average output tokens per task of 44k — surprisingly on par with the larger Qwen 3.6 Plus, despite the 27B model's reputation for verbosity.
Methodology
- Model: Qwen 3.6 27B FP8 with BF16 KV cache, reasoning enabled, 262k context window, served via VLLM
- Hardware: 1x RTX6000 Pro Blackwell on RunPod
- Agent harness: mini-swe on Modal sandboxes
- 1 rollout per task (instead of the official 4) to save time; no score range
- Costs calculated from RunPod hourly rate for completed tasks
- Orchestration: Codex 5.5xhigh monitored and managed the full run
Key Observations
The author notes the score is suspiciously close to Qwen 3.6 Plus, raising questions about architectural differences. They argue that local models are falling further behind frontier closed-source offerings: K2.6 is the best open-source model, but most can't even run it locally. Qwen 3.6 27B is positioned as a "poor man's SOTA" local option. The trend suggests frontier performance requires large scale, which often leads to closed sourcing, making local inference a losing game in terms of competitiveness.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Medicare's ACCESS Program: Payment Model Built for AI Agents, Details Inside
CMS's ACCESS program pays for AI-driven chronic care, not just time with clinicians. Pair Team's voice AI Flora reduced ER visits by 50%. Cohort goes live July 5.

Anthropic's Natural Language Autoencoders Turn Claude's Activations into Readable English — Here's How
Anthropic releases Natural Language Autoencoders (NLAs) that convert Claude's internal activations into plain-text explanations, revealing model reasoning about rhymes, safety test awareness, and cheating detection.

Claude Code CC 2.1.124 and 2.1.126: File Modification Budget Exceeded Reminder, Harness Instructions Update, REPL Awaits Clarification, and Malware Analysis Reminder Removed
CC 2.1.124 adds a system reminder for file changes omitted due to budget limits, updates harness instructions with explicit insertion points, and clarifies REPL auto-await behavior. CC 2.1.126 removes the malware analysis post-read reminder.

Bram Cohen critiques 'vibe coding' and AI-assisted development practices
Bram Cohen argues that 'vibe coding'—where developers avoid looking at code while using AI assistants—leads to poor software quality, using Claude's source code leak as an example of the problems with excessive dogfooding.