Qwen 3.6 27B Benchmarked on DeepSWE: 2% Score, 70 Hours, 44k Avg Output Tokens

✍️ OpenClawRadar📅 Published: June 22, 2026🔗 Source
Qwen 3.6 27B Benchmarked on DeepSWE: 2% Score, 70 Hours, 44k Avg Output Tokens
Ad

A Reddit user benchmarked Qwen 3.6 27B on the DeepSWE benchmark, scoring 2% (1.79% rounded up) — placing 18th out of 20, above Haiku 4.5 and Minimax M2.7. The full run took 70 hours, with an average task time of 32 minutes and average output tokens per task of 44k — surprisingly on par with the larger Qwen 3.6 Plus, despite the 27B model's reputation for verbosity.

Methodology

  • Model: Qwen 3.6 27B FP8 with BF16 KV cache, reasoning enabled, 262k context window, served via VLLM
  • Hardware: 1x RTX6000 Pro Blackwell on RunPod
  • Agent harness: mini-swe on Modal sandboxes
  • 1 rollout per task (instead of the official 4) to save time; no score range
  • Costs calculated from RunPod hourly rate for completed tasks
  • Orchestration: Codex 5.5xhigh monitored and managed the full run
Ad

Key Observations

The author notes the score is suspiciously close to Qwen 3.6 Plus, raising questions about architectural differences. They argue that local models are falling further behind frontier closed-source offerings: K2.6 is the best open-source model, but most can't even run it locally. Qwen 3.6 27B is positioned as a "poor man's SOTA" local option. The trend suggests frontier performance requires large scale, which often leads to closed sourcing, making local inference a losing game in terms of competitiveness.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also