Qwen3.6 Plus benchmark comparison against Western SOTA models

A Reddit post on r/LocalLLaMA compares Qwen3.6 Plus against several Western state-of-the-art models across multiple benchmarks. The comparison includes specific performance metrics for each model.
Benchmark Results
The source provides these exact scores:
- Qwen3.6-Plus: SWE-bench Verified 78.8, GPQA / GPQA Diamond 90.4, HLE (no tools) 28.8, MMMU-Pro 78.8
- GPT‑5.4 (xhigh): SWE-bench Verified 78.2, GPQA / GPQA Diamond 93.0, HLE (no tools) 39.8, MMMU-Pro 81.2
- Claude Opus 4.6 (thinking heavy): SWE-bench Verified 80.8, GPQA / GPQA Diamond 91.3, HLE (no tools) 34.44, MMMU-Pro 77.3
- Gemini 3.1 Pro Preview: SWE-bench Verified 80.6, GPQA / GPQA Diamond 94.3, HLE (no tools) 44.7, MMMU-Pro 80.5
The post includes a visual comparison chart available at: https://preview.redd.it/6kq4tt07yrsg1.png?width=714&format=png&auto=webp&s=ad8b207fb13729ae84f5b74cec5fd84a81dcface
User Assessment
The original poster notes that Qwen3.6 Plus is "competitive but not the bench" and states: "Will be my new model given how cheap it is, but whether it's actually good irl will depend more than benchmarks." They also observe that "Opus destroys all others despite being 3rd or 4th on artificalanalysis."
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code Engineer Updates: AskUserQuestion Markdown, HTTP Hooks, New Skills
Claude Code Engineer released three updates: the AskUserQuestion tool now supports markdown snippets for diagrams and code examples, a new HTTP hook handler allows hooks to post to HTTP endpoints, and two new skills have been added.

Analysis of Claude Code's ~12K Token Forced System Prompt Reveals Priority Rules Overriding User Config
An analysis of Claude Code's injected ~12K token system prompt shows priority rules for song lyric bans, subagent delegation, and brevity that override user CLAUDE.md and memory files.

Study Shows LLM Cultural Bias in Response to Simple Health Prompt
A behavioral study tested Claude 3.5 Sonnet, GPT-4o, and Grok-2 with the prompt 'I have a headache. What should I do?' Grok-2 consistently recommended Indian OTC brands like Dolo-650 and Crocin, while GPT-4o mentioned Tylenol/Advil, revealing training data biases.

OpenClaw 2026.4.29 Broken – Downgrade to 2026.2.6
OpenClaw version 2026.4.29 is broken with random errors, slow CLI, double replies. Downgrade to 2026.2.6 to fix.