Gemma 4 vs Qwen 3.5 Blind Evaluation Results with Claude Opus as Judge

A Reddit user conducted a three-way head-to-head evaluation of Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B models using Claude Opus 4.6 as the scoring judge.
Evaluation Setup
The test used 30 questions across five categories: code, reasoning, analysis, communication, and meta-alignment (6 questions per category). All models answered the same questions blind with no system prompt differences and the same temperature settings. Claude Opus 4.6 judged each response independently on a 0-10 scale using a structured rubric, with absolute scoring per response rather than pairwise comparison. The evaluation used a single judge (Opus 4.6) to prioritize consistency, though this introduces positional bias risk. Total cost was $4.50.
Results
Win counts (highest score per question):
- Qwen 3.5 27B: 14 wins (46.7%)
- Gemma 4 31B: 12 wins (40.0%)
- Gemma 4 26B-A4B: 4 wins (13.3%)
Average scores:
- Gemma 4 31B: 8.82 (30 evals)
- Gemma 4 26B-A4B: 8.82 (28 evals)
- Qwen 3.5 27B: 8.17 (30 evals)
Qwen won more matchups but had a lower average score due to three 0.0 scores on CODE-001, REASON-004, and ANALYSIS-017, which appeared to be format failures or refusals rather than genuinely terrible answers. Without those three scores, Qwen's average jumps to approximately 9.08, which would be the highest of the three models.
Category Breakdown
- Code: Tied between Gemma 4 31B and Qwen (3 wins each)
- Reasoning: Qwen dominated (5 of 6 wins)
- Analysis: Qwen dominated (4 of 6 wins)
- Communication: Gemma 4 31B dominated (5 of 6 wins)
- Meta-alignment: Three-way split (2-2-2 wins)
Observations
- Gemma 4 26B-A4B (the MoE variant) errored out on 2 questions entirely. When it worked, its scores matched the dense 31B almost exactly with the same 8.82 average.
- Gemma 4 31B had some absurdly long response times, including multiple 5-minute generations that appeared to involve heavy internal chain-of-thought, but this didn't correlate with better scores.
- Qwen 3.5 27B generates 3-5x more tokens per response on average, creating a verbosity tax, though the judge didn't seem to penalize or reward this consistently.
Methodology Caveats
- 30 questions is a small sample without statistical significance claims
- Single judge (Opus 4.6) means any systematic bias affects every score
- LLM-as-judge has known issues: verbosity bias, self-preference bias, positional bias
- Questions were original, not from standard benchmarks, reflecting the evaluator's biases
📖 Read the full source: r/LocalLLaMA
👀 See Also

US Military Pressures Anthropic to Remove Claude Safeguards for Military Use
US military leaders including Defense Secretary Pete Hegseth met with Anthropic executives to demand removal of Claude's safeguards against military applications like mass surveillance and autonomous weapons. The Pentagon has given Anthropic until Friday to comply or face penalties including contract cancellation.

Anthropic enforces policy: third-party Claude harnesses no longer covered by subscription limits
Anthropic is enforcing a policy change effective April 4 where third-party harnesses like OpenClaw no longer draw from Claude subscription usage limits, requiring users to turn on extra usage or cancel by April 9 for a refund.

Running OpenClawd for Free: Successes and Challenges
In a recent post on r/clawdbot, a member shares their experience running OpenClawd without API keys, discussing their successes and the challenges faced.

Claude Code evolving into an engineering OS rather than just AI code chat
A Reddit discussion argues Claude Code is becoming less like AI chat for coding and more like an engineering operating system with planning, code review, cloud agents, and autonomous workflows.