Benchmark Comparison of Qwen 3.5 Models Against Major AI Models

A benchmark comparison website has been shared that provides head-to-head performance data for multiple large language models. The site includes verified scores and comparative infographics for a range of models, focusing on the Qwen 3.5 series from Alibaba.
Models Included in the Comparison
The source lists the following models as being part of the full comparison:
- GPT-5.2
- Claude 4.5 Opus
- Gemini-3 Pro
- Qwen3-Max-Thinking
- K2.5-1T-A32B
- Qwen3.5-397B
- GPT-5-mini
- GPT-OSS-120B
- Qwen3-235B
- Qwen3.5-122B
- Qwen3.5-27B
- Qwen3.5-35B
What the Source Provides
The source material specifically states that the comparison includes "all verified scores and head-to-head infographics." This suggests the website aggregates performance metrics from standardized AI benchmarks, which typically measure capabilities in areas like reasoning, coding, and general knowledge. The link provided points to a dedicated comparison site at https://compareqwen35.tiiny.site.
For context, benchmark comparisons are a standard method in the AI community to evaluate model performance objectively. The Qwen series are open-source models developed by Alibaba, and comparing them against proprietary models from OpenAI (GPT), Anthropic (Claude), and Google (Gemini) provides practical data for developers choosing which model to use or fine-tune for specific tasks. The inclusion of parameter sizes (e.g., 122B, 397B) indicates the comparison covers models of varying scales, which is relevant for assessing performance versus computational cost.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Apple's AI Strategy and the Commoditization of Intelligence
The article argues that Apple's conservative approach to AI may be advantageous as intelligence becomes commoditized, with models like Gemma4 achieving 85.2% on MMLU Pro while running on phones, and OpenAI's Sora costing $15M daily against $2.1M revenue.

ThermoQA: Open Benchmark for Engineering Thermodynamics Tests LLMs on 293 Calculation Problems
ThermoQA is an open benchmark with 293 engineering thermodynamics problems across three tiers, testing LLMs on exact numerical calculations. Claude Opus 4.6 leads with 94.1% composite score, while DeepSeek-R1 shows highest run-to-run variance at ±2.5%.

Meta's MCI Tool Captures Employee Interactions for AI Training
Meta is installing tracking software called Model Capability Initiative (MCI) on U.S. employee computers to capture mouse movements, keystrokes, clicks, and occasional screen snapshots for AI model training. The data aims to improve AI's ability to replicate human computer interactions like dropdown menu selection and keyboard shortcuts.

Qwen3.6 Plus benchmark comparison against Western SOTA models
Qwen3.6 Plus scores 78.8 on SWE-bench Verified, 90.4 on GPQA/GPQA Diamond, 28.8 on HLE (no tools), and 78.8 on MMMU-Pro, placing it competitively against models like GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro Preview.