Benchmark Comparison of Qwen 3.5 Models Against Major AI Models

A benchmark comparison website has been shared that provides head-to-head performance data for multiple large language models. The site includes verified scores and comparative infographics for a range of models, focusing on the Qwen 3.5 series from Alibaba.
Models Included in the Comparison
The source lists the following models as being part of the full comparison:
- GPT-5.2
- Claude 4.5 Opus
- Gemini-3 Pro
- Qwen3-Max-Thinking
- K2.5-1T-A32B
- Qwen3.5-397B
- GPT-5-mini
- GPT-OSS-120B
- Qwen3-235B
- Qwen3.5-122B
- Qwen3.5-27B
- Qwen3.5-35B
What the Source Provides
The source material specifically states that the comparison includes "all verified scores and head-to-head infographics." This suggests the website aggregates performance metrics from standardized AI benchmarks, which typically measure capabilities in areas like reasoning, coding, and general knowledge. The link provided points to a dedicated comparison site at https://compareqwen35.tiiny.site.
For context, benchmark comparisons are a standard method in the AI community to evaluate model performance objectively. The Qwen series are open-source models developed by Alibaba, and comparing them against proprietary models from OpenAI (GPT), Anthropic (Claude), and Google (Gemini) provides practical data for developers choosing which model to use or fine-tune for specific tasks. The inclusion of parameter sizes (e.g., 122B, 397B) indicates the comparison covers models of varying scales, which is relevant for assessing performance versus computational cost.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Sora AI Video Economics: $20 User Costs OpenAI $65 in Compute
OpenAI's Sora AI video generation app reportedly costs $65 in compute per $20/month user, with peak inference costs estimated at $15 million daily versus $2.1 million total lifetime revenue.

Graduates Boo AI Pep Talks at Commencements: A Sign of Developer Sentiment
College graduates booed speakers pushing AI enthusiasm at commencement ceremonies, reflecting broader unease about AI's impact on jobs and society.

Senior Government AI Lead Lacks Local LLM Awareness: A Developer's Account
A local LLM developer reports that a senior government AI leader was unaware of why businesses would choose local LLMs over cloud APIs, despite understanding technical basics.

Analysis of 100M tokens in Claude Code reveals 99.4% input usage
Analysis of 1,289 requests across extended coding sessions shows Claude Code used 100.3M input tokens (99.4%) versus only 616K output tokens (0.6%), with 84.2M tokens cached due to repeated context re-sending.