Qwen3-30B-A3B vs Qwen3.5-35B-A3B Performance Comparison on RTX 5090

Performance Comparison: Qwen3-30B-A3B vs Qwen3.5-35B-A3B
A detailed benchmark comparing Qwen3-30B-A3B and the newly released Qwen3.5-35B-A3B on an NVIDIA RTX 5090 reveals trade-offs between speed and context handling. Both models use the same Mixture of Experts architecture with 3B active parameters, with the 3.5 version adding 5B more total parameters and including a vision projector.
Hardware and Setup
- GPU: NVIDIA RTX 5090 (32 GB VRAM, Blackwell)
- Server: llama.cpp b8115 (Docker: ghcr.io/ggml-org/llama.cpp:server-cuda)
- Quantization: Q4_K_M for both models
- KV Cache: Q8_0 (-ctk q8_0 -ctv q8_0)
- Context: 32,768 tokens (-c 32768)
- Parameters: -ngl 999 -np 4 --flash-attn on -t 12
- Model A: Qwen3-30B-A3B-Q4_K_M (17 GB on disk)
- Model B: Qwen3.5-35B-A3B-Q4_K_M (21 GB on disk)
Both models were warmed up with a throwaway request before timing. Server-side timings came from API responses, not wall-clock measurements.
Raw Inference Speed Results
Direct llama.cpp /v1/chat/completions testing showed:
- Short prompts (8-9 tokens): 30B: 248.2 tok/s, 3.5: 169.5 tok/s
- Medium prompts (73-78 tokens): 30B: 236.1 tok/s, 3.5: 163.5 tok/s
- Long-form (800 tokens): 30B: 232.6 tok/s, 3.5: 116.3 tok/s
- Code generation (298-400 tokens): 30B: 233.9 tok/s, 3.5: 161.6 tok/s
- Reasoning (200 tokens): 30B: 234.8 tok/s, 3.5: 158.2 tok/s
Average generation speed: 30B: 237.1 tok/s, 3.5: 153.8 tok/s (30B is 35% faster)
Prompt processing averages: 30B: 773.5 tokens/s, 3.5: 518.1 tokens/s
The 3.5 model shows an interesting regression on long outputs (800 tokens), dropping to 116 tok/s versus ~160 tok/s on shorter outputs. Prompt processing is slower on the 3.5 due to its larger vocabulary (248K vs 152K tokens).
Memory Usage
VRAM usage: 30B uses 27.3 GB idle, 3.5 uses 29.0 GB idle. Both fit comfortably on the RTX 5090.
Response Quality Observations
Testing at temperature=0.7 showed both models produce competent output. Key observations:
- Creative writing: Both solid, with 3.5 showing slightly more atmospheric prose
- Haiku generation: Both produce valid 5-7-5 structures
- Coding tasks: Both correctly implement LRU cache with O(1) get/put operations
The 3.5 model handles long context significantly better with flat token scaling versus the 30B's 21% degradation. Quality differences are minimal with a slight edge to 3.5 in structure and formatting.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Pope Leo XIV's Encyclical on AI: Key Takeaways for Developers
The Vatican released an encyclical on AI ethics. The document highlights LLM interpretability issues, cultural biases in training data, and the environmental cost of AI.

Hollywood Writers Shift to AI Training: First-Person Account of Data Annotation Work
A Hollywood showrunner describes transitioning to AI training work at $52/hour after the 2023 strike, annotating conversations, images, and videos for companies like Mercor and Outlier.

Claude-Code v2.1.45 Enhancements and Fixes
Claude-Code v2.1.45 introduces support for Claude Sonnet 4.6 and various fixes for system stability.

Bloomberg Reports on AI Coding Agents and Productivity Concerns in 2026
A Bloomberg article from February 2026 discusses AI coding agents like Claude Code and reports on a 'productivity panic' in the tech industry. The article received 44 points and 14 comments on Hacker News.