Qwen3 vs Qwen3.5: 35% Speed Gain on RTX 5090

Performance Comparison: Qwen3-30B-A3B vs Qwen3.5-35B-A3B

A detailed benchmark comparing Qwen3-30B-A3B and the newly released Qwen3.5-35B-A3B on an NVIDIA RTX 5090 reveals trade-offs between speed and context handling. Both models use the same Mixture of Experts architecture with 3B active parameters, with the 3.5 version adding 5B more total parameters and including a vision projector.

Hardware and Setup

GPU: NVIDIA RTX 5090 (32 GB VRAM, Blackwell)
Server: llama.cpp b8115 (Docker: ghcr.io/ggml-org/llama.cpp:server-cuda)
Quantization: Q4_K_M for both models
KV Cache: Q8_0 (-ctk q8_0 -ctv q8_0)
Context: 32,768 tokens (-c 32768)
Parameters: -ngl 999 -np 4 --flash-attn on -t 12
Model A: Qwen3-30B-A3B-Q4_K_M (17 GB on disk)
Model B: Qwen3.5-35B-A3B-Q4_K_M (21 GB on disk)

Both models were warmed up with a throwaway request before timing. Server-side timings came from API responses, not wall-clock measurements.

Raw Inference Speed Results

Direct llama.cpp /v1/chat/completions testing showed:

Short prompts (8-9 tokens): 30B: 248.2 tok/s, 3.5: 169.5 tok/s
Medium prompts (73-78 tokens): 30B: 236.1 tok/s, 3.5: 163.5 tok/s
Long-form (800 tokens): 30B: 232.6 tok/s, 3.5: 116.3 tok/s
Code generation (298-400 tokens): 30B: 233.9 tok/s, 3.5: 161.6 tok/s
Reasoning (200 tokens): 30B: 234.8 tok/s, 3.5: 158.2 tok/s

Average generation speed: 30B: 237.1 tok/s, 3.5: 153.8 tok/s (30B is 35% faster)

Prompt processing averages: 30B: 773.5 tokens/s, 3.5: 518.1 tokens/s

The 3.5 model shows an interesting regression on long outputs (800 tokens), dropping to 116 tok/s versus ~160 tok/s on shorter outputs. Prompt processing is slower on the 3.5 due to its larger vocabulary (248K vs 152K tokens).

Memory Usage

VRAM usage: 30B uses 27.3 GB idle, 3.5 uses 29.0 GB idle. Both fit comfortably on the RTX 5090.

Response Quality Observations

Testing at temperature=0.7 showed both models produce competent output. Key observations:

Creative writing: Both solid, with 3.5 showing slightly more atmospheric prose
Haiku generation: Both produce valid 5-7-5 structures
Coding tasks: Both correctly implement LRU cache with O(1) get/put operations

The 3.5 model handles long context significantly better with flat token scaling versus the 30B's 21% degradation. Quality differences are minimal with a slight edge to 3.5 in structure and formatting.

📖 Read the full source: r/LocalLLaMA