Benchmark Results for Qwen3.5 Models with 2K to 400K Context on RTX 4090

Qwen3.5 Performance Testing on RTX 4090
A developer shared benchmark results for Qwen3.5 models running on an RTX 4090 GPU, testing context windows from 2,048 to 400,000 tokens. The tests were originally planned for 262k context but extended to 400k using yarn and other methods.
Models Tested
The following Qwen3.5 model variants were benchmarked:
- Qwen3.5-0.8B-Q4_K_M
- Qwen3.5-0.8B-bf16
- Qwen3.5-2B-Q4_K_M
- Qwen3.5-2B-bf16
- Qwen3.5-4B-Q4_K_M
- Qwen3.5-4B-bf16
- Qwen3.5-9B-Q4_K_M
- Qwen3.5-9B-bf16
- Qwen3.5-27B-Q4_K_M
- Qwen3.5-35B-A3B-Q4_K_M
Context Windows Tested
The models were evaluated at these specific context lengths: 2048, 4096, 8192, 32768, 65536, 98304, 131072, 196608, 262144, 327680, 360448, 393216, and 400000 tokens.
Testing Methodology
The benchmark script was configured to achieve the best possible tokens/second speed using NGL settings with 8-bit and 4-bit KV cache. The developer noted that while initial time-to-first-token (TTFT) appears lengthy, the Warm TTFT Avg (s) column shows better performance once the KV cache is loaded. Context was fully loaded in the first interaction intentionally.
To test context capabilities, the models were given a 1-sentence prompt to summarize logs, followed by 2k to 400k tokens of log data. The developer reported some discrepancies but overall satisfactory performance.
Current Status and Next Steps
Three models failed during testing and are undergoing KV offload tests: Qwen3.5-4B-bf16, Qwen3.5-27B-Q4_K_M, and Qwen3.5-35B-A3B-Q4_K_M. The developer had to restart these tests after a script issue wasted 24 hours of runtime.
Once the VRAM offloading tests complete, the developer plans to compare results against foundational models and has saved outputs for analysis. The developer expressed particular surprise at the performance of the 9B and 27B dense models.
The developer is seeking community input on which models to compare against and what grading methodology to use for evaluation.
📖 Read the full source: r/openclaw
👀 See Also

CBP's Clearview AI Deal: Facial Recognition for Tactical Targeting
U.S. Customs and Border Protection has contracted Clearview AI for tactical targeting, using face recognition technology on billions of internet-scraped images.

Slurm Coding: The AI-Powered Development Pattern Where Time Disappears
A developer describes 'Slurm coding' as an intense development pattern enabled by AI coding tools, where small ideas rapidly escalate into complete systems through a feedback loop of quick implementation and dopamine hits.

Qwen3-30B-A3B vs Qwen3.5-35B-A3B Performance Comparison on RTX 5090
A head-to-head benchmark of Qwen3-30B-A3B and Qwen3.5-35B-A3B on an RTX 5090 shows the 30B model is 35% faster in generation, while the 3.5 model handles long context better with flat token scaling versus the 30B's 21% degradation.

Claude offers extra usage credit for Pro, Max, and Team plans
Claude is giving Pro, Max, and Team plan subscribers a one-time extra usage credit equal to their subscription price. The credit can be used across Claude, Claude Code, Claude Cowork, and third-party products.