Qwen3.5-122B on Blackwell SM120: fp8 KV Cache Corruption Issue and Performance Findings

Key Findings from Qwen3.5-122B Testing on Blackwell SM120
A detailed test of Qwen3.5-122B on 8x RTX PRO 6000 Blackwell hardware (AWS g7e.48xlarge, SM120) with SGLang revealed critical configuration issues and performance characteristics. The most significant finding: fp8_e4m3 KV cache doesn't crash but silently produces corrupt output with no errors or warnings - just exclamation marks and repetition instead of proper answers. The only fix is using bf16 KV cache instead.
Configuration Requirements
DeltaNet layers in Qwen3.5-122B add constraints that standard MoE models don't have. The setup required 6 specific Triton backend flags on SM120 hardware:
- Attention backend forced to Triton (for DeltaNet layers)
- KV cache forced to bf16 (fp8 corrupts output)
- No CUDA graphs (due to Triton SMEM overflow)
- No HiCache (DeltaNet incompatible)
This contrasts with M2.5 testing on the same hardware, which only needed 2 Triton backend flags.
Performance Benchmarks
All tests used the same hardware and methodology with SGLang nightly (cu13 20260219), TP=8:
- Burst tok/s: 1,985 vs 1,818 (Qwen3.5-122B vs M2.5)
- Online 4 rps: 310 vs 404
- Online 8 rps: 514 vs 744
- Single-request tok/s: ~25 (with MTP) vs 72
- Arena-Hard quality: 6.99/10 vs 4.94/10 (judged by Claude Opus 4.6, not comparable to leaderboard results)
Optimization Results
Of the optimization paths tested, MTP (Multi-Token Prediction) was the only one that materially improved performance, providing a 2.75x single-request speedup (~9 to ~25 tok/s). Other optimizations available on SM120 hardware - FP8 KV cache, CUDA graphs, and HiCache - were blocked by DeltaNet constraints in Qwen3.5-122B.
Qwen3.5-122B wins on burst throughput and quality metrics, while M2.5 still wins on every sustained serving metric due to being able to use the optimizations that Qwen3.5-122B's DeltaNet blocks.
Full results, compatibility matrix, exact reproduction commands, and all JSONL artifacts are available in the GitHub issue linked below.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Open Source vs Frontier Models: Single-File Canvas Car Scene Benchmark
A developer tested 12 models including GPT-5.5, Claude Opus 4.7, and Qwen 3.6 Plus on a single-file HTML canvas car driving animation task, with results publicly compared.

Atlassian Enables Default Data Collection for AI Training
Atlassian has enabled default data collection across its products to train AI models, according to a source posted on Hacker News with 312 points and 75 comments.

AI Tools May Lead to Homogenized Output in Creative and Development Work
A Reddit user reports that multiple teams using AI tools like ChatGPT, Co-Pilot, and Claude for strategy roadmaps and software development are producing similar outputs with identical buzzword patterns and design structures.

Encyclopedia Britannica Files Lawsuit Against OpenAI Over AI Training Data
Encyclopedia Britannica has filed a lawsuit against OpenAI, alleging copyright infringement related to AI training data. The case was reported by Reuters on March 16, 2026, and has generated discussion on Hacker News.