RTX 5080 16GB: Qwen3.6 35B MoE at 128k Context — 56 tok/s, and Why MTP Doesn't Help

Mainline llama.cpp commit b9190 merged MTP (Multi-Token Prediction). Benchmarks on a RTX 5080 16GB with Qwen3.6 35B MoE at 128k context reveal a clear finding: MTP hurts performance when the model doesn't fully fit on GPU.
The Best Config (No MTP)
Qwen3.6-35B-A3B Q4_K_XL --fit-target 1536 at 131k context yields:
- 56 tok/s generation
- 1,584 tok/s prompt processing at 128k context
No MTP flags needed.
Why MTP Slows Down 35B MoE on 16GB
Three configs tested at coding-agent context lengths:
- 27B IQ3+MTP: 12.45 GB, fully on GPU — avg 73 tok/s (MTP helps)
- 35B Q4_K_XL+MTP: ~22 GB, partial offload — avg 74 tok/s (MTP hurts)
- 35B Q8_0+MTP: ~36 GB, heavy offload — avg 46 tok/s
Without MTP, the 35B Q4_K_XL achieves 97 tok/s at --fit-target 0 (15,815 MiB VRAM) and 86 tok/s at --fit-target 1536 (14,269 MiB). With MTP enabled at --fit-target 1536, speed drops to 74 tok/s (14,623 MiB) — a 23% slowdown.
The root cause: MTP's compute buffer reserves ~1.5 GB (--fit-target 1536), pushing ~3 more MoE expert layers from GPU to CPU. Since MoE inference is bottlenecked by CPU-bound expert layers, MTP's 79% token acceptance rate can't compensate for the slower per-step speed.
For the 27B model (fits entirely on GPU), --fit-target 0 works with or without MTP, so no VRAM penalty — MTP boosts speed from ~56 to 73 tok/s.
Rule of Thumb
MTP helps when your model fits on GPU. It hurts when the MTP compute buffer forces more layers to CPU. On 16GB cards with 35B MoE, skip MTP.
Full test system: RTX 5080 16GB, Ryzen 9 9950X, 128GB RAM, llama.cpp b9204 (mainline). Common MTP flags: -np 1 --fit on -fa on -t 20 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2.
📖 Read the full source: r/LocalLLaMA
👀 See Also

AI Agents Display High Rates of Ethical Constraint Violations
Recent benchmarks show autonomous AI agents violated ethical constraints in 30-50% of cases due to KPI-driven pressures.

Manifest adds GitHub Copilot as fourth AI provider for OpenClaw routing
Manifest now supports routing OpenClaw requests through GitHub Copilot subscriptions, joining Anthropic, OpenAI, and Minimax as available providers. This allows developers to use their existing Copilot plans for code tasks through models built for development.

OpenClaw April Updates: A Month of Breaking Changes and Eroded Trust
OpenClaw's April updates show a pattern: new features and fixes shipped alongside critical bugs. Postinstall scripts deleting files, security holes, and broken skills erode confidence.

OpenAI secretly funded age verification advocacy group in California
OpenAI secretly funded the Parents and Kids Safe AI Coalition, a California group pushing for age verification requirements in AI, while hiding its involvement from other advocacy organizations. The company pledged $10 million to support the Parents and Kids Safe AI Act legislation.