RTX 5080 16GB: Qwen3.6 35B MoE at 128k Context — 56 tok/s, and Why MTP Doesn't Help

✍️ OpenClawRadar📅 Published: May 20, 2026🔗 Source
RTX 5080 16GB: Qwen3.6 35B MoE at 128k Context — 56 tok/s, and Why MTP Doesn't Help
Ad

Mainline llama.cpp commit b9190 merged MTP (Multi-Token Prediction). Benchmarks on a RTX 5080 16GB with Qwen3.6 35B MoE at 128k context reveal a clear finding: MTP hurts performance when the model doesn't fully fit on GPU.

The Best Config (No MTP)

Qwen3.6-35B-A3B Q4_K_XL --fit-target 1536 at 131k context yields:

  • 56 tok/s generation
  • 1,584 tok/s prompt processing at 128k context

No MTP flags needed.

Why MTP Slows Down 35B MoE on 16GB

Three configs tested at coding-agent context lengths:

  • 27B IQ3+MTP: 12.45 GB, fully on GPU — avg 73 tok/s (MTP helps)
  • 35B Q4_K_XL+MTP: ~22 GB, partial offload — avg 74 tok/s (MTP hurts)
  • 35B Q8_0+MTP: ~36 GB, heavy offload — avg 46 tok/s

Without MTP, the 35B Q4_K_XL achieves 97 tok/s at --fit-target 0 (15,815 MiB VRAM) and 86 tok/s at --fit-target 1536 (14,269 MiB). With MTP enabled at --fit-target 1536, speed drops to 74 tok/s (14,623 MiB) — a 23% slowdown.

The root cause: MTP's compute buffer reserves ~1.5 GB (--fit-target 1536), pushing ~3 more MoE expert layers from GPU to CPU. Since MoE inference is bottlenecked by CPU-bound expert layers, MTP's 79% token acceptance rate can't compensate for the slower per-step speed.

For the 27B model (fits entirely on GPU), --fit-target 0 works with or without MTP, so no VRAM penalty — MTP boosts speed from ~56 to 73 tok/s.

Ad

Rule of Thumb

MTP helps when your model fits on GPU. It hurts when the MTP compute buffer forces more layers to CPU. On 16GB cards with 35B MoE, skip MTP.

Full test system: RTX 5080 16GB, Ryzen 9 9950X, 128GB RAM, llama.cpp b9204 (mainline). Common MTP flags: -np 1 --fit on -fa on -t 20 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also