MTP + Unified Memory Boosts llama.cpp Inference 30% on RTX 5090

✍️ OpenClawRadar📅 Published: May 12, 2026🔗 Source
Ad

Combining GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 with Multi-Token Prediction (MTP) speculation in llama.cpp yields a ~30% throughput improvement — 64 tok/sec vs 49 tok/sec on a Qwen3.6-27B Q8_0 model. The benchmark was run on an RTX 5090 paired with 128GB DDR5 5600 CL36 and a Ryzen 9 9950X3D.

Command & Configuration

CUDA_VISIBLE_DEVICES=0 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /home/marcin/llama-server \
    -m /home/marcin/Pobrane/Qwen3.6-27B-Q8_0.gguf \
    --threads 16 \
    -c 262144 -fa on -np 1 \
    --spec-type mtp --spec-draft-n-max 3 \
    --webui-mcp-proxy \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --host 0.0.0.0 \
    --port 8090 \
    --jinja

Key flags:

  • GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 — allows the GPU to directly access host memory, bypassing CUDA malloc for large contexts.
  • --spec-type mtp --spec-draft-n-max 3 — enables Multi-Token Prediction speculation with a draft depth of 3.
  • Qwen3.6-27B-Q8_0.gguf — a 27B parameter Qwen3.6 model quantized to Q8_0, prepared with Unsloth’s MTP support.
  • -c 262144 — 256K context window; -fa on for flash attention.
Ad

Results

  • Without MTP (only unified memory): 49 tok/sec
  • With MTP + unified memory: 64 tok/sec
  • Gain: 30% higher throughput

The draft-n-max of 3 means the model speculates up to 3 tokens ahead, reducing serial decoding overhead. Combined with unified memory, it avoids expensive PCIe transfers between CPU and GPU RAM.

Who This Is For

Developers running large-context local inference on high-end consumer GPUs (RTX 5090) with ample system RAM (≥128GB). Suitable for chatbots, code assistants, or any latency-sensitive LLM workload where speculative sampling is supported.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also