MTP + Unified Memory Boosts llama.cpp 30% on RTX 5090

Combining GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 with Multi-Token Prediction (MTP) speculation in llama.cpp yields a ~30% throughput improvement — 64 tok/sec vs 49 tok/sec on a Qwen3.6-27B Q8_0 model. The benchmark was run on an RTX 5090 paired with 128GB DDR5 5600 CL36 and a Ryzen 9 9950X3D.

Command & Configuration

CUDA_VISIBLE_DEVICES=0 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /home/marcin/llama-server \
    -m /home/marcin/Pobrane/Qwen3.6-27B-Q8_0.gguf \
    --threads 16 \
    -c 262144 -fa on -np 1 \
    --spec-type mtp --spec-draft-n-max 3 \
    --webui-mcp-proxy \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --host 0.0.0.0 \
    --port 8090 \
    --jinja

Key flags:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 — allows the GPU to directly access host memory, bypassing CUDA malloc for large contexts.
--spec-type mtp --spec-draft-n-max 3 — enables Multi-Token Prediction speculation with a draft depth of 3.
Qwen3.6-27B-Q8_0.gguf — a 27B parameter Qwen3.6 model quantized to Q8_0, prepared with Unsloth’s MTP support.
-c 262144 — 256K context window; -fa on for flash attention.

Results

Without MTP (only unified memory): 49 tok/sec
With MTP + unified memory: 64 tok/sec
Gain: 30% higher throughput

The draft-n-max of 3 means the model speculates up to 3 tokens ahead, reducing serial decoding overhead. Combined with unified memory, it avoids expensive PCIe transfers between CPU and GPU RAM.

Who This Is For

Developers running large-context local inference on high-end consumer GPUs (RTX 5090) with ample system RAM (≥128GB). Suitable for chatbots, code assistants, or any latency-sensitive LLM workload where speculative sampling is supported.

📖 Read the full source: r/LocalLLaMA

MTP + Unified Memory Boosts llama.cpp Inference 30% on RTX 5090

Command & Configuration

Results

Who This Is For

👀 See Also

Phantom: A Persistent AI Agent Built with Claude's Agent SDK

AskAlf: Open-source multi-agent orchestration platform for self-hosted AI workflows

Claude Skills: 12 Strict Coding Rule Packs for TypeScript, Rust, Swift, Go, JS, Postgres, and Audits

Context Routing Layer Reduces Claude Code Token Usage by Tracking Accessed Files