MTP + Unified Memory Boosts llama.cpp Inference 30% on RTX 5090
Combining GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 with Multi-Token Prediction (MTP) speculation in llama.cpp yields a ~30% throughput improvement — 64 tok/sec vs 49 tok/sec on a Qwen3.6-27B Q8_0 model. The benchmark was run on an RTX 5090 paired with 128GB DDR5 5600 CL36 and a Ryzen 9 9950X3D.
Command & Configuration
CUDA_VISIBLE_DEVICES=0 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /home/marcin/llama-server \
-m /home/marcin/Pobrane/Qwen3.6-27B-Q8_0.gguf \
--threads 16 \
-c 262144 -fa on -np 1 \
--spec-type mtp --spec-draft-n-max 3 \
--webui-mcp-proxy \
--chat-template-kwargs '{"preserve_thinking": true}' \
--host 0.0.0.0 \
--port 8090 \
--jinja
Key flags:
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1— allows the GPU to directly access host memory, bypassing CUDA malloc for large contexts.--spec-type mtp --spec-draft-n-max 3— enables Multi-Token Prediction speculation with a draft depth of 3.Qwen3.6-27B-Q8_0.gguf— a 27B parameter Qwen3.6 model quantized to Q8_0, prepared with Unsloth’s MTP support.-c 262144— 256K context window;-fa onfor flash attention.
Results
- Without MTP (only unified memory): 49 tok/sec
- With MTP + unified memory: 64 tok/sec
- Gain: 30% higher throughput
The draft-n-max of 3 means the model speculates up to 3 tokens ahead, reducing serial decoding overhead. Combined with unified memory, it avoids expensive PCIe transfers between CPU and GPU RAM.
Who This Is For
Developers running large-context local inference on high-end consumer GPUs (RTX 5090) with ample system RAM (≥128GB). Suitable for chatbots, code assistants, or any latency-sensitive LLM workload where speculative sampling is supported.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Argus: A GitHub App That Reviews CLAUDE.md Files and Posts Scores on PRs
Argus is a GitHub App built with Claude Code that reviews CLAUDE.md files and posts a score on every pull request. After testing on multiple repositories, the most common failures are missing explicit scope limits and escalation paths.

SynapsCAD: Open-Source Desktop App for OpenSCAD with Claude AI Integration
SynapsCAD is an open-source desktop application that combines an OpenSCAD code editor, real-time 3D viewport, and AI assistant. Built entirely in Rust with Bevy 0.15 and egui, it supports Claude API integration for natural language 3D CAD coding.

Depct tool collects runtime data to help Claude debug production issues
Depct is a tool that collects runtime instrumentation from Node.js apps, builds graphs from the data, and feeds it to Claude via AWS Bedrock to help debug intermittent production failures. It also generates architecture diagrams and dependency maps from runtime behavior.

ClawControl iOS client released for OpenClaw self-hosted servers
ClawControl v1.50 is now available on iOS as a privacy-focused mobile client for self-hosted OpenClaw/Claw servers. The open-source app enables real-time chat with streaming responses, agent management, and session control from mobile devices.