Qwen 3.6 27B hits 2.5x speed with MTP speculative decoding on llama.cpp

A Reddit user has compiled llama.cpp with a pending PR (#22673) that enables Multi-Token Prediction (MTP) for Qwen 3.6 27B. MTP uses the model's built-in tensor layers for speculative decoding, claiming a 2.5x speedup — from ~11 tok/s to 28 tok/s on a Mac M2 Max 96GB.
Key Details
- Model: Qwen 3.6 27B (Qwen2.5-3.0 architecture variant)
- Hardware tested: Mac M2 Max 96GB
- Results: 28 tok/s with MTP (vs ~11 tok/s without)
- Context support: Up to 262K tokens with turbo4 KV cache on 48GB Mac
- Quantizations: Pre-converted GGUF quants uploaded by the user at
froggeric/Qwen3.6-27B-MTP-GGUF
Compilation Instructions
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli llama-serverServer Command
llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--mmproj mmproj-Qwen3.6-27B-f16.gguf \
--spec-type mtp --spec-draft-n-max 5 \
--cache-type-k turbo4 --cache-type-v turbo4 \
-c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081Three optimizations combined:
--spec-type mtp --spec-draft-n-max 5: enables MTP speculative decoding (2.5x faster)--cache-type-k turbo4 --cache-type-v turbo4: 4.25-bit KV cache (quarter memory vs 16-bit)-c 262144: 262K context window (fits 48GB with turbo4)
Hardware Recommendations
Apple Silicon and NVIDIA GPU quantization/KV cache tables are provided in the source for RAM-constrained setups (e.g., IQ2_M on 16GB Apple Silicon with 48K context). Vision (mmproj) support is available on 32GB+ configurations.
Additional Fixes
The user also published 7 fixes to the Qwen jinja chat template that were broken due to vLLM-specific formatting. These are now compatible with llama.cpp and other tools.
Note: Existing GGUF files on Hugging Face do not include MTP support — they require re-conversion with the PR applied. The user warns that initial uploads are incomplete; check the Hugging Face repo status.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Tessera: Open-Source GUI Workspace for Managing Multiple Claude Code Sessions
Tessera is an open-source GUI that lets you run multiple Claude Code sessions side by side with Git worktree isolation, Kanban task tracking, live diffs, and agent activity inspection.

Multi-Agent Content Pipeline for Claude Code with Quality Gates
A developer built a six-agent content pipeline for Claude Code that separates research, writing, editing, and SEO tasks with quality gates between stages. The system halts for manual approval before publishing and allows individual agent re-runs.

Selfware: Rust-based local AI agent framework with PDVR architecture
Selfware is an open-source AI agent framework built in Rust for local inference, implementing a PDVR cognitive cycle with 54 built-in tools and designed for long-running tasks on consumer hardware.

OpenClaw Developer Achieves AI Agent Breakthroughs with Uber and Restaurant Booking Automation
An OpenClaw developer has successfully created AI agents that autonomously complete Uber ride bookings and restaurant reservations on real websites, overcoming bot detection and CAPTCHAs using a stack with stealth browsers, residential proxies, and CAPTCHA solving.