Qwen 3.6 27B hits 2.5x speed with MTP speculative decoding on llama.cpp

✍️ OpenClawRadar📅 Published: May 6, 2026🔗 Source
Qwen 3.6 27B hits 2.5x speed with MTP speculative decoding on llama.cpp
Ad

A Reddit user has compiled llama.cpp with a pending PR (#22673) that enables Multi-Token Prediction (MTP) for Qwen 3.6 27B. MTP uses the model's built-in tensor layers for speculative decoding, claiming a 2.5x speedup — from ~11 tok/s to 28 tok/s on a Mac M2 Max 96GB.

Key Details

  • Model: Qwen 3.6 27B (Qwen2.5-3.0 architecture variant)
  • Hardware tested: Mac M2 Max 96GB
  • Results: 28 tok/s with MTP (vs ~11 tok/s without)
  • Context support: Up to 262K tokens with turbo4 KV cache on 48GB Mac
  • Quantizations: Pre-converted GGUF quants uploaded by the user at froggeric/Qwen3.6-27B-MTP-GGUF

Compilation Instructions

git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli llama-server
Ad

Server Command

llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --mmproj mmproj-Qwen3.6-27B-f16.gguf \
  --spec-type mtp --spec-draft-n-max 5 \
  --cache-type-k turbo4 --cache-type-v turbo4 \
  -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081

Three optimizations combined:

  • --spec-type mtp --spec-draft-n-max 5: enables MTP speculative decoding (2.5x faster)
  • --cache-type-k turbo4 --cache-type-v turbo4: 4.25-bit KV cache (quarter memory vs 16-bit)
  • -c 262144: 262K context window (fits 48GB with turbo4)

Hardware Recommendations

Apple Silicon and NVIDIA GPU quantization/KV cache tables are provided in the source for RAM-constrained setups (e.g., IQ2_M on 16GB Apple Silicon with 48K context). Vision (mmproj) support is available on 32GB+ configurations.

Additional Fixes

The user also published 7 fixes to the Qwen jinja chat template that were broken due to vLLM-specific formatting. These are now compatible with llama.cpp and other tools.

Note: Existing GGUF files on Hugging Face do not include MTP support — they require re-conversion with the PR applied. The user warns that initial uploads are incomplete; check the Hugging Face repo status.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also