Speculative Decoding Benefit: MTP Acceptance Rate >50%

A Reddit user tested MTP (Multi-Token Prediction) using mlx-vlm on Gemma-4 (26B, 4-bit) and found performance depends entirely on draft token acceptance rate. Measurements on an M4 Max Studio show concrete thresholds.

Workload Results

Code generation: 75 tok/s → 114.8 tok/s (1.53× faster) — acceptance rate: 66% of slots
Long-form prose: 75 tok/s → 71.1 tok/s (0.95×, essentially wash) — acceptance rate: 31% of slots
JSON output: 51.3 tok/s → 25.6 tok/s (0.50× slower) — acceptance rate: 8% of slots

The threshold appears to be ~50% acceptance. Below that, speculative decoding overhead outweighs gains.

Test details: code was "write some python functions to do X"; long-form prose was "write an 800 word essay on paper money in the Tang Dynasty"; JSON output involved grouping items by similarity into structured output.

Bonus tip: The user notes Gemma's JSON structure instruction following is decent, but enabling structured output (json_schema) adds ~20% overhead. They recommend accepting slightly sloppy JSON and fixing it at runtime. mlx-vlm does not support json_schema for spec-decode anyway.

Bottom line: MTP is great for local coding but can degrade performance for structured or prose tasks with low acceptance rates.

📖 Read the full source: r/LocalLLaMA

MTP Acceptance Rate: 50% Threshold Determines Speculative Decoding Benefit

Workload Results

👀 See Also

Stop Copy-Pasting Errors Into Claude Code — Give It Access Instead

High CPU/RAM and Gateway Restarts in OpenClaw? Disable IPv6 for Telegram

Practical Habits for Critical LLM Interaction

OpenClaw Discord proxy fix for REST API timeout issues