MTP Acceptance Rate: 50% Threshold Determines Speculative Decoding Benefit

✍️ OpenClawRadar📅 Published: May 9, 2026🔗 Source
MTP Acceptance Rate: 50% Threshold Determines Speculative Decoding Benefit
Ad

A Reddit user tested MTP (Multi-Token Prediction) using mlx-vlm on Gemma-4 (26B, 4-bit) and found performance depends entirely on draft token acceptance rate. Measurements on an M4 Max Studio show concrete thresholds.

Ad

Workload Results

  • Code generation: 75 tok/s → 114.8 tok/s (1.53× faster) — acceptance rate: 66% of slots
  • Long-form prose: 75 tok/s → 71.1 tok/s (0.95×, essentially wash) — acceptance rate: 31% of slots
  • JSON output: 51.3 tok/s → 25.6 tok/s (0.50× slower) — acceptance rate: 8% of slots

The threshold appears to be ~50% acceptance. Below that, speculative decoding overhead outweighs gains.

Test details: code was "write some python functions to do X"; long-form prose was "write an 800 word essay on paper money in the Tang Dynasty"; JSON output involved grouping items by similarity into structured output.

Bonus tip: The user notes Gemma's JSON structure instruction following is decent, but enabling structured output (json_schema) adds ~20% overhead. They recommend accepting slightly sloppy JSON and fixing it at runtime. mlx-vlm does not support json_schema for spec-decode anyway.

Bottom line: MTP is great for local coding but can degrade performance for structured or prose tasks with low acceptance rates.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also