MTP Acceptance Rate: 50% Threshold Determines Speculative Decoding Benefit

A Reddit user tested MTP (Multi-Token Prediction) using mlx-vlm on Gemma-4 (26B, 4-bit) and found performance depends entirely on draft token acceptance rate. Measurements on an M4 Max Studio show concrete thresholds.
Workload Results
- Code generation: 75 tok/s → 114.8 tok/s (1.53× faster) — acceptance rate: 66% of slots
- Long-form prose: 75 tok/s → 71.1 tok/s (0.95×, essentially wash) — acceptance rate: 31% of slots
- JSON output: 51.3 tok/s → 25.6 tok/s (0.50× slower) — acceptance rate: 8% of slots
The threshold appears to be ~50% acceptance. Below that, speculative decoding overhead outweighs gains.
Test details: code was "write some python functions to do X"; long-form prose was "write an 800 word essay on paper money in the Tang Dynasty"; JSON output involved grouping items by similarity into structured output.
Bonus tip: The user notes Gemma's JSON structure instruction following is decent, but enabling structured output (json_schema) adds ~20% overhead. They recommend accepting slightly sloppy JSON and fixing it at runtime. mlx-vlm does not support json_schema for spec-decode anyway.
Bottom line: MTP is great for local coding but can degrade performance for structured or prose tasks with low acceptance rates.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Essential Custom Instructions for Claude to Prevent Common Annoyances
A Reddit user shares three specific custom instructions to address common Claude annoyances: requiring warnings before destructive commands, preventing mid-answer plan changes, and keeping code blocks exclusively for functional code.

Claude Code Works Better as Code Reviewer Than Generator
A developer shares that Claude Code produces more grounded output when used to review existing code rather than generate from scratch. Key practices include starting sessions with current implementations, maintaining project context files, and restarting sessions when responses degrade.

Compress CLAUDE.md Files to Reduce System Prompt Bloat in Claude Code
A technique for compressing CLAUDE.md files by removing human-readable formatting like markdown headers and prose, replacing them with compact notation like pipe-delimited lists, achieving 60-70% character reduction while maintaining the same information for Claude.

Routing cuts OpenClaw Max usage cost by 85%: $200/mo to $30/mo with API routing
A user tracked token usage and found only 15% of tasks need Opus. By routing routine work to Sonnet via API, monthly cost dropped from $200 to $30 with identical output quality.