MTPLX: 2.24x Faster Tokens on Apple Silicon Using Native MTP Heads

MTPLX is an inference engine for Apple Silicon that exploits a model's built-in Multi-Token Prediction (MTP) heads as speculative drafters. The key result: Qwen 3.6 27B 4-bit MLX goes from 28 tok/s to 63 tok/s (2.24× faster) on a MacBook Pro M5 Max at temperature 0.6, top_p 0.95, top_k 20 — the exact settings Qwen recommends for coding.
How It Works
Unlike DFlash or DDTree (which require an external drafter model and are greedy-only), MTPLX uses the model's own MTP heads. Each MTP head drafts sequentially, producing per-token probability distributions. This enables exact rejection sampling with temperature and residual correction. No external drafter means no extra memory usage.
For Qwen 3.6 27B (which ships MTP heads up to depth 5), the optimal depth was found to be D3 after sweeping D2–D5. Deeper depths (D4/D5) had good early acceptance but deeper positions cost more verify time than tokens saved.
Status vs. DFlash / DDTree
DFlash MLX achieves higher raw speed but is restricted to greedy (temperature 0) sampling only, severely limiting real-world use. DDTree inherits the same limitations. Both require an external drafter. MTPLX works with any model that retains its MTP heads and supports full temperature-sampled inference.
Installation & Usage
MTPLX ships as a full CLI with the following commands:
mtplx start wizard— guided setup- Model download and inspection with four-tier MTP compatibility detection
- Configurable depth 2–7+
- OpenAI/Anthropic compatible API server, browser chat UI, terminal chat
- Benchmarking suite, health diagnostics, crash-safe fan control with idle-aware auto-restore
- A 562-test suite included
The engine is built on a patched MLX fork with custom Metal kernels, compiled verify graphs, innovation-tape GDN rollback, and a draft-only requantised LM head.
Who It's For
Developers running local LLMs on Apple Silicon who need high-throughput, temperature-sampled inference for coding or creative writing without sacrificing output quality.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenCawt: Open Source Judiciary System for AI Agent Disputes
OpenCawt is an open source judiciary system for autonomous agents that lets them lodge disputes, present evidence, receive structured decisions, and seal outcomes as verifiable public records. It includes a lightweight protocol layer called OCP for formalizing agreements and decisions within other applications.

OpenProphet: Open-Source Autonomous Trading Agent with Web UI
OpenProphet is an open-source, autonomous trading agent with a web interface that supports multiple Alpaca accounts simultaneously and runs on OpenCode. It allows configuration of agent personas and strategies, with the ability to use any LLM, not just Claude.

TEMM1E v3.1.0: AI Agent That Self-Fine-Tunes Using User Interactions
TEMM1E v3.1.0 introduces Eigen-Tune, a system that captures LLM interactions as training data, scores quality from user behavior, and fine-tunes local models via LoRA with zero added LLM cost. Tested on Apple M2, it corrected temperature conversions from 72°F = '150°C' to '21.2°C' after 10 conversations.

Lumyr: Dashboard Generation via Claude with Python and Streamlit Automation
Lumyr is a tool that generates live, shareable dashboards from plain English descriptions using Claude for dashboard generation and automating the Python and Streamlit layer. Users don't need to write Python, open Streamlit, deploy, set up hosting, or manage infrastructure.