MTPLX: 2.24x Faster Tokens on Apple Silicon Using Native MTP Heads

✍️ OpenClawRadar📅 Published: May 5, 2026🔗 Source
MTPLX: 2.24x Faster Tokens on Apple Silicon Using Native MTP Heads
Ad

MTPLX is an inference engine for Apple Silicon that exploits a model's built-in Multi-Token Prediction (MTP) heads as speculative drafters. The key result: Qwen 3.6 27B 4-bit MLX goes from 28 tok/s to 63 tok/s (2.24× faster) on a MacBook Pro M5 Max at temperature 0.6, top_p 0.95, top_k 20 — the exact settings Qwen recommends for coding.

How It Works

Unlike DFlash or DDTree (which require an external drafter model and are greedy-only), MTPLX uses the model's own MTP heads. Each MTP head drafts sequentially, producing per-token probability distributions. This enables exact rejection sampling with temperature and residual correction. No external drafter means no extra memory usage.

For Qwen 3.6 27B (which ships MTP heads up to depth 5), the optimal depth was found to be D3 after sweeping D2–D5. Deeper depths (D4/D5) had good early acceptance but deeper positions cost more verify time than tokens saved.

Status vs. DFlash / DDTree

DFlash MLX achieves higher raw speed but is restricted to greedy (temperature 0) sampling only, severely limiting real-world use. DDTree inherits the same limitations. Both require an external drafter. MTPLX works with any model that retains its MTP heads and supports full temperature-sampled inference.

Ad

Installation & Usage

MTPLX ships as a full CLI with the following commands:

  • mtplx start wizard — guided setup
  • Model download and inspection with four-tier MTP compatibility detection
  • Configurable depth 2–7+
  • OpenAI/Anthropic compatible API server, browser chat UI, terminal chat
  • Benchmarking suite, health diagnostics, crash-safe fan control with idle-aware auto-restore
  • A 562-test suite included

The engine is built on a patched MLX fork with custom Metal kernels, compiled verify graphs, innovation-tape GDN rollback, and a draft-only requantised LM head.

Who It's For

Developers running local LLMs on Apple Silicon who need high-throughput, temperature-sampled inference for coding or creative writing without sacrificing output quality.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

OpenCawt: Open Source Judiciary System for AI Agent Disputes
Tools

OpenCawt: Open Source Judiciary System for AI Agent Disputes

OpenCawt is an open source judiciary system for autonomous agents that lets them lodge disputes, present evidence, receive structured decisions, and seal outcomes as verifiable public records. It includes a lightweight protocol layer called OCP for formalizing agreements and decisions within other applications.

OpenClawRadar
OpenProphet: Open-Source Autonomous Trading Agent with Web UI
Tools

OpenProphet: Open-Source Autonomous Trading Agent with Web UI

OpenProphet is an open-source, autonomous trading agent with a web interface that supports multiple Alpaca accounts simultaneously and runs on OpenCode. It allows configuration of agent personas and strategies, with the ability to use any LLM, not just Claude.

OpenClawRadar
TEMM1E v3.1.0: AI Agent That Self-Fine-Tunes Using User Interactions
Tools

TEMM1E v3.1.0: AI Agent That Self-Fine-Tunes Using User Interactions

TEMM1E v3.1.0 introduces Eigen-Tune, a system that captures LLM interactions as training data, scores quality from user behavior, and fine-tunes local models via LoRA with zero added LLM cost. Tested on Apple M2, it corrected temperature conversions from 72°F = '150°C' to '21.2°C' after 10 conversations.

OpenClawRadar
Lumyr: Dashboard Generation via Claude with Python and Streamlit Automation
Tools

Lumyr: Dashboard Generation via Claude with Python and Streamlit Automation

Lumyr is a tool that generates live, shareable dashboards from plain English descriptions using Claude for dashboard generation and automating the Python and Streamlit layer. Users don't need to write Python, open Streamlit, deploy, set up hosting, or manage infrastructure.

OpenClawRadar