MLX Inference Performance Update: April 2026 Benchmarks and Features

✍️ OpenClawRadar📅 Published: April 14, 2026🔗 Source
MLX Inference Performance Update: April 2026 Benchmarks and Features
Ad

Performance Benchmarks on M2 Ultra

The source benchmarks MLX inference on a Mac Studio M2 Ultra with 128GB unified memory, running large models locally for coding agent workloads. Generation speed was measured across four models with decode throughput in tokens/second at various KV cache depths (256 output tokens per run).

Model Performance Data

  • Qwen3.5-27B (dense, 8-bit): 20.2 tok/s at 4K, 16.4 tok/s at 64K, 13.1 tok/s at 128K
  • Qwen3.5-35B-A3B (MoE, 8-bit): 71.8 tok/s at 4K, 53.5 tok/s at 64K, 41.9 tok/s at 128K
  • Nemotron Super 120B (5-bit): 36.4 tok/s at 4K, 31.2 tok/s at 64K, 28.4 tok/s at 128K
  • Qwen3.5-122B-A10B (MoE, 5-bit): 40.6 tok/s at 4K, 29.4 tok/s at 64K, 23.1 tok/s at 128K

The 35B MoE achieves high throughput because only 3B of its 35B parameters are active per token. Nemotron Super 120B shows minimal degradation with context (14% drop from 4K to 64K) because 80 of its 88 layers use Mamba-2, which has constant cost per token.

Feature Speedups

Multi-Token Prediction (MTP): Qwen 3.5 models have a built-in draft head that predicts the next token in parallel. With probabilistic acceptance at 90% rate, the 122B goes from ~17 tok/s to 38.8 tok/s (2.3x speedup). Server overhead is minimal: a short-prompt request through vllm-mlx generates at 39 tok/s, matching baseline.

SpecPrefill: For long prompts, a 2B draft model scores token importance via attention, then the target only prefills the top 20%. On the 122B at 128K context, Time To First Token (TTFT) drops from 19.3 minutes to 3.5 minutes (5.5x speedup). This feature only activates for prompts above 8K tokens.

Ad

MLX vs. llama.cpp Comparison

Benchmarking Qwen3.5-35B-A3B on both stacks (512 tokens generated after filling KV cache):

  • 32K context: MLX 8-bit: 60.8 tok/s, llama.cpp FA ON (5-bit): 54.85 tok/s, llama.cpp FA OFF: 36.45 tok/s
  • 64K context: MLX 8-bit: 53.2 tok/s, llama.cpp FA ON (5-bit): 45.84 tok/s, llama.cpp FA OFF: 24.47 tok/s
  • 128K context: MLX 8-bit: 42.7 tok/s, llama.cpp FA ON (5-bit): 34.48 tok/s, llama.cpp FA OFF: 13.73 tok/s

MLX uses a 2-pass split-K decode kernel (sdpa_vector_2pass) that dispatches up to 1024 threadgroups at 128K context. The comparison shows MLX is competitive with llama.cpp at long context lengths.

Hybrid Architecture Impact

The models tested use hybrid architectures with fewer attention layers:

  • Qwen3.5-35B-A3B: 25% attention layers (10 of 40), 71.8 tok/s at 4K, -25% drop at 64K
  • Nemotron Super 120B: 9% attention layers (8 of 88), 36.4 tok/s at 4K, -14% drop at 64K

Qwen 3.5 uses GatedDeltaNet layers (linear recurrence) for most of the network with standard attention for only 25% of layers. Fewer attention layers means less KV cache to scan per token and less degradation at long context.

Recent Improvements

The MLX ecosystem has three layers that have seen rapid development. MLX core received a thread safety overhaul (per-thread M... [source text truncated]. Combined with continuous batching and prefix cache, the 122B now serves coding agents interactively at context lengths that were previously impractical.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also