MLX Inference Performance Update: April 2026 Benchmarks and Features

Performance Benchmarks on M2 Ultra
The source benchmarks MLX inference on a Mac Studio M2 Ultra with 128GB unified memory, running large models locally for coding agent workloads. Generation speed was measured across four models with decode throughput in tokens/second at various KV cache depths (256 output tokens per run).
Model Performance Data
- Qwen3.5-27B (dense, 8-bit): 20.2 tok/s at 4K, 16.4 tok/s at 64K, 13.1 tok/s at 128K
- Qwen3.5-35B-A3B (MoE, 8-bit): 71.8 tok/s at 4K, 53.5 tok/s at 64K, 41.9 tok/s at 128K
- Nemotron Super 120B (5-bit): 36.4 tok/s at 4K, 31.2 tok/s at 64K, 28.4 tok/s at 128K
- Qwen3.5-122B-A10B (MoE, 5-bit): 40.6 tok/s at 4K, 29.4 tok/s at 64K, 23.1 tok/s at 128K
The 35B MoE achieves high throughput because only 3B of its 35B parameters are active per token. Nemotron Super 120B shows minimal degradation with context (14% drop from 4K to 64K) because 80 of its 88 layers use Mamba-2, which has constant cost per token.
Feature Speedups
Multi-Token Prediction (MTP): Qwen 3.5 models have a built-in draft head that predicts the next token in parallel. With probabilistic acceptance at 90% rate, the 122B goes from ~17 tok/s to 38.8 tok/s (2.3x speedup). Server overhead is minimal: a short-prompt request through vllm-mlx generates at 39 tok/s, matching baseline.
SpecPrefill: For long prompts, a 2B draft model scores token importance via attention, then the target only prefills the top 20%. On the 122B at 128K context, Time To First Token (TTFT) drops from 19.3 minutes to 3.5 minutes (5.5x speedup). This feature only activates for prompts above 8K tokens.
MLX vs. llama.cpp Comparison
Benchmarking Qwen3.5-35B-A3B on both stacks (512 tokens generated after filling KV cache):
- 32K context: MLX 8-bit: 60.8 tok/s, llama.cpp FA ON (5-bit): 54.85 tok/s, llama.cpp FA OFF: 36.45 tok/s
- 64K context: MLX 8-bit: 53.2 tok/s, llama.cpp FA ON (5-bit): 45.84 tok/s, llama.cpp FA OFF: 24.47 tok/s
- 128K context: MLX 8-bit: 42.7 tok/s, llama.cpp FA ON (5-bit): 34.48 tok/s, llama.cpp FA OFF: 13.73 tok/s
MLX uses a 2-pass split-K decode kernel (sdpa_vector_2pass) that dispatches up to 1024 threadgroups at 128K context. The comparison shows MLX is competitive with llama.cpp at long context lengths.
Hybrid Architecture Impact
The models tested use hybrid architectures with fewer attention layers:
- Qwen3.5-35B-A3B: 25% attention layers (10 of 40), 71.8 tok/s at 4K, -25% drop at 64K
- Nemotron Super 120B: 9% attention layers (8 of 88), 36.4 tok/s at 4K, -14% drop at 64K
Qwen 3.5 uses GatedDeltaNet layers (linear recurrence) for most of the network with standard attention for only 25% of layers. Fewer attention layers means less KV cache to scan per token and less degradation at long context.
Recent Improvements
The MLX ecosystem has three layers that have seen rapid development. MLX core received a thread safety overhaul (per-thread M... [source text truncated]. Combined with continuous batching and prefix cache, the 122B now serves coding agents interactively at context lengths that were previously impractical.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Motherboard Sales Collapse 25%+ as AI Chip Production Crowds Out Consumer PC Components
Asus, Gigabyte, MSI, and ASRock all slash 2026 motherboard shipment targets by 22–37% as chipmakers prioritize AI processor production, driving component shortages and price hikes.

Claude Code v2.1.145: JSON Agent Listing, OTEL Span Fixes, Security Patch, and More
Claude Code v2.1.145 adds `claude agents --json` for scripting, fixes a permission-prompt bypass, improves OTEL spans, and more.

Analysis of TB2 Benchmarking Issues in db-wal-recovery Task
A Reddit analysis reveals problems with Terminal Bench 2.0's db-wal-recovery task, where agents can accidentally destroy evidence by opening SQLite databases, and shows how prompt injection affects leaderboard results.

Startups Report Spending More on AI Compute Than Human Salaries
AI startups like Swan AI report monthly AI compute bills exceeding $113k, with CEOs describing this as 'tokenmaxxing' where AI spending replaces traditional headcount budgets.