M5 Max vs M3 Max Inference Benchmarks for Qwen Models on oMLX

✍️ OpenClawRadar📅 Published: March 28, 2026🔗 Source
M5 Max vs M3 Max Inference Benchmarks for Qwen Models on oMLX
Ad

Reddit user /u/onil_gova ran inference benchmarks comparing 16-inch MacBook Pros with M5 Max and M3 Max processors, both equipped with 40 GPU cores and 128GB unified memory. The tests used oMLX v0.2.23 and three Qwen 3.5 models: the 122B-A10B MoE, 35B-A3B MoE, and 27B dense.

Benchmark Results

At pp1024/tg128 (prompt processing length 1024, token generation length 128), the M5 Max showed significant speed improvements:

  • 35B-A3B MoE: 134.5 vs 80.3 tg tok/s (1.7x faster)
  • 122B-A10B MoE: 65.3 vs 46.1 tg tok/s (1.4x faster)
  • 27B dense: 32.8 vs 23.0 tg tok/s (1.4x faster)

The performance gap widens with longer contexts. At 65K context length, the 27B dense model dropped to 6.8 tg tok/s on M3 Max versus 19.6 tg tok/s on M5 Max (2.9x difference).

Ad

Prefill and Batching Performance

Prefill advantages were even larger, reaching up to 4x faster on M5 Max at long context lengths, attributed to the M5 Max's GPU Neural Accelerators.

Batching performance showed important differences for agentic workloads:

  • M5 Max scaled to 2.54x throughput at 4x batch size on the 35B-A3B model
  • M3 Max batching on dense models degraded performance (0.80x at 2x batch on the 122B model)

The bandwidth difference (614 GB/s on M5 Max vs 400 GB/s on M3 Max) is significant for multi-step agent loops or parallel tool calls.

MoE Efficiency Insights

The benchmarks revealed that the 122B model (with 10B active parameters) generates faster than the 27B dense model on both machines. This demonstrates that active parameter count determines inference speed, not total model size.

The full interactive breakdown with all charts and data is available at: https://claude.ai/public/artifacts/c9fba245-e734-4b3b-be44-a6cabdec6f8f

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also