M5 Max vs M3 Max Inference Benchmarks for Qwen Models on oMLX

Reddit user /u/onil_gova ran inference benchmarks comparing 16-inch MacBook Pros with M5 Max and M3 Max processors, both equipped with 40 GPU cores and 128GB unified memory. The tests used oMLX v0.2.23 and three Qwen 3.5 models: the 122B-A10B MoE, 35B-A3B MoE, and 27B dense.
Benchmark Results
At pp1024/tg128 (prompt processing length 1024, token generation length 128), the M5 Max showed significant speed improvements:
- 35B-A3B MoE: 134.5 vs 80.3 tg tok/s (1.7x faster)
- 122B-A10B MoE: 65.3 vs 46.1 tg tok/s (1.4x faster)
- 27B dense: 32.8 vs 23.0 tg tok/s (1.4x faster)
The performance gap widens with longer contexts. At 65K context length, the 27B dense model dropped to 6.8 tg tok/s on M3 Max versus 19.6 tg tok/s on M5 Max (2.9x difference).
Prefill and Batching Performance
Prefill advantages were even larger, reaching up to 4x faster on M5 Max at long context lengths, attributed to the M5 Max's GPU Neural Accelerators.
Batching performance showed important differences for agentic workloads:
- M5 Max scaled to 2.54x throughput at 4x batch size on the 35B-A3B model
- M3 Max batching on dense models degraded performance (0.80x at 2x batch on the 122B model)
The bandwidth difference (614 GB/s on M5 Max vs 400 GB/s on M3 Max) is significant for multi-step agent loops or parallel tool calls.
MoE Efficiency Insights
The benchmarks revealed that the 122B model (with 10B active parameters) generates faster than the 27B dense model on both machines. This demonstrates that active parameter count determines inference speed, not total model size.
The full interactive breakdown with all charts and data is available at: https://claude.ai/public/artifacts/c9fba245-e734-4b3b-be44-a6cabdec6f8f
📖 Read the full source: r/LocalLLaMA
👀 See Also

Apple's libibverbs Hides GPUDirect RDMA Symbols; Zero-Copy Metal Buffer RDMA Works on macOS
A developer discovered that Apple's RDMA subsystem accepts Metal GPU buffers for zero-copy network transfers and found hidden ibv_reg_dmabuf_mr symbols suggesting GPUDirect RDMA is possible on macOS without kernel modification.
Public Backlash Against AI Is Real: Violence, Polling Data, and Diminishing Returns
A Molotov attack on OpenAI's CEO, Gen Z anger rising to 31%, and 80% of companies seeing zero productivity gain — the honeymoon is over for AI.

Is OpenClaw Living Up to Expectations?
OpenClaw, a highly anticipated AI coding agent, is causing a stir among users. While some praise its capabilities, others express disappointment. Here’s a closer look at the community's feedback.

Claude-Code v2.1.88 Release: Flicker-Free Rendering, Permission Hooks, and Critical Fixes
Claude-Code v2.1.88 introduces a flicker-free rendering option via CLAUDE_CODE_NO_FLICKER=1, adds a PermissionDenied hook for auto mode retries, and fixes memory leaks, crashes, and rendering issues across Windows, macOS, and Linux terminals.