Flash-MOE Benchmark on M5 Max: 12.99 tok/s with Qwen3.5-397B

Performance Results
A user benchmarked the flash-moe implementation on an M5 Max MacBook Pro with 128GB unified memory, running the mlx-community/Qwen3.5-397B-A17B-4bit model. The original benchmark by Dan Woods on an M3 Max with 48GB RAM achieved 4.36 tokens per second. On the M5 Max, the baseline configuration with 4-bit quantization and no cache-io-split reached 12.48 tok/s. With the optimal --cache-io-split 4 setting, performance increased to 12.99 tok/s, making it three times faster than the original benchmark.
Cache-IO-Split Analysis
The user performed a full sweep of cache-io-split values using the Anemll fork of flash-moe, which adds Metal 4 NAX support for M5+ chips. The results show that splits 2 and 3 degrade performance, while split 4 provides the best optimization:
- cache-io-split 1 (none): 12.48 tok/s, 28.4ms expert I/O per token
- cache-io-split 2: 9.94 tok/s, 28.2ms expert I/O per token
- cache-io-split 3: 9.99 tok/s, 36.1ms expert I/O per token
- cache-io-split 4: 12.99 tok/s, 25.9ms expert I/O per token
- cache-io-split 5: 12.64 tok/s, 27.5ms expert I/O per token
- cache-io-split 8: 12.90 tok/s, 26.4ms expert I/O per token
The analysis suggests that split 4 aligns with the M5 Max SSD controller's internal parallelism, while higher values add scheduling overhead. The recommendation is to use --cache-io-split 4 or no split at all, avoiding splits 2 and 3.
Quantization Comparison
Testing 2-bit versus 4-bit quantization revealed that 2-bit offers no speed advantage on the M5 Max, with SSD speed making smaller files unnecessary and dequantization overhead canceling any gains. Quality suffers significantly with 2-bit:
- 4-bit: 12.99 tok/s, 3.64 perplexity on WikiText-2
- 2-bit: ~12.65 tok/s, 5.71 perplexity on WikiText-2 (57% worse)
The conclusion is to use 4-bit quantization for better quality without sacrificing speed.
Technical Details
The benchmark used the Anemll fork available at https://github.com/Anemll/flash-moe. Sustained performance remained stable at 11.23 tok/s over 1000 tokens with no degradation. The user noted that background processes using Metal/GPU, such as LM Studio, can significantly impact performance and should be closed during benchmarking.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Top 6 Open Source Claude Skills (April 15 – May 3)
Six open-source Claude skills from the last 15 days: brand-alchemy, npm-downloads-to-leads, hyperframes, email-newsletter, pricing, and more. Detailed breakdown of each skill's functionality.

Chrome Extension Bridges Google Messages to Claude Code via MCP
A developer built a Chrome extension that connects Google Messages Web to Claude Code using MCP with stdio and WebSocket transport. The extension lists chats, reads messages, and drafts replies but currently can't send messages due to Angular's zone.js isolation.

Fehu: CLI Double-Entry Bookkeeping with Claude AI MCP Integration
Fehu is a lightweight CLI personal accounting tool that connects to Claude AI via MCP, allowing natural language transaction recording with a SQLite-backed double-entry system. It features hierarchical accounts, auto-tagging with hashtags, a powerful calc engine, and multi-currency support.

BuddyBoard: A Competitive Leaderboard for Claude Code's /buddy Feature
BuddyBoard is a community-built tool that creates a competitive leaderboard for Claude Code's /buddy feature, generating trading cards with stats, rarity tiers, and a BuddyDex tracking 1,728 possible combinations. Run with npx buddy-board to submit your buddy to the global ranking.