Qwen3.5-397B Hits 20.34 tok/s on M5 Max via SSD Streaming

Hardware and Model Configuration

The experiment was conducted on a MacBook Pro M5 Max with 128GB unified memory and a 40-core GPU. The model used was Qwen3.5-397B-A17B with Q3-GGUF experts (Unsloth IQ3_XXS/IQ4_XS mixed precision), Q8_0 embedding, and Q6_K LM head. The model occupies 209GB on disk—4x larger than the available RAM—requiring everything to stream from SSD.

Performance Results

Decode speed reached 20.34 tok/s with prefill at 5.52 tok/s. This represents a 2x improvement over the M5 Max starting point of 10.61 tok/s and a 4.67x improvement over Dan Woods' original baseline of 4.36 tok/s on M3 Max hardware.

Methodology

The researcher used the autoresearch loop methodology from Dan Woods' flash-moe project, running it with Claude Code (Anthropic) to systematically execute and evaluate 36 experiments. Each experiment was logged with results before proceeding, with automatic quality gating via perplexity thresholds to catch regressions. Human-AI collaboration involved the researcher directing the research and making scientific decisions while Claude Code implemented and benchmarked under direction.

Technical Foundation

The work builds on Dan Woods' original flash-moe paper and Anemll's fork, which is a pure C/Metal inference engine for running Qwen3.5-397B via SSD streaming on Apple Silicon. The Anemll fork added Q3-GGUF expert support essential to these results, with the researcher adding further Metal-level optimizations.

Effective Optimizations

16 IO threads + cache-io-split=4: Instead of reading each expert weight file as one sequential chunk, split into 4 parallel page-aligned reads hitting different SSD channels simultaneously. +1.5 tok/s
Temporal expert prediction: Discovered 27% cross-token routing correlation, overlapping SSD reads with GPU compute. +4.3 tok/s
Q3-GGUF experts (Unsloth IQ3_XXS/IQ4_XS): Smaller payload with Q3 as the sweet spot. Better perplexity than 4-bit (5.58 vs 5.62) while being 23% smaller. +2.3 tok/s
CMD2 pre-encode: Eliminate 30μs per-layer submission gap. +0.44 tok/s
Fused Q/K/V projection kernel: Read input vector once instead of three times (Metal GPU optimization). +0.76 tok/s
CMD2 pre-encode extended to all full-attention layers: +0.47 tok/s

Note: Gains are not perfectly additive since some optimizations interact with each other.

Failed Approaches

The research had a 78% discard rate. Failed approaches included: 1-bit QJL quantization (perplexity 5647, catastrophic), ternary 2-bit with 84% weight sparsity (model collapsed), K=3 expert routing (quality collapse), cross-layer prediction (0% hit rate), NAX offloading (tile padding overhead cancelled gains), and 2-bit MLX experts (faster in isolation but worse perplexity and no speed advantage once temporal prediction was applied to Q3).

Limitations and Future Work

The research is limited to a single hardware platform, so results may not generalize. Q3 quantization at this scale degrades noticeably on long-form generation, producing artifacts on longer responses despite acceptable quality for short tasks. Quality was evaluated via perplexity only, not standardized benchmarks like MMLU or GPQA. This is a speed research project, not a production quality claim.

One surprising finding: Apple's Neural Engine (ANE) was completely idle during inference, drawing 0W despite offering 38 TOPS of compute. The problem is that MoE inference needs to decide which experts to activate dynamically, while ANE only works with static pre-compiled graphs. There may be an opportunity for batch prefill.

📖 Read the full source: r/LocalLLaMA