Autoresearch Pushes Qwen3.5-397B to 20.34 tok/s on M5 Max via SSD Streaming

Hardware and Model Configuration
The experiment was conducted on a MacBook Pro M5 Max with 128GB unified memory and a 40-core GPU. The model used was Qwen3.5-397B-A17B with Q3-GGUF experts (Unsloth IQ3_XXS/IQ4_XS mixed precision), Q8_0 embedding, and Q6_K LM head. The model occupies 209GB on disk—4x larger than the available RAM—requiring everything to stream from SSD.
Performance Results
Decode speed reached 20.34 tok/s with prefill at 5.52 tok/s. This represents a 2x improvement over the M5 Max starting point of 10.61 tok/s and a 4.67x improvement over Dan Woods' original baseline of 4.36 tok/s on M3 Max hardware.
Methodology
The researcher used the autoresearch loop methodology from Dan Woods' flash-moe project, running it with Claude Code (Anthropic) to systematically execute and evaluate 36 experiments. Each experiment was logged with results before proceeding, with automatic quality gating via perplexity thresholds to catch regressions. Human-AI collaboration involved the researcher directing the research and making scientific decisions while Claude Code implemented and benchmarked under direction.
Technical Foundation
The work builds on Dan Woods' original flash-moe paper and Anemll's fork, which is a pure C/Metal inference engine for running Qwen3.5-397B via SSD streaming on Apple Silicon. The Anemll fork added Q3-GGUF expert support essential to these results, with the researcher adding further Metal-level optimizations.
Effective Optimizations
- 16 IO threads + cache-io-split=4: Instead of reading each expert weight file as one sequential chunk, split into 4 parallel page-aligned reads hitting different SSD channels simultaneously. +1.5 tok/s
- Temporal expert prediction: Discovered 27% cross-token routing correlation, overlapping SSD reads with GPU compute. +4.3 tok/s
- Q3-GGUF experts (Unsloth IQ3_XXS/IQ4_XS): Smaller payload with Q3 as the sweet spot. Better perplexity than 4-bit (5.58 vs 5.62) while being 23% smaller. +2.3 tok/s
- CMD2 pre-encode: Eliminate 30μs per-layer submission gap. +0.44 tok/s
- Fused Q/K/V projection kernel: Read input vector once instead of three times (Metal GPU optimization). +0.76 tok/s
- CMD2 pre-encode extended to all full-attention layers: +0.47 tok/s
Note: Gains are not perfectly additive since some optimizations interact with each other.
Failed Approaches
The research had a 78% discard rate. Failed approaches included: 1-bit QJL quantization (perplexity 5647, catastrophic), ternary 2-bit with 84% weight sparsity (model collapsed), K=3 expert routing (quality collapse), cross-layer prediction (0% hit rate), NAX offloading (tile padding overhead cancelled gains), and 2-bit MLX experts (faster in isolation but worse perplexity and no speed advantage once temporal prediction was applied to Q3).
Limitations and Future Work
The research is limited to a single hardware platform, so results may not generalize. Q3 quantization at this scale degrades noticeably on long-form generation, producing artifacts on longer responses despite acceptable quality for short tasks. Quality was evaluated via perplexity only, not standardized benchmarks like MMLU or GPQA. This is a speed research project, not a production quality claim.
One surprising finding: Apple's Neural Engine (ANE) was completely idle during inference, drawing 0W despite offering 38 TOPS of compute. The problem is that MoE inference needs to decide which experts to activate dynamically, while ANE only works with static pre-compiled graphs. There may be an opportunity for batch prefill.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude-Code v2.1.108 adds prompt caching controls, recap feature, and slash command discovery
Claude-Code v2.1.108 introduces ENABLE_PROMPT_CACHING_1H and FORCE_PROMPT_CACHING_5M environment variables for cache TTL control, adds a session recap feature configurable via /config or /recap, and enables the model to discover built-in slash commands through the Skill tool.

Grammar-Based Method Matches or Outperforms AI in Authorship Analysis
A University of Manchester study found that LambdaG, a grammar-based authorship analysis method, matched or exceeded leading AI systems across most test datasets while offering greater transparency and lower computational cost.

Developer's Obsidian AI Agent Project Goes Viral Overnight
A PhD researcher built a crew of AI agents to manage their Obsidian vault, shared it on GitHub, and woke up to 700+ stars in less than 13 hours. The sudden attention led to panic, making the repo private temporarily before reopening with improvements.

Samsung Workers Demand Share of AI Chip Profits — What Developers Need to Know
Samsung's labor deal sets a precedent: 10.5% of operating profit from the semiconductor division goes to bonuses. A broader movement of workers across AI supply chains demanding a share of record profits.