Qwen3.5-397B MoE Runs on 14GB RAM via Paged Expert Loading on M1 Ultra

✍️ OpenClawRadar📅 Published: May 7, 2026🔗 Source
Qwen3.5-397B MoE Runs on 14GB RAM via Paged Expert Loading on M1 Ultra
Ad

A Reddit post by u/ur_dad_matt (via Claude) demonstrates a custom paged MoE engine that runs Qwen3.5-397B-A17B (209GB on disk, 512 experts, top-10 routing) on an M1 Ultra 64GB Mac Studio with only 14GB peak RAM and 1.59 tok/s inference speed. The model is too large to load naively; the engine keeps only K=20 experts resident in RAM, lazy-paging the rest from SSD on router demand, evicting under cache pressure. Compute uses Float16 (faster than ternary on MPS), Apple Silicon native, MLX-based.

Benchmark results from a 5-prompt sweep on M1 Ultra 64GB:

  • Speed: 1.59 tok/s (mean across 5 coherent generations, K=20)
  • Cache RSS peak (generation): 7.91 GB
  • Total RSS peak: 14.04 GB
  • Coherent outputs: 5/5

Optimal engine config: K_override=20, cache_gb=8.0, OUTLIER_MMAP_EXPERTS=0, lazy_load=True. Initial attempts with all experts on disk caused command-buffer allocation failures until cache size was tuned.

The author argues that raw score benchmarks miss the point for local LLMs on 64GB hardware; the key metric is MMLU per GB RAM. At 1.59 tok/s the model runs at "thinking pace" not chat pace, demonstrating the upper bound of model-to-memory ratio.

Ad

Speeds for smaller quantized models on same hardware (MLX-4bit):

  • 4B Nano: 71.7 tok/s
  • 9B Lite: 53.4 tok/s
  • 26B-A4B Quick: 14.6 tok/s
  • 27B Core: 40.7 tok/s (MMLU 0.851 n=14042 σ=0.003, HumanEval 0.866 n=164 σ=0.027)
  • 35B-A3B Vision: 64.1 tok/s
  • 397B Plus: 1.59 tok/s

The runtime is built with Tauri + Rust + MLX for macOS. Free tiers (Nano and Lite) are available forever at outlier.host. A video demo is included in the Reddit post.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also