Run 397B Qwen3.5 MoE on 14GB RAM: Paged Expert Loading

A Reddit post by u/ur_dad_matt (via Claude) demonstrates a custom paged MoE engine that runs Qwen3.5-397B-A17B (209GB on disk, 512 experts, top-10 routing) on an M1 Ultra 64GB Mac Studio with only 14GB peak RAM and 1.59 tok/s inference speed. The model is too large to load naively; the engine keeps only K=20 experts resident in RAM, lazy-paging the rest from SSD on router demand, evicting under cache pressure. Compute uses Float16 (faster than ternary on MPS), Apple Silicon native, MLX-based.

Benchmark results from a 5-prompt sweep on M1 Ultra 64GB:

Speed: 1.59 tok/s (mean across 5 coherent generations, K=20)
Cache RSS peak (generation): 7.91 GB
Total RSS peak: 14.04 GB
Coherent outputs: 5/5

Optimal engine config: K_override=20, cache_gb=8.0, OUTLIER_MMAP_EXPERTS=0, lazy_load=True. Initial attempts with all experts on disk caused command-buffer allocation failures until cache size was tuned.

The author argues that raw score benchmarks miss the point for local LLMs on 64GB hardware; the key metric is MMLU per GB RAM. At 1.59 tok/s the model runs at "thinking pace" not chat pace, demonstrating the upper bound of model-to-memory ratio.

Speeds for smaller quantized models on same hardware (MLX-4bit):

4B Nano: 71.7 tok/s
9B Lite: 53.4 tok/s
26B-A4B Quick: 14.6 tok/s
27B Core: 40.7 tok/s (MMLU 0.851 n=14042 σ=0.003, HumanEval 0.866 n=164 σ=0.027)
35B-A3B Vision: 64.1 tok/s
397B Plus: 1.59 tok/s

The runtime is built with Tauri + Rust + MLX for macOS. Free tiers (Nano and Lite) are available forever at outlier.host. A video demo is included in the Reddit post.

📖 Read the full source: r/LocalLLaMA

Qwen3.5-397B MoE Runs on 14GB RAM via Paged Expert Loading on M1 Ultra

👀 See Also

Running OpenClaw Locally with Ollama to Avoid API Costs

Practical workflow patterns for reliable AI coding in multi-file projects

How an Idle Agent Burned 50M Tokens a Day – and How to Fix It

One-Soup One-Dish: A Japanese Cooking Principle for AI Fatigue