Qwen3.5-397B MoE Runs on 14GB RAM via Paged Expert Loading on M1 Ultra

A Reddit post by u/ur_dad_matt (via Claude) demonstrates a custom paged MoE engine that runs Qwen3.5-397B-A17B (209GB on disk, 512 experts, top-10 routing) on an M1 Ultra 64GB Mac Studio with only 14GB peak RAM and 1.59 tok/s inference speed. The model is too large to load naively; the engine keeps only K=20 experts resident in RAM, lazy-paging the rest from SSD on router demand, evicting under cache pressure. Compute uses Float16 (faster than ternary on MPS), Apple Silicon native, MLX-based.
Benchmark results from a 5-prompt sweep on M1 Ultra 64GB:
- Speed: 1.59 tok/s (mean across 5 coherent generations, K=20)
- Cache RSS peak (generation): 7.91 GB
- Total RSS peak: 14.04 GB
- Coherent outputs: 5/5
Optimal engine config: K_override=20, cache_gb=8.0, OUTLIER_MMAP_EXPERTS=0, lazy_load=True. Initial attempts with all experts on disk caused command-buffer allocation failures until cache size was tuned.
The author argues that raw score benchmarks miss the point for local LLMs on 64GB hardware; the key metric is MMLU per GB RAM. At 1.59 tok/s the model runs at "thinking pace" not chat pace, demonstrating the upper bound of model-to-memory ratio.
Speeds for smaller quantized models on same hardware (MLX-4bit):
- 4B Nano: 71.7 tok/s
- 9B Lite: 53.4 tok/s
- 26B-A4B Quick: 14.6 tok/s
- 27B Core: 40.7 tok/s (MMLU 0.851 n=14042 σ=0.003, HumanEval 0.866 n=164 σ=0.027)
- 35B-A3B Vision: 64.1 tok/s
- 397B Plus: 1.59 tok/s
The runtime is built with Tauri + Rust + MLX for macOS. Free tiers (Nano and Lite) are available forever at outlier.host. A video demo is included in the Reddit post.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Fixing Claude Code's KV Cache Invalidation with Local Backends
Claude Code versions 2.1.36+ inject dynamic telemetry headers and git status updates into every request, breaking prefix matching and forcing full 20K+ token system prompt reprocessing on local backends like llama.cpp. A configuration fix in ~/.claude/settings.json can reduce processing from 60+ seconds to ~4 seconds.

Workaround for OpenClaw Claude Access via Claude Code CLI
A method routes OpenClaw through Claude Code CLI to maintain Claude subscription access after Anthropic blocked direct third-party harnesses. The process involves installing the CLI, setting up an OAuth token, and configuring OpenClaw to use the ACP plugin.

Java Performance Optimization: Eight Anti-Patterns That Slow Down Your Code
A Java order-processing app improved from 1,198ms to 239ms elapsed time, 85,000 to 419,000 orders per second, and 1GB to 139MB heap usage by fixing eight common anti-patterns identified through Java Flight Recording profiling.

Troubleshooting OpenClaw: A Minimalist Reset Method
A Reddit user shares a five-step method to fix unstable OpenClaw setups by removing all skills, switching to Claude Sonnet, clearing sessions, simplifying SOUL.md, and testing with basic commands.