Qwen 35B-A3B Agent on 16GB M4 Mac: SSD I/O Bottleneck vs RAM

Running a Qwen 35B-A3B MoE model as an always-on agent on a 16GB M4 Mac Mini (basic spec) seemed plausible on paper: with llama.cpp --mmap and --flash-attn, the IQ3_XXS quant (12GB on disk) keeps RAM resident at 4–6GB via expert paging, delivering ~17 tok/s with --threads 8 --ctx-size 4096. As a batch tool, it works on this box. But scaling to a continuous agentic loop, sitting alongside Claude Code (Opus/Sonnet) and Codex CLI, collapsed — and the bottleneck was disk, not RAM.

The setup that broke

Ollama daemon serving qwen3.5:9b + qwen3.5:4b (config: OLLAMA_MAX_LOADED_MODELS=2, OLLAMA_KEEP_ALIVE=10m, OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q8_0)
llama-server for the 35B on its own port
LiteLLM bridge proxying everything as a Claude-compatible endpoint on :4000
One or two Claude Code sessions
Codex CLI session
Usual home-server cron, watchers, mail queue

What failed

Continuous mmap paging from the 35B + Claude Code's file-watcher/indexer + Codex holding context = constant SSD contention. The Mac started rebooting spontaneously (no crash logs in log show --predicate 'eventMessage CONTAINS "panic"'), background cron jobs missed windows by 5+ minutes, then quietly failed. Known issues: Claude Code and Codex CLIs have open bugs for memory growth in long sessions (#22968), idle CPU pegging (#19393), and accumulating processes (#11122). With one harness it's invisible; with two plus a paging 35B doing real loops, disk dies first.

Stable workaround

35B llama-server LaunchDaemon disabled (plist renamed .disabled)
24GB reclaimed by deleting the 35B GGUF and an old 26B Gemma
All Anthropic-shaped routes go to Ollama: qwen3.5:9b for opus/sonnet, qwen3.5:4b for haiku
Both Metal-resident via Ollama (~3GB GPU + 0.5GB CPU each), evict cleanly on idle
LiteLLM moved to a proper user LaunchAgent (KeepAlive=true, ThrottleInterval=30) — it had been a bare python -m litellm process for 7 days

The takeaway

The 35B-A3B-as-agent-loop dream is alive on a different class of box. On unified 16GB, it's a single-purpose batch tool, not an always-on layer. The author estimates 32GB unified memory minimum for sustained MoE agent inference without swap pain or daemon contention.

If you've got a trick for running it sustainably on 16GB without disk contention, the thread on r/LocalLLaMA is still active.

📖 Read the full source: r/LocalLLaMA