Qwen 35B-A3B as always-on agent on 16GB M4 Mac: disk I/O fails before RAM

Running a Qwen 35B-A3B MoE model as an always-on agent on a 16GB M4 Mac Mini (basic spec) seemed plausible on paper: with llama.cpp --mmap and --flash-attn, the IQ3_XXS quant (12GB on disk) keeps RAM resident at 4–6GB via expert paging, delivering ~17 tok/s with --threads 8 --ctx-size 4096. As a batch tool, it works on this box. But scaling to a continuous agentic loop, sitting alongside Claude Code (Opus/Sonnet) and Codex CLI, collapsed — and the bottleneck was disk, not RAM.
The setup that broke
- Ollama daemon serving
qwen3.5:9b+qwen3.5:4b(config:OLLAMA_MAX_LOADED_MODELS=2,OLLAMA_KEEP_ALIVE=10m,OLLAMA_FLASH_ATTENTION=1,OLLAMA_KV_CACHE_TYPE=q8_0) llama-serverfor the 35B on its own port- LiteLLM bridge proxying everything as a Claude-compatible endpoint on
:4000 - One or two Claude Code sessions
- Codex CLI session
- Usual home-server cron, watchers, mail queue
What failed
Continuous mmap paging from the 35B + Claude Code's file-watcher/indexer + Codex holding context = constant SSD contention. The Mac started rebooting spontaneously (no crash logs in log show --predicate 'eventMessage CONTAINS "panic"'), background cron jobs missed windows by 5+ minutes, then quietly failed. Known issues: Claude Code and Codex CLIs have open bugs for memory growth in long sessions (#22968), idle CPU pegging (#19393), and accumulating processes (#11122). With one harness it's invisible; with two plus a paging 35B doing real loops, disk dies first.
Stable workaround
- 35B
llama-serverLaunchDaemon disabled (plist renamed.disabled) - 24GB reclaimed by deleting the 35B GGUF and an old 26B Gemma
- All Anthropic-shaped routes go to Ollama:
qwen3.5:9bfor opus/sonnet,qwen3.5:4bfor haiku - Both Metal-resident via Ollama (~3GB GPU + 0.5GB CPU each), evict cleanly on idle
- LiteLLM moved to a proper user LaunchAgent (
KeepAlive=true,ThrottleInterval=30) — it had been a barepython -m litellmprocess for 7 days
The takeaway
The 35B-A3B-as-agent-loop dream is alive on a different class of box. On unified 16GB, it's a single-purpose batch tool, not an always-on layer. The author estimates 32GB unified memory minimum for sustained MoE agent inference without swap pain or daemon contention.
If you've got a trick for running it sustainably on 16GB without disk contention, the thread on r/LocalLLaMA is still active.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Anthropic Doubles Claude Code Rate Limits, Signs Compute Deal with SpaceX
Claude Code five-hour rate limits doubled for Pro/Max/Team/Enterprise plans, peak-hour reductions removed, and API rate limits raised for Opus models. SpaceX Colossus 1 adds 300+ MW capacity (220k NVIDIA GPUs) within a month.

Developer Prefers Qwen3.5-27B Over Proprietary Models for Its Failure Mode
A developer on r/LocalLLaMA reports preferring Qwen3.5-27B over Gemini 3.1 Pro and GPT-5.3 Codex because it gives up on problematic tasks rather than generating potentially dangerous code like unrestricted Perl or NodeJS scripts.

Sora AI Video Economics: $20 User Costs OpenAI $65 in Compute
OpenAI's Sora AI video generation app reportedly costs $65 in compute per $20/month user, with peak inference costs estimated at $15 million daily versus $2.1 million total lifetime revenue.

Merlin Research releases Qwen3.5-4B-Safety-Thinking model for structured reasoning
Merlin Research has released Qwen3.5-4B-Safety-Thinking, a 4 billion parameter safety-aligned reasoning model built on Qwen3.5. The model is designed for structured 'thinking' and safety in real-world scenarios including agent systems.