FOMOE Enables 397B Qwen3.5 Model Inference on $2,100 Desktop Hardware

✍️ OpenClawRadar📅 Published: March 29, 2026🔗 Source
FOMOE Enables 397B Qwen3.5 Model Inference on $2,100 Desktop Hardware
Ad

What FOMOE Solves

Large Mixture of Experts (MoE) models require hundreds of GBs of weight storage, typically in flash memory like NVMe. During inference, only a small fraction of weights are needed, but you can't predict which ones ahead of time. Random access patterns make flash latencies too high for practical inference on consumer hardware.

How FOMOE Works

The system makes most expert weight reads unnecessary through several techniques:

  • Stores the most common experts in GPU memory (VRAM) with an up-to-date rolling expert cache
  • Achieves 60% VRAM hit rate with warm start, reducing NVMe reads to 28% (12% served from DRAM)
  • Uses dual GPU ping-pong architecture to overlap weight loading and compute
  • Implements Cache-Aware Routing (CAR) - when two experts score similarly, the model picks the next-best scoring expert already in VRAM or DRAM cache within acceptable threshold
Ad

Performance Results

  • 5-9 tokens/second inference speed for Qwen3.5's 397B parameter model
  • NVMe reads reduced to 7% with CAR enabled
  • Only 3.5% drop in perplexity measured on wikitext
  • Hardware requirements: two $500 GPUs, 32GB RAM, one NVMe drive
  • Uses Q4_K_M quantization

The implementation consists of approximately 15,000 lines of Claude-driven C/HIP code with heavy human guidance.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also