FOMOE Enables 397B Qwen3.5 Model Inference on $2,100 Desktop Hardware

✍️ OpenClawRadar📅 Published: March 29, 2026🔗 Source

What FOMOE Solves

Large Mixture of Experts (MoE) models require hundreds of GBs of weight storage, typically in flash memory like NVMe. During inference, only a small fraction of weights are needed, but you can't predict which ones ahead of time. Random access patterns make flash latencies too high for practical inference on consumer hardware.

How FOMOE Works

The system makes most expert weight reads unnecessary through several techniques:

Stores the most common experts in GPU memory (VRAM) with an up-to-date rolling expert cache
Achieves 60% VRAM hit rate with warm start, reducing NVMe reads to 28% (12% served from DRAM)
Uses dual GPU ping-pong architecture to overlap weight loading and compute
Implements Cache-Aware Routing (CAR) - when two experts score similarly, the model picks the next-best scoring expert already in VRAM or DRAM cache within acceptable threshold

Performance Results

5-9 tokens/second inference speed for Qwen3.5's 397B parameter model
NVMe reads reduced to 7% with CAR enabled
Only 3.5% drop in perplexity measured on wikitext
Hardware requirements: two $500 GPUs, 32GB RAM, one NVMe drive
Uses Q4_K_M quantization

The implementation consists of approximately 15,000 lines of Claude-driven C/HIP code with heavy human guidance.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tools

GLM-5.1 vs MiniMax M2.7: Performance comparison for AI coding agents

GLM-5.1 achieves SWE-bench-Verified 77.8 and Terminal Bench 2.0 56.2 scores, the highest among open-source models, while MiniMax M2.7 offers fast responses with low TTFT and high throughput ideal for CI bots and batch edits.

Mar 31, 2026, 05:45 PM UTC

OpenClawRadar

Tools

8 Advanced Claude Code Tips: Cost Saving, Context Management, Custom Commands

Practical tips from heavy daily use of Claude Code, covering git workflow automation, multimodal image input, API usage tracking, context compaction, session resumption, rule management, thinking triggers, and custom commands.

May 11, 2026, 08:18 AM UTC

OpenClawRadar

Tools

Introducing Aionic Anthology: A Framework for Structuring Claude's AI Tasks

The Aionic Anthology framework organizes Claude's AI tasks by separating context into categories and adding a risk evaluation system to improve task execution.

Feb 14, 2026, 03:45 AM UTC

OpenClawRadar

Tools

srclight: Fully Local Code Indexing MCP Server with Ollama Embeddings

srclight is an MCP server for deep code indexing that runs 100% locally with no API keys or cloud calls. It uses tree-sitter AST parsing for 11 languages, SQLite FTS5 for keyword search, Ollama for embeddings, and GPU-accelerated cosine similarity via cupy.

Feb 25, 2026, 01:45 AM UTC

OpenClawRadar