FOMOE Enables 397B Qwen3.5 Model Inference on $2,100 Desktop Hardware

What FOMOE Solves
Large Mixture of Experts (MoE) models require hundreds of GBs of weight storage, typically in flash memory like NVMe. During inference, only a small fraction of weights are needed, but you can't predict which ones ahead of time. Random access patterns make flash latencies too high for practical inference on consumer hardware.
How FOMOE Works
The system makes most expert weight reads unnecessary through several techniques:
- Stores the most common experts in GPU memory (VRAM) with an up-to-date rolling expert cache
- Achieves 60% VRAM hit rate with warm start, reducing NVMe reads to 28% (12% served from DRAM)
- Uses dual GPU ping-pong architecture to overlap weight loading and compute
- Implements Cache-Aware Routing (CAR) - when two experts score similarly, the model picks the next-best scoring expert already in VRAM or DRAM cache within acceptable threshold
Performance Results
- 5-9 tokens/second inference speed for Qwen3.5's 397B parameter model
- NVMe reads reduced to 7% with CAR enabled
- Only 3.5% drop in perplexity measured on wikitext
- Hardware requirements: two $500 GPUs, 32GB RAM, one NVMe drive
- Uses Q4_K_M quantization
The implementation consists of approximately 15,000 lines of Claude-driven C/HIP code with heavy human guidance.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Clawion: OpenClaw wrapper with Claude Max support and GitHub integration
Clawion is an OpenClaw wrapper that supports Claude Max without requiring an API key. Setup involves picking a template, connecting Telegram, and deploying a code companion with GitHub integration for automated PR creation.

Claude Code v2.1.141: New Environment Variables, Hooks Enhancement, and Bug Fixes
Anthropic released Claude Code v2.1.141 with new environment variables (CLAUDE_CODE_PLUGIN_PREFER_HTTPS, ANTHROPIC_WORKSPACE_ID), terminalSequence field for hooks, agent listing by cwd, and over 20 bug fixes.

LobsterBoard adds theme system and template gallery
LobsterBoard now includes a theme system with five visual options and a template gallery that allows users to export and import dashboard layouts with automatic sensitive data stripping.

SIDJUA V1.0: Self-Hosted Governance Platform for AI Agents
SIDJUA V1.0 is a free, self-hosted governance platform for AI agents that runs on Docker, including Raspberry Pi. It provides mandatory checkpoints for agent tasks, encrypted credential storage, network isolation, and granular budget controls.