Flash-MoE: Running 397B Parameter Qwen Model on MacBook Pro with Pure C/Metal

Technical Implementation
Flash-MoE runs Qwen3.5-397B-A17B, a 397 billion parameter Mixture-of-Experts model with 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, with K=4 activated per token plus one shared expert. Hidden dimension is 4096.
Performance Benchmarks
- 4-bit experts, FMA kernel: 4.36 tokens/second, excellent quality, full tool calling, 209GB on disk (current best)
- 4-bit experts, baseline: 3.90 tokens/second, excellent quality
- 2-bit experts, trust OS: 5.74 tokens/second, good quality, 120GB on disk (breaks JSON/tool calling)
- 2-bit peak single token: 7.05 tokens/second, good quality (not suitable for tool use)
Note: 2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.
Hardware Requirements
- Machine: MacBook Pro with Apple M3 Max
- Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE
- Memory: 48 GB unified (~400 GB/s bandwidth)
- SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured)
- macOS: 26.2 (Darwin 25.2.0)
Key Techniques
SSD Expert Streaming
Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching with no custom cache needed ("Trust the OS" principle), achieving ~71% hit rate naturally.
FMA-Optimized Dequant Kernel
The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction, resulting in 12% faster performance than the naive formulation.
Metal Compute Shaders
Hand-written Metal kernels include:
- 4-bit and 2-bit dequantized matrix-vector multiply (tiled, SIMD-reduced, shared input cache, FMA-optimized)
- Fused SwiGLU activation
- RMS normalization (two-pass: sum-of-squares reduction + apply)
- Batched GPU attention (Q@K^T, softmax, scores@V) for full attention layers
- GPU RoPE (fused with Q deinterleave and K normalization)
- MoE combine + residual + sigmoid gate (fused kernel)
Deferred GPU Expert Compute
CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer's attention projections.
Accelerate BLAS for Linear Attention
The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update, achieving 64% faster performance than scalar code.
Pipeline Performance
Per layer average at 4-bit: 4.28ms
- CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
- CPU: flush results [0.01ms CPU]
- CMD2: o_proj + norm + routing + shared [0.55ms GPU]
- CPU: softmax + topK routing [0.003ms]
- I/O: parallel pread K=4 experts [2.41ms SSD]
- CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]
Architecture Constraints
On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration, requiring a serial pipeline.
📖 Read the full source: HN AI Agents
👀 See Also

Codesight: AI Context Engine Cuts 30K-60K Tokens from Claude Code Sessions
Codesight is an open-source tool that analyzes codebases to provide AI coding agents with structured context, reducing token waste. A developer collaborated with the maintainer to add AST parsing for Next.js and Prisma, an eval suite, token telemetry, and profiles for Claude Code and Cursor.

Unofficial Ultrahuman Ring MCP Server for AI Agent Integration
A community-developed MCP server wraps the Ultrahuman Partner API, allowing AI coding agents to directly access ring and CGM metrics like sleep, HRV, glucose, and recovery scores via structured data calls.

Headless OpenClaw Setup with Discord via Docker Scripts
A GitHub repository provides scripts to run OpenClaw with Discord in a headless Docker container, avoiding the TUI/WebUI. It includes a management script with commands like claw init, start, and stop, plus preconfigured support for OpenAI Responses API, Chromium, and various tools.

GuppyLM: A 9M Parameter LLM Built from Scratch for Educational Purposes
GuppyLM is a ~9M parameter language model trained from scratch on 60K synthetic conversations, using a vanilla transformer architecture with 6 layers, 384 hidden dimensions, and 6 attention heads. It trains in about 5 minutes on a free Colab T4 GPU and speaks with a fish personality focused on water, food, and tank life.