Flash-MoE: Run 397B Qwen Model on MacBook Pro at 4.4 tok/s

Technical Implementation

Flash-MoE runs Qwen3.5-397B-A17B, a 397 billion parameter Mixture-of-Experts model with 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, with K=4 activated per token plus one shared expert. Hidden dimension is 4096.

Performance Benchmarks

4-bit experts, FMA kernel: 4.36 tokens/second, excellent quality, full tool calling, 209GB on disk (current best)
4-bit experts, baseline: 3.90 tokens/second, excellent quality
2-bit experts, trust OS: 5.74 tokens/second, good quality, 120GB on disk (breaks JSON/tool calling)
2-bit peak single token: 7.05 tokens/second, good quality (not suitable for tool use)

Note: 2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.

Hardware Requirements

Machine: MacBook Pro with Apple M3 Max
Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE
Memory: 48 GB unified (~400 GB/s bandwidth)
SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured)
macOS: 26.2 (Darwin 25.2.0)

Key Techniques

SSD Expert Streaming

Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching with no custom cache needed ("Trust the OS" principle), achieving ~71% hit rate naturally.

FMA-Optimized Dequant Kernel

The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction, resulting in 12% faster performance than the naive formulation.

Metal Compute Shaders

Hand-written Metal kernels include:

4-bit and 2-bit dequantized matrix-vector multiply (tiled, SIMD-reduced, shared input cache, FMA-optimized)
Fused SwiGLU activation
RMS normalization (two-pass: sum-of-squares reduction + apply)
Batched GPU attention (Q@K^T, softmax, scores@V) for full attention layers
GPU RoPE (fused with Q deinterleave and K normalization)
MoE combine + residual + sigmoid gate (fused kernel)

Deferred GPU Expert Compute

CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer's attention projections.

Accelerate BLAS for Linear Attention

The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update, achieving 64% faster performance than scalar code.

Pipeline Performance

Per layer average at 4-bit: 4.28ms

CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
CPU: flush results [0.01ms CPU]
CMD2: o_proj + norm + routing + shared [0.55ms GPU]
CPU: softmax + topK routing [0.003ms]
I/O: parallel pread K=4 experts [2.41ms SSD]
CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]

Architecture Constraints

On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration, requiring a serial pipeline.

📖 Read the full source: HN AI Agents

Flash-MoE: Running 397B Parameter Qwen Model on MacBook Pro with Pure C/Metal

Technical Implementation

Performance Benchmarks

Hardware Requirements

Key Techniques

SSD Expert Streaming

FMA-Optimized Dequant Kernel

Metal Compute Shaders

Deferred GPU Expert Compute

Accelerate BLAS for Linear Attention

Pipeline Performance

Architecture Constraints

👀 See Also

Startup Bookkeeper: Free Claude Skill for Small Business Tracking

AgentLens: Observability Tool for Multi-Agent AI Workflows

Rival-Review: A Cross-Model Review Loop for AI Agent Plans

Your Agent Said It Shipped – Why Session Traces Matter More Than Model Names