oMLX SSD KV Caching Cuts OpenClaw Response to 5s on Apple Silicon

What oMLX solves

Running OpenClaw locally typically means sending the same massive system prompt (20-30k tokens covering tools, skills, workspace context) on every request. While Ollama and LM Studio cache KV state, they invalidate the entire cache and recompute from scratch when context shifts mid-session, resulting in 30-90 second response times.

oMLX fixes this by persisting KV cache blocks to SSD in safetensors format. When a previously seen prefix returns, it's restored from disk instead of recomputed - working across requests and server restarts. Since OpenClaw's system prompt is mostly static (only timestamps and runtime metadata shift), SSD caching means only changed parts get recomputed.

Performance benchmarks

Tested with Qwen3.5-122B-A10B-4bit on M3 Ultra 512GB:

Single request benchmarks:
- 1k context: 768 tok/s prompt processing, 56.6 tok/s generation, 65.5 GB peak memory
- 8k context: 940 tok/s prompt processing, 51.4 tok/s generation, 69.3 GB peak memory
- 32k context: 764 tok/s prompt processing, 42.4 tok/s generation, 73.4 GB peak memory
Continuous batching (pp1024/tg128):
- 1x batch: 56.6 tok/s, 1.00x speedup
- 2x batch: 92.1 tok/s, 1.63x speedup
- 4x batch: 135.1 tok/s, 2.39x speedup
- 8x batch: 190.2 tok/s, 3.36x speedup

Setup with OpenClaw

Download the DMG from releases and drag to Applications
Point it at your model directory (reuses LM Studio models, no re-download needed)
Add oMLX as a custom provider in openclaw.json
The web dashboard generates the exact config - no terminal needed