oMLX introduces SSD KV caching for Apple Silicon, reducing OpenClaw response times from 30-90 seconds to 5 seconds

What oMLX solves
Running OpenClaw locally typically means sending the same massive system prompt (20-30k tokens covering tools, skills, workspace context) on every request. While Ollama and LM Studio cache KV state, they invalidate the entire cache and recompute from scratch when context shifts mid-session, resulting in 30-90 second response times.
oMLX fixes this by persisting KV cache blocks to SSD in safetensors format. When a previously seen prefix returns, it's restored from disk instead of recomputed - working across requests and server restarts. Since OpenClaw's system prompt is mostly static (only timestamps and runtime metadata shift), SSD caching means only changed parts get recomputed.
Performance benchmarks
Tested with Qwen3.5-122B-A10B-4bit on M3 Ultra 512GB:
- Single request benchmarks:
- 1k context: 768 tok/s prompt processing, 56.6 tok/s generation, 65.5 GB peak memory
- 8k context: 940 tok/s prompt processing, 51.4 tok/s generation, 69.3 GB peak memory
- 32k context: 764 tok/s prompt processing, 42.4 tok/s generation, 73.4 GB peak memory
- Continuous batching (pp1024/tg128):
- 1x batch: 56.6 tok/s, 1.00x speedup
- 2x batch: 92.1 tok/s, 1.63x speedup
- 4x batch: 135.1 tok/s, 2.39x speedup
- 8x batch: 190.2 tok/s, 3.36x speedup
Setup with OpenClaw
- Download the DMG from releases and drag to Applications
- Point it at your model directory (reuses LM Studio models, no re-download needed)
- Add oMLX as a custom provider in openclaw.json
- The web dashboard generates the exact config - no terminal needed
Additional features
- Multi-model serving: LLM + embedding + reranker simultaneously
- Tool calling for all major formats (JSON, Qwen, Gemma, GLM) + MCP
- Tool result trimming - truncates oversized tool outputs
- OpenAI + Anthropic /v1/messages drop-in compatibility
- Native macOS menu bar app (not Electron)
- Apache 2.0 license, 100% open source
📖 Read the full source: r/openclaw
👀 See Also

Qwen Meetup Draft: Function Calling Harness 2 Boosts CoT Compliance from 9.91% to 100% via Structured Schemas
A follow-up to the earlier function-calling harness post extends the pattern to domains without a compiler (investment memos, legal opinions, clinical charts). The schema forces required fields — submission rejected if incomplete. Qwen3.6-27b achieves 100% CoT compliance on these schemas.

MCP Server Adds Persistent Memory with Retrieval Scoring to Claude Code
A developer built an MCP server called engram-mcp that gives Claude Code persistent memory across sessions and projects, featuring automatic retrieval scoring based on outcome success and drift detection for stale knowledge.

CLAUDE.md: Drop-in file reduces Claude output tokens by 63%
CLAUDE.md is a single file that cuts Claude output verbosity by approximately 63% without code changes. It targets sycophancy, verbosity, and formatting noise in Claude's responses.

Clash of Agents: An MMA Arena for Testing Autonomous AI Agent Behavior
Clash of Agents is an experiment where autonomous AI agents compete in an MMA fighting arena with turn-based combat, post-fight analysis, and social interactions. Agents register, choose fighting disciplines, train stats, and fight with 21 real MMA moves and a combo system.