vllm-mlx fork adds tool calling and prompt cache for local AI coding agents

A developer has published a modified version of vllm-mlx that fixes several issues for running AI coding agents like OpenClaw locally on Mac. The fork adds working tool calling and prompt caching to the OpenAI-compatible server for Apple Silicon.
Key fixes and features
The developer made 37 commits on top of upstream vllm-mlx to address specific problems:
- Tool calling: Added
--tool-call-parser hermesflag — Qwen3-Coder-Next tool calls work out of the box - MiniMax-M2.5: Added streaming and non-streaming tool call parsing with 4/4 accuracy on function calling benchmarks (weather, search, code execution, multi-tool)
- Prompt cache: Added persistent KV cache across requests in SimpleEngine — same system prompt and conversation history only prefills new tokens
- Reasoning separation: Built heuristic parser for MiniMax outputs that had reasoning inline with no tags — reduced leak rate from 60% to 0%
Performance improvements
With 33K token context, time to first token (TTFT) improved from 28 seconds to 0.3 seconds on cache hit. Benchmarks on Mac Studio M3 Ultra 256GB:
- Qwen3-Coder-Next 4bit: 42GB RAM, 70 tok/s decode, 1270 tok/s prefill
- Qwen3-Coder-Next 6bit: 60GB RAM, 65 tok/s decode, 1090-1440 tok/s prefill
- Qwen3-Coder-Next 8bit: 75GB RAM, ~45 tok/s decode, ~900 tok/s prefill
- MiniMax-M2.5 4bit: 120GB RAM, 33-38 tok/s decode, 430-500 tok/s prefill
The developer recommends Qwen3-Coder-Next 6bit as the sweet spot for interactive coding, noting quality is noticeably better than 4bit (which had occasional garbled output).
Setup instructions
pip install git+https://github.com/raullenchai/vllm-mlx.git
python -c "from mlx_lm import load; load('lmstudio-community/Qwen3-Coder-Next-MLX-6bit')"
python -m vllm_mlx.server \
--model lmstudio-community/Qwen3-Coder-Next-MLX-6bit \
--tool-call-parser hermes \
--prefill-step-size 8192 \
--kv-bits 8 \
--port 8000
Then point OpenClaw or any OpenAI SDK client at http://localhost:8000/v1.
Hardware requirements
- Qwen3-Coder-Next 4bit: 42GB — fits on M2 Pro 64GB or better
- Qwen3-Coder-Next 6bit: 60GB — needs M2/M3/M4 Max 96GB+ or Ultra
- MiniMax-M2.5: 120GB — Ultra 192GB+ only
What didn't work
- Speculative decoding with Qwen3-0.6B as draft model — mlx-lm has a known bug with Qwen3 (skips tokens, issue #846)
- DeepSeek-R1-Distill-70B for OpenClaw — great at reasoning but tool calling is unreliable
The repository has 1500+ tests and is licensed under Apache 2.0.
📖 Read the full source: r/LocalLLaMA
👀 See Also

SLOP Plugin Adds Real-Time App State Awareness to OpenClaw Agents
A new OpenClaw plugin integrates with SLOP (State Layer for Observable Programs), giving AI agents structured access to application state and contextual actions. The plugin auto-discovers SLOP-enabled apps via ~/.slop/providers/ and a Chrome extension bridge.

x402 API Gateway for OpenClaw Bots: One Endpoint Replaces 18 API Keys
An x402 API gateway eliminates the need for multiple API keys in OpenClaw bots by providing access to 18 services including smart LLM routing, web search, maps, travel, food, AI, and finance data through a single endpoint authenticated via USDC wallet credits.

OmniRecall Beta: FAISS-Powered Memory Injection for Cloud LLM Chats
OmniRecall is a local mitmproxy bypass that intercepts traffic to cloud chat interfaces like DeepSeek, adding a permanent memory layer using FAISS indexing and sentence-transformers MiniLM-L6. It's currently in beta, requires CPU-only operation, and uses an aggressively restrictive source-available license.

MCP Server: Comparing Local and Cloud LLMs with Debate Feature
The MCP server enables developers to query local models via Ollama alongside various cloud LLMs, offering features like side-by-side comparison and a structured debate function.