Claude 4.6 Opus Reasoning Distilled to 14GB for Apple Silicon via MLX Quantization

A developer has successfully quantized a local AI model that brings Claude 4.6 Opus's reasoning capabilities to Apple Silicon hardware, significantly reducing its memory footprint while maintaining performance.
The Model and Its Origin
The work centers on Qwen 3.5 27B, specifically a version distilled from Claude 4.6 Opus reasoning trajectories. The developer sought a model that could "think" rather than just autocomplete code, describing Opus's signature as "deliberate, analytical, and catches the subtle architectural flaws that other models miss." This distilled version brings that "thinking" scaffold to an open-weight architecture.
The Quantization Process
The original model was 55.6GB in BF16 format, which the developer noted is a "non-starter" for most local setups as it consumes the entire memory pool. To address this, they used MLX to quantize the model for Apple Silicon, converting it to 4-bit precision. The goal was to maintain high-fidelity Opus reasoning while making it lean enough for daily use in technical planning and complex logic.
Results and Performance
- Footprint: Reduced from 55GB to 14GB
- Speed: ~16 tokens/second on an M4 Pro
- Reasoning: Maintains the full <think> block, allowing the model to "talk to itself" to verify logic, simulate edge cases, and self-correct before presenting final answers
Availability and Requirements
The developer has uploaded the weights to Hugging Face. The model requires a Mac with 24GB+ of RAM to run private, high-tier logic and technical planning completely offline.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Local Trello-Style Project Manager for OpenClaw Agents
A developer built a local Trello-like project management tool that runs on the same machine as their OpenClaw agent, storing cards as markdown files with YAML frontmatter. The system uses Node.js/Express for the API, React for the UI, and allows the AI agent to read/write files directly on the filesystem.

Claude Code vs. Codex: Real-World Build Test – 36 Files vs. 28, Infinite Loop, and $0.46 Cost Difference
A developer pits Claude Code against Cursor's Codex on two real tasks: a PR triage bot and a WebSocket code review UI. Claude built 36 files in 12 minutes with zero TypeScript errors; Codex produced a working UI but hit an infinite React loop. Cost difference: ~$0.46.

OpenRouter Model Pricing and Intelligence-per-Dollar Analysis
A Reddit user compiled OpenRouter API pricing for 16 AI models and calculated intelligence-per-dollar values, identifying MiMo-V2-Flash as best value at $0.09/M tokens and GPT-5.4 as most intelligent at $2.50/M tokens.

MCP Support Merged into llama.cpp with New WebUI Features
The Model Context Protocol (MCP) pull request for llama.cpp has been merged, adding MCP support, tool calls, an agentic loop, and a server selector to the llama-server/WebUI side.