Mac Studio local LLM loadout: GLM 5.1, Kimi K2.6, and what's working for coding with Claude Code

Over on r/LocalLLaMA, user ezyz posted their Mac Studio local LLM loadout as of May 2026, running on an M3 Ultra with 512GB unified memory. The post is a day-to-day vibe check, not rigorous benchmarks, but it's full of practical observations for anyone running large models locally for coding with Claude Code.
Current active models and performance
GLM 5.1 is the biggest winner. Quantized, it fits in ~380GB with max context, leaving room for other tasks. Decode speed is ~17 t/s, prefill ~190 t/s. The author trusts it up to a 6/10 on task complexity (10 being 'brownfield legacy codebase + vague spec') for coding via Claude Code. It handles self-contained, semi-scoped problems consistently, with occasional API Claude assistance for planning or cleanup.
Kimi K2.6 is in the same tier — not obviously better or worse — but is larger. Even aggressively quantized, it uses ~460GB, leaving little for other experiments. It's faster: prefill ~220 t/s, decode ~21 t/s. The friction is needing to unload it for memory-heavy experiments.
Minimax 2.7 is impressive for its size and speed, but the author rates it only 3-4/10 for dev work. It's an awkward size — GLM and Kimi win on shipping usable code, while smaller models win on assistant tasks like 'summarize this web search'. It does quickly bail out of reasoning for simple requests.
Gemma 4 31B disappointed: MLX support is still messy a month post-release. The 31B dense isn't much faster the big MoEs, the official chat template has multiple unaddressed bugs, and patches are still trickling in. The author plans to revisit once MTP/draft support stabilizes.
Qwen 3.6 35B was replaced with Qwen 3.5 9B for multimodal tasks like translating screenshots — it's good enough and fast enough, and handles Claude Code's Haiku background tasks with no noticeable difference, while saving ~14GB memory.
Pending support and future watch
Neither Deepseek 4 Flash nor Mimo 2.5 have officially landed in llama.cpp or mlx-lm yet. The author will try the PRs when time permits. They guess the pro versions of both will be too large and slow for the M3 Ultra — GLM's 40B active parameters is roughly their patience limit.
Eagerly watched projects:
- Exo and tinygrad for Mac + NVIDIA clustering and disaggregated prefill
- Stable Dflash / DDtree / MTP support
- Novel quantization formats (paroquant, JANGTQ) — see llama.cpp PR #21038
- Local music generation — Ace Step 1.5 is 'almost good' but voices not there yet.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Emergency coding setup: Claude Code on OCI free VM with Termux on Android
A developer shares a setup using Oracle Cloud Infrastructure's free VM (24GB RAM, 4 vCPUs) with Claude Code installed, accessed via Termux on Android for emergency coding when a laptop isn't available. The setup requires Claude Pro ($20/month) or Max ($100/month) subscription.

Running an AI-Operated Store: Lessons from Ultrathink.art
The team behind ultrathink.art, an e-commerce store where every function is handled by AI agents, shares insights on treating agents like contractors rather than fancy autocomplete. Key differences include how you scope their work, what information you provide, and how you verify completion.

Developer Builds Minecraft Launcher with Claude Code
A developer with 20+ years experience used Claude Code to create BlockHaven Launcher, an Electron-based Minecraft launcher with Microsoft authentication, Modrinth mod browsing, and isolated multi-instance profiles. The project is open-source with MIT licensing.

Autonomous 5-Agent Claude System Replaces $3K/Month API Costs with Single Subscription
A developer built a 5-agent autonomous swarm using Claude Opus 4.6 running as Discord bots on WSL2, powered by a single Claude Max subscription instead of API credits, replacing what would cost $3,000+ monthly with $100-200.