Qwen3.5-122B-A10B-MINT-MLX runs smoothly on M5 Pro with 64GB RAM

✍️ OpenClawRadar📅 Published: April 20, 2026🔗 Source

Local LLM Performance on Apple Silicon

A Reddit user has shared their experience running the Qwen3.5-122B-A10B-MINT-MLX model locally on an M5 Pro with 64GB RAM. The setup demonstrates that large language models can run effectively on consumer hardware with proper configuration.

Configuration Details

The user achieved smooth performance using specific terminal commands for VRAM allocation:

sysctl iogpu.unified_memory_limit_percentage
sudo sysctl iogpu.wired_limit_mb=61440

In LM Studio, they set the context window to 16384 tokens. With this configuration, the system maintained stable performance while running Safari with multiple tabs, Messages, and Activity Monitor simultaneously.

Performance Benchmarks

The Qwen3.5-122B-A10B-MINT-MLX model delivered:

Time to First Token: 0.86 seconds
Token Generation Speed: 39.58 tokens/second

The user noted the model "solved a bunch of riddles correctly and did a bit of vibe coding" with no complaints about the 3-bit MINT quantization. The only issue occurred when the context window filled up near 59GB VRAM usage, causing system lockup.

Comparison with Other Models

The user also tested "Qwen3.5 40B Claude 4.6 Opus Deckard Heretic Uncensored Thinking Mxfp8," which they found to be more accurate than the 122B model but significantly slower:

Token Generation Speed: 6.93 tokens/second
Prompt processing remained fast despite slower generation

This demonstrates the trade-off between model size, quantization, and inference speed that developers face when choosing local LLM configurations.

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

Claude App Ranks Second in US App Store After Pentagon Dispute

Anthropic's Claude chatbot app rose to number two among free apps in Apple's US App Store, climbing from outside the top 100 in late January to second place by late February 2026. This surge followed the company's public negotiations with the Pentagon over AI usage restrictions.

Mar 1, 2026, 01:45 AM UTC

OpenClawRadar

News

OpenClaw 2026.3.24: Bridge Config Removed, Heartbeat Token Savings, Loop Detection

OpenClaw 2026.3.24 removes the deprecated bridge configuration section from openclaw.json, adds isolatedSession: true to heartbeat config to reduce token costs from ~100K to 2-5K per run, and introduces new features including imageGenerationModel, tools.loopDetection, channels.modelByChannel, built-in model aliases, and pdfModel.

Mar 29, 2026, 03:45 AM UTC

OpenClawRadar

News

Attentional Gating: The Challenge of Selective Forgetting in AI Memory Systems

A developer building a five-layer memory system for an OpenClaw bot identifies a key limitation: current approaches focus on recall but lack mechanisms for suppressing irrelevant information during focused tasks, similar to human attentional gating.

Mar 22, 2026, 01:45 AM UTC

OpenClawRadar

News

KV Cache Architecture Evolution: From GPT-2 to Mamba

Analysis of KV cache memory costs shows GPT-2 used 300 KiB/token, Llama 3 reduced it to 128 KiB/token with grouped-query attention, and DeepSeek V3 achieved 68.6 KiB/token with multi-head latent attention. Mamba/SSMs eliminate KV cache entirely with fixed-size hidden states.

Mar 29, 2026, 01:45 AM UTC

OpenClawRadar