Optimizing GLM-4.7-Flash on M4 Mac Mini with 24GB RAM

Practical Configuration for GLM-4.7-Flash on M4 Hardware
A developer testing OpenClaw and Ollama on an M4 Mac Mini with 24GB RAM has shared specific optimization details for running the GLM-4.7-Flash model. The source provides concrete memory allocation realities and configuration parameters that work within the hardware constraints.
Memory Reality and Model Selection
The testing reveals that the effective GPU memory budget on the M4 Mini is approximately 17.8GB Metal (GPU-wired), not the full 24GB. The rest is consumed by macOS, applications, and CPU compute. This limitation affects model selection and context size.
- Q4_K_XL quantization (17.5GB GGUF) cannot handle 32k context: Model (14.4GB) + KV (2.8GB) + compute (1.4GB) = 18.6GB → Out of Memory
- Q3_K_XL quantization (13.8GB GGUF) works at 32k context: Model (12.7GB) + KV (3.2GB) + compute (1.4GB) = 16.1GB with 1.7GB headroom
- Context ceiling is approximately 34k before OOM occurs
Configuration Details
The successful setup uses:
- Model: unsloth/GLM-4.7-Flash-GGUF from Hugging Face
- Quantization: Q3_K_XL
- Context size: 32k with MLA (Multi-Head Latent Attention)
- KV cache implementation: llama.cpp's v-less KV cache (PR #19067, Jan 2026) triggered by GGUF metadata (key_length_mla, kv_lora_rank)
- Build requirement: llama.cpp b7860+
The MLA implementation reduces KV memory usage significantly - 32k context KV cache is only 3.2GB instead of 13GB.
Framework-Specific Considerations
Agentic frameworks like OpenClaw have internal context thresholds that affect performance:
- OpenClaw triggers aggressive compaction below 32k context
- Increasing context from 20k to 32k reduced startup time from 5 minutes to 2 minutes 17 seconds
- Compaction passes dropped from 2 to 1 when matching num_ctx to framework thresholds
- num_ctx must be baked into the Ollama Modelfile - OpenClaw and other orchestrators using Ollama's OpenAI-compatible API ignore it at the request level
Performance Testing Data
The developer provided specific timing data for various tasks:
Task Time Input Tokens Compactions Result Personality intro 119s ~13,900 2 ✅ Profile recall 60s 13,247 2 ✅ w/ caveat Task creation 61s 13,375 2 ✅ Memory write 165s 14,448 2 ✅ Memory recall 89s 14,085 2 ✅ Web search + synthesis 273s 18,668 2 ✅
MLX Considerations
The developer notes that MLX and GGUF are different formats - Unsloth/bartowski GGUF files cannot run with mlx-lm. Currently, no 3-bit Flash model exists in the mlx-community repository, only 4-bit models are available.
📖 Read the full source: r/openclaw
👀 See Also

Accessing USB Webcams in WSL2 for Local Motion Detection
A developer shares how to use usbipd-win to pass USB webcams from Windows to WSL2, enabling local motion detection with OpenCV without cloud dependencies.

Reddit user shares practical Claude setup for consistent AI coding assistance
A developer describes moving from single prompts to separate context files (about-me.md, my-voice.md, my-rules.md) and implementing a structured workflow where Claude reads context, asks questions, creates plans, then executes tasks.

OpenClaw 4.1 with Gemma 4 Stack: Hybrid Architecture and Setup Fixes
A Reddit post details an optimized local agent stack combining OpenClaw 4.1 with Google's Gemma 4 model, featuring a hybrid architecture, specific configuration fixes for Ollama tool calling, and context window adjustments.

Fixing Claude Cowork 'Failed to start workspace' errors on Windows 11 Home
A user solved Claude Cowork startup errors on Windows 11 Home by installing Windows Subsystem for Linux (WSL2) from the Microsoft Store, which is required for the underlying VM technology.