Optimizing GLM-4.7-Flash on M4 Mac Mini with 24GB RAM

✍️ OpenClawRadar📅 Published: February 24, 2026🔗 Source
Optimizing GLM-4.7-Flash on M4 Mac Mini with 24GB RAM
Ad

Practical Configuration for GLM-4.7-Flash on M4 Hardware

A developer testing OpenClaw and Ollama on an M4 Mac Mini with 24GB RAM has shared specific optimization details for running the GLM-4.7-Flash model. The source provides concrete memory allocation realities and configuration parameters that work within the hardware constraints.

Memory Reality and Model Selection

The testing reveals that the effective GPU memory budget on the M4 Mini is approximately 17.8GB Metal (GPU-wired), not the full 24GB. The rest is consumed by macOS, applications, and CPU compute. This limitation affects model selection and context size.

  • Q4_K_XL quantization (17.5GB GGUF) cannot handle 32k context: Model (14.4GB) + KV (2.8GB) + compute (1.4GB) = 18.6GB → Out of Memory
  • Q3_K_XL quantization (13.8GB GGUF) works at 32k context: Model (12.7GB) + KV (3.2GB) + compute (1.4GB) = 16.1GB with 1.7GB headroom
  • Context ceiling is approximately 34k before OOM occurs

Configuration Details

The successful setup uses:

  • Model: unsloth/GLM-4.7-Flash-GGUF from Hugging Face
  • Quantization: Q3_K_XL
  • Context size: 32k with MLA (Multi-Head Latent Attention)
  • KV cache implementation: llama.cpp's v-less KV cache (PR #19067, Jan 2026) triggered by GGUF metadata (key_length_mla, kv_lora_rank)
  • Build requirement: llama.cpp b7860+

The MLA implementation reduces KV memory usage significantly - 32k context KV cache is only 3.2GB instead of 13GB.

Ad

Framework-Specific Considerations

Agentic frameworks like OpenClaw have internal context thresholds that affect performance:

  • OpenClaw triggers aggressive compaction below 32k context
  • Increasing context from 20k to 32k reduced startup time from 5 minutes to 2 minutes 17 seconds
  • Compaction passes dropped from 2 to 1 when matching num_ctx to framework thresholds
  • num_ctx must be baked into the Ollama Modelfile - OpenClaw and other orchestrators using Ollama's OpenAI-compatible API ignore it at the request level

Performance Testing Data

The developer provided specific timing data for various tasks:

Task                     Time   Input Tokens  Compactions  Result
Personality intro        119s   ~13,900      2            ✅
Profile recall           60s    13,247       2            ✅ w/ caveat
Task creation            61s    13,375       2            ✅
Memory write             165s   14,448       2            ✅
Memory recall            89s    14,085       2            ✅
Web search + synthesis   273s   18,668       2            ✅

MLX Considerations

The developer notes that MLX and GGUF are different formats - Unsloth/bartowski GGUF files cannot run with mlx-lm. Currently, no 3-bit Flash model exists in the mlx-community repository, only 4-bit models are available.

📖 Read the full source: r/openclaw

Ad

👀 See Also