Benchmarking 88 Small GGUF Models on a 16GB Mac Mini M4

✍️ OpenClawRadar📅 Published: March 2, 2026🔗 Source
Benchmarking 88 Small GGUF Models on a 16GB Mac Mini M4
Ad

An automated pipeline was developed to download, benchmark, upload, and delete GGUF models in waves on a Mac Mini M4 with 16GB unified memory. The pipeline tested 88 models to find suitable local LLMs for this hardware configuration.

Key Findings

  • 9 out of 88 models are unusable on 16GB RAM - Any model where weights plus KV cache exceed approximately 14GB causes memory thrashing, resulting in TTFT > 10 seconds or < 0.1 tokens/second. This includes all dense 27B+ models.
  • Only 4 models sit on the Pareto frontier of throughput vs quality - All are LFM2-8B-A1B architecture (LiquidAI's MoE with 1B active parameters). The MoE design means only about 1B parameters are active per token, achieving 12-20 tokens/second where dense 8B models top out at 5-7 tokens/second.
  • Context scaling from 1k to 4k is flat - Most models show zero throughput degradation, with some LFM2 variants actually speeding up at 4k context.
  • Concurrency scaling is poor (0.57x at concurrency 2 vs ideal 2.0x) - The Mac Mini is memory-bandwidth limited, so running one request at a time is recommended.
Ad

Pareto Frontier Models

These four models beat all others on both speed and quality:

  • LFM2-8B-A1B-Q5_K_M (unsloth): 14.24 TPS average, 44.6 quality score
  • LFM2-8B-A1B-Q8_0 (unsloth): 12.37 TPS average, 46.2 quality score
  • LFM2-8B-A1B-UD-Q8_K_XL (unsloth): 12.18 TPS average, 47.9 quality score
  • LFM2-8B-A1B-Q8_0 (LiquidAI): 12.18 TPS average, 51.2 quality score

Quality evaluation used compact subsets (20 GSM8K + 60 MMLU questions) - directionally useful for ranking but not publication-grade absolute numbers.

Recommendations

For best quality: LFM2-8B-A1B-Q8_0. For speed: Q5_K_M. For balance: UD-Q6_K_XL.

Technical Details

  • Hardware: Mac Mini M4, 16GB unified memory, macOS 15.x
  • Software: llama-server (llama.cpp)
  • Methodology: Throughput numbers are p50 over multiple requests
  • Data: All data is reproducible from artifacts in the repository

The full pipeline is automated and open source. CSV data with all 88 models and benchmark scripts are available in the repository.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also