Benchmark Results: 331 GGUF Models Tested on Mac Mini M4 16GB

A comprehensive benchmark tested 331 GGUF models on a Mac Mini M4 with 16GB unified memory to identify viable options for local deployment. The testing pipeline ran for weeks, automating model evaluation to move beyond subjective selection.
Key Findings
31 out of 331 models were completely unusable on 16GB hardware, defined by time-to-first-token (TTFT) > 10 seconds or throughput < 0.1 tokens/second. These models technically load but experience memory thrashing. Every 27B+ dense model tested fell into this category, with Qwen3.5-27B-heretic-v2-Q4_K_S being the worst performer at 97-second TTFT and 0.007 tokens/second.
When model weights plus KV cache exceed approximately 14GB, performance "falls off a cliff." Dense models above 14B are memory-bandwidth-starved on this hardware.
Architecture Comparison
Mixture-of-Experts (MoE) models dominate on 16GB hardware:
- Median tokens/second: MoE 20.0 vs Dense 4.4
- Median TTFT: MoE 0.66s vs Dense 0.87s
- Maximum quality score: MoE 50.4 vs Dense 46.2
MoE models with 1-3B active parameters fit in GPU memory while achieving quality comparable to much larger dense models.
Pareto-Optimal Models
Only 11 models out of 331 sit on the Pareto frontier (no other model beats them on both speed and quality):
- Ling-mini-2.0 (Q4_K_S, abliterated): 50.3 tok/s, 24.2 quality
- Ling-mini-2.0 (IQ4_NL): 49.8 tok/s, 25.8 quality
- Ling-mini-2.0 (Q3_K_L): 46.3 tok/s, 26.2 quality
- Ling-mini-2.0 (Q3_K_L, abliterated): 46.0 tok/s, 28.3 quality
- Ling-Coder-lite (IQ4_NL): 24.3 tok/s, 29.2 quality
- Ling-Coder-lite (Q4_0): 23.6 tok/s, 31.3 quality
- LFM2-8B-A1B (Q5_K_M): 19.7 tok/s, 44.6 quality
- LFM2-8B-A1B (Q5_K_XL): 18.9 tok/s, 44.6 quality
- LFM2-8B-A1B (Q8_0): 15.1 tok/s, 46.2 quality
- LFM2-8B-A1B (Q8_K_XL): 14.9 tok/s, 47.9 quality
- LFM2-8B-A1B (Q6_K_XL): 13.9 tok/s, 50.4 quality
Every single Pareto-optimal model is MoE architecture. Every other model in the 331 is strictly dominated by one of these eleven.
Context and Concurrency Performance
Context scaling shows surprisingly flat performance: median tokens/second ratio (4096 vs 1024 context) is 1.0x. Most models show zero degradation going from 1k to 4k context, with some MoE models actually speeding up at 4k. The memory bandwidth cliff hasn't hit yet at 4k on this hardware.
Concurrency is a net loss: at concurrency 2, per-request throughput drops to 0.55x (ideal would be 1.0x). Two concurrent requests fight for the same unified memory bus. The recommendation is to run one request at a time on 16GB hardware.
Top Recommendations
- LFM2-8B-A1B-UD-Q6_K_XL (unsloth) - Best overall: 50.4 quality composite (highest of all 331 models), 13.9 tokens/second, 0.48s TTFT. MoE with 1B active parameters - architecturally ideal for 16GB.
- LFM2-8B-A1B-Q5_K_M (unsloth) - Best speed among quality models: 19.7 tokens/second (fastest LFM2 variant), 44.6 quality (only 6 points below the top). Smallest quant = most headroom for longer contexts.
- LFM2-8B-A1B-UD-Q8_K_XL (unsloth) - Balanced performance option.
📖 Read the full source: r/LocalLLaMA
👀 See Also

LORE.md: An Open Standard for Extracting Structured Knowledge from AI Conversations
LORE.md is an open standard for extracting durable knowledge from AI conversations into a structured format. It captures decisions with rationale, insights, patterns, open questions, and next steps, with everything linking across sessions.

OpenClaw's AWS Deployment: A Focus on Automation
OpenClaw's tool allows for one-click deployment to AWS, simplifying cloud operations for developers using AI coding agents.

AnyClaw: Ubuntu 24.04 with Android hardware access and AI agent for terminal development
AnyClaw provides a full Ubuntu 24.04 environment running in proot on Android with direct access to Android hardware APIs from the terminal, including camera, GPS, battery, and sensors via bash commands and Java execution. It includes an AI coding agent that can orchestrate these tools and a web UI accessible from any browser on the same network.

Memento v1.0: Local Persistent Memory for AI Coding Agents
Memento v1.0 is a fully local memory layer for AI coding agents that runs embeddings, storage, and search on your machine with no cloud dependencies. It uses all-MiniLM-L6-v2 embeddings, HNSW indexing, and supports multiple IDEs with 17 MCP tools.