331 GGUF Models Benchmarked on Mac Mini M4 16GB: Only 11 Pareto-Optimal

A comprehensive benchmark tested 331 GGUF models on a Mac Mini M4 with 16GB unified memory to identify viable options for local deployment. The testing pipeline ran for weeks, automating model evaluation to move beyond subjective selection.

Key Findings

31 out of 331 models were completely unusable on 16GB hardware, defined by time-to-first-token (TTFT) > 10 seconds or throughput < 0.1 tokens/second. These models technically load but experience memory thrashing. Every 27B+ dense model tested fell into this category, with Qwen3.5-27B-heretic-v2-Q4_K_S being the worst performer at 97-second TTFT and 0.007 tokens/second.

When model weights plus KV cache exceed approximately 14GB, performance "falls off a cliff." Dense models above 14B are memory-bandwidth-starved on this hardware.

Architecture Comparison

Mixture-of-Experts (MoE) models dominate on 16GB hardware:

Median tokens/second: MoE 20.0 vs Dense 4.4
Median TTFT: MoE 0.66s vs Dense 0.87s
Maximum quality score: MoE 50.4 vs Dense 46.2

MoE models with 1-3B active parameters fit in GPU memory while achieving quality comparable to much larger dense models.

Pareto-Optimal Models

Only 11 models out of 331 sit on the Pareto frontier (no other model beats them on both speed and quality):

Ling-mini-2.0 (Q4_K_S, abliterated): 50.3 tok/s, 24.2 quality
Ling-mini-2.0 (IQ4_NL): 49.8 tok/s, 25.8 quality
Ling-mini-2.0 (Q3_K_L): 46.3 tok/s, 26.2 quality
Ling-mini-2.0 (Q3_K_L, abliterated): 46.0 tok/s, 28.3 quality
Ling-Coder-lite (IQ4_NL): 24.3 tok/s, 29.2 quality
Ling-Coder-lite (Q4_0): 23.6 tok/s, 31.3 quality
LFM2-8B-A1B (Q5_K_M): 19.7 tok/s, 44.6 quality
LFM2-8B-A1B (Q5_K_XL): 18.9 tok/s, 44.6 quality
LFM2-8B-A1B (Q8_0): 15.1 tok/s, 46.2 quality
LFM2-8B-A1B (Q8_K_XL): 14.9 tok/s, 47.9 quality
LFM2-8B-A1B (Q6_K_XL): 13.9 tok/s, 50.4 quality

Every single Pareto-optimal model is MoE architecture. Every other model in the 331 is strictly dominated by one of these eleven.

Context and Concurrency Performance

Context scaling shows surprisingly flat performance: median tokens/second ratio (4096 vs 1024 context) is 1.0x. Most models show zero degradation going from 1k to 4k context, with some MoE models actually speeding up at 4k. The memory bandwidth cliff hasn't hit yet at 4k on this hardware.

Concurrency is a net loss: at concurrency 2, per-request throughput drops to 0.55x (ideal would be 1.0x). Two concurrent requests fight for the same unified memory bus. The recommendation is to run one request at a time on 16GB hardware.

Top Recommendations

LFM2-8B-A1B-UD-Q6_K_XL (unsloth) - Best overall: 50.4 quality composite (highest of all 331 models), 13.9 tokens/second, 0.48s TTFT. MoE with 1B active parameters - architecturally ideal for 16GB.
LFM2-8B-A1B-Q5_K_M (unsloth) - Best speed among quality models: 19.7 tokens/second (fastest LFM2 variant), 44.6 quality (only 6 points below the top). Smallest quant = most headroom for longer contexts.
LFM2-8B-A1B-UD-Q8_K_XL (unsloth) - Balanced performance option.

📖 Read the full source: r/LocalLLaMA