Benchmarking 88 Small GGUF Models on a 16GB Mac Mini M4

An automated pipeline was developed to download, benchmark, upload, and delete GGUF models in waves on a Mac Mini M4 with 16GB unified memory. The pipeline tested 88 models to find suitable local LLMs for this hardware configuration.
Key Findings
- 9 out of 88 models are unusable on 16GB RAM - Any model where weights plus KV cache exceed approximately 14GB causes memory thrashing, resulting in TTFT > 10 seconds or < 0.1 tokens/second. This includes all dense 27B+ models.
- Only 4 models sit on the Pareto frontier of throughput vs quality - All are LFM2-8B-A1B architecture (LiquidAI's MoE with 1B active parameters). The MoE design means only about 1B parameters are active per token, achieving 12-20 tokens/second where dense 8B models top out at 5-7 tokens/second.
- Context scaling from 1k to 4k is flat - Most models show zero throughput degradation, with some LFM2 variants actually speeding up at 4k context.
- Concurrency scaling is poor (0.57x at concurrency 2 vs ideal 2.0x) - The Mac Mini is memory-bandwidth limited, so running one request at a time is recommended.
Pareto Frontier Models
These four models beat all others on both speed and quality:
- LFM2-8B-A1B-Q5_K_M (unsloth): 14.24 TPS average, 44.6 quality score
- LFM2-8B-A1B-Q8_0 (unsloth): 12.37 TPS average, 46.2 quality score
- LFM2-8B-A1B-UD-Q8_K_XL (unsloth): 12.18 TPS average, 47.9 quality score
- LFM2-8B-A1B-Q8_0 (LiquidAI): 12.18 TPS average, 51.2 quality score
Quality evaluation used compact subsets (20 GSM8K + 60 MMLU questions) - directionally useful for ranking but not publication-grade absolute numbers.
Recommendations
For best quality: LFM2-8B-A1B-Q8_0. For speed: Q5_K_M. For balance: UD-Q6_K_XL.
Technical Details
- Hardware: Mac Mini M4, 16GB unified memory, macOS 15.x
- Software: llama-server (llama.cpp)
- Methodology: Throughput numbers are p50 over multiple requests
- Data: All data is reproducible from artifacts in the repository
The full pipeline is automated and open source. CSV data with all 88 models and benchmark scripts are available in the repository.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code's File-Based Memory System: A Pragmatic Alternative to Vector DBs
Claude Code implements a file-based memory system using .md files with frontmatter metadata and a MEMORY.md index, avoiding vector databases and embedding pipelines by scanning files, building manifests, and using a small model to select relevant memories.

OpenClaw Agent Maintains Memory When Switching from Claude Subscription to API
A developer reports successfully migrating their OpenClaw setup from a Claude subscription to API key without losing agent memory, using the mengram-memory skill that saves to an external layer. The agent retained ~100+ learned facts, evolved procedures, and episodic memories.

Agentlint: GitHub App that catches CLAUDE.md contradictions and broken pointers on every PR
Agentlint is a GitHub App that audits your full agent-rules surface (CLAUDE.md, AGENTS.md, skills, hooks) on every PR, posting inline comments for contradictions, broken paths, and unsupported harness features. Free for public repos.

SprintiQ: Open-Source Sprint Planning for Claude Code
SprintiQ is an open-source agile platform that acts as an orchestration layer for Claude Code, offering AI-powered user story generation, sprint planning, velocity tracking, and a CLI that syncs git activity to sprints in real time.