Homelab Developer Benchmarks 19 Local LLMs with 45 Practical Tests on AMD Strix Halo

Practical Benchmarking for Real-World LLM Use Cases
A developer with a homelab setup conducted extensive testing of local LLMs using a custom 45-test benchmark suite designed around actual use cases rather than generic academic benchmarks. The tests were run on an AMD Strix Halo system with Ryzen AI MAX+ 395, 128GB RAM, and 96GB shared VRAM using Vulkan/RADV with llama-server (kyuz0 Docker image).
Why Custom Benchmarks Matter
The developer uses Claude Opus for interactive coding but needs local models for 24/7 services including:
- Email classification running every 15 minutes to sort 50+ emails
- Camera notifications using vision models to describe motion alerts
- Meal planning with dietary constraints
- Finance analysis for tax scenarios and portfolio projections
- Home Assistant automation generation and validation
These tasks require fast, reliable models with good structured output capabilities that generic benchmarks like MMLU scores don't adequately measure.
The 45-Test Suite
The benchmark includes tests across 12 categories, each scored 0-10 by Claude Opus 4.6 against specific rubrics:
- Coding (4 tests): Docker Compose, systemd services, Python scripts, code review
- Homelab ops (6 tests): Memory analysis, OOM debugging, disk triage, network debug, log parsing
- Tool calling (5 tests): Proxmox pct/qm commands, SSH chains, Docker ops, git workflows
- Food/meal planning (6 tests): JSON meal plans, prep schedules, recipe scaling, shopping lists, nutrition
- Finance (5 tests): Tax calculations, portfolio analysis, FIRE projections, tax-loss harvesting
- Email classification (3 tests): Category assignment, ambiguous cases, unsubscribe decisions
- Home Assistant (3 tests): Automation YAML, template sensors, conditions
- Math (4 tests): Mortgage payoff, probability, number theory, tax optimization
- Reasoning (3 tests): Energy bills, statistics, logic constraints
- Instruction following (3 tests): Format compliance, JSON output, negative constraints
- Long context (1 test): Extract facts from 8K-token infrastructure doc
- Speed (2 tests): Time-to-first-token, sustained generation
Nine tests are weighted 2x as "critical" for the developer's most common use cases, with a maximum possible score of 540.
Testing Methodology
Each test has specific rubrics defining what constitutes a good answer. For example, the memory analysis test requires correctly identifying that "available" memory (22G) is the real free metric, not "free" (5.7G), and that swap usage is non-critical. The tax calculation test checks for correct AGI, taxable income, and bracket math. All raw responses and rubrics are saved for cross-checking.
Models Tested
The developer tested 19 model configurations across 6 families on Vulkan with llama-server, including:
- Qwen family: Qwen3.5-122B-A10B (10B active MoE) - previously used in production, Qwen3-Coder-Next 80B-A3B (3B active)
- Gemma 4 26B-A4B - ended up on top after fixing two separate bugs that made it appear broken initially
The developer notes this isn't rigorous academic methodology but practical testing to determine which models work best for specific homelab tasks.
📖 Read the full source: r/LocalLLaMA
👀 See Also

iOS App Built Entirely with Claude Code by Non-Engineer Ships to App Store
A product manager with no iOS development experience shipped SpectraSort, a photo sorting app built entirely with Claude Code. The app uses on-device AI for quality ranking and personal taste learning, processing about 10 photos/second on the Neural Engine.

Using Lava's MCP Gateway with Claude Code for Low-Cost Content Workflow
A user connected Lava's MCP gateway to Claude Code and accessed research tools like Exa, Serper, and Tavily without accounts or API keys, creating a social media content workflow for $0.03.

RAG Pipeline Test Shows Cost Per Token Isn't the Right Metric for Model Selection
A developer tested Claude Haiku 4.5, Amazon Nova Pro, and Amazon Nova Lite on identical RAG pipelines with real queries and found the cheapest model per token produced the least useful answers, costing more per useful response.

Setting Up Claude Code with Telegram for Elderly Shopping Assistance
A Reddit user describes configuring Claude Code with Telegram to help parents navigate shopping websites, using a cloud-hosted sandbox with Playwright MCP and custom shopping skills.