19 Local LLMs Benchmarked: 45 Tests on AMD Strix Halo

Practical Benchmarking for Real-World LLM Use Cases

A developer with a homelab setup conducted extensive testing of local LLMs using a custom 45-test benchmark suite designed around actual use cases rather than generic academic benchmarks. The tests were run on an AMD Strix Halo system with Ryzen AI MAX+ 395, 128GB RAM, and 96GB shared VRAM using Vulkan/RADV with llama-server (kyuz0 Docker image).

Why Custom Benchmarks Matter

The developer uses Claude Opus for interactive coding but needs local models for 24/7 services including:

Email classification running every 15 minutes to sort 50+ emails
Camera notifications using vision models to describe motion alerts
Meal planning with dietary constraints
Finance analysis for tax scenarios and portfolio projections
Home Assistant automation generation and validation

These tasks require fast, reliable models with good structured output capabilities that generic benchmarks like MMLU scores don't adequately measure.

The 45-Test Suite

The benchmark includes tests across 12 categories, each scored 0-10 by Claude Opus 4.6 against specific rubrics:

Coding (4 tests): Docker Compose, systemd services, Python scripts, code review
Homelab ops (6 tests): Memory analysis, OOM debugging, disk triage, network debug, log parsing
Tool calling (5 tests): Proxmox pct/qm commands, SSH chains, Docker ops, git workflows
Food/meal planning (6 tests): JSON meal plans, prep schedules, recipe scaling, shopping lists, nutrition
Finance (5 tests): Tax calculations, portfolio analysis, FIRE projections, tax-loss harvesting
Email classification (3 tests): Category assignment, ambiguous cases, unsubscribe decisions
Home Assistant (3 tests): Automation YAML, template sensors, conditions
Math (4 tests): Mortgage payoff, probability, number theory, tax optimization
Reasoning (3 tests): Energy bills, statistics, logic constraints
Instruction following (3 tests): Format compliance, JSON output, negative constraints
Long context (1 test): Extract facts from 8K-token infrastructure doc
Speed (2 tests): Time-to-first-token, sustained generation

Nine tests are weighted 2x as "critical" for the developer's most common use cases, with a maximum possible score of 540.

Testing Methodology

Each test has specific rubrics defining what constitutes a good answer. For example, the memory analysis test requires correctly identifying that "available" memory (22G) is the real free metric, not "free" (5.7G), and that swap usage is non-critical. The tax calculation test checks for correct AGI, taxable income, and bracket math. All raw responses and rubrics are saved for cross-checking.

Models Tested

The developer tested 19 model configurations across 6 families on Vulkan with llama-server, including:

Qwen family: Qwen3.5-122B-A10B (10B active MoE) - previously used in production, Qwen3-Coder-Next 80B-A3B (3B active)
Gemma 4 26B-A4B - ended up on top after fixing two separate bugs that made it appear broken initially

The developer notes this isn't rigorous academic methodology but practical testing to determine which models work best for specific homelab tasks.

📖 Read the full source: r/LocalLLaMA