Homelab Developer Benchmarks 19 Local LLMs with 45 Practical Tests on AMD Strix Halo

✍️ OpenClawRadar📅 Published: April 14, 2026🔗 Source
Homelab Developer Benchmarks 19 Local LLMs with 45 Practical Tests on AMD Strix Halo
Ad

Practical Benchmarking for Real-World LLM Use Cases

A developer with a homelab setup conducted extensive testing of local LLMs using a custom 45-test benchmark suite designed around actual use cases rather than generic academic benchmarks. The tests were run on an AMD Strix Halo system with Ryzen AI MAX+ 395, 128GB RAM, and 96GB shared VRAM using Vulkan/RADV with llama-server (kyuz0 Docker image).

Why Custom Benchmarks Matter

The developer uses Claude Opus for interactive coding but needs local models for 24/7 services including:

  • Email classification running every 15 minutes to sort 50+ emails
  • Camera notifications using vision models to describe motion alerts
  • Meal planning with dietary constraints
  • Finance analysis for tax scenarios and portfolio projections
  • Home Assistant automation generation and validation

These tasks require fast, reliable models with good structured output capabilities that generic benchmarks like MMLU scores don't adequately measure.

The 45-Test Suite

The benchmark includes tests across 12 categories, each scored 0-10 by Claude Opus 4.6 against specific rubrics:

  • Coding (4 tests): Docker Compose, systemd services, Python scripts, code review
  • Homelab ops (6 tests): Memory analysis, OOM debugging, disk triage, network debug, log parsing
  • Tool calling (5 tests): Proxmox pct/qm commands, SSH chains, Docker ops, git workflows
  • Food/meal planning (6 tests): JSON meal plans, prep schedules, recipe scaling, shopping lists, nutrition
  • Finance (5 tests): Tax calculations, portfolio analysis, FIRE projections, tax-loss harvesting
  • Email classification (3 tests): Category assignment, ambiguous cases, unsubscribe decisions
  • Home Assistant (3 tests): Automation YAML, template sensors, conditions
  • Math (4 tests): Mortgage payoff, probability, number theory, tax optimization
  • Reasoning (3 tests): Energy bills, statistics, logic constraints
  • Instruction following (3 tests): Format compliance, JSON output, negative constraints
  • Long context (1 test): Extract facts from 8K-token infrastructure doc
  • Speed (2 tests): Time-to-first-token, sustained generation

Nine tests are weighted 2x as "critical" for the developer's most common use cases, with a maximum possible score of 540.

Ad

Testing Methodology

Each test has specific rubrics defining what constitutes a good answer. For example, the memory analysis test requires correctly identifying that "available" memory (22G) is the real free metric, not "free" (5.7G), and that swap usage is non-critical. The tax calculation test checks for correct AGI, taxable income, and bracket math. All raw responses and rubrics are saved for cross-checking.

Models Tested

The developer tested 19 model configurations across 6 families on Vulkan with llama-server, including:

  • Qwen family: Qwen3.5-122B-A10B (10B active MoE) - previously used in production, Qwen3-Coder-Next 80B-A3B (3B active)
  • Gemma 4 26B-A4B - ended up on top after fixing two separate bugs that made it appear broken initially

The developer notes this isn't rigorous academic methodology but practical testing to determine which models work best for specific homelab tasks.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also