15 LLMs Benchmarked on 38 Workflow Tasks

A developer built a benchmark harness to determine which LLMs to route work to, testing 15 models on 38 tasks from their real workflow. Tasks included CSV transforms, letter counting, modular arithmetic, format compliance, and multi-step instructions. All tasks were scored programmatically using regex and exact match—no LLM judge was involved.

Benchmark Results

The benchmark involved 570 API calls costing $2.29 total. Key findings:

Claude 3.5 Opus: 100% score, $0.69 per run, 14.2 seconds
Claude 3.5 Sonnet: 100% score, $0.20 per run, 5.1 seconds
MiniMax M2.5: 98.60% score, $0.02 per run, 2.3 seconds
Kimi K2.5: 98.60% score, $0.05 per run, 3.8 seconds
GPT-oss-20b (local): 98.30% score, $0 per run, 4.1 seconds
Gemini 2.5 Flash: 97.10% score, $0.00 per run, 1.1 seconds
Claude 3.5 Haiku: 96.90% score, $0.02 per run, 1.8 seconds

Cost-Performance Analysis

Sonnet and Opus both scored 100%, but Opus costs 3.5x more per call. For the developer's day-to-day tasks, Sonnet handles everything Opus does. Gemini Flash at $0.003 per run versus Opus at $0.69 per run represents a 265x cost difference for a 2.9-point performance gap.

Surprising Findings

MiniMax M2.5 and Kimi K2.5 both achieved 98.6% with 100% format compliance—the developer hadn't used either model before running the benchmark. GPT-oss-20b running locally scored 98.3% for $0, outperforming Haiku and DeepSeek R1.

QA Process

The quality assurance process revealed scoring bugs. Initial results showed Haiku beating Sonnet, which turned out to be a scorer bug producing quality scores above 100%. Five QA passes were conducted, each with a different model, and each found bugs the previous ones missed.

The developer is changing their daily driver to Sonnet based on these results but plans to switch between models more frequently given the performance variations.

📖 Read the full source: r/ClaudeAI