Benchmark Results: 15 LLMs Tested on 38 Real Workflow Tasks

A developer built a benchmark harness to determine which LLMs to route work to, testing 15 models on 38 tasks from their real workflow. Tasks included CSV transforms, letter counting, modular arithmetic, format compliance, and multi-step instructions. All tasks were scored programmatically using regex and exact match—no LLM judge was involved.
Benchmark Results
The benchmark involved 570 API calls costing $2.29 total. Key findings:
- Claude 3.5 Opus: 100% score, $0.69 per run, 14.2 seconds
- Claude 3.5 Sonnet: 100% score, $0.20 per run, 5.1 seconds
- MiniMax M2.5: 98.60% score, $0.02 per run, 2.3 seconds
- Kimi K2.5: 98.60% score, $0.05 per run, 3.8 seconds
- GPT-oss-20b (local): 98.30% score, $0 per run, 4.1 seconds
- Gemini 2.5 Flash: 97.10% score, $0.00 per run, 1.1 seconds
- Claude 3.5 Haiku: 96.90% score, $0.02 per run, 1.8 seconds
Cost-Performance Analysis
Sonnet and Opus both scored 100%, but Opus costs 3.5x more per call. For the developer's day-to-day tasks, Sonnet handles everything Opus does. Gemini Flash at $0.003 per run versus Opus at $0.69 per run represents a 265x cost difference for a 2.9-point performance gap.
Surprising Findings
MiniMax M2.5 and Kimi K2.5 both achieved 98.6% with 100% format compliance—the developer hadn't used either model before running the benchmark. GPT-oss-20b running locally scored 98.3% for $0, outperforming Haiku and DeepSeek R1.
QA Process
The quality assurance process revealed scoring bugs. Initial results showed Haiku beating Sonnet, which turned out to be a scorer bug producing quality scores above 100%. Five QA passes were conducted, each with a different model, and each found bugs the previous ones missed.
The developer is changing their daily driver to Sonnet based on these results but plans to switch between models more frequently given the performance variations.
📖 Read the full source: r/ClaudeAI
👀 See Also

PowerShell Script Automates OpenClaw Docker Setup on Windows
A PowerShell script handles Windows-specific networking quirks and Docker configuration for OpenClaw, automating checks, image retrieval, setup guidance, and container deployment.

Delimit Governance Layer for Multi-Agent AI Development
Delimit is an open-source governance layer that coordinates multiple AI coding agents to prevent conflicts. It provides shared memory, collision detection, and audit tracking for agents like Claude Code, Codex, and Gemini.

FixAI: Browser Game Teaches Consumer Law by Fighting Corporate AI Bots
FixAI is a browser game with 36 levels where players argue against corporate or government AI systems using real consumer laws. Built with Vanilla JS, Node/Express, and Claude Haiku, it features a resistance scoring system and educational explanations of legal arguments.

Claude Code Session Data Loss: Backup Script for Windows & Mac
Users report silent session data loss in Claude Code. Here's a free, automated backup script for Windows and Mac using PowerShell and launchd.