Benchmark Results for Small Local and OpenRouter Models on Agentic Text-to-SQL Task

A developer has published benchmark results for small local and OpenRouter models on an agentic text-to-SQL task. The benchmark takes English queries like "Show order lines, revenue, units sold, revenue per unit (total revenue ÷ total units sold), average list price per product in the subcategory, gross profit, and margin percentage for each product subcategory" and converts them to SQL that is tested against database tables.
Benchmark Details
The agent can see query results and modify SQL to fix issues, with a limit on debugging rounds. The benchmark is deliberately short with 25 questions and runs in much less than 5 minutes for most models, making it practical for testing different configurations. It's designed to be tough enough to separate the best models from others.
Key Findings
- The best open models identified were kimi-k2.5, Qwen 3.5 397B-A17B, and Qwen 3.5 27B
- NVIDIA Nemotron-Cascade-2-30B-A3B outscores Qwen 3.5-35B-A3B and matches Codex 5.3
- Mimo v2 Flash was described as "a gem of a model"
Self-Hosted Option
The benchmark now includes the ability to run it yourself against your own server using the WASM version of Llama.cpp. The developer is seeking feedback on what to change for version 2 and wants to see scores others get with different configurations.
📖 Read the full source: r/LocalLLaMA
👀 See Also

LogClaw: Open-Source AI SRE for Auto-Ticketing from Logs
LogClaw is an open-source log intelligence platform that runs on Kubernetes, ingests logs via OpenTelemetry, detects anomalies using signal-based composite scoring, and automatically creates tickets with root cause analysis in about 90 seconds.

ClawControl 1.7.1 improves message reliability and media support for OpenClaw
ClawControl 1.7.1 fixes several client-side issues including runaway text accumulation, ghost messages, and media handling problems. The update maintains compatibility with OpenClaw through version 3.28.

WCY format reduces LLM token overhead by 50-71% and adds structural 'I don't know' markers
WCY (Watch-Compute-Yield) is a line-oriented format that reduces JSON token overhead by 50-71% and introduces structural '?' markers for LLMs to indicate uncertainty during reasoning. The format requires no fine-tuning—just three few-shot examples.

OpenGalatea MCP Server Connects Claude to Prusa 3D Printers
OpenGalatea is an open-source MCP server that enables Claude to control Prusa 3D printers via PrusaLink, allowing natural language commands to search Printables.com, slice models, and manage prints.