Benchmark Results for Small Local and OpenRouter Models on Agentic Text-to-SQL Task

✍️ OpenClawRadar📅 Published: April 17, 2026🔗 Source
Benchmark Results for Small Local and OpenRouter Models on Agentic Text-to-SQL Task
Ad

A developer has published benchmark results for small local and OpenRouter models on an agentic text-to-SQL task. The benchmark takes English queries like "Show order lines, revenue, units sold, revenue per unit (total revenue ÷ total units sold), average list price per product in the subcategory, gross profit, and margin percentage for each product subcategory" and converts them to SQL that is tested against database tables.

Benchmark Details

The agent can see query results and modify SQL to fix issues, with a limit on debugging rounds. The benchmark is deliberately short with 25 questions and runs in much less than 5 minutes for most models, making it practical for testing different configurations. It's designed to be tough enough to separate the best models from others.

Ad

Key Findings

  • The best open models identified were kimi-k2.5, Qwen 3.5 397B-A17B, and Qwen 3.5 27B
  • NVIDIA Nemotron-Cascade-2-30B-A3B outscores Qwen 3.5-35B-A3B and matches Codex 5.3
  • Mimo v2 Flash was described as "a gem of a model"

Self-Hosted Option

The benchmark now includes the ability to run it yourself against your own server using the WASM version of Llama.cpp. The developer is seeking feedback on what to change for version 2 and wants to see scores others get with different configurations.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also