Local LLM Benchmark: Backend Generation by Function Calling – GLM, Qwen, DeepSeek Compared

Five months after an initial uncontrolled measurement, AutoBe.dev has published a proper benchmark of local and frontier LLMs for backend code generation using function calling. The benchmark uses a controlled variable setup with a real scoring rubric, testing models on generating recursive-union AST schemas via a function calling harness.
Key Findings
- The function calling harness has effectively closed the gap between frontier and local models on backend generation. Specifically,
gpt-5.4's DB/API design scores are approximately equal toqwen3.5-35b-a3b, andclaude-sonnet-4.6's logic scores matchqwen3.5-27b. - This is the last round including frontier models. Running them monthly costs ~200–300M tokens (~$1,000–$1,500 per model on GPT 5.5 pricing). From next month, only OpenRouter endpoints under $0.25/M tokens or models that fit on a 64GB unified-memory laptop will be included.
- Frontend automation will be added to the benchmark in the June/July round, using the SDK AutoBe already emits to drive end-to-end AI-built frontends (visuals rough, but all functions work).
Unexpected Inversions
Several results are still under investigation:
openai/gpt-5.4scores below its ownminisibling.deepseek-v4-prolands one notch belowqwen3.5-35b-a3band barely separates from its ownFlashsibling.- Within the Qwen family, dense 27B beats every MoE variant, including 397B-A17B.
Possible explanations being investigated include CoT-compliance phenomenon (larger/frontier models tend to skip procedural instructions enforced by the harness) and benchmark defects (n=4 reference projects, narrow score band, harness scoring own pipeline).
Recommended Models
Three locked-in candidates for next month:
openai/gpt-5.4-nano— $0.25/M tokensqwen/qwen3.6-27b— $0.195/M tokensdeepseek/deepseek-v4-flash— $0.14/M tokens
All are under $0.25/M on OpenRouter or runnable on a 64GB unified-memory laptop, and handle function calling cleanly.
References
- Benchmark Dashboard: https://autobe.dev/benchmark/
- Generation Results: GitHub: autobe-examples
- GitHub Repository: https://github.com/wrtnlabs/autobe
📖 Read the full source: r/LocalLLaMA
👀 See Also

Spotify Rolls Out 'Verified' Badges to Tag Human Artists vs AI-Generated Acts
Spotify adds a green checkmark 'Verified by Spotify' badge to artist profiles that meet criteria like linked social accounts, concert dates, or merchandise, aiming to distinguish human acts from AI-generated ones.

Claude Opus 4.6 Breaks CLAUDE.md File References
Users report that Claude Opus 4.6 no longer automatically loads files referenced in CLAUDE.md, forcing manual intervention for each file.

AI Coding Agents Struggle with Context Management in Large Codebases
Analysis of AI coding agents reveals they spend 15-20 tool calls on orientation tasks like grepping for routes and reading middleware before writing code, burning through context windows. Vercel achieved 100% accuracy by stripping 80% of tools and using bash, while Pi uses just 4 tools and a system prompt under 1,000 tokens.

Agent Harness Outside the Sandbox: Durable Execution & Cold Starts
Running the agent loop outside the sandbox isolates credentials, enables sandbox suspension, and simplifies multi-user sharing, but requires solving durable execution and cold start latency.