Qwen3.5 vs GPT-5.4: Local LLM Backend Function Calling Benchmark

Five months after an initial uncontrolled measurement, AutoBe.dev has published a proper benchmark of local and frontier LLMs for backend code generation using function calling. The benchmark uses a controlled variable setup with a real scoring rubric, testing models on generating recursive-union AST schemas via a function calling harness.

Key Findings

The function calling harness has effectively closed the gap between frontier and local models on backend generation. Specifically, gpt-5.4's DB/API design scores are approximately equal to qwen3.5-35b-a3b, and claude-sonnet-4.6's logic scores match qwen3.5-27b.
This is the last round including frontier models. Running them monthly costs ~200–300M tokens (~$1,000–$1,500 per model on GPT 5.5 pricing). From next month, only OpenRouter endpoints under $0.25/M tokens or models that fit on a 64GB unified-memory laptop will be included.
Frontend automation will be added to the benchmark in the June/July round, using the SDK AutoBe already emits to drive end-to-end AI-built frontends (visuals rough, but all functions work).

Unexpected Inversions

Several results are still under investigation:

openai/gpt-5.4 scores below its own mini sibling.
deepseek-v4-pro lands one notch below qwen3.5-35b-a3b and barely separates from its own Flash sibling.
Within the Qwen family, dense 27B beats every MoE variant, including 397B-A17B.

Possible explanations being investigated include CoT-compliance phenomenon (larger/frontier models tend to skip procedural instructions enforced by the harness) and benchmark defects (n=4 reference projects, narrow score band, harness scoring own pipeline).