How Weak Local LLMs Fixed an AI Backend Generator

What Happened

AutoBe is an open-source AI agent that generates complete backend applications using TypeScript, NestJS, and Prisma. Initially, it achieved 100% compilation success, but the code was unmaintainable—there was no code reuse, so every small change required regenerating everything. The team rebuilt the system around modular code generation, which immediately crashed the success rate to 40%.

The Debugging Breakthrough

When the new architecture introduced dependencies between modules, the team used intentionally weak local LLMs to find bugs they didn't know existed. The qwen3-30b-a3b-thinking model had about a 10% success rate and exposed AST schema ambiguities and malformed structures. The qwen3-next-80b-a3b-instruct model had about a 20% success rate and revealed type mismatches and edge cases in nested relationships.

That low success rate was valuable: each fix tightened the entire system. When a schema is precise enough that a 30B model can't misinterpret it, stronger models won't get it wrong either. This approach also highlights the cost advantage of local LLMs—discovering edge cases requires hundreds of generation-compile-diagnose cycles, which would be prohibitively expensive at cloud API prices.

Architectural Shift

The team moved from prompt engineering to schema design with validation feedback. They stripped system prompts to almost nothing and moved all constraints into function calling schemas, letting validation feedback do the teaching. AutoBe uses three AST types that are particularly challenging for LLMs to generate: AutoBeDatabase (Prisma models, relations, indexes), AutoBeOpenApi (OpenAPI schemas, endpoints, DTOs), and AutoBeTest (30+ expression types).

These structures are difficult because they involve unlimited union types, unlimited depth, and recursive references. For example, the compiler AST includes types like IArrayLiteralExpression and IObjectLiteralExpression that contain recursive references to IExpression[].

Results

Through validation feedback alone, the team improved from 6.75% raw function calling success to 100%. They're now back to 100% success with GLM v5, and other local models are climbing in performance.

📖 Read the full source: r/LocalLLaMA