Developer shares hybrid AI coding workflow: Claude for planning, local models for execution

Hybrid AI coding workflow reduces cloud costs
A developer on r/LocalLLaMA shared a detailed workflow that combines cloud and local AI models to reduce token costs while maintaining coding quality. The approach addresses the realization that many coding tasks don't require expensive cloud models.
The workflow architecture
The system follows a "Reason in the cloud, Execute locally" logic:
- Planner (Claude 3.5 Sonnet): Receives the task and generates a precise
task_context.mdfile containing instructions, file paths, and logic. This costs approximately 300-500 tokens. - Coder (Local Qwen2.5-Coder 30B via Ollama): Takes the specification and actual file content to write the code. This runs locally with zero cost.
- Validator: A simple Bash script runs
tsc --noEmitormypyfor type checking. - Reviewer (Local Qwen2.5-Coder 7B): Runs in parallel to check for obvious logic flaws.
- Auto-fix: If the build fails, the error log goes back to the local coder for 2-3 iterations.
Implementation details
The entire pipeline is wrapped into a set of Bash scripts using just jq and curl to communicate with the Ollama API. The system auto-detects language standards (TypeScript, Python, C++, etc.) based on the planner's output and doesn't require heavy Python/Node runtimes.
The developer notes that local models (even 30B ones) often fail at complex architectural reasoning but are surprisingly good at execution when given crystal-clear specifications.
Results and savings
On a recent TypeScript project involving 12 files changed:
- Claude usage was limited to the initial planning phase only
- Local models handled everything else: writing 12 files, linting, and reviewing
- Total savings: approximately 85% token reduction compared to doing everything inside the Claude Code CLI
The developer has made the scripts available in a repository called ai-orchestrator on GitHub (username: Mybono) for those interested in implementation details.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Career-Ops Fork Adds LinkedIn Job Discovery Using Apify
A developer forked the career-ops Claude Code system and added LinkedIn job discovery using Apify, addressing the main limitation of the original project which only scanned pre-configured company career pages.

Built AI Forensic Accounting Software with My Dad — CaseTrail Automates Financial Fraud Detection
A father-son team built CaseTrail, an AI-powered forensic accounting tool that ingests bank statements and identifies anomalies. The blog details integration with LLMs for transaction analysis.

Detrix MCP Server Adds Runtime Debugging to AI Coding Agents
Detrix is a free, open-source MCP server that enables MCP-compatible agents to observe live variables in running code without restarts or code changes. It supports Python, Go, and Rust applications running locally or in Docker.

Google Research introduces TurboQuant for AI model compression
Google Research has introduced TurboQuant, a compression algorithm that reduces AI model size with zero accuracy loss. It addresses memory overhead in vector quantization and improves key-value cache performance.