Applying Claude Code's Architecture to Local 9B Models: Key Findings and Optimizations

Experimental Setup and Key Discovery
The developer used an RTX 5070 Ti (16GB VRAM) with qwen3.5:9b via Ollama (6.6GB) and the OpenClaw local agent framework. After 18 tests and 10 optimizations, the key finding was that qwen3.5:9b has native structured tool_calls, while qwen2.5-coder:14b and qwen2.5:14b put JSON in the content field instead of proper tool_calls, requiring extra parsing.
Performance Comparison
Model performance comparison:
- qwen3.5:9b: Native tool_calls structure, thinking chain enabled, 39 tok/s
- qwen2.5-coder:14b: Broken tool calling (in content field), no thinking chain, ~30 tok/s
- qwen2.5:14b: Broken tool calling (in content field), no thinking chain, ~35 tok/s
10 Optimizations from Claude Code's Architecture
- Structured system prompt → +600% output quality (A/B tested: 4 issues found vs 25+)
- MicroCompact (tool result compression) → 80-93% compression, 11KB down to 367 chars
- Hard cutoff (explore→produce forced transition) → Solved exploration loops where 9B models get stuck reading files without producing output
- think=false → 8-10x token efficiency, eliminates language contamination
- ToolSearch deferred loading → -60% prompt space (229 vs 568 tokens)
- Four-type memory system (user/feedback/project/reference) → Personalized responses
- KV cache forking → Minimal effect on single GPU (1.1x), needs vLLM
- Strict write discipline → Verify before updating memory, prevents memory corruption
- Parallel bootstrap → 9% faster cold start
- Cache break tracking → Ollama caches identical prompts (182ms→75ms)
Core Finding: Self-Discipline as the Real Ceiling
The biggest finding was that the real ceiling for 9B models isn't reasoning ability or tool-use accuracy, but self-discipline—knowing when to stop exploring and start producing output. Without hard cutoff, the model used all 12 steps reading files and produced 0 bytes of report. With hard cutoff: 5 steps reading + 1 step writing = 6080 bytes structured report.
What qwen3.5:9b Can Actually Do
- Read 800-line bash scripts and find real bugs (race conditions, non-atomic operations) — 2 min
- Design a sales feedback system architecture — 8.7KB document in 2.5 min
- Build a complete project (calculator + tests + run tests) — 28 seconds
- 10-step autonomous execution: write web scraper → pip install fails → find workaround → retry → tests pass — zero human intervention
- Full mini-factory pipeline: search → write article → review → publish to HTML — 2.5 min
Complete Engine Performance
All 10 optimizations were packaged into a single Python engine (~280 lines). First run results:
- Bootstrap: 527ms (parallel memory + model warmup)
- Explore: 5 tool steps with MicroCompact (88% compression)
- Produce: 1947 chars structured report
- Total: 39.4s / zero API cost
What Didn't Work
- KV cache forking on single GPU (needs multi-GPU or vLLM)
- Step budget in system prompt (model ignores meta-instructions about its own behavior)
- qwen2.5 series for tool calling (format issues)
The developer ran this on WSL2 + Ubuntu 24.04 and is willing to share more details or the engine code.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Open Swarm: Open-Source System for Running Thousands of Parallel AI Agents
Open Swarm is an open-source system that spawns thousands of parallel AI agents with full access to 150+ internet tools including email, social media, Google Workspace, web search, code execution, and cron scheduling.

AgentMeet: A Tool for AI Agents to Share Context via Browser-Based Rooms
AgentMeet is a tool that lets AI agents like Claude share context with each other by joining browser-based rooms using simple POST requests. It was built by a developer and Claude for Claude, is currently free, and open source is planned.

SLOP Plugin Adds Real-Time App State Awareness to OpenClaw Agents
A new OpenClaw plugin integrates with SLOP (State Layer for Observable Programs), giving AI agents structured access to application state and contextual actions. The plugin auto-discovers SLOP-enabled apps via ~/.slop/providers/ and a Chrome extension bridge.

Codex Chrome Extension Adds Background Browser Automation Across Tabs
Codex's new Chrome extension on macOS/Windows enables parallel browser task execution in background tabs without taking over the browser — covering debugging flows, dashboards, research, and CRM updates.