Applying Claude Code's Architecture to Local 9B Models: Key Findings and Optimizations

✍️ OpenClawRadar📅 Published: April 4, 2026🔗 Source
Applying Claude Code's Architecture to Local 9B Models: Key Findings and Optimizations
Ad

Experimental Setup and Key Discovery

The developer used an RTX 5070 Ti (16GB VRAM) with qwen3.5:9b via Ollama (6.6GB) and the OpenClaw local agent framework. After 18 tests and 10 optimizations, the key finding was that qwen3.5:9b has native structured tool_calls, while qwen2.5-coder:14b and qwen2.5:14b put JSON in the content field instead of proper tool_calls, requiring extra parsing.

Performance Comparison

Model performance comparison:

  • qwen3.5:9b: Native tool_calls structure, thinking chain enabled, 39 tok/s
  • qwen2.5-coder:14b: Broken tool calling (in content field), no thinking chain, ~30 tok/s
  • qwen2.5:14b: Broken tool calling (in content field), no thinking chain, ~35 tok/s

10 Optimizations from Claude Code's Architecture

  • Structured system prompt → +600% output quality (A/B tested: 4 issues found vs 25+)
  • MicroCompact (tool result compression) → 80-93% compression, 11KB down to 367 chars
  • Hard cutoff (explore→produce forced transition) → Solved exploration loops where 9B models get stuck reading files without producing output
  • think=false → 8-10x token efficiency, eliminates language contamination
  • ToolSearch deferred loading → -60% prompt space (229 vs 568 tokens)
  • Four-type memory system (user/feedback/project/reference) → Personalized responses
  • KV cache forking → Minimal effect on single GPU (1.1x), needs vLLM
  • Strict write discipline → Verify before updating memory, prevents memory corruption
  • Parallel bootstrap → 9% faster cold start
  • Cache break tracking → Ollama caches identical prompts (182ms→75ms)
Ad

Core Finding: Self-Discipline as the Real Ceiling

The biggest finding was that the real ceiling for 9B models isn't reasoning ability or tool-use accuracy, but self-discipline—knowing when to stop exploring and start producing output. Without hard cutoff, the model used all 12 steps reading files and produced 0 bytes of report. With hard cutoff: 5 steps reading + 1 step writing = 6080 bytes structured report.

What qwen3.5:9b Can Actually Do

  • Read 800-line bash scripts and find real bugs (race conditions, non-atomic operations) — 2 min
  • Design a sales feedback system architecture — 8.7KB document in 2.5 min
  • Build a complete project (calculator + tests + run tests) — 28 seconds
  • 10-step autonomous execution: write web scraper → pip install fails → find workaround → retry → tests pass — zero human intervention
  • Full mini-factory pipeline: search → write article → review → publish to HTML — 2.5 min

Complete Engine Performance

All 10 optimizations were packaged into a single Python engine (~280 lines). First run results:

  • Bootstrap: 527ms (parallel memory + model warmup)
  • Explore: 5 tool steps with MicroCompact (88% compression)
  • Produce: 1947 chars structured report
  • Total: 39.4s / zero API cost

What Didn't Work

  • KV cache forking on single GPU (needs multi-GPU or vLLM)
  • Step budget in system prompt (model ignores meta-instructions about its own behavior)
  • qwen2.5 series for tool calling (format issues)

The developer ran this on WSL2 + Ubuntu 24.04 and is willing to share more details or the engine code.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also