Claude Code Architecture Applied to Qwen3.5:9B: 10 Key Optimizations

Experimental Setup and Key Discovery

The developer used an RTX 5070 Ti (16GB VRAM) with qwen3.5:9b via Ollama (6.6GB) and the OpenClaw local agent framework. After 18 tests and 10 optimizations, the key finding was that qwen3.5:9b has native structured tool_calls, while qwen2.5-coder:14b and qwen2.5:14b put JSON in the content field instead of proper tool_calls, requiring extra parsing.

Performance Comparison

Model performance comparison:

qwen3.5:9b: Native tool_calls structure, thinking chain enabled, 39 tok/s
qwen2.5-coder:14b: Broken tool calling (in content field), no thinking chain, ~30 tok/s
qwen2.5:14b: Broken tool calling (in content field), no thinking chain, ~35 tok/s

10 Optimizations from Claude Code's Architecture

Structured system prompt → +600% output quality (A/B tested: 4 issues found vs 25+)
MicroCompact (tool result compression) → 80-93% compression, 11KB down to 367 chars
Hard cutoff (explore→produce forced transition) → Solved exploration loops where 9B models get stuck reading files without producing output
think=false → 8-10x token efficiency, eliminates language contamination
ToolSearch deferred loading → -60% prompt space (229 vs 568 tokens)
Four-type memory system (user/feedback/project/reference) → Personalized responses
KV cache forking → Minimal effect on single GPU (1.1x), needs vLLM
Strict write discipline → Verify before updating memory, prevents memory corruption
Parallel bootstrap → 9% faster cold start
Cache break tracking → Ollama caches identical prompts (182ms→75ms)

Core Finding: Self-Discipline as the Real Ceiling

The biggest finding was that the real ceiling for 9B models isn't reasoning ability or tool-use accuracy, but self-discipline—knowing when to stop exploring and start producing output. Without hard cutoff, the model used all 12 steps reading files and produced 0 bytes of report. With hard cutoff: 5 steps reading + 1 step writing = 6080 bytes structured report.

What qwen3.5:9b Can Actually Do

Read 800-line bash scripts and find real bugs (race conditions, non-atomic operations) — 2 min
Design a sales feedback system architecture — 8.7KB document in 2.5 min
Build a complete project (calculator + tests + run tests) — 28 seconds
10-step autonomous execution: write web scraper → pip install fails → find workaround → retry → tests pass — zero human intervention
Full mini-factory pipeline: search → write article → review → publish to HTML — 2.5 min

Complete Engine Performance

All 10 optimizations were packaged into a single Python engine (~280 lines). First run results:

Bootstrap: 527ms (parallel memory + model warmup)
Explore: 5 tool steps with MicroCompact (88% compression)
Produce: 1947 chars structured report
Total: 39.4s / zero API cost

What Didn't Work

KV cache forking on single GPU (needs multi-GPU or vLLM)
Step budget in system prompt (model ignores meta-instructions about its own behavior)
qwen2.5 series for tool calling (format issues)

The developer ran this on WSL2 + Ubuntu 24.04 and is willing to share more details or the engine code.

📖 Read the full source: r/LocalLLaMA