Qwen3.6-27B Local Agent Test: 12% Tool-Call Gap vs Claude

A developer replaced Claude with Qwen3.6-27B in a multi-agent orchestrator for two weeks, running entirely on a single RTX 3090. The goal was straightforward: test whether a local model could serve as the reasoning layer — lead/manager/sub-agent loop — in real coding workflows. The results offer hard numbers for anyone considering cutting cloud costs.

Setup and Baseline

Hardware: RTX 3090, 24GB VRAM
Model: Qwen3.6-27B at Q6_K quantization (~22GB on-GPU), effective context 32k
Inference engine: Ollama
Orchestrator: Multi-agent system with structured-JSON plans, plan-approval modal, auto-review pass after sub-agent completion
Workload: 47 multi-step coding workflows across two real repositories

What Worked (The Reasoning Layer)

Plan generation. Qwen3.6 generated multi-step plans roughly as well as Claude on these tasks. Slightly more conservative — fewer unsolicited refactoring suggestions — but coherent and schema-valid ~95% of the time after prompt tweaks. The remaining 5% were fixable with a single re-prompt.

Memory extraction. Mem0-style fact extraction every 6 turns worked fine. Qwen pulled out the same facts Claude does (e.g., "user prefers no comments unless they explain a 'why'") and stored them cleanly in Qdrant.

Auto-review of sub-agent output. A second Qwen instance reviewing the first one's code caught ~60% of the bugs Claude's review caught on the same set. Less aggressive, still useful, and free.

Where It Broke

Tool-call reliability. Qwen3.6's JSON tool-call output had a ~12% format error rate across 47 tasks. Claude was ~0.5% on the same workload. Errors were not malformed JSON — they were wrong field names, wrong types, hallucinated tool signatures. Using Outlines or strict-output mode reduced errors but didn't eliminate them.

Long-context drift. Past ~14k tokens of accumulated session context, Qwen started misremembering decisions (e.g., "you said use Postgres" when the opposite was said). Effective practical limit is ~12k tokens, then aggressive summarize-and-reset.

Cascade-failure handling. When a sub-agent failed, Claude's planner usually noticed and re-planned. Qwen sometimes generated downstream steps assuming the sub-agent succeeded. Three cascading hallucinations in 47 runs — not catastrophic with plan gating, but would be without it.

Practical Implications

The developer's take: "Qwen3.6-27B is a viable reasoning layer for local multi-agent systems today. It is NOT a viable execution layer." If you're building local-only agents, you need:

Structured-output enforcement at the tool-call boundary (Outlines, lm-format-enforcer, or grammar mode of your inference engine)
Plan-approval gating so the 12% format errors never reach actual file writes
Re-plan-on-failure logic — the model itself can't be trusted to handle cascading failures

The 12% tool-call error gap is the metric to watch. Once Qwen3.6 or the next local model hits ~2% on this metric, the case for cloud reasoning in agent loops weakens considerably.

📖 Read the full source: r/LocalLLaMA

Qwen3.6-27B as a Local Reasoning Layer: 2-Week Multi-Agent Test Results

Setup and Baseline

What Worked (The Reasoning Layer)

Where It Broke

Practical Implications

👀 See Also

Memento Vault: Local Tool for Persistent Context in Claude Code Sessions

Open-source multi-account manager for Claude CLI enables profile switching

Mímir: A Python Memory System Built on 21 Neuroscience Mechanisms

Nudge: A local-first app that surfaces Claude-generated plans via contextual triggers