Using a smaller model as a runtime hygiene layer improves OpenClaw agent reliability

✍️ OpenClawRadar📅 Published: March 14, 2026🔗 Source
Using a smaller model as a runtime hygiene layer improves OpenClaw agent reliability
Ad

Problem: Sloppy outputs degrade long-running agents

When running OpenClaw locally on a Mac Studio M4 (36GB) with Qwen 3.5 27B (4-bit, oMLX) as a household agent, the model didn't become less capable over time—it became sloppy. Specific issues included:

  • Tool calls leaking as raw text instead of structured tool use
  • Planning thoughts bleeding into final replies
  • Parroting tool results and policy text back to the user
  • Malformed outputs poisoning the context, causing degradation with each subsequent turn

The core issue wasn't capability but runtime hygiene: the model knew what to do but failed at proper behavior within the OpenClaw runtime environment.

Solution: Four-layer architecture for runtime hygiene

The developer implemented a four-layer approach that proved more effective than simply using a larger model:

  • Summarization: Context compaction via lossless-claw (DAG-based, freshTailCount=12, contextThreshold=0.60). This provided the single biggest improvement.
  • Sheriff: Regex and heuristic checks that catch malformed replies before they enter OpenClaw. This prevents leaked tool markup, planner ramble, and raw JSON from becoming durable context.
  • Judge: A smaller, cheaper model that classifies borderline outputs as "valid final answer" vs "junk." This model isn't for intelligence but for runtime hygiene—it's an immune system rather than a second brain. It also handles all summarization for lossless-claw.
  • Ozempic (internal name): Aggressive memory scrubbing that ensures the model re-reads only user requests, final answers, and compact tool-derived facts on future turns—not planner rambling, raw tool JSON, retry artifacts, or policy self-talk.
Ad

Why this beats using a bigger model

A single model must simultaneously solve tasks, maintain formatting discipline, manage context coherence, avoid poisoning itself with its own outputs, and recover from bad outputs—especially challenging at local quantization levels. Splitting responsibilities so the main model does the work while a smaller model maintains runtime hygiene proved more effective than adding more parameters.

Result: Sustained operation without resets

The approach moved from needing /new resets every 20-30 minutes to sustained single-session operation on a Mac Studio M4 with 36GB RAM, fully local with no API calls.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also