Using a smaller model as a runtime hygiene layer improves OpenClaw agent reliability

Problem: Sloppy outputs degrade long-running agents
When running OpenClaw locally on a Mac Studio M4 (36GB) with Qwen 3.5 27B (4-bit, oMLX) as a household agent, the model didn't become less capable over time—it became sloppy. Specific issues included:
- Tool calls leaking as raw text instead of structured tool use
- Planning thoughts bleeding into final replies
- Parroting tool results and policy text back to the user
- Malformed outputs poisoning the context, causing degradation with each subsequent turn
The core issue wasn't capability but runtime hygiene: the model knew what to do but failed at proper behavior within the OpenClaw runtime environment.
Solution: Four-layer architecture for runtime hygiene
The developer implemented a four-layer approach that proved more effective than simply using a larger model:
- Summarization: Context compaction via lossless-claw (DAG-based, freshTailCount=12, contextThreshold=0.60). This provided the single biggest improvement.
- Sheriff: Regex and heuristic checks that catch malformed replies before they enter OpenClaw. This prevents leaked tool markup, planner ramble, and raw JSON from becoming durable context.
- Judge: A smaller, cheaper model that classifies borderline outputs as "valid final answer" vs "junk." This model isn't for intelligence but for runtime hygiene—it's an immune system rather than a second brain. It also handles all summarization for lossless-claw.
- Ozempic (internal name): Aggressive memory scrubbing that ensures the model re-reads only user requests, final answers, and compact tool-derived facts on future turns—not planner rambling, raw tool JSON, retry artifacts, or policy self-talk.
Why this beats using a bigger model
A single model must simultaneously solve tasks, maintain formatting discipline, manage context coherence, avoid poisoning itself with its own outputs, and recover from bad outputs—especially challenging at local quantization levels. Splitting responsibilities so the main model does the work while a smaller model maintains runtime hygiene proved more effective than adding more parameters.
Result: Sustained operation without resets
The approach moved from needing /new resets every 20-30 minutes to sustained single-session operation on a Mac Studio M4 with 36GB RAM, fully local with no API calls.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Local Qwen3-0.6B INT8 as Embedding Backbone for AI Memory System
A developer implemented Qwen3-0.6B quantized to INT8 via ONNX Runtime as a local embedding model for an AI memory lifecycle system, achieving 12ms batch inference on CPU with 1024-dimensional vectors and cosine similarity thresholds of 0.75 for semantic relatedness.

Startup Founder Uses AI Agents for Customer Support and Competitor Research
A startup founder automated customer support by connecting an AI agent to documentation, reducing daily time from 2 hours to 20 minutes, and set up weekly competitor research summaries delivered to Slack.

Claude AI Diagnoses Zigbee Network Issue, Recommends Switching from deCONZ to Zigbee2MQTT
A user reported that Claude AI identified a deCONZ issue where switching scenes triggered over 80 ZCL-attribute read commands that overwhelmed a Conbee 2 adapter. Claude recommended migrating to Zigbee2MQTT, which resolved years of unreliable lighting behavior.

SkiTomorrow.ai: A Ski Trip Decision Engine Built with Claude Code
SkiTomorrow.ai is a free web tool that scores 234 ski resorts worldwide based on live snow forecasts, travel distance, and cost, then provides personalized rankings. The developer built it entirely using Claude Code and shared specific workflow insights.