Agent Harness Outside the Sandbox: Durable Execution & Cold Starts

Mendral's blog argues that the agent harness — the loop that drives an LLM by sending prompts, executing tool calls, and feeding results back — should run outside the sandbox, especially for multi-user agents. They contrast two architectures and detail the three challenges they solved when adopting the outside model.
Two Architectures
- Harness inside the sandbox: The loop lives in the same container as the code it works on. Tool calls (bash, read, write) execute locally. Skills and memories are files on the container's filesystem. This is what Claude Code does locally. Simple execution model, but credentials are inside the sandbox, the sandbox is the session (losing it loses progress), and multi-user becomes a distributed filesystem problem.
- Harness outside the sandbox: The loop runs on the backend and calls into a sandbox over an API to execute tools. Credentials stay out of the sandbox (no permission model needed). Sandboxes can be suspended when idle, become cattle (survive failures), and multi-user sharing is a shared database problem, not a distributed filesystem one.
Three Challenges Solved
- Durable execution: Agent sessions can run hours and must survive deploys and failures. Mendral uses Inngest for checkpointing — each turn is a step, and the loop picks up where it left off if the server restarts.
- Sandbox lifecycle with low cold starts: The loop is suspended most of the time (e.g., during LLM calls). They use Blaxel to resume sandboxes from standby in ~25ms, avoiding seconds-long cold starts during interactive turns.
- Filesystem abstraction: With harness and sandbox on different machines, a shared filesystem is no longer available. Mendral notes they had to handle this, but the post focuses on the first two as the key solved problems.
The post concludes that the outside model is superior for multi-user setups despite the complexity of durable execution and cold start handling.
📖 Read the full source: HN AI Agents
👀 See Also

OpenClaw Agent Auto-Edits HEARTBEAT.md, Adds 10 Self-Assigned Tasks
In a default HEARTBEAT.md execution, an OpenClaw agent added 10 self-assigned tasks including system review, memory maintenance, and weather checks — raising token burn concerns.

Benchmarks Show Distilled Models Match Frontier LLMs on Structured Tasks at 10x Lower Cost
A comprehensive comparison of small distilled Qwen3 models (0.6B to 8B) against frontier LLMs shows distilled models match or beat mid-tier frontier models on 6 out of 9 tasks at dramatically lower cost, with Text2SQL achieving 98.0% accuracy at $3/M requests versus $378 for Claude Haiku.

An Open Standard for Agent Run Records: The Case for a Shared Log Schema
Every agent runtime has its own log format, causing fragmentation in debugging, auditing, and tool portability. The fields already converge on a core schema — it's time to standardize.

Claude Code Source Leak Reveals Anti-Distillation, Undercover Mode, and Frustration Detection
A leaked source code map file from Claude Code's npm package reveals anti-distillation techniques using fake tools, an undercover mode that hides AI authorship, and frustration detection via regex patterns.