Phaselock: Open-Source AI Agent Control System with 4 Mechanisms

What Phaselock Does

Phaselock is an open-source Agent Skill that applies parenting techniques for autistic children to AI agent control. The developer created it after noticing parallels between vague AI agent failures and how autistic children process instructions.

Core Control Mechanisms

The system implements four specific control patterns:

Explicit gates before action: Uses a BeforeToolUse hook that checks for an approved gate file on disk. No file, no write. The AI cannot proceed without architectural declaration first.
Immediate feedback on mistakes: A PostToolUse hook runs static analysis after every file write (PHPStan, PHPCS, ESLint, ruff, or language-appropriate tools) and injects structured JSON results back into context. The AI sees exactly what broke and corrects itself before moving on.
Constrained choices not open options: Complex features are broken into dependency-ordered slices. The AI works one slice at a time, with each slice halting for human review before the next begins.
Rules that can't be rationalized away: Shell hooks either allow or block actions mechanically. The AI's opinion about its own output is not evidence.

Technical Implementation

Phaselock works with Claude Code, Cursor, Windsurf, and anything that supports hooks and agent skills. The domain knowledge is shaped around Magento 2 and PHP, but the enforcement architecture is language-agnostic.

Scaling Challenge and Solution

Phaselock has a scaling problem: it loads all rules into context every session. At 80 rules it's manageable, but at 500 rules you're burning context before the task starts, and at 10,000 rules it's physically impossible.

The developer is building Writ as a solution: a hybrid retrieval system that figures out which rules matter right now and returns only those. It achieves sub-10ms retrieval with 726x context reduction at 10,000 rules. Writ is still experimental and undergoing stress-testing.

Current Open Question

The developer is grappling with evaluation challenges. Ground truth queries are synthetic at 80 rules, and they don't yet know if retrieval quality holds on real queries from real sessions. They're asking: "Has anyone tackled RAG evaluation at small corpus sizes where synthetic benchmarks might not reflect real usage? What did you learn?"

📖 Read the full source: r/LocalLLaMA