IronBee: Open-source verification layer for Claude Code and Cursor

What IronBee does
IronBee is an open-source verification layer that installs hooks into Claude Code (and also works with Cursor) to prevent AI coding agents from shipping untested code. The tool addresses a common issue where Claude Code confidently states "I've implemented the feature" without verifying if it actually works in the browser.
Key features
- Blocks task completion until the agent tests changes in a real browser
- Tracks every file edit, browser tool call, and verification attempt
- Forces the agent to submit structured verdicts (not just "looks good")
- Makes the agent fix and re-verify on failure
- Uses the browser-devtools MCP server so Claude Code can navigate pages, click buttons, fill forms, take screenshots, and check console errors
- Includes
/ironbee-verifywith different modes (default, full, visual, functional) - Includes
/ironbee-analyzefor session analytics showing time spent coding vs fixing, problematic files, and agent improvement over time
Performance data
According to the source, tracking sessions revealed that 82% had bugs Claude Code would have shipped without verification, with a first-pass rate of only 18%. In testing, IronBee caught and fixed every bug before it shipped.
Setup
Installation requires two commands:
npm install -g @ironbee-ai/cli
cd your-project
ironbee installSource information
Announcement blog post: https://medium.com/@serkan_ozal/introducing-ironbee-the-verification-and-intelligence-layer-for-ai-coding-agents-dd554279efa3
GitHub repository: https://github.com/ironbee-ai/ironbee-cli
📖 Read the full source: r/ClaudeAI
👀 See Also

SWE-CI: New Benchmark Tests AI Agents on Long-Term Code Maintenance via CI
SWE-CI is a repository-level benchmark that evaluates LLM-powered agents on maintaining codebases through continuous integration cycles, shifting focus from static bug fixing to long-term maintainability across 100 real-world tasks.

Be brief beats caveman plugin in Claude Code compression benchmark
A 24-prompt benchmark shows Claude Code's caveman compression plugin produces the same token counts and quality as simply prepending 'be brief.' — but the plugin's consistent output shape and safety escape rules offer structural advantages.
Claude Code vs Codex: 36 vs 28 files, $2.50 vs $2.04, infinite loop caught — real-world comparison
A developer runs the same two tasks on Claude Code and Codex (Cursor): PR triage bot and real-time code review UI. Results: 36 vs 28 files, $2.50 vs $2.04 cost, Claude produced fewer TypeScript errors, Codex had an infinite React loop.

Security scanning skill for AI coding agents automatically checks deployments
A developer created a skill file that enables AI coding agents to automatically scan their own deployments for security issues like exposed secrets, open ports, missing security headers, and leaked source code. The scan runs after every deploy and takes about 30 seconds.