Spec27: Spec-Driven Validation for AI Agents – API-Level Testing Without Internal Access

Safe Intelligence has launched Spec27, a spec-driven validation tool for AI agents. Unlike traditional LLM eval frameworks that score general model behavior, Spec27 lets teams define reusable specifications for the specific mission an agent must fulfill. Tests are generated automatically from those specs and run against the agent's primary interfaces only — no assumption about internal stack, no SDKs or gateways required.
Key Features
- Outside-in testing: All tests execute against the agent's exposed API or UI. No need to instrument the agent's internals, which is crucial for agents built on vendor platforms where you don't control the stack.
- Spec-driven test generation: Define specs in terms of expected behavior (e.g., “when asked X, must do Y and not Z”). Spec27 auto-generates adversarial and robustness checks, surfacing sensitivities and regressions as models, prompts, or tools change.
- Early access: Currently strongest for single-turn agent and application validation. Multi-turn interactions and richer telemetry/tool-call integration are on the roadmap.
Who Is It For
Teams deploying internal agents, vendor agents, or any AI system where reliability matters more than benchmark scores. If you're testing agents on platforms that don't expose internals, Spec27's black-box approach directly addresses that gap.
Getting Started
Spec27 is open to try for HN readers. The launch site offers a sample flow so you can explore without setup. Sign up at spec27.ai/launch.
📖 Read the full source: HN AI Agents
👀 See Also

Fable 5 in Claude Code: Day One Cost Analysis — $210 API-equivalent, $0 Paid
A developer switched to claude-fable-5 in Claude Code and measured token usage across 742 replies. API-equivalent cost: $210.15. Actual paid: $0 during the plan window until June 22.

User-built PTC for Claude Code shows 40-65% token savings on analysis tasks, not code writing
A developer built a local PTC implementation called Thalamus for Claude Code and analyzed 79 real sessions, finding 40-65% token savings on analysis tasks but near-zero savings on code-writing tasks. The agent used execute() primarily for general Python computation rather than batching tool calls.

OmniCoder-9B: 9B Parameter Coding Agent Fine-Tuned on 425K Agentic Trajectories
Tesslate released OmniCoder-9B, a 9-billion parameter coding agent model fine-tuned on Qwen3.5-9B's hybrid architecture. It was trained on 425,000+ curated agentic coding trajectories from Claude Opus 4.6, GPT-5.4, GPT-5.3-Codex, and Gemini 3.1 Pro.

Audio Engineer Builds Mix Analysis Tool with Claude Code
An audio engineer created a tool that analyzes audio mixes using the Web Audio API and Claude to provide specific feedback on issues like muddy low-mids, lack of headroom, and buried vocals. The tool offers a free tier for quick analysis and a paid pro report with detailed frequency notes and plugin suggestions.