AI TDD Pipeline: How Bad Instructions Created 3,400 Tests and What Fixed It

✍️ OpenClawRadar📅 Published: April 2, 2026🔗 Source
AI TDD Pipeline: How Bad Instructions Created 3,400 Tests and What Fixed It
Ad

The Problem: Literal Interpretation at Scale

A developer created a multi-agent TDD pipeline using Claude Code, with different agents handling specific jobs: one writes tests, one writes code to pass them, one reviews everything, and one hunts for edge cases. The initial instruction was simple: "write tests for everything."

The system appeared to work - test count kept climbing and CI was green. However, an audit revealed problems with the 3,400 generated tests:

  • 44% valid
  • 30% needed rework
  • 26% complete garbage

The garbage tests included:

  • Tests that constructed a JSON config object and then asserted it equaled itself
  • Tests that checked whether a TypeScript interface had the right shape by building the object and asserting it matches what they just built
  • Tests for static files that will never change

The developer deleted almost 20,000 lines of test code and identified the core issue: "Claude didn't screw up. I did. I said 'write tests for everything' and it heard me loud and clear. Every file. Every config. Every type definition. My instructions were the problem, and the agent followed them perfectly."

Ad

The Solution: Classification and Review

The fix involved two key changes:

1. Classifying work items before testing:

  • Features get 3-5 behavioral tests (does this thing actually work?)
  • Tasks get 1-2 smoke tests (did it break anything obvious?)
  • Bugs get 2-3 regression tests (will this specific bug come back?)
  • Enhancements only test new or changed behavior

2. Adding a review agent: A separate agent looks at both tests and implementation with fresh context, catching issues the writing agents missed because they were too close to their own output.

Results After the Fix

  • 3,400 tests down to 2,525
  • Execution time dropped from 117 seconds to ~50 seconds
  • Every remaining test validates actual behavior

Key Insight

"Building with AI agents makes your sloppy thinking visible at scale. A human writes bad tests, you get a few bad tests. Give a bad instruction to an agent pipeline processing hundreds of work items? You get hundreds of bad tests. Same bad thinking, just amplified across everything it touches. Fix the thinking, fix the output."

📖 Read the full source: r/ClaudeAI

Ad

👀 See Also

Using Lava's MCP Gateway with Claude Code for Low-Cost Content Workflow
Use Cases

Using Lava's MCP Gateway with Claude Code for Low-Cost Content Workflow

A user connected Lava's MCP gateway to Claude Code and accessed research tools like Exa, Serper, and Tavily without accounts or API keys, creating a social media content workflow for $0.03.

OpenClawRadar
Non-developer builds resale scoring tool with Claude and eBay API
Use Cases

Non-developer builds resale scoring tool with Claude and eBay API

A detective with no software engineering background built FlipIQ, a local Flask/SQLite tool that uses Claude to analyze eBay sold data and generate confidence scores for resale items. The tool includes photo ID features and runs free with an eBay API key and Ollama.

OpenClawRadar
OpenClaw Agent Implements Autonomous Self-Improvement Loop with Nightly Dream Cycles
Use Cases

OpenClaw Agent Implements Autonomous Self-Improvement Loop with Nightly Dream Cycles

An OpenClaw user has configured their agent to run a nightly 'dream cycle' that scans AI research, reflects on performance, and implements safe improvements autonomously. The cycle costs approximately $0.40 per night using model routing with Haiku for scanning and Opus for judgment.

OpenClawRadar
Building a Slay the Spire 2 Agent with Local LLMs: Lessons and Open Problems
Use Cases

Building a Slay the Spire 2 Agent with Local LLMs: Lessons and Open Problems

A developer built an agent that plays Slay the Spire 2 using Qwen3.5-27B via KoboldCPP/Ollama, achieving ~10 sec/action and ~88% action success rate with techniques like state-based tool routing and single-tool mode, while identifying open problems like prompt consistency and tool calling reliability.

OpenClawRadar