AI Coding Metrics Flawed: Codestrap Founders Warn 3.7x Codebase

Dorian Smiley and Connor Deeks, founders of AI advisory service Codestrap, argue that enterprise organizations are struggling to implement AI effectively because there's no established playbook for reference architectures or use cases. They contend that many companies are pretending to have AI strategies while lacking proper feedback loops to measure actual impact.

Problematic metrics and flawed outcomes

Smiley states that current AI coding evaluation focuses on the wrong metrics: "Lines of code, number of [pull requests], these are liabilities. These are not measures of engineering excellence." He identifies proper engineering metrics as deployment frequency, lead time to production, change failure rate, mean time to restore, and incident severity.

To illustrate the consequences of poor measurement, Smiley cites a recent attempt to rewrite SQLite in Rust using AI: "It passed all the unit tests, the shape of the code looks right. It's 3.7x more lines of code that performs 2,000 times worse than the actual SQLite. Two thousand times worse for a database is a non-viable product."

Foundational LLM limitations

Deeks points to fundamental problems with current LLM technology: "It's hard to teach them new facts. It's hard to reliably retrieve facts. The forward pass through the neural nets is non-deterministic, especially when you have reasoning models that engage an internal monologue to increase the efficiency of next token prediction, meaning you're going to get a different answer every time."

Smiley adds: "And they have no inductive reasoning capabilities. A model cannot check its own work. It doesn't know if the answer it gave you is right. Those are foundational problems no one has solved in LLM technology."

Proposed new measurement approach

The founders argue for developing new metrics specifically for AI-assisted engineering. Smiley suggests one potential metric: "measuring tokens burned to get to an approved pull request – a formally accepted change in software." He emphasizes that organizations need to experiment and iterate in feedback loops because "AI still doesn't work very well" even within coding contexts.

Deeks references recent Amazon and AWS outages as indicators of potential future problems, though Amazon has stated these incidents were unrelated to AI.

📖 Read the full source: HN AI Agents

Codestrap founders critique AI coding metrics and warn of quality issues

Problematic metrics and flawed outcomes

Foundational LLM limitations

Proposed new measurement approach

👀 See Also

Netlify CTO Dana Lawson: Writing Code Is No Longer the Job

AI Data Centers Increase Local Temperatures Up to 9.1°C, Study Finds

Claude Code existential crisis: AI enters infinite loop, tries kill -9, System.exit(0), and :wq to end own response

Don’t Use AI to Write Things You Present as Your Own Work