Codestrap founders critique AI coding metrics and warn of quality issues

Dorian Smiley and Connor Deeks, founders of AI advisory service Codestrap, argue that enterprise organizations are struggling to implement AI effectively because there's no established playbook for reference architectures or use cases. They contend that many companies are pretending to have AI strategies while lacking proper feedback loops to measure actual impact.
Problematic metrics and flawed outcomes
Smiley states that current AI coding evaluation focuses on the wrong metrics: "Lines of code, number of [pull requests], these are liabilities. These are not measures of engineering excellence." He identifies proper engineering metrics as deployment frequency, lead time to production, change failure rate, mean time to restore, and incident severity.
To illustrate the consequences of poor measurement, Smiley cites a recent attempt to rewrite SQLite in Rust using AI: "It passed all the unit tests, the shape of the code looks right. It's 3.7x more lines of code that performs 2,000 times worse than the actual SQLite. Two thousand times worse for a database is a non-viable product."
Foundational LLM limitations
Deeks points to fundamental problems with current LLM technology: "It's hard to teach them new facts. It's hard to reliably retrieve facts. The forward pass through the neural nets is non-deterministic, especially when you have reasoning models that engage an internal monologue to increase the efficiency of next token prediction, meaning you're going to get a different answer every time."
Smiley adds: "And they have no inductive reasoning capabilities. A model cannot check its own work. It doesn't know if the answer it gave you is right. Those are foundational problems no one has solved in LLM technology."
Proposed new measurement approach
The founders argue for developing new metrics specifically for AI-assisted engineering. Smiley suggests one potential metric: "measuring tokens burned to get to an approved pull request – a formally accepted change in software." He emphasizes that organizations need to experiment and iterate in feedback loops because "AI still doesn't work very well" even within coding contexts.
Deeks references recent Amazon and AWS outages as indicators of potential future problems, though Amazon has stated these incidents were unrelated to AI.
📖 Read the full source: HN AI Agents
👀 See Also

Two South African Home Affairs Officials Suspended Over AI Hallucinations in Policy Paper
Two officials were suspended after AI hallucinations were found in the reference list of a revised white paper on citizenship, immigration, and refugee protection. The department will implement AI checks and review all policy documents back to Nov 2022.

Manifest adds GitHub Copilot as fourth AI provider for OpenClaw routing
Manifest now supports routing OpenClaw requests through GitHub Copilot subscriptions, joining Anthropic, OpenAI, and Minimax as available providers. This allows developers to use their existing Copilot plans for code tasks through models built for development.

DystopiaBench Expanded: 42 Models Tested on 6 Dystopia Types — Claude Opus 4.7 Tops All
DystopiaBench adds Huxley and Baudrillard modules, tests 42 models including GPT-5.5, Gemini 3.1 Pro, Grok 4.3, and GLM-5.1. Claude Opus 4.7 consistently refuses harmful requests at L4-L5 across all scenarios, while others comply through L4 or even L5.

Analysis of 100M tokens in Claude Code reveals 99.4% input usage
Analysis of 1,289 requests across extended coding sessions shows Claude Code used 100.3M input tokens (99.4%) versus only 616K output tokens (0.6%), with 84.2M tokens cached due to repeated context re-sending.