Deterministic vs Probabilistic Code Generation: Why Bun's Vibe-Coded Rust Conversion Raises Red Flags

Noah Hall, writing for The Tech Enabler, draws a sharp line between deterministic and probabilistic code generation. He uses Bun's recent vibe-coded conversion of a million-line codebase from Zig to Rust as a cautionary tale. His core argument: deterministic systems produce consistent, reviewable results; LLMs introduce uncertainty that makes code review impossible at scale.
Deterministic Code Generation
Hall points to established deterministic tooling: Python's 2to3 for Python 2→3 migration, and transpilers for languages like Elm, PureScript, and TypeScript that always produce the same JavaScript. His own language Derw can output JavaScript, TypeScript, or English; Tegan outputs JavaScript or Go; Mojie targets JavaScript, Python, or English. All are based on AST-to-AST transformation — given the same input, you always get the same output. Consistency matters: "If a bug is consistent, we can fix it. If a bug is inconsistent, it becomes exponentially more difficult to fix."
Probabilistic Code Generation
LLMs vary output each run — sometimes A, sometimes B. Hall created neuro-lingo three years ago as a parody: humans write only function signatures and comments, and LLMs generate the implementation fresh each compilation. An example:
function add(a: number, b: number): number {
// Add two numbers together
}
function main() {
// Print "Hello World" to the console
// Print the result of add(2, 3)
}"Every time neuro-lingo is compiled, the code is generated from fresh by the LLMs. It's slightly different each time. Sometimes it introduces bugs. Sometimes it's clean and simple. Sometimes it's chaotic." Hall argues that fully AI-driven code flows are doing exactly this, but shipping to production with human accountability.
The "There Are Tests" Fallacy
Tests alone can't guarantee quality. Hall cites SQLite as the most tested codebase: 155.8 KSLOC of C code vs. 92,053.1 KSLOC of test code (590× more). Despite 100% branch coverage, millions of test cases, and extensive harnesses, SQLite still relies on human review. "It is not possible for a human to review 1 million lines of changes in 9 days. Bun has not reviewed the code they have merged to master."
Hall concludes that deterministic code generation still needs validation, and probabilistic generation creates risk that scales with line count. The source article goes deeper on each example.
📖 Read the full source: HN AI Agents
👀 See Also

1.2B Local Model Beats 1T Clouds in Poker: Aggression Trumps Knowledge in Shove-or-Fold Format
A 1.2B Liquid model won 2 of 5 Texas Hold'em tournaments against models up to 1T parameters, because in a short-stack format, never folding earned more chips than smart play.

Claude Security public beta: scans codebase, validates own findings, proposes patches
Anthropic launched Claude Security in public beta for Enterprise customers. It reasons through code like a security researcher, challenges its own findings via adversarial self-verification, and proposes concrete patches.

Tolan's AI-Enabled Engineering Interview Process
Tolan has redesigned their engineering interview to mirror day-to-day work with AI coding agents. Candidates get a few hours to build a feature from a Figma spec or short specification, using AI tools like Claude, Codex, Cursor, or Gemini.

STAR Reasoning Framework Accuracy Drops from 100% to 0% in Production Prompts
A researcher found that the STAR reasoning framework, which raised Claude's accuracy on an implicit constraint problem from 0% to 100% in isolation, dropped to 0-30% accuracy when used inside a 60-line production system prompt. The issue was caused by conflicting instructions in the production prompt that triggered premature answer commitments.