AI Carb Counting Fails Reproducibility: 27K Queries Show 429g Spread on One Photo

A newly published preprint tested four AI models — OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, and Google Gemini 3.1 Pro — on a simple task: estimate carbohydrates from photos of food. The same 13 photos, the same prompt, the same settings, repeated 500+ times per model (26,904 total queries). Results show that even at the lowest randomness setting, reproducibility is wildly inconsistent across models.
Key Findings
- Worst-case spread: Gemini 2.5 Pro’s estimates for a single paella photo ranged from 55g to 484g — a 429g difference. At a 1:10 insulin-to-carb ratio, that’s 42.9 units of insulin. A potential fatality.
- Median variation (CV): Claude 2.4%, GPT-5.4 8.4%, Gemini 3.1 Pro 10.3%, Gemini 2.5 Pro 11.0%.
- Median insulin swing: Claude 0.9U, GPT-5.4 2.3U, Gemini 3.1 Pro 2.9U, Gemini 2.5 Pro 4.7U.
- Worst-case insulin swing: Claude 13.6U, GPT-5.4 16.6U, Gemini 3.1 Pro 16.2U, Gemini 2.5 Pro 42.9U.
The “Precisely Wrong” Problem
Three models (Claude, Gemini 2.5 Pro, Gemini 3.1 Pro) independently converged on ~28g for a cheese sandwich with a reference value of 40g (packet label: 20g per slice of bread). Claude showed just 0.3% CV across 510 queries, yet every single query was 12g low — a consistent underdose of ~1.2U. GPT-5.4 swung the other way, averaging ~74g with high variability.
Food Identification Errors
- Bakewell tart: Claude called it “Linzer torte” 100% of the time. GPT-5.4 called it “jam tart” or “cake bar.” Only Gemini 3.1 Pro correctly identified it (99.8%).
- Crema catalana: Three of four models called it “crème brûlée” 100% of the time. Gemini 3.1 Pro got it right only 3.4% of queries.
- Cheese sandwich: Gemini 3.1 Pro hallucinated “deli meat” in 17.4% of queries — potentially inflating carb estimates.
Insulin Dosing Risk
On five images with strong reference values, Claude was the only model with zero queries in the “clinically significant” (2-5U error) or “severe hypo risk” (>5U error) zones. 100% of Claude’s queries landed in safe or moderate zones. The other models produced dangerous outliers with every image.
Bottom line: a single number from any AI carb-counting app gives users no visibility into the underlying distribution of estimates. High consistency (Claude) does not guarantee accuracy. Low consistency (Gemini) can produce any result. Production systems must account for this variance.
📖 Read the full source: HN AI Agents
👀 See Also

Anthropic launches Claude Community Ambassadors program
Anthropic has launched the Claude Community Ambassadors program, which provides resources for organizing local developer meetups and connecting builders worldwide. The program is open to participants from any background and location.

Claude Skills Have No Business Model for Creators — A Developer's Dilemma
A Reddit post highlights that Claude skill creators can't monetize their work, as Anthropic shipped a great runtime but stopped short of a creator economy layer. Builders are left with open-source projects and no path to sustainability.

UW Researchers Plan to Use Teacher-Worn Cameras for AI Training, Parents Opt-Out
University of Washington researchers planned to have preschool teachers wear first-person cameras to record children for AI model training, with an opt-out consent model.

1-Bit Bonsai Image 4B: On-Device Image Generation via Binary/Ternary FLUX.2
PrismML releases Bonsai Image 4B, a binary (1.125-bit) and ternary (1.71-bit) FLUX.2 Klein 4B variant that shrinks the diffusion transformer to 0.93 GB / 1.21 GB, enabling 512x512 image generation on iPhone 17 Pro Max in 9.4 seconds.