Ontario Audit: 60% of AI Scribe Systems Mix Up Drugs, 85% Miss Mental Health Details

The Office of the Auditor General of Ontario audited 20 approved AI Scribe systems used by physicians and nurse practitioners, simulating doctor-patient recordings to evaluate accuracy. The results are stark:
- 12 of 20 systems inserted incorrect drug information into patient notes.
- 9 of 20 fabricated information — e.g., claiming “no masses found” or “patient anxious” — that was never discussed.
- 17 of 20 missed key mental health details from the recording.
- 6 of 20 fully or partially omitted mental health issues.
The audit also slammed the evaluation scoring methodology. Accuracy of medical notes accounted for just 4% of the total score, while having a domestic presence in Ontario contributed 30%. Bias controls, threat/risk/privacy assessments, and SOC 2 Type 2 compliance each counted only 2–4%. As the report states, such weightings “could result in the selection of vendors whose AI tools may produce inaccurate or biased medical records.”
While OntarioMD has recommended manual review of AI notes, the audit noted no mandatory attestation feature in any approved system. Ontario’s Ministry of Health said over 5,000 physicians use these tools with no reported patient harm.
📖 Read the full source: HN AI Agents
👀 See Also

OpenClaw .23 Update Causing Agent Issues and Data Loss
The OpenClaw .23 update is causing agents to become unresponsive, fail to execute tasks, and lose connection with browser extensions. Running the repair command can strip entire JSON configurations, requiring system backups for recovery.

AI Coding Agents Struggle with Context Management in Large Codebases
Analysis of AI coding agents reveals they spend 15-20 tool calls on orientation tasks like grepping for routes and reading middleware before writing code, burning through context windows. Vercel achieved 100% accuracy by stripping 80% of tools and using bash, while Pi uses just 4 tools and a system prompt under 1,000 tokens.

Why Every Client Wants a Chatbot Now (And Why It's the New Carousel)
A developer chronicles the trend of clients demanding AI chatbots on websites, despite admitting they close them immediately — parallels to the carousel era.

DystopiaBench Expanded: 42 Models Tested on 6 Dystopia Types — Claude Opus 4.7 Tops All
DystopiaBench adds Huxley and Baudrillard modules, tests 42 models including GPT-5.5, Gemini 3.1 Pro, Grok 4.3, and GLM-5.1. Claude Opus 4.7 consistently refuses harmful requests at L4-L5 across all scenarios, while others comply through L4 or even L5.