Claude Opus 4.6 accuracy drops on BridgeBench hallucination test

BridgeMind AI reported on Twitter that Claude Opus 4.6's accuracy on the BridgeBench hallucination test has decreased from 83% to 68%. The tweet was shared on Hacker News where it received 58 points and 11 comments.
The BridgeBench hallucination test is a benchmark used to measure how often AI models generate incorrect or fabricated information. A drop from 83% to 68% accuracy represents a significant performance regression in this specific evaluation.
For developers using AI coding agents, hallucination tests like BridgeBench are important for understanding model reliability. When models hallucinate in coding contexts, they can generate incorrect code, suggest non-existent APIs, or provide misleading documentation references.
The Hacker News discussion around this tweet likely includes technical analysis from developers who work with AI models. These conversations typically cover practical implications for development workflows, testing strategies, and how to mitigate hallucination risks in production systems.
Accuracy drops in specific benchmarks don't necessarily reflect overall model performance degradation, but they highlight areas where recent updates may have introduced regressions. Developers should verify critical code suggestions and maintain testing protocols when working with updated AI models.
📖 Read the full source: HN AI Agents
👀 See Also

Sora AI Video Economics: $20 User Costs OpenAI $65 in Compute
OpenAI's Sora AI video generation app reportedly costs $65 in compute per $20/month user, with peak inference costs estimated at $15 million daily versus $2.1 million total lifetime revenue.
Amazon Employees 'Tokenmaxxing' with MeshClaw AI Agents to Meet Usage Targets
Amazon developers are automating unnecessary tasks with the internal MeshClaw tool to inflate AI token consumption, after the company set weekly usage targets for 80% of devs and introduced internal leaderboards.

Kimi k2.5: Breaking New Ground in AI Automation
Kimi k2.5 has set a new standard for AI automation, boasting advanced capabilities that are turning heads in the tech community. Discover how it is reshaping the landscape.

Claude Code v2.1.73: Model Overrides, Stability Fixes, and Performance Improvements
Claude Code v2.1.73 adds modelOverrides for custom provider IDs, fixes critical freezes and deadlocks, resolves subagent model downgrades, and improves voice mode stability. The release addresses 18 specific issues including bash command permission prompts, session corruption, and Linux sandbox failures.