Claude Code Skill /veracity-tweaked-555 Finds Document Hallucinations

Veracity-Checking Skill Architecture

A researcher with a sleep science background from University of Miami built a Claude Code skill called /veracity-tweaked-555 that decomposes documents into atomic claims and verifies each one via web search. The tool uses 16 parallel agents across 4 waves per run and was built in collaboration with Claude Code (Opus 4.6), where Claude drafted the code while the researcher designed the methodology.

Self-Audit Results and Error Patterns

When the researcher ran the veracity checker on its own SKILL.md documentation, it scored 62 out of 100. The skill designed to catch hallucinations had hallucinated facts in its own documentation, including:

Fabricating a performance statistic ("3x more accurate" for SAFE, which the paper never claims)
Inflating a paper's improvement claim ("+35.5%" was actually +5.5% over SOTA)
Fabricating an acronym expansion for a real technique

After initial fixes, the score reached 80, then 84 after a third run. A week later, after a more rigorous convergence loop with 6 runs, 19 agents, and 35 additional fixes, it stabilized at 96.5/100. However, the v3 audit dropped to 74 because v1 fixes had introduced new errors (an understated token cost and an incomplete tool list).

The errors follow consistent patterns: attribution inflation (slightly stronger language than the source warrants), plausible-but-fabricated identifiers (PMIDs, arXiv IDs that look real but point to different papers), and stale statistics presented as current.

Context Engineering Challenge

A single audit run generates approximately 917K tokens across 16 agents, exceeding Claude Code's 200K context window. When Claude Code compacts conversations to stay within limits, it performs lossy compression. After a few compactions, the agent loses track of how findings relate to each other — which fix caused which regression, which claim contradicts which other claim. Individual facts (names, numbers, function signatures) survive better than the connections between them.

Claude's diagnosis was that relational information — causal chains, cross-references, multi-step dependencies — is harder to preserve in a summary than isolated facts.

Solution and Additional Skill Audits

The researcher solved this by building a companion skill called /context-engineer that predicts overflow before it happens and externalizes relational state to JSON files on disk. The design test: if you can /clear your entire conversation and resume from the state file alone, the architecture is correct.

Running veracity checks on other Claude Code skills revealed:

One skill had a fabricated paper title in its attribution section — the citation looked perfect (authors, venue) but the title was fabricated and the year was wrong
The same skill misattributed an audit framework to the wrong standards body, appearing in multiple locations
The /context-engineer skill had internal inconsistencies — prose said "5-10K tokens" while a table said "5-15K tokens" for the same metric

12 total fixes were needed across all skills. All passed at 95+ on 3 consecutive runs after corrections.

📖 Read the full source: r/ClaudeAI

Researcher Builds Veracity-Checking Skill for Claude Code, Finds Hallucinations in Own Documentation

Veracity-Checking Skill Architecture

Self-Audit Results and Error Patterns

Context Engineering Challenge

Solution and Additional Skill Audits

👀 See Also

Local Qwen Models Achieve Browser Automation with Stepwise Planning and Compact DOM

E2a: Open-Source Email Gateway for AI Agents with SPF/DKIM Verification and Webhook/WebSocket Delivery

Benchmark shows context engine reduces AI coding agent costs by 3x on SWE-bench

Grape Root Tool Reduces Claude Code Token Usage by Caching Repository Context