Researcher Builds Veracity-Checking Skill for Claude Code, Finds Hallucinations in Own Documentation
Veracity-Checking Skill Architecture
A researcher with a sleep science background from University of Miami built a Claude Code skill called /veracity-tweaked-555 that decomposes documents into atomic claims and verifies each one via web search. The tool uses 16 parallel agents across 4 waves per run and was built in collaboration with Claude Code (Opus 4.6), where Claude drafted the code while the researcher designed the methodology.
Self-Audit Results and Error Patterns
When the researcher ran the veracity checker on its own SKILL.md documentation, it scored 62 out of 100. The skill designed to catch hallucinations had hallucinated facts in its own documentation, including:
- Fabricating a performance statistic ("3x more accurate" for SAFE, which the paper never claims)
- Inflating a paper's improvement claim ("+35.5%" was actually +5.5% over SOTA)
- Fabricating an acronym expansion for a real technique
After initial fixes, the score reached 80, then 84 after a third run. A week later, after a more rigorous convergence loop with 6 runs, 19 agents, and 35 additional fixes, it stabilized at 96.5/100. However, the v3 audit dropped to 74 because v1 fixes had introduced new errors (an understated token cost and an incomplete tool list).
The errors follow consistent patterns: attribution inflation (slightly stronger language than the source warrants), plausible-but-fabricated identifiers (PMIDs, arXiv IDs that look real but point to different papers), and stale statistics presented as current.
Context Engineering Challenge
A single audit run generates approximately 917K tokens across 16 agents, exceeding Claude Code's 200K context window. When Claude Code compacts conversations to stay within limits, it performs lossy compression. After a few compactions, the agent loses track of how findings relate to each other — which fix caused which regression, which claim contradicts which other claim. Individual facts (names, numbers, function signatures) survive better than the connections between them.
Claude's diagnosis was that relational information — causal chains, cross-references, multi-step dependencies — is harder to preserve in a summary than isolated facts.
Solution and Additional Skill Audits
The researcher solved this by building a companion skill called /context-engineer that predicts overflow before it happens and externalizes relational state to JSON files on disk. The design test: if you can /clear your entire conversation and resume from the state file alone, the architecture is correct.
Running veracity checks on other Claude Code skills revealed:
- One skill had a fabricated paper title in its attribution section — the citation looked perfect (authors, venue) but the title was fabricated and the year was wrong
- The same skill misattributed an audit framework to the wrong standards body, appearing in multiple locations
- The
/context-engineerskill had internal inconsistencies — prose said "5-10K tokens" while a table said "5-15K tokens" for the same metric
12 total fixes were needed across all skills. All passed at 95+ on 3 consecutive runs after corrections.
📖 Read the full source: r/ClaudeAI
👀 See Also
CTOP: Terminal UI to Monitor Claude Code Sessions, Zero Deps
CTOP is a zero-dependency Node.js TUI that shows CPU, memory, context window saturation, token breakdown, and cost estimates for all running Claude Code and Codex sessions.

Understudy: A Teachable Desktop Agent That Learns Tasks by Demonstration
Understudy is a local-first desktop agent runtime that can operate GUI apps, browsers, shell tools, files, and messaging in one session. You demonstrate a task once, it records screen video and semantic events, extracts intent rather than coordinates, and turns it into a reusable skill.

Modo: Open-Source AI IDE with Spec-Driven Development and Agent Hooks
Modo is an open-source desktop IDE built on Void editor that adds spec-driven development workflows, agent hooks, and steering files. It structures prompts into requirements, design, and tasks before generating code.

A2P: An MCP Server That Enforces Engineering Discipline for AI Coding Agents
A2P (Architect-to-Product) is an AI engineering framework packaged as an MCP server that enforces a gated workflow: Architecture → Plan → Build → Audit → Security → Deploy, with each feature slice requiring RED → GREEN → REFACTOR → SAST → DONE progression.