Building an Agentic RAG for Obsidian with Claude and an Eval Harness to Detect Hallucinations

✍️ OpenClawRadar📅 Published: May 16, 2026🔗 Source
Building an Agentic RAG for Obsidian with Claude and an Eval Harness to Detect Hallucinations
Ad

A developer on r/ClaudeAI built an agentic RAG system over their Obsidian vault to let Claude answer questions from engineering PDFs without burning through the weekly token limit. The workflow: convert engineering PDFs to markdown, drop them in an Obsidian vault, use a cheap agent (Kimi K2.5) for BM25 retrieval over the vault, and have Claude only see relevant chunks instead of whole books. This dropped token cost per question from ~50k to ~5k.

The new problem: the agent was sometimes confidently wrong — e.g., saying "Marcus Aurelius wrote about death in Book IX section 3" when the canonical passage was in Book IV section 5. Plausible enough that manual verification was needed. So the developer built an eval harness using Claude Sonnet 4.6 as the LLM judge, deliberately a different model family from the Kimi agent to avoid grading its own output.

Initial rubric had four buckets including a 0.7 "thin but not wrong." On hand-grading, the human grader (the same developer, blind, on a different day) also collapsed everything borderline into 0.7. The agreement number looked respectable but was actually measuring shared bias. After four rubric iterations, the working version collapsed the middle bucket entirely and added a 0.9 bucket for one specific case: "right answer, wrong chunk." This case previously caused a false positive (1.0 papering over a retrieval miss) or false negative (0.4 punishing a correct answer). The split fixed it.

Ad

Under the new rubric, judge agreement with human on 18 rows went from 7/18 (39%) to 17/18 (94%). Caveats: 18 rows is a small sample, single grader (inter-grader reliability not established), BM25 isn't novel (but works well for technical/literary corpora where query/document vocabulary overlap is high). A negative result: the same chunking technique that lifted one corpus by 33pp regressed another by 17pp on the same eval — the harness caught it on the first run.

The full writeup with the four-iteration rubric story, calibration worksheet, and negative-result note is on Medium. The author is curious about others using Claude Sonnet as judge for their RAG/agent setups, what rubric they landed on, and how they handle inter-grader reliability with a single human in the loop.

📖 Read the full source: r/ClaudeAI

Ad

👀 See Also

AutoBe: How Weak Local LLMs Fixed an AI Backend Generator's Architecture
Tools

AutoBe: How Weak Local LLMs Fixed an AI Backend Generator's Architecture

AutoBe is an open-source AI agent that generates complete backend apps using TypeScript, NestJS, and Prisma. The team discovered their initial 100% compilation success produced unmaintainable code, then rebuilt with modular generation—crashing success to 40%—and used weak local LLMs like qwen3-30b-a3b-thinking to debug schema ambiguities.

OpenClawRadar
Agent frameworks waste 350,000+ tokens per session resending static files
Tools

Agent frameworks waste 350,000+ tokens per session resending static files

A benchmark on a local Qwen 3.5 122B setup revealed agent frameworks waste over 350,000 tokens per session by resending static files. A compile-time approach reduced query context from 1,373 tokens to 73, achieving a 95% reduction.

OpenClawRadar
ToolLoop: Open-Source Framework for Claude-Style Tools with Any LLM
Tools

ToolLoop: Open-Source Framework for Claude-Style Tools with Any LLM

ToolLoop is an open-source Python framework with 11 tools for file operations, code search, shell access, and sub-agents that works with any LLM through LiteLLM. The 2,700-line framework allows switching models mid-conversation while maintaining shared context.

OpenClawRadar
Chamber: AI Agent for GPU Infrastructure Management
Tools

Chamber: AI Agent for GPU Infrastructure Management

Chamber is an AI agent that manages GPU infrastructure by handling tasks like provisioning clusters, diagnosing failed jobs, and managing workloads. It provides structured operations with validation and rollback, not just raw shell commands.

OpenClawRadar