How to Build a YouTube MCP Server With Persistent Indexes

A developer has shared detailed architecture notes from building a YouTube MCP server that implements persistent local indexes, contrasting with the common "extract-and-forget" pattern observed in over 40 existing servers.

Architecture Decisions

Three-tier fallback on every tool: Uses YouTube Data API → yt-dlp → page extraction. Every response includes a provenance field ({sourceTier, fallbackDepth, partial, fetchedAt, sourceNotes}) to prevent silent degradation. Quota exhaustion on tier 1 results in a degraded response with clear provenance instead of a failure.
Persistence model: SQLite + sqlite-vec for local vector storage in a single file, with no Docker or external database. Embeddings persist across sessions, allowing knowledge to accumulate—the tenth query on an indexed playlist is richer and faster than the first.
Embedding provider abstraction: Uses Gemini text-embedding-004 (768d) when a Gemini key is present, falling back to all-MiniLM-L6-v2 (384d) fully offline via local inference. Both are handled by the same abstraction, enabling semantic search with zero API keys at reduced quality or transparent upgrades when a key is added.
Visual search as a separate index: Three independent layers: Apple Vision VNGenerateImageFeatureVectorRequest for per-frame feature prints for image-to-image similarity, Gemini Vision for natural language scene descriptions per keyframe, and Gemini text-embedding-004 for 768d embeddings over OCR text + descriptions for text→visual search. Returns actual frame paths on disk + timestamps + match reasoning, genuinely separate from the transcript pipeline.
Token efficiency via strict output schemas: Achieves 75–87% smaller responses than raw YouTube API output by removing thumbnails, eTags, and localization bloat, and using normalized engagement ratios instead of raw counts.

Tradeoffs Encountered

Disk usage grows with persistence: Solved with TTL caches per tool category, a mediaStoreHealth diagnostic, and per-collection cleanup tools.
Visual indexing is expensive: Due to keyframe extraction + vision + OCR + embeddings. Made opt-in per video rather than automatic during import.
Three-tier fallback adds latency when earlier tiers fail: Considered worth it for reliability, as API quota exhaustion is a real problem in production, and yt-dlp/page extraction keep things working.
mcpName vs npm name collision risk: MCP registry uses io.github.<user>/<name> while npm is flat. Solved by making them explicit and different.
Apple Vision locks the image-to-image similarity layer to macOS: Accepted tradeoff, as the Gemini-based layers work cross-platform.

The code is open source, and the developer is open to discussing design decisions further, particularly on the persistence vs extraction tradeoff or the visual pipeline.

📖 Read the full source: r/LocalLLaMA

Persistent Indexes Over Extraction: Architecture for a YouTube MCP Server

Architecture Decisions

Tradeoffs Encountered

👀 See Also

Real-World Insights on Using OpenClaw with LLMs: Challenges and Limitations

Claude to PDF Chrome Extension Exports Long Conversations with Formatting Intact

Essential OpenClaw plugins for developers using AI coding agents

OpenClaw Skill Reduces Agent Handoff by Enabling Self-Execution