Persistent Indexes Over Extraction: Architecture for a YouTube MCP Server

✍️ OpenClawRadar📅 Published: April 15, 2026🔗 Source
Persistent Indexes Over Extraction: Architecture for a YouTube MCP Server
Ad

A developer has shared detailed architecture notes from building a YouTube MCP server that implements persistent local indexes, contrasting with the common "extract-and-forget" pattern observed in over 40 existing servers.

Architecture Decisions

  • Three-tier fallback on every tool: Uses YouTube Data API → yt-dlp → page extraction. Every response includes a provenance field ({sourceTier, fallbackDepth, partial, fetchedAt, sourceNotes}) to prevent silent degradation. Quota exhaustion on tier 1 results in a degraded response with clear provenance instead of a failure.
  • Persistence model: SQLite + sqlite-vec for local vector storage in a single file, with no Docker or external database. Embeddings persist across sessions, allowing knowledge to accumulate—the tenth query on an indexed playlist is richer and faster than the first.
  • Embedding provider abstraction: Uses Gemini text-embedding-004 (768d) when a Gemini key is present, falling back to all-MiniLM-L6-v2 (384d) fully offline via local inference. Both are handled by the same abstraction, enabling semantic search with zero API keys at reduced quality or transparent upgrades when a key is added.
  • Visual search as a separate index: Three independent layers: Apple Vision VNGenerateImageFeatureVectorRequest for per-frame feature prints for image-to-image similarity, Gemini Vision for natural language scene descriptions per keyframe, and Gemini text-embedding-004 for 768d embeddings over OCR text + descriptions for text→visual search. Returns actual frame paths on disk + timestamps + match reasoning, genuinely separate from the transcript pipeline.
  • Token efficiency via strict output schemas: Achieves 75–87% smaller responses than raw YouTube API output by removing thumbnails, eTags, and localization bloat, and using normalized engagement ratios instead of raw counts.
Ad

Tradeoffs Encountered

  • Disk usage grows with persistence: Solved with TTL caches per tool category, a mediaStoreHealth diagnostic, and per-collection cleanup tools.
  • Visual indexing is expensive: Due to keyframe extraction + vision + OCR + embeddings. Made opt-in per video rather than automatic during import.
  • Three-tier fallback adds latency when earlier tiers fail: Considered worth it for reliability, as API quota exhaustion is a real problem in production, and yt-dlp/page extraction keep things working.
  • mcpName vs npm name collision risk: MCP registry uses io.github.<user>/<name> while npm is flat. Solved by making them explicit and different.
  • Apple Vision locks the image-to-image similarity layer to macOS: Accepted tradeoff, as the Gemini-based layers work cross-platform.

The code is open source, and the developer is open to discussing design decisions further, particularly on the persistence vs extraction tradeoff or the visual pipeline.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also