Persistent Memory for Claude: Local Stack with MCP, 39ms Retrieval, 82% Token Reduction

✍️ OpenClawRadar📅 Published: May 8, 2026🔗 Source
Persistent Memory for Claude: Local Stack with MCP, 39ms Retrieval, 82% Token Reduction
Ad

A Reddit user built a local persistent memory layer for Claude that solves the zero-context problem between sessions. The stack runs entirely locally (no cloud, no API keys) and integrates via MCP. Key architecture: four layers (L0 append-only event log in SQLite, L1 structured facts deferred, L2/L3 wiki prose, L4 crystallized session nodes with summary + decisions + open threads), Qdrant Docker for vector search, llama.cpp with Qwen3-Embedding-4B on GPU and Qwen3.5-2B-Q4_K_M on CPU for embedding and chat, and a FastMCP server exposing 7 tools (retrieve, crystallize_session, list_sessions, get_l4_node, index_status, reindex, shutdown_models).

Numbers

  • Token reduction vs grep+Read baseline: 82.7% mean, 86.2% median.
  • Retrieval F1: 0.50 vs 0.20 baseline.
  • Embed cold start ~4s; hot-path p95 39ms (was 2241ms before bug fix).
  • L4 session retrieval eval: 0.920 mean score (gate 0.6).
  • 738 chunks indexed across 104 markdown files.
Ad

Key Learned: Connection Reuse on Windows

The hot-path retrieve was stuck at 2241ms p95 even with GPU-resident embedding on a 4070 Ti Super. The cause: every httpx.post() opened a fresh TCP connection, and Windows localhost handshakes took ~2 seconds. Switching to a persistent httpx.Client with keep-alive dropped p95 to 39ms — a 57× speedup.

Other Surprises

  • Qwen3 thinking mode: If enable_thinking is not disabled via chat_template_kwargs: {enable_thinking: false} with --jinja on llama-server, the model spends all token budget on thinking blocks and outputs empty content.
  • MCP registration: Claude Desktop's agentic mode (Cowork) reads a plugin config file, not ~/.claude.json. The LKS service must be packaged as a proper Cowork .plugin bundle.

Who It's For

Developers who use Claude heavily and want a cost-effective, private, local memory layer that maintains context across sessions without cloud dependencies.

📖 Read the full source: r/ClaudeAI

Ad

👀 See Also