Persistent Memory for Claude: Local Stack with MCP, 39ms Retrieval, 82% Token Reduction

✍️ OpenClawRadar📅 Published: May 8, 2026🔗 Source

A Reddit user built a local persistent memory layer for Claude that solves the zero-context problem between sessions. The stack runs entirely locally (no cloud, no API keys) and integrates via MCP. Key architecture: four layers (L0 append-only event log in SQLite, L1 structured facts deferred, L2/L3 wiki prose, L4 crystallized session nodes with summary + decisions + open threads), Qdrant Docker for vector search, llama.cpp with Qwen3-Embedding-4B on GPU and Qwen3.5-2B-Q4_K_M on CPU for embedding and chat, and a FastMCP server exposing 7 tools (retrieve, crystallize_session, list_sessions, get_l4_node, index_status, reindex, shutdown_models).

Numbers

Token reduction vs grep+Read baseline: 82.7% mean, 86.2% median.
Retrieval F1: 0.50 vs 0.20 baseline.
Embed cold start ~4s; hot-path p95 39ms (was 2241ms before bug fix).
L4 session retrieval eval: 0.920 mean score (gate 0.6).
738 chunks indexed across 104 markdown files.

Key Learned: Connection Reuse on Windows

The hot-path retrieve was stuck at 2241ms p95 even with GPU-resident embedding on a 4070 Ti Super. The cause: every httpx.post() opened a fresh TCP connection, and Windows localhost handshakes took ~2 seconds. Switching to a persistent httpx.Client with keep-alive dropped p95 to 39ms — a 57× speedup.

Other Surprises

Qwen3 thinking mode: If enable_thinking is not disabled via chat_template_kwargs: {enable_thinking: false} with --jinja on llama-server, the model spends all token budget on thinking blocks and outputs empty content.
MCP registration: Claude Desktop's agentic mode (Cowork) reads a plugin config file, not ~/.claude.json. The LKS service must be packaged as a proper Cowork .plugin bundle.

Who It's For

Developers who use Claude heavily and want a cost-effective, private, local memory layer that maintains context across sessions without cloud dependencies.

📖 Read the full source: r/ClaudeAI

👀 See Also

Tools

Caliber: Local CLI tool generates AI coding assistant configs from your repo

Caliber is a local-first CLI tool that scans repositories in languages like TypeScript, Python, Go, and Rust, then generates prompt and configuration files for AI coding assistants including Claude Code, Cursor, and Codex. It runs entirely on your machine with your own keys, has 13k npm installs, and is open source under MIT license.

Apr 7, 2026, 11:45 AM UTC

OpenClawRadar

Tools

OmniCoder-9B fine-tune shows strong performance for agentic coding on 8GB VRAM systems

A Reddit user tested OmniCoder-9B, a fine-tune of Qwen3.5-9B on Opus traces, with OpenCode and reported 40+ tokens per second speeds using Q4_K_M GGUF quantization at 100k context length on an 8GB VRAM system.

Mar 13, 2026, 05:45 AM UTC

OpenClawRadar

Tools

FixAI Dev: A Consumer Rights Game Using Claude Haiku with Strict JSON Contracts

A developer built a browser game where Claude Haiku acts as a corporate AI denying consumer requests; players argue using real consumer protection laws across 37 cases in EU, US, UK, and Australia. The architecture uses Haiku for language only, with server-side game logic and strict JSON contracts between components.

Mar 31, 2026, 07:45 PM UTC

OpenClawRadar

Tools

My Agent Built Himself an Interoception System — Now He Has Desires

Feb 7, 2026, 03:58 PM UTC

u/zerofucksleft