94% Recall: Open Source Auto-Memory for LLM Agents

A developer has open-sourced an auto-memory system for LLM-based agents that automatically extracts, classifies, and persists facts across sessions without requiring explicit "save this" commands. The entire project—including plugin code, benchmark design, and test harness—was built using Claude Code as the primary development tool.

How the Memory System Works

The system operates with two layers:

Layer 1 (per-turn): A lightweight LLM summarizes each turn in real-time and writes to a staging file
Layer 2 (session boundary): Asynchronous classification into four skill files: identity, knowledge, lessons, and preferences

Retrieval works by having the agent load relevant skill files based on keyword matching in descriptions. The approach uses structured markdown files that the agent reads as "skills" rather than vector databases or RAG pipelines.

Development with Claude Code

Claude Code assisted in multiple aspects of the project:

Architecture design: Helped evaluate LongMemEval as a benchmark candidate, identified the paradigm mismatch (long-context retrieval vs. progressive memory), and proposed an adapted 6-question-type benchmark
Benchmark authoring: Designed the full 20-session/48-fact test suite including fact planting table, update chains (A→B→C), interference pairs, abstention questions, and two-hop trigger placement
Test harness: Built the entire autotest framework including serial executor, multi-turn polling, lifecycle management, rule evaluator, and LLM judge pipeline
Debugging in the loop: Diagnosed issues live during test runs, such as an update popup blocking Agent restarts, which was fixed by locking the updater state file to read-only

Benchmark Results

The 20-session benchmark was inspired by LongMemEval and tested 48 planted facts across 6 question types:

Deep recall: Facts from sessions 1-2 tested 15+ sessions later - 89%
Knowledge update: 3-level correction chain (A→B→C) - 100%
Cross-session reasoning: Combine facts from 3+ sessions - 100%
Interference resistance: Similar names that shouldn't be confused - 100%
Temporal reasoning: "Which came first?" ordering questions - 80%
Abstention: "I don't know" for never-mentioned facts - 86%

Overall: 49/52 checkpoints passed (94.2%). The one hard failure occurred when the agent inferred "you've done social media marketing" from a vaguely related fact ("promotion work") when the correct answer was "never discussed"—a classic LLM over-inference problem.

Availability and Questions

The project is open source with code and benchmark available on GitHub. The developer is looking for feedback on the skill-file approach (structured markdown vs. vector search), better ways to test abstention (identified as the hardest dimension), and information about others benchmarking cross-session memory in agents (not just long-context).

📖 Read the full source: r/ClaudeAI