Open Source Auto-Memory System for LLM Agents Achieves 94% Recall Accuracy

A developer has open-sourced an auto-memory system for LLM-based agents that automatically extracts, classifies, and persists facts across sessions without requiring explicit "save this" commands. The entire project—including plugin code, benchmark design, and test harness—was built using Claude Code as the primary development tool.
How the Memory System Works
The system operates with two layers:
- Layer 1 (per-turn): A lightweight LLM summarizes each turn in real-time and writes to a staging file
- Layer 2 (session boundary): Asynchronous classification into four skill files: identity, knowledge, lessons, and preferences
Retrieval works by having the agent load relevant skill files based on keyword matching in descriptions. The approach uses structured markdown files that the agent reads as "skills" rather than vector databases or RAG pipelines.
Development with Claude Code
Claude Code assisted in multiple aspects of the project:
- Architecture design: Helped evaluate LongMemEval as a benchmark candidate, identified the paradigm mismatch (long-context retrieval vs. progressive memory), and proposed an adapted 6-question-type benchmark
- Benchmark authoring: Designed the full 20-session/48-fact test suite including fact planting table, update chains (A→B→C), interference pairs, abstention questions, and two-hop trigger placement
- Test harness: Built the entire autotest framework including serial executor, multi-turn polling, lifecycle management, rule evaluator, and LLM judge pipeline
- Debugging in the loop: Diagnosed issues live during test runs, such as an update popup blocking Agent restarts, which was fixed by locking the updater state file to read-only
Benchmark Results
The 20-session benchmark was inspired by LongMemEval and tested 48 planted facts across 6 question types:
- Deep recall: Facts from sessions 1-2 tested 15+ sessions later - 89%
- Knowledge update: 3-level correction chain (A→B→C) - 100%
- Cross-session reasoning: Combine facts from 3+ sessions - 100%
- Interference resistance: Similar names that shouldn't be confused - 100%
- Temporal reasoning: "Which came first?" ordering questions - 80%
- Abstention: "I don't know" for never-mentioned facts - 86%
Overall: 49/52 checkpoints passed (94.2%). The one hard failure occurred when the agent inferred "you've done social media marketing" from a vaguely related fact ("promotion work") when the correct answer was "never discussed"—a classic LLM over-inference problem.
Availability and Questions
The project is open source with code and benchmark available on GitHub. The developer is looking for feedback on the skill-file approach (structured markdown vs. vector search), better ways to test abstention (identified as the hardest dimension), and information about others benchmarking cross-session memory in agents (not just long-context).
📖 Read the full source: r/ClaudeAI
👀 See Also

Introducing NetViews 2.3: A Robust Network Diagnostic Tool for macOS
NetViews 2.3 combines host discovery, Wi-Fi insights, and real-time monitoring with a streamlined GUI for better network diagnostics on macOS.

Qwen3.5-35B-A3B-UD-Q6_K_XL Tested in Production Development Workflows
A developer tested the Qwen3.5-35B-A3B-UD-Q6_K_XL model across multiple real client projects, achieving solid performance with benchmarks of 1504pp2048 and 47.71 tg256, and token speeds of 80tps on a single GPU.

Reasoning Guard: Proxy-Level Loop Detection for Local LLM Inference
A proxy-layer guard that detects and recovers from LLM reasoning loops using deterministic stream checks — token caps, n-gram repetition, and sentence fingerprinting — without model modifications.

TUI Studio: Visual Terminal UI Design Tool in Alpha
TUI Studio is a Figma-like visual editor for designing terminal user interfaces with drag-and-drop components, real-time ANSI preview, and planned export to six frameworks including Ink, BubbleTea, and Textual. Currently in alpha with non-functional exports, it's available for macOS, Windows, and Docker.