Approach to Self-Improving Memory in Local AI Agents

Memory Architecture for Persistent AI Agents
A developer on r/LocalLLaMA has shared their approach to creating AI agents that don't repeat mistakes across sessions. The core problem addressed is that every session starts from zero, with context windows resetting and corrections being lost between sessions.
Memory Implementation
The system uses markdown as the source of truth instead of a database. MEMORY.md is human-editable - delete a line in vim and the agent forgets it. SQLite and FAISS (HNSW, 768-dim) are derived caches that are rebuildable from markdown anytime. This allows users to version-control their agent's memory with git.
Episode Scoring and Rule Learning
Each execution gets scored +1/-1 and saved as an episode. On similar future tasks, relevant episodes get pulled into context. When the same error signature (SHA256 of tool name + normalized error) shows up twice within 7 days, a rule learner generates a one-line prevention rule.
Rules start at 0.40 confidence and need 0.60 to actually get injected into future prompts. Success bumps confidence +0.03, failure drops it -0.05. Rules that don't help eventually decay away.
Trust Escalation System
Instead of configuring permission levels upfront, the agent tracks approval patterns. 5 approvals at 90%+ rate = auto-promote. One revert = demote back. There's a shadow mode for auditing.
Task Decomposition and Safety
Complex goals become a DAG (Directed Acyclic Graph). Circular dependencies are caught via topological sort, failure cascades to dependents via DFS (Depth-First Search). A completion gate checks 18 requirements (R01-R18) - did the agent actually read files, write changes, verify results, stay in the workspace?
Safety features include 43 bash risk patterns, dual-pass analysis (raw + decoded), fail-closed design (Guardian crash = deny), and minimum writable depth of 3 to prevent rm -rf /.
The developer is seeking feedback on whether the confidence decay on rules feels right and whether the +0.03/-0.05 asymmetry is optimal. They're also wondering if there are better alternatives to HNSW for this scale (typically <10k episodes).
📖 Read the full source: r/LocalLLaMA
👀 See Also

Flash-MoE: Running 397B Parameter Qwen Model on MacBook Pro with Pure C/Metal
Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B, a 397 billion parameter Mixture-of-Experts model, on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. The 209GB model streams from SSD through custom Metal compute shaders with no Python or frameworks.

Solo Dev Uses Claude + Blender MCP to Create App Store Video in 90 Minutes
Reddit user Positive_Camel2086 details how they used Claude with the Blender MCP server to generate a 10-second vertical launch video, automating camera rigging, materials, fog, and particle systems via conversational prompts.

Giving Claude a Local LLM as an Assistant via MCP on Mac
A developer connects Claude to a local Qwen 2.5 Coder 14B via Ollama and MCP, creating a no-cost assistant for delegating tasks like text processing and handling large files.

AgentLens: Observability Tool for Multi-Agent AI Workflows
AgentLens provides unified tracing across Ollama, vLLM, Anthropic, and OpenAI, with cost tracking, an MCP server for querying stats from Claude Code, and a CLI for inline checks. It's self-hosted and runs locally via Docker.