MemAware Benchmark: Test AI Memory Beyond Keyword Search

MemAware is an open-source benchmark designed to test whether AI assistants with memory can surface relevant context from past conversations when current queries don't explicitly hint at that information.

How the Benchmark Works

The benchmark contains 900 questions across three difficulty levels. It tests scenarios where relevant context exists in memory but the current question doesn't contain keywords that would trigger a search match. For example: you told your AI assistant about your 45-minute commute months ago, then later ask "What time should I set my alarm for my 8:30 AM meeting?" The assistant should factor in your commute, but searching "alarm 8:30 meeting" won't find conversations about commuting.

Key Findings

Search barely helps: BM25 search scored 2.8% vs 0.8% with no memory — a tiny improvement that costs 5x the tokens.
Vector search fails on hard questions: It helps when keywords overlap (6%) but drops to 0.7% on cross-domain connections — the same as no memory. Example hard question: "How should I bid at the charity auction?" should recall a past $800 handbag purchase as a spending baseline, but embedding similarity can't connect these concepts.
Searching when you shouldn't is expensive: The "always search" pattern reads ~4.7K tokens of results per question regardless of whether they help. Most of the time, the results are irrelevant noise.

The Core Problem

Current AI memory implementations are essentially just search systems. True memory awareness — knowing what information is stored and proactively surfacing relevant context — is a different problem that search alone can't solve.

The benchmark is available for testing different approaches at: https://github.com/kevin-hs-sohn/memaware

📖 Read the full source: r/ClaudeAI