MemAware Benchmark Tests AI Memory Beyond Keyword Search

MemAware is an open-source benchmark designed to test whether AI assistants with memory can surface relevant context from past conversations when current queries don't explicitly hint at that information.
How the Benchmark Works
The benchmark contains 900 questions across three difficulty levels. It tests scenarios where relevant context exists in memory but the current question doesn't contain keywords that would trigger a search match. For example: you told your AI assistant about your 45-minute commute months ago, then later ask "What time should I set my alarm for my 8:30 AM meeting?" The assistant should factor in your commute, but searching "alarm 8:30 meeting" won't find conversations about commuting.
Key Findings
- Search barely helps: BM25 search scored 2.8% vs 0.8% with no memory — a tiny improvement that costs 5x the tokens.
- Vector search fails on hard questions: It helps when keywords overlap (6%) but drops to 0.7% on cross-domain connections — the same as no memory. Example hard question: "How should I bid at the charity auction?" should recall a past $800 handbag purchase as a spending baseline, but embedding similarity can't connect these concepts.
- Searching when you shouldn't is expensive: The "always search" pattern reads ~4.7K tokens of results per question regardless of whether they help. Most of the time, the results are irrelevant noise.
The Core Problem
Current AI memory implementations are essentially just search systems. True memory awareness — knowing what information is stored and proactively surfacing relevant context — is a different problem that search alone can't solve.
The benchmark is available for testing different approaches at: https://github.com/kevin-hs-sohn/memaware
📖 Read the full source: r/ClaudeAI
👀 See Also

Shipwright: An Open-Source Project Management Tool Built on Claude Code
Shipwright is an open-source project management tool that runs on Claude Code with 44 skills, 7 specialized agents, and 16 workflows. It includes binary quality gates and recovery playbooks, and was used to audit credential registries and evaluate automation platforms before engineering work began.

Debugging Claude Code's Build-Check Logic: Why Name Search Fails and Structural Footprint Search Fixes It
Claude Code told a user 'feature not built' four times in one session — all wrong. The fix: replace name-based search with structural footprint search (routes, schemas, registered tools). Practical rule shared.

Gemma Gem: On-Device AI Agent for Browser Automation via WebGPU
Gemma Gem is a Chrome extension that runs Google's Gemma 4 model (2B or 4B) entirely on-device using WebGPU, with no API keys or cloud dependencies. It provides tools to read page content, take screenshots, click elements, type text, scroll, and run JavaScript through a chat interface.

antirez's DS4: Running DeepSeek V4 Flash with 1M Context on Mac Metal and DGX
Redis creator Salvatore Sanfilippo released DS4, a project to run DeepSeek V4 Flash with a 1M context window on Mac Metal hardware and DGX, with OpenAI/Anthropic endpoints for agentic coding tools.