Merlin: Local-first LLM context dedup – measure up to 71% chunk overlap, free & open-core

The author has released Merlin, a local-first deduplication tool for LLM context windows. Benchmarks across 22 million passages from real agent sessions and RAG pipelines show 22% duplicate content in typical agent context and up to 71% on RAG-heavy queries. For local models with 8K/16K/32K context, stripping that redundancy means more useful tokens fit before truncation.
Three integration modes
1. HTTP proxy mode
Best for Ollama, vLLM, SGLang, OpenWebUI, llama.cpp server, or anything with an OpenAI-compatible endpoint. Run the proxy locally and point your client at http://localhost:8787/v1 instead of your model server directly. Chunk-level dedup happens in the outgoing request before reaching the model.
Default is cache-aware: leaves the conversation prefix untouched (so vLLM/SGLang prefix-caching still hits) and only dedups the most recent user message. There's an opt-in aggressive mode if your cache hit rate is already low.
2. MCP server
For Claude Desktop, Claude Code, OpenClaw, Cursor. Exposes tools:
merlin_dedupe– dedup textmerlin_dedupe_file– dedup file contentsmerlin_savings_summary– show statsmerlin_status– check service
These tools are not auto-invoked; you must instruct the model to call them on chunky pastes.
3. Standalone CLI
For shell pipelines and preprocessing. Single-threaded, ~250 KB binary, no runtime dependencies, no network calls. Takes a positional input file and writes deduped lines via --output-dedup=path.txt.
Installation (one command per setup)
curl -LO https://github.com/corbenicai/merlin-community/releases/latest/download/merlin-community.zip
unzip merlin-community.zip && cd merlin-community
python shared/install_helpers.py <integration> enable
Where <integration> is claude_desktop, claude_code, openclaw, cursor, or proxy.
Measurements & tradeoffs
- Papers: arXiv:2605.09611 (architecture), arXiv:2605.09990 (22M-passage measurement), Zenodo: 10.5281/zenodo.20090991
- Community tier caps: 50 MB per run, 200 MB per day, 2 GB per month. Refuses oversized work cleanly (verified on 51 MB file). Hobby use is fine.
- Open-core: Public repo is the community edition; a separate closed-source Pro engine exists for high-throughput servers.
- Doesn't fix session fragmentation where the whole conversation is replayed every turn — that's an orchestration problem above this tool's scope.
- Binary availability: Windows x64 in v0.2.1. Linux + macOS CI pipeline pending.
Who it's for
Local LLM users running agents or RAG with Ollama, vLLM, SGLang, llama.cpp, or any OpenAI-compatible backend who want to pack more real tokens into limited context windows.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Using an MCP Server to Optimize React Native Apps with Claude Code
An MCP server streams live runtime data from a React Native app into Claude Code, identifying performance issues like Zustand store thrashing and unnecessary re-renders.

Mnemos: an MCP server for persistent Claude Code memory
Mnemos is an open-source MCP server that gives Claude Code persistent memory across sessions, recording corrections as structured patterns and pushing ranked context at startup. Single 15 MB Go binary, no Docker or vector DB needed.
Survey of Local-First Markdown Memory Servers for AI Agents: Mem0, Hindsight, Zep, and the Newcomer Engram
A user tested ~20 local agent memory systems for storing memories as editable files. Engram (by Obsidian68) was the only one that met all requirements: fully local, Markdown storage, smart dedup, importance decay, and standalone server.

UK Sovereign LLM Inference: Relax.ai Launches Public Docs
Relax.ai released docs for UK sovereign LLM inference, redirecting to /docs/getting-started/introduction. The service was shared on HN with 104 points.