Merlin LLM Context Dedup: 71% Chunk Overlap Reduction

The author has released Merlin, a local-first deduplication tool for LLM context windows. Benchmarks across 22 million passages from real agent sessions and RAG pipelines show 22% duplicate content in typical agent context and up to 71% on RAG-heavy queries. For local models with 8K/16K/32K context, stripping that redundancy means more useful tokens fit before truncation.

Three integration modes

1. HTTP proxy mode

Best for Ollama, vLLM, SGLang, OpenWebUI, llama.cpp server, or anything with an OpenAI-compatible endpoint. Run the proxy locally and point your client at http://localhost:8787/v1 instead of your model server directly. Chunk-level dedup happens in the outgoing request before reaching the model.

Default is cache-aware: leaves the conversation prefix untouched (so vLLM/SGLang prefix-caching still hits) and only dedups the most recent user message. There's an opt-in aggressive mode if your cache hit rate is already low.

2. MCP server

For Claude Desktop, Claude Code, OpenClaw, Cursor. Exposes tools:

merlin_dedupe – dedup text
merlin_dedupe_file – dedup file contents
merlin_savings_summary – show stats
merlin_status – check service

These tools are not auto-invoked; you must instruct the model to call them on chunky pastes.

3. Standalone CLI

For shell pipelines and preprocessing. Single-threaded, ~250 KB binary, no runtime dependencies, no network calls. Takes a positional input file and writes deduped lines via --output-dedup=path.txt.

Installation (one command per setup)

curl -LO https://github.com/corbenicai/merlin-community/releases/latest/download/merlin-community.zip
unzip merlin-community.zip && cd merlin-community
python shared/install_helpers.py <integration> enable

Where <integration> is claude_desktop, claude_code, openclaw, cursor, or proxy.

Measurements & tradeoffs

Papers: arXiv:2605.09611 (architecture), arXiv:2605.09990 (22M-passage measurement), Zenodo: 10.5281/zenodo.20090991
Community tier caps: 50 MB per run, 200 MB per day, 2 GB per month. Refuses oversized work cleanly (verified on 51 MB file). Hobby use is fine.
Open-core: Public repo is the community edition; a separate closed-source Pro engine exists for high-throughput servers.
Doesn't fix session fragmentation where the whole conversation is replayed every turn — that's an orchestration problem above this tool's scope.
Binary availability: Windows x64 in v0.2.1. Linux + macOS CI pipeline pending.

Who it's for

Local LLM users running agents or RAG with Ollama, vLLM, SGLang, llama.cpp, or any OpenAI-compatible backend who want to pack more real tokens into limited context windows.

📖 Read the full source: r/LocalLLaMA