MCP Stack Benchmark: Fix Prompt Cache with 2 Lines of Code

When optimizing a Claude Code MCP stack, it's easy to focus on one metric: byte savings. But Greg Shevchenko's new analysis shows that a single-axis benchmark can recommend a system that's strictly worse in production. The missing axis: cache friendliness, i.e., whether the same input produces byte-identical bytes across runs so Anthropic's prompt cache hits.

Shevchenko's biggest byte-saver—a retrieval MCP that cut context 60–70%—was actually defeating the 5-minute TTL prompt cache on every call. Two runs of the same query produced different bytes because rg --files-with-matches output order leaked through a Map insertion sequence into the final context. The fix was two lines: sort the rg hits before slicing, and sort the Map entries by path. After the change, byte savings remained unchanged, but cache_friendly_score went from ~0% to 100%.

What the Harness Measures

Shevchenko released an open-source benchmark harness (stdlib-only Python, offline) that measures:

Mean ratio + CV across N≥5 runs per fixture → byte-saving axis
Unique MD5 count == 1 check → cache-friendliness axis (0–100%)
12-anti-pattern audit on tool definitions (DSA reference)

Any compressor as (str) -> str can be plugged in. The harness uses cluster-bootstrap CIs, Wilson CIs, preregistration, and real-data Cohen's κ.

Public Alternatives Surveyed

Shevchenko surveyed public docs for: Cursor codebase index, Sourcegraph Cody, Aider repo-map, Microsoft LLMLingua/LLMLingua-2, Firecrawl/Jina Reader, RouteLLM/Martian (as of May 2026). None disclosed cache-friendliness metrics.

Limitations

He hypothesized that the prep layer triggers more downstream cache hits on subsequent turns, but it didn't reach significance (Welch p=0.32, Cohen's d≈0.18, N=137). Two-judge Cohen's κ on the corpus was 0.5955 (moderate, below 0.7 threshold), with 4 of 5 disagreements on one ambiguous task—fixing the spec would push κ to ~0.83.

The harness is MIT-licensed. If you're running a Claude Code MCP stack, measuring cache_friendly_score is now a concrete, actionable step.

📖 Read the full source: r/ClaudeAI

Measuring Claude Code MCP Stack: Cache Friendliness vs. Byte Savings, and a 2-Line Fix for Prompt Cache

What the Harness Measures

Public Alternatives Surveyed

Limitations

👀 See Also

Claude Code Skill Tax: 2,596 Installed Skills, 40 Used, $91/Month Wasted

Clarc v1.0: Workflow OS for Claude Code with 63 Agents and 249 Skills

OpenClaw plugin adds persistent memory with Engram server

OpenClaw-superpowers adds reliability features for operational failure modes