Measuring Claude Code MCP Stack: Cache Friendliness vs. Byte Savings, and a 2-Line Fix for Prompt Cache

✍️ OpenClawRadar📅 Published: June 7, 2026🔗 Source
Measuring Claude Code MCP Stack: Cache Friendliness vs. Byte Savings, and a 2-Line Fix for Prompt Cache
Ad

When optimizing a Claude Code MCP stack, it's easy to focus on one metric: byte savings. But Greg Shevchenko's new analysis shows that a single-axis benchmark can recommend a system that's strictly worse in production. The missing axis: cache friendliness, i.e., whether the same input produces byte-identical bytes across runs so Anthropic's prompt cache hits.

Shevchenko's biggest byte-saver—a retrieval MCP that cut context 60–70%—was actually defeating the 5-minute TTL prompt cache on every call. Two runs of the same query produced different bytes because rg --files-with-matches output order leaked through a Map insertion sequence into the final context. The fix was two lines: sort the rg hits before slicing, and sort the Map entries by path. After the change, byte savings remained unchanged, but cache_friendly_score went from ~0% to 100%.

Ad

What the Harness Measures

Shevchenko released an open-source benchmark harness (stdlib-only Python, offline) that measures:

  • Mean ratio + CV across N≥5 runs per fixture → byte-saving axis
  • Unique MD5 count == 1 check → cache-friendliness axis (0–100%)
  • 12-anti-pattern audit on tool definitions (DSA reference)

Any compressor as (str) -> str can be plugged in. The harness uses cluster-bootstrap CIs, Wilson CIs, preregistration, and real-data Cohen's κ.

Public Alternatives Surveyed

Shevchenko surveyed public docs for: Cursor codebase index, Sourcegraph Cody, Aider repo-map, Microsoft LLMLingua/LLMLingua-2, Firecrawl/Jina Reader, RouteLLM/Martian (as of May 2026). None disclosed cache-friendliness metrics.

Limitations

He hypothesized that the prep layer triggers more downstream cache hits on subsequent turns, but it didn't reach significance (Welch p=0.32, Cohen's d≈0.18, N=137). Two-judge Cohen's κ on the corpus was 0.5955 (moderate, below 0.7 threshold), with 4 of 5 disagreements on one ambiguous task—fixing the spec would push κ to ~0.83.

The harness is MIT-licensed. If you're running a Claude Code MCP stack, measuring cache_friendly_score is now a concrete, actionable step.

📖 Read the full source: r/ClaudeAI

Ad

👀 See Also

🦀
Tools

Claude Code Skill Tax: 2,596 Installed Skills, 40 Used, $91/Month Wasted

Every installed Claude Code skill loads into every session's system prompt. One user measured 102,651 tokens loaded per session with 98.6% never used, costing ~$91/month. An open-source tool, skill-tax, audits usage and cost.

OpenClawRadar
Clarc v1.0: Workflow OS for Claude Code with 63 Agents and 249 Skills
Tools

Clarc v1.0: Workflow OS for Claude Code with 63 Agents and 249 Skills

Clarc is a plugin layer for Claude Code that provides 63 specialized subagents, 249 domain skills, and 178 slash commands for development workflows. Installation is via npx with support for multiple editors including Cursor and OpenCode.

OpenClawRadar
OpenClaw plugin adds persistent memory with Engram server
Tools

OpenClaw plugin adds persistent memory with Engram server

A developer built a TypeScript plugin connecting OpenClaw agents to Engram, a Go-based memory server using SQLite with FTS5 search. The plugin provides 11 tools, 4 lifecycle hooks, and automatic recall that injects relevant memories into prompts before each agent turn.

OpenClawRadar
OpenClaw-superpowers adds reliability features for operational failure modes
Tools

OpenClaw-superpowers adds reliability features for operational failure modes

The openclaw-superpowers repository has expanded with eight new reliability-focused skills including deployment preflight checks, cron execution proofing, session reset recovery, and MCP auth lifecycle management. These additions bring the total to 60 skills, with 44 being OpenClaw-native and 23 designed for cron scheduling.

OpenClawRadar