Fixing Claude Code's KV Cache Invalidation with Local Backends

✍️ OpenClawRadar📅 Published: March 31, 2026🔗 Source
Fixing Claude Code's KV Cache Invalidation with Local Backends
Ad

Claude Code versions 2.1.36 and above inject dynamic content into system prompts on every request, causing KV cache invalidation when using local inference backends like llama.cpp, llama-server, or LM Studio. This forces hardware to reprocess 20K+ token system prompts from scratch for minor tool calls.

The Problem

llama.cpp relies on exact string matching for KV cache reuse. When the beginning of a prompt changes, the entire cache is flushed and the full prompt must be reprocessed. Claude Code introduces two dynamic elements that mutate prompts on every turn:

  • Telemetry Hash: Injects a billing/telemetry header (x-anthropic-billing-header: cch=xxxxx) with a hash that changes on every request
  • Git Snapshot: Injects git status output into the environment block, changing the prompt whenever files are modified

This results in server logs showing "forcing full prompt re-processing due to lack of cache data" and 60+ second processing times for what should be minor operations.

Ad

The Solution

Configure Claude Code to disable dynamic prompt elements and route to your local hardware. Open ~/.claude/settings.json (or your project's local config) and ensure the following configuration:

{
  "includeGitInstructions": false,
  "env": {
    "ANTHROPIC_BASE_URL": "<your-llama-server-here>",
    "ANTHROPIC_API_KEY": "<any-string>",
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
    "DISABLE_TELEMETRY": "1",
    "DISABLE_ERROR_REPORTING": "1",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

After restarting Claude Code, llama-server logs should show improved cache recognition. Instead of processing 24,000 tokens, you'll see messages like "selected slot by LCP similarity, sim_best = 0.973" followed by "prompt processing progress, n_tokens = 24270, batch.n_tokens = 4" - indicating only 600 tokens of delta processing instead of full reprocessing.

This reduces local tool call times from over a minute to approximately 4 seconds on hardware like Turing-era Quadro RTX-8000.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also