llama.cpp Massive Prompt Reprocessing with Coding Agents: Debugging KV Cache and Context Swapping

✍️ OpenClawRadar📅 Published: May 14, 2026🔗 Source
llama.cpp Massive Prompt Reprocessing with Coding Agents: Debugging KV Cache and Context Swapping
Ad

A developer on r/LocalLLaMA is hitting a serious performance issue with llama.cpp when running long-context coding agents (opencode + pi.dev) via llama-swap. Even with highly similar prompts (LCP similarity often >0.99), the system periodically discards the KV cache and reprocesses 40k+ tokens, causing TTFT of multiple minutes.

Observed Behavior

  • Context grows to 50k+ tokens.
  • After several normal reuses (e.g., prompt eval time = 473 ms / 19 tokens), n_past suddenly drops to ~4-5k.
  • llama.cpp then reprocesses the full prompt: n_tokens = 4750 prompt eval time = 222411 ms / 44016 tokens.
  • Cache usage hits 4676 MiB, exceeding the configured limit (2500 MiB).

Current Configuration

llama-server --ctx-size 150000 --parallel 1 --ctx-checkpoints 32 --cache-ram 2500 --cache-reuse 256 -no-kvu --no-context-shift
Ad

Suspected Causes

  • Cache invalidation due to overflow of --cache-ram limit – the log shows 4676 MiB used vs 2500 MiB limit.
  • Bad KV reuse mechanism when early prompt tokens change (possibly frequent alterations by opencode).
  • Insufficient --ctx-checkpoints or --cache-reuse for the 150k context size.

Recommendations from the Community

The thread is thin on answers so far, but obvious first steps include increasing --cache-ram to match typical usage (e.g., 5000+ MiB), or reducing --ctx-size to stay under the cache limit. Also check if opencode is intentionally mutating prompt prefixes; if so, locking the system prompt or using a fixed prefix could improve reuse.

For developers running similar setups, share your working configs in the source thread.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also