llama.cpp Massive Prompt Reprocessing with Coding Agents: Debugging KV Cache and Context Swapping

A developer on r/LocalLLaMA is hitting a serious performance issue with llama.cpp when running long-context coding agents (opencode + pi.dev) via llama-swap. Even with highly similar prompts (LCP similarity often >0.99), the system periodically discards the KV cache and reprocesses 40k+ tokens, causing TTFT of multiple minutes.
Observed Behavior
- Context grows to 50k+ tokens.
- After several normal reuses (e.g.,
prompt eval time = 473 ms / 19 tokens),n_pastsuddenly drops to ~4-5k. - llama.cpp then reprocesses the full prompt:
n_tokens = 4750 prompt eval time = 222411 ms / 44016 tokens. - Cache usage hits 4676 MiB, exceeding the configured limit (2500 MiB).
Current Configuration
llama-server --ctx-size 150000 --parallel 1 --ctx-checkpoints 32 --cache-ram 2500 --cache-reuse 256 -no-kvu --no-context-shiftSuspected Causes
- Cache invalidation due to overflow of
--cache-ramlimit – the log shows 4676 MiB used vs 2500 MiB limit. - Bad KV reuse mechanism when early prompt tokens change (possibly frequent alterations by opencode).
- Insufficient
--ctx-checkpointsor--cache-reusefor the 150k context size.
Recommendations from the Community
The thread is thin on answers so far, but obvious first steps include increasing --cache-ram to match typical usage (e.g., 5000+ MiB), or reducing --ctx-size to stay under the cache limit. Also check if opencode is intentionally mutating prompt prefixes; if so, locking the system prompt or using a fixed prefix could improve reuse.
For developers running similar setups, share your working configs in the source thread.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Good AI-Assisted Development Happens at the Systems Level, Not the Task Level
A Reddit user explains how shifting from fixing AI agent output to designing constraints—like a linter rule that forces UI navigation—prevents entire classes of bugs permanently.

Reducing Claude Hallucinations with Pre-Output Prompt Injection
A Reddit post details a method to cut Claude AI hallucinations by half using a pre-output prompt that forces the model to record uncertainties and next steps before responding. The approach involves adding specific markdown instructions to Claude's system prompt and creating a Python script.

Reducing MCP token usage by replacing servers with CLI alternatives
A developer found that MCP servers were consuming 30-40% of their context window with tool definitions, so they replaced four MCP servers with CLI tools where available, reducing from 6 to 2 MCP servers while maintaining functionality.

Claude Compaction Workaround: Using a Handoff.MD File
A Reddit user shares a workaround for Claude's conversation compaction message: create a detailed handoff.md file summarizing the conversation, then start a new session with that file. The post includes specific steps for using ChatGPT to generate prompts and managing projects with instructions.