llama.cpp Massive Prompt Reprocessing with Coding Agents: Debugging KV Cache and Context Swapping

✍️ OpenClawRadar📅 Published: May 14, 2026🔗 Source

A developer on r/LocalLLaMA is hitting a serious performance issue with llama.cpp when running long-context coding agents (opencode + pi.dev) via llama-swap. Even with highly similar prompts (LCP similarity often >0.99), the system periodically discards the KV cache and reprocesses 40k+ tokens, causing TTFT of multiple minutes.

Observed Behavior

Context grows to 50k+ tokens.
After several normal reuses (e.g., prompt eval time = 473 ms / 19 tokens), n_past suddenly drops to ~4-5k.
llama.cpp then reprocesses the full prompt: n_tokens = 4750 prompt eval time = 222411 ms / 44016 tokens.
Cache usage hits 4676 MiB, exceeding the configured limit (2500 MiB).

Current Configuration

llama-server --ctx-size 150000 --parallel 1 --ctx-checkpoints 32 --cache-ram 2500 --cache-reuse 256 -no-kvu --no-context-shift

Suspected Causes

Cache invalidation due to overflow of --cache-ram limit – the log shows 4676 MiB used vs 2500 MiB limit.
Bad KV reuse mechanism when early prompt tokens change (possibly frequent alterations by opencode).
Insufficient --ctx-checkpoints or --cache-reuse for the 150k context size.

Recommendations from the Community

The thread is thin on answers so far, but obvious first steps include increasing --cache-ram to match typical usage (e.g., 5000+ MiB), or reducing --ctx-size to stay under the cache limit. Also check if opencode is intentionally mutating prompt prefixes; if so, locking the system prompt or using a fixed prefix could improve reuse.

For developers running similar setups, share your working configs in the source thread.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tips

Claude Code Works Better as Code Reviewer Than Generator

A developer shares that Claude Code produces more grounded output when used to review existing code rather than generate from scratch. Key practices include starting sessions with current implementations, maintaining project context files, and restarting sessions when responses degrade.

Mar 22, 2026, 12:45 PM UTC

OpenClawRadar

Tips

Telegram vs Discord vs WhatsApp: Choosing Your OpenClaw Channel

Feb 7, 2026, 03:58 PM UTC

r/openclaw community

Tips

Using Dictation Tools for More Effective AI Agent Instructions

A developer found that switching from typed to spoken instructions for OpenClaw improved output quality by providing more natural, detailed context, using SaySo.ai as a dictation tool.

Apr 18, 2026, 04:45 AM UTC

OpenClawRadar

Tips

8 Tactical Claude Code Workflow Tips for Production-Ready Output

Force clarifying questions, auto-verify in To-Dos, use Early Exit, and leverage Vision/DevTools to get production-ready code from Claude.

May 3, 2026, 10:16 AM UTC

OpenClawRadar