AVP Protocol Enables LLM Agents to Share KV-Cache Instead of Text for Token Efficiency

✍️ OpenClawRadar📅 Published: February 28, 2026🔗 Source
AVP Protocol Enables LLM Agents to Share KV-Cache Instead of Text for Token Efficiency
Ad

What AVP Does

AVP (Agent Vector Protocol) is a protocol that enables LLM agents in multi-agent setups to pass KV-cache directly between agents instead of text. This eliminates redundant tokenization and forward passes that occur when each agent re-processes the entire conversation history.

How It Works

Instead of the traditional text-based approach where each agent re-tokenizes everything, AVP allows Agent A to serialize its key-value attention states after reasoning, and Agent B injects them directly. This means:

  • Same model on both sides: Direct KV-cache transfer with zero overhead
  • Same family, different size (e.g., Qwen2.5-7B talking to 1.5B): Vocabulary-mediated projection with no learned parameters or calibration data needed
  • Different families: Falls back to JSON
  • Transport-agnostic: Works alongside A2A, MCP, gRPC, or whatever you're already using
  • Binary wire format: Not JSON+Base64 (which has 33% overhead on tensor data)

Performance Results

Testing across Qwen2.5, Llama 3.2, and DeepSeek-R1-Distill models showed:

  • Token savings of 73-78%
  • 2-4x speedups
  • These results held consistent across all three model families
  • The gap widens with chain length: at 4 agents it's roughly 2x, at 16 agents (projected) it would be around 6x

The efficiency comes from text prompt sizes ballooning at each hop (186 → 545 → 1,073 → 1,397 tokens in a 4-agent GSM8K chain), while latent stays flat at ~164-207 tokens per hop because prior context arrives as pre-computed KV-cache.

Ad

Limitations

  • Sample sizes are n=20 per model (enough for token/speed claims but not for accuracy claims)
  • Tested on small models only (1.5B-3B on an RTX 3070 Ti) with 7B+ results pending
  • Requires 1 Gbps+ bandwidth minimum (KV-cache for a 3B model runs about 130 MB per sample)
  • Self-hosted only (requires KV-cache access, won't work with OpenAI/Anthropic/etc. APIs)
  • Same model only for now (cross-model implementation exists but not benchmarked)
  • Latent uses 17-54x more VRAM than text because you're holding KV-cache across hops

Getting Started

Install with: pip install avp

Two API levels available:

import avp
msg = avp.pack("Hello", model="Qwen/Qwen2.5-7B-Instruct", think_steps=20)
answer = avp.unpack(msg, model="Qwen/Qwen2.5-7B-Instruct")

Or with more control:

from avp import HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
context = connector.think("Analyze this problem", steps=20)
answer = connector.generate("Solve it.", context=context)

vLLM connector also available: pip install "avp[vllm]"

Project Links

  • SDK: github.com/VectorArc/avp-python (MIT, 377 tests, 7 benchmarks)
  • Spec: github.com/VectorArc/avp-spec
  • Benchmark details: BENCHMARKS.md

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also