AVP Protocol Cuts LLM Token Use 73-78% With KV-Cache Sharing

What AVP Does

AVP (Agent Vector Protocol) is a protocol that enables LLM agents in multi-agent setups to pass KV-cache directly between agents instead of text. This eliminates redundant tokenization and forward passes that occur when each agent re-processes the entire conversation history.

How It Works

Instead of the traditional text-based approach where each agent re-tokenizes everything, AVP allows Agent A to serialize its key-value attention states after reasoning, and Agent B injects them directly. This means:

Same model on both sides: Direct KV-cache transfer with zero overhead
Same family, different size (e.g., Qwen2.5-7B talking to 1.5B): Vocabulary-mediated projection with no learned parameters or calibration data needed
Different families: Falls back to JSON
Transport-agnostic: Works alongside A2A, MCP, gRPC, or whatever you're already using
Binary wire format: Not JSON+Base64 (which has 33% overhead on tensor data)

Performance Results

Testing across Qwen2.5, Llama 3.2, and DeepSeek-R1-Distill models showed:

Token savings of 73-78%
2-4x speedups
These results held consistent across all three model families
The gap widens with chain length: at 4 agents it's roughly 2x, at 16 agents (projected) it would be around 6x

The efficiency comes from text prompt sizes ballooning at each hop (186 → 545 → 1,073 → 1,397 tokens in a 4-agent GSM8K chain), while latent stays flat at ~164-207 tokens per hop because prior context arrives as pre-computed KV-cache.

Limitations

Sample sizes are n=20 per model (enough for token/speed claims but not for accuracy claims)
Tested on small models only (1.5B-3B on an RTX 3070 Ti) with 7B+ results pending
Requires 1 Gbps+ bandwidth minimum (KV-cache for a 3B model runs about 130 MB per sample)
Self-hosted only (requires KV-cache access, won't work with OpenAI/Anthropic/etc. APIs)
Same model only for now (cross-model implementation exists but not benchmarked)
Latent uses 17-54x more VRAM than text because you're holding KV-cache across hops

Getting Started

Install with: pip install avp

Two API levels available:

import avp
msg = avp.pack("Hello", model="Qwen/Qwen2.5-7B-Instruct", think_steps=20)
answer = avp.unpack(msg, model="Qwen/Qwen2.5-7B-Instruct")

Or with more control:

from avp import HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
context = connector.think("Analyze this problem", steps=20)
answer = connector.generate("Solve it.", context=context)

vLLM connector also available: pip install "avp[vllm]"