AVP Protocol Enables LLM Agents to Share KV-Cache Instead of Text for Token Efficiency

What AVP Does
AVP (Agent Vector Protocol) is a protocol that enables LLM agents in multi-agent setups to pass KV-cache directly between agents instead of text. This eliminates redundant tokenization and forward passes that occur when each agent re-processes the entire conversation history.
How It Works
Instead of the traditional text-based approach where each agent re-tokenizes everything, AVP allows Agent A to serialize its key-value attention states after reasoning, and Agent B injects them directly. This means:
- Same model on both sides: Direct KV-cache transfer with zero overhead
- Same family, different size (e.g., Qwen2.5-7B talking to 1.5B): Vocabulary-mediated projection with no learned parameters or calibration data needed
- Different families: Falls back to JSON
- Transport-agnostic: Works alongside A2A, MCP, gRPC, or whatever you're already using
- Binary wire format: Not JSON+Base64 (which has 33% overhead on tensor data)
Performance Results
Testing across Qwen2.5, Llama 3.2, and DeepSeek-R1-Distill models showed:
- Token savings of 73-78%
- 2-4x speedups
- These results held consistent across all three model families
- The gap widens with chain length: at 4 agents it's roughly 2x, at 16 agents (projected) it would be around 6x
The efficiency comes from text prompt sizes ballooning at each hop (186 → 545 → 1,073 → 1,397 tokens in a 4-agent GSM8K chain), while latent stays flat at ~164-207 tokens per hop because prior context arrives as pre-computed KV-cache.
Limitations
- Sample sizes are n=20 per model (enough for token/speed claims but not for accuracy claims)
- Tested on small models only (1.5B-3B on an RTX 3070 Ti) with 7B+ results pending
- Requires 1 Gbps+ bandwidth minimum (KV-cache for a 3B model runs about 130 MB per sample)
- Self-hosted only (requires KV-cache access, won't work with OpenAI/Anthropic/etc. APIs)
- Same model only for now (cross-model implementation exists but not benchmarked)
- Latent uses 17-54x more VRAM than text because you're holding KV-cache across hops
Getting Started
Install with: pip install avp
Two API levels available:
import avp
msg = avp.pack("Hello", model="Qwen/Qwen2.5-7B-Instruct", think_steps=20)
answer = avp.unpack(msg, model="Qwen/Qwen2.5-7B-Instruct")Or with more control:
from avp import HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
context = connector.think("Analyze this problem", steps=20)
answer = connector.generate("Solve it.", context=context)vLLM connector also available: pip install "avp[vllm]"
Project Links
- SDK: github.com/VectorArc/avp-python (MIT, 377 tests, 7 benchmarks)
- Spec: github.com/VectorArc/avp-spec
- Benchmark details: BENCHMARKS.md
📖 Read the full source: r/LocalLLaMA
👀 See Also

Skales: Desktop AI Agent with Ollama Support, 300MB Idle RAM
Skales is a native Electron desktop app that provides an autonomous AI agent with .exe/.dmg installers, works with Ollama for local inference or cloud providers, and uses ~300MB idle RAM with data stored locally in ~/.skales-data.

Memento Vault: Local Tool for Persistent Context in Claude Code Sessions
Memento Vault is a set of hooks that automatically captures session transcripts, scores them, and stores atomic notes in a local git repo. It provides zero-cost retrieval via BM25 + vector search with 472ms average latency and injects relevant context at session start, on every prompt, and on file reads.

Simplifying Automation with OpenClaw Wrappers
OpenClaw Wrappers offer an efficient way to manage AI coding agents. Discover how these tools integrate easily into existing frameworks with specific command examples and community feedback.

Deterministic Compiler Architecture for Multi-Step LLM Workflows Shows Strong Benchmark Results
A deterministic compilation architecture for structured LLM workflows uses typed node registries, parameter contracts, and static validation to compile workflow graphs ahead of time. Benchmarks show it outperforms GPT-4.1 and Claude Sonnet 4.6 across workflow depths from 3-12+ nodes.