Ctxpact: Compress 110k Tokens to 12k for Local LLMs

Ctxpact is a lightweight OpenAI-compatible proxy that sits between AI agents and local LLMs to intelligently compress oversized inputs before they hit models with limited context windows. It's designed for agentic workflows like OpenClaw and Hermes that send 100k+ token payloads to models with only 16k context windows, where truncation would lose critical information.

How It Works

The system uses a 3-stage compaction pipeline:

DCP (Dynamic Context Pruning): Dedups tool calls, strips superseded file writes, truncates error stack traces. Zero LLM calls, purely structural.
Summarize: Evicts old conversation turns, replaces with LLM-generated summaries. Keeps a sliding window of recent turns intact.
Extract: When input is still too large (like a 110k novel), uses one of 16 extraction strategies to pull the most relevant content within token budget.

Extraction Strategies

The extraction stage implements 16 strategies ranging from:

0 LLM calls: Embedding similarity (ChromaDB), section headers, heuristic keyword grep, LLMLingua compression
1 LLM call: LLM generates search terms, IDF-weighted word-level matching assembles context
2 LLM calls (best accuracy): readagent — embed + BM25 + RRF fusion, dual LLM term expansion, position-aware excerpting
N LLM calls: Multi-turn tool-calling loops, DSPy code generation, map-reduce chunking

Benchmark Results

Tested 12 strategies across 2 models (LFM2-8B-A1B and Qwen3.5-9B) on 331 GGUF models total:

Frankenstein test: 110k tokens compressed to 12k tokens, 8 reading comprehension questions; 8/8 correct, deterministic across 3 consecutive runs, 0% variance
LoCoMo-MC10: Multi-session conversation QA, 10-choice, random baseline is 10%; readagent + Qwen3.5-9B scores 15/20 (75%)
Combined performance: readagent + Qwen3.5-9B achieves 87.5%, rlm + Qwen3.5-9B achieves 80.0%

Key Findings

Model choice matters more than strategy choice: Switching from LFM2 to Qwen3.5 improved every single strategy by +25-50 percentage points. Median strategy went from 5/8 to 7/8 just by changing model.
NR-MMLU predicts context engineering performance: LFM2's 47% NR-MMLU vs Qwen3.5's 65% maps directly to accuracy differences.
2 LLM extraction calls is the sweet spot: Going from 0 to 1 call gives meaningful boost; 1 to 2 calls reaches peak accuracy. Beyond 2 calls, accuracy drops.
readagent and rlm are breakthrough strategies: Both achieve 8/8 on Frankenstein. Only strategies that solve Q4 (Ireland question). readagent leads cross-domain at 75% LoCoMo vs rlm's 60%.

Technical Details

Architecture: Standalone proxy (considered LiteLLM plugin and sidecar process) because breakthrough strategies need mid-pipeline LLM calls
Implementation: ~11k lines of Python, FastAPI server, 3 endpoints, OpenAI-compatible, no heavy frameworks
Compatibility: Drops in front of any llama-server / Ollama / vLLM backend. No API keys, no cloud, everything runs on your hardware

For developers running local LLMs with agentic workflows that exceed context windows, Ctxpact provides a practical solution to maintain information integrity while staying within hardware constraints.

📖 Read the full source: r/LocalLLaMA