Be brief beats caveman plugin in Claude Code compression benchmark

Max Taylor benchmarked the popular Claude Code compression plugin 'caveman' against a trivial baseline: prepending 'be brief.' to each prompt. The results are surprisingly flat — but reveal where the plugin actually delivers value.
Benchmark methodology
24 prompts across six categories (bug diagnosis, concept explanation, architecture tradeoffs, multi-step setup, security/destructive ops, error interpretation). Each prompt had a rubric with required key points, required terms, and forbidden claims. Five arms were tested: baseline (no instruction), 'be brief.', and caveman at three intensity levels (lite, full, ultra). All ran via claude -p on claude-opus-4-7. Responses were scored by claude-sonnet-4-6 against the rubric.
Quality results
All arms scored within 1.5% of each other:
- Baseline: 0.985
- Brief: 0.985
- Lite: 0.976
- Full: 0.975
- Ultra: 0.970
Every arm hit 100% of key points. Zero forbidden claims triggered across 120 responses. Compression did not drop substantive content.
Token counts
| Arm | Mean tokens |
|---|---|
| Baseline | 636 |
| Brief | 419 (34% cut) |
| Lite | 401 |
| Full | 404 |
| Ultra | 449 |
'Be brief.' cut tokens by 34% vs baseline. Caveman lite and full landed close to brief. Ultra, the strictest mode, produced the longest answers of the three — but the category split tells a different story.
The category split reveals caveman's design
On bug diagnosis, concept explanations, architecture tradeoffs, and error interpretation, ultra is shortest or tied. Compression works as advertised. On multi-step setup and security warnings, all caveman modes show higher token counts. The reason: caveman's 'Auto-Clarity' rule explicitly disables compression for safety warnings, irreversible actions, and multi-step sequences. The safety escape engages, and compression stops — by design.
So what is caveman actually for?
If 'be brief.' matches on tokens and quality, the plugin's value is structural:
- Consistent output shape — every response follows the same pattern, useful for downstream tooling or uniform session feel.
- Intensity dial — slash commands to switch lite/full/ultra mid-session.
- Persistence across long sessions — caveman re-injects its ruleset via
SessionStartandUserPromptSubmithooks to prevent drift (not tested in this single-shot benchmark).
The full dataset and harness are open source.
📖 Read the full source: HN AI Agents
👀 See Also

Ouroboros 0.26.0-beta Combines Claude and Codex via MCP Server
Ouroboros 0.26.0-beta introduces a harness that runs Claude and Codex simultaneously, assigning Claude to clarify user intent and Codex to execute well-defined tasks via an MCP server architecture.

Node Control: Real-Time Multiplayer .io Game Built Entirely with Claude 4.6 and 4.7
Developer built a live competitive multiplayer .io game, Node Control, using Claude 4.6 and 4.7. Features server-authoritative netcode at 60Hz, 4-region deployment on fly.io, and neural-network aesthetic.

Claude Agent Teams UI: Desktop App for Visualizing Claude Code Agent Workflows
A developer built a free, open-source desktop app that adds a visual layer to Claude Code's experimental Agent Teams feature. The app provides a real-time kanban board where tasks move automatically as agents work, plus cross-team communication, built-in review workflows, and per-task code review.

Codegraph: Pre-indexed knowledge graph cuts Claude/Cursor tool calls by 94%
Codegraph uses a pre-indexed knowledge graph of symbol relationships, call graphs, and code structure to reduce API tool calls by up to 94% and speed up usage by ~77% for Claude, Cursor, Codex, and OpenCode agents.