Claude Code: 'Be Brief' Outperforms Caveman Plugin in 24-Prompt Test

Max Taylor benchmarked the popular Claude Code compression plugin 'caveman' against a trivial baseline: prepending 'be brief.' to each prompt. The results are surprisingly flat — but reveal where the plugin actually delivers value.

Benchmark methodology

24 prompts across six categories (bug diagnosis, concept explanation, architecture tradeoffs, multi-step setup, security/destructive ops, error interpretation). Each prompt had a rubric with required key points, required terms, and forbidden claims. Five arms were tested: baseline (no instruction), 'be brief.', and caveman at three intensity levels (lite, full, ultra). All ran via claude -p on claude-opus-4-7. Responses were scored by claude-sonnet-4-6 against the rubric.

Quality results

All arms scored within 1.5% of each other:

Baseline: 0.985
Brief: 0.985
Lite: 0.976
Full: 0.975
Ultra: 0.970

Every arm hit 100% of key points. Zero forbidden claims triggered across 120 responses. Compression did not drop substantive content.

Token counts

Arm	Mean tokens
Baseline	636
Brief	419 (34% cut)
Lite	401
Full	404
Ultra	449

'Be brief.' cut tokens by 34% vs baseline. Caveman lite and full landed close to brief. Ultra, the strictest mode, produced the longest answers of the three — but the category split tells a different story.

The category split reveals caveman's design

On bug diagnosis, concept explanations, architecture tradeoffs, and error interpretation, ultra is shortest or tied. Compression works as advertised. On multi-step setup and security warnings, all caveman modes show higher token counts. The reason: caveman's 'Auto-Clarity' rule explicitly disables compression for safety warnings, irreversible actions, and multi-step sequences. The safety escape engages, and compression stops — by design.

So what is caveman actually for?

If 'be brief.' matches on tokens and quality, the plugin's value is structural:

Consistent output shape — every response follows the same pattern, useful for downstream tooling or uniform session feel.
Intensity dial — slash commands to switch lite/full/ultra mid-session.
Persistence across long sessions — caveman re-injects its ruleset via SessionStart and UserPromptSubmit hooks to prevent drift (not tested in this single-shot benchmark).

The full dataset and harness are open source.

📖 Read the full source: HN AI Agents