The Human Creativity Benchmark: Separating Convergence from Divergence in AI Creative Evaluation

Contra Labs' new Human Creativity Benchmark (HCB) tackles a core problem in evaluating AI-generated creative work: creative tasks have no ground truth. Traditional benchmarks treat evaluator disagreement as noise to be resolved via majority voting or adjudication. The HCB instead separates convergence (agreement on shareable best practices) from divergence (genuine differences in aesthetic taste).
Key Findings
- Convergence is high on verifiable axes: prompt adherence, usability, and technical correctness (e.g., legibility, layout).
- Divergence dominates on taste-driven axes: visual appeal, mood, conceptual risk.
- Desktop Apps and Landing Pages show highest convergence; Ad Video and Brand Assets remain most divergent.
- No current generative model is reliably both correct (convergent) and steerable (divergent on request).
- Mode collapse is identified as a practical problem: models converge on safe, averaged aesthetics when given the same brief.
Methodology
The HCB defines evaluation axes on a spectrum from objectively verifiable to inherently subjective. For each axis, evaluator agreement is measured. Convergence reflects shared standards like visual hierarchy, color contrast, and rendering quality. Divergence captures personal taste—essential for creative workflows where professionals need multiple directions for exploration and iteration.
Implications for AI Agents
For developers using AI coding agents, this benchmark underscores that creative tools must offer both reliability (following instructions) and steerability (adjusting to personal taste). The HCB provides a framework to evaluate these dimensions separately, rather than smoothing out divergence into a single quality score. Agents that fail to support differentiated output risk being unusable for real creative work.
📖 Read the full source: HN AI Agents
👀 See Also

Jeeves: TUI for Browsing and Resuming AI Agent Sessions
Jeeves is a terminal user interface that lets you search, preview, and resume AI agent sessions from Claude Code, Codex, and OpenCode in a single view. It's written in Go and available via multiple package managers including Homebrew, Nix, and Go install.

LamBench: A Lambda Calculus Benchmark Suite for AI Coding Agents
LamBench is a benchmark suite evaluating AI agents on lambda calculus tasks, measuring intelligence, speed, and elegance. The v1 release includes problems and a matrix of scores.

Codiff v0.1.0: A Local Diff Viewer for LLM-Generated Code Reviews
Codiff v0.1.0 is a fast, minimal desktop app for reviewing local Git diffs, with LLM walkthrough mode and inline comments that can be copied as Markdown.

the-knowledge-guy: Turn Your Bookshelf Into a Tutor With Claude Code Skills
A Claude Code skill set that ingests your PDF/EPUB books locally and lets you ask questions, get taught topic-by-topic, or pull cheatsheets — all with citations across your library.