The Human Creativity Benchmark: Separating Convergence from Divergence in AI Creative Evaluation

Contra Labs' new Human Creativity Benchmark (HCB) tackles a core problem in evaluating AI-generated creative work: creative tasks have no ground truth. Traditional benchmarks treat evaluator disagreement as noise to be resolved via majority voting or adjudication. The HCB instead separates convergence (agreement on shareable best practices) from divergence (genuine differences in aesthetic taste).
Key Findings
- Convergence is high on verifiable axes: prompt adherence, usability, and technical correctness (e.g., legibility, layout).
- Divergence dominates on taste-driven axes: visual appeal, mood, conceptual risk.
- Desktop Apps and Landing Pages show highest convergence; Ad Video and Brand Assets remain most divergent.
- No current generative model is reliably both correct (convergent) and steerable (divergent on request).
- Mode collapse is identified as a practical problem: models converge on safe, averaged aesthetics when given the same brief.
Methodology
The HCB defines evaluation axes on a spectrum from objectively verifiable to inherently subjective. For each axis, evaluator agreement is measured. Convergence reflects shared standards like visual hierarchy, color contrast, and rendering quality. Divergence captures personal taste—essential for creative workflows where professionals need multiple directions for exploration and iteration.
Implications for AI Agents
For developers using AI coding agents, this benchmark underscores that creative tools must offer both reliability (following instructions) and steerability (adjusting to personal taste). The HCB provides a framework to evaluate these dimensions separately, rather than smoothing out divergence into a single quality score. Agents that fail to support differentiated output risk being unusable for real creative work.
📖 Read the full source: HN AI Agents
👀 See Also

PocketBot: iOS app uses Claude to generate deterministic JavaScript automations from natural language
PocketBot is an iOS mobile automation app that uses Claude via AWS Bedrock to convert plain-language requests into self-contained JavaScript scripts. The LLM writes the code once, then the deterministic scripts run on schedule in a sandboxed runtime without AI involvement.

OpenClaw Budget Guard Plugin Prevents Concurrent Budget Overspend
A new OpenClaw plugin called @runcycles/openclaw-budget-guard solves concurrent budget overspend by implementing atomic balance checks, reservation before execution, and idempotent retries. It requires a Cycles server with Redis and can be installed via bash command.
Claude Code vs Codex: 36 vs 28 files, $2.50 vs $2.04, infinite loop caught — real-world comparison
A developer runs the same two tasks on Claude Code and Codex (Cursor): PR triage bot and real-time code review UI. Results: 36 vs 28 files, $2.50 vs $2.04 cost, Claude produced fewer TypeScript errors, Codex had an infinite React loop.

Orc: Multi-Agent Coding Orchestration Tool Adds Planning and Notification Features
Orc is an open-source tool that orchestrates AI coding agents across projects with a local TUI interface. The latest release adds planning as a first-class phase, notification systems for human intervention, and natural language lifecycle hooks.