Claude Sonnet 4.6 Beats Opus 4.6 in Execution Benchmark

A Reddit user on r/ClaudeAI posted a side-by-side comparison of Sonnet 4.6 and Opus 4.6 using a multi-layered creative prompt. The test required each model to explain why the sky is blue as a medieval scholar who secretly knows modern physics, satisfying three audiences simultaneously: the king (metaphor only), the court mathematician (disguised Rayleigh scattering formula), and a hidden skeptic (three logical breadcrumbs). After the response, the model had to break character, identify the breadcrumbs, self-rate creativity, suggest changes for a child audience, and write a follow-up line in iambic pentameter.

Key Findings

Sonnet 4.6 outperformed Opus 4.6 on execution — the response was more creative and better satisfied the constraints. Specifically, the breadcrumbs were plausible and the iambic pentameter line scanned correctly.
The λ⁻⁴ relationship was embedded within a metaphor about angels scattering divine light, with the exponent hidden in the number of steps in a divine ladder.
Three breadcrumbs included: (1) a reference to "tiny spheres" too small for the king's eyes, (2) the n² density factor phrased as "twice as many prayers at dusk," (3) a mention of an experiment with a "glass cube and a candle" — an anachronistic reference to later home experiments.

Sonnet 4.6 vs Opus 4.6

Sonnet 4.6 creativity self-rating: 8/10. It cited stronger metaphor cohesion and natural anachronisms.
Opus 4.6 was more literal and included less disguising of the science, resulting in a lower execution score.
The user concluded that for tasks requiring hidden constraints and creative disguise, Sonnet 4.6 is the better choice.

Practical Takeaway for Developers

If you're building agents that need to obey layered constraints or embed technical truths in narrative, Sonnet 4.6 currently edges out Opus 4.6 on execution. Use this benchmark as a sanity check for your own prompts that require multi-audience reasoning.

📖 Read the full source: r/ClaudeAI

Claude Sonnet 4.6 Beats Opus 4.6 on Execution in Prompt Benchmark

Key Findings

Sonnet 4.6 vs Opus 4.6

Practical Takeaway for Developers

👀 See Also

Claude Code v2.1.158: Auto Mode Now on Bedrock, Vertex, Foundry for Opus 4.7/4.8

Claude Code v2.1.183: Safer Auto Mode, TUI Fixes, and Destructive Git Command Blocking

Claude-Code v2.1.110 adds TUI mode, push notifications, and multiple fixes

Nonprofits Gain Access to Claude Opus 4.6 on Team and Enterprise Plans