Claude Sonnet 4.6 Beats Opus 4.6 on Execution in Prompt Benchmark

A Reddit user on r/ClaudeAI posted a side-by-side comparison of Sonnet 4.6 and Opus 4.6 using a multi-layered creative prompt. The test required each model to explain why the sky is blue as a medieval scholar who secretly knows modern physics, satisfying three audiences simultaneously: the king (metaphor only), the court mathematician (disguised Rayleigh scattering formula), and a hidden skeptic (three logical breadcrumbs). After the response, the model had to break character, identify the breadcrumbs, self-rate creativity, suggest changes for a child audience, and write a follow-up line in iambic pentameter.
Key Findings
- Sonnet 4.6 outperformed Opus 4.6 on execution — the response was more creative and better satisfied the constraints. Specifically, the breadcrumbs were plausible and the iambic pentameter line scanned correctly.
- The
λ⁻⁴relationship was embedded within a metaphor about angels scattering divine light, with the exponent hidden in the number of steps in a divine ladder. - Three breadcrumbs included: (1) a reference to "tiny spheres" too small for the king's eyes, (2) the
n²density factor phrased as "twice as many prayers at dusk," (3) a mention of an experiment with a "glass cube and a candle" — an anachronistic reference to later home experiments.
Sonnet 4.6 vs Opus 4.6
- Sonnet 4.6 creativity self-rating: 8/10. It cited stronger metaphor cohesion and natural anachronisms.
- Opus 4.6 was more literal and included less disguising of the science, resulting in a lower execution score.
- The user concluded that for tasks requiring hidden constraints and creative disguise, Sonnet 4.6 is the better choice.
Practical Takeaway for Developers
If you're building agents that need to obey layered constraints or embed technical truths in narrative, Sonnet 4.6 currently edges out Opus 4.6 on execution. Use this benchmark as a sanity check for your own prompts that require multi-audience reasoning.
📖 Read the full source: r/ClaudeAI
👀 See Also

Claude Opus 4.7 Released with Hybrid Reasoning and 1M Context Window
Anthropic released Claude Opus 4.7, a hybrid reasoning model with a 1M context window that delivers stronger performance on coding, vision, and complex multi-step tasks. Pricing starts at $5 per million input tokens and $25 per million output tokens.

Claude Code v2.1.147: Pinned Sessions, /code-review, and Dozens of Fixes
Claude Code v2.1.147 introduces pinned background sessions, renames /simplify to /code-review with effort levels and --comment, plus fixes for PowerShell, MCP, Windows, and more.

Claude-Code v2.1.74 Release: Memory Leak Fixes, Context Optimization, and Plugin Improvements
Claude-Code v2.1.74 fixes a critical memory leak in streaming API responses that caused unbounded RSS growth on Node.js/npm code paths. The update adds actionable suggestions to the /context command and introduces the autoMemoryDirectory setting for custom auto-memory storage.

AI Interview Platforms Tested: CodeSignal, Humanly, Eightfold in Job Screening
The Verge tested three AI interview platforms including CodeSignal, Humanly, and Eightfold for job screening. The AI avatars conduct one-on-one video interviews, analyze responses, and claim to reduce bias, though bias-free systems remain impossible due to training data limitations.