Open Source vs Frontier Models: Single-File Canvas Car Scene Benchmark

✍️ OpenClawRadar📅 Published: May 17, 2026🔗 Source

A developer ran the same single-file Canvas prompt across 12 models to compare open-source and frontier model capabilities on a realistic side-view car driving scene. The task: one standalone HTML file, no libraries, no external assets, with parallax scenery, spinning wheels, subtle body motion, cinematic lighting, and seamless looping. The test harness is OpenCodeOrchestra, and results are live at oco-canvas-car-scene-compare.

Models Tested

Each model ran in an isolated Orchestrator with highest available thinking/effort setting. List includes GPT-5.5 xhigh, GPT-5.4 xhigh, Claude Opus 4.7 (max effort), Claude Opus 4.6 (max effort), Claude Sonnet 4.6 (high effort), Kimi K2.6, DeepSeek V4 Pro, DeepSeek V4 Flash, GLM-5.1, MiniMax M2.7, Qwen 3.6 Plus, and Grok 4.3. Tok/s and generation time were not measured.

Key Findings

Some models used auditor models internally; some didn't.
Clear winners and ambiguous results are visible in the gallery.
MiMo V2.5 Pro was excluded due to billing issues with OpenCode Go subscription.

The gallery page allows side-by-side comparison of each model's output. Source code is on GitHub at AidenGeunGeun/oco-canvas-car-scene-compare.

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

Frontier AI Access Tightens: Anthropic's Mythos and the Structural Shift to Selective Rollouts

Anthropic's Mythos cybersecurity model and OpenAI's Daybreak initiative signal a new era where economic and security constraints restrict frontier AI to select U.S.-based firms, driven by misuse risks, distillation threats, and emerging government controls.

May 15, 2026, 08:18 AM UTC

OpenClawRadar

News

Anthropic study reveals cognitive degradation in AI-assisted workflows

Anthropic's global study of 80,000 users found academic users report cognitive degradation rates 2.5x higher than average when using AI tools like Claude and Cursor. The source identifies the problem as users eliminating the 'digestion phase' of work.

Mar 26, 2026, 11:45 PM UTC

OpenClawRadar

News

Stanford Study: Law Professors Prefer AI Answers Over Peers 75% of the Time

In a blind evaluation of 3,000 comparisons, law professors rated AI-generated answers significantly higher than peer-written ones. AI responses were flagged as harmful only 3.5% of the time vs 12% for humans.

Jun 3, 2026, 12:19 PM UTC

OpenClawRadar

News

Cowork VM Service Fails on Windows 11 Due to Missing DCOM Registry Entry

A user diagnosed a Cowork bug where the VM service fails to start on Windows 11 Pro upgraded from Home. The missing DCOM APPID {15C20B67-12E7-4BB6-92BB-7AFF07997402} prevents Hyper-V communication, requiring an Anthropic patch.

Apr 1, 2026, 06:45 PM UTC

OpenClawRadar