YC-Bench Benchmark Tests LLMs as Startup CEOs, GLM-5 Shows Strong Cost-Efficiency

✍️ OpenClawRadar📅 Published: April 13, 2026🔗 Source
YC-Bench Benchmark Tests LLMs as Startup CEOs, GLM-5 Shows Strong Cost-Efficiency
Ad

YC-Bench: A Long-Horizon Startup Simulation Benchmark

Researchers have developed YC-Bench, a benchmark where an LLM plays the role of CEO in a simulated startup environment over a full year, involving hundreds of decision turns. The simulation requires managing employees, selecting contracts, handling payroll, and navigating a market where approximately 35% of clients secretly inflate work requirements after task acceptance. Feedback is delayed and sparse, with no hand-holding provided to the models.

Benchmark Results and Key Findings

The benchmark tested 12 models with 3 seeds each. The leaderboard shows:

  • 🥇 Claude Opus 4.6 - $1.27M average final funds (~$86 per run in API cost)
  • 🥈 GLM-5 - $1.21M average final funds (~$7.62 per run)
  • 🥉 GPT-5.4 - $1.00M average final funds (~$23 per run)
  • All other models performed below the starting capital of $200K, with several going bankrupt

GLM-5 is highlighted as a significant finding, performing within 5% of Claude Opus on raw performance while costing approximately 11× less to run. For production agentic pipelines, this represents a substantial cost-efficiency improvement. Kimi-K2.5 actually tops the revenue-per-API-dollar chart at 2.5× better than the next model.

Ad

What the Benchmark Reveals About LLM Capabilities

The benchmark exposes long-horizon coherence under delayed feedback, a capability most evaluations miss. When immediate feedback isn't available to determine decision quality, most models collapse into loops, abandon recently established strategies, or continue accepting tasks from clients they've already identified as problematic.

The strongest predictor of success wasn't model size or traditional benchmark scores, but whether the model actively used a persistent scratchpad to record learned information. Top-performing models rewrote their notes approximately 34 times per run, while bottom-performing models averaged 0–2 entries.

Resources and Implementation

The benchmark is fully open-source with code available on GitHub. The paper provides detailed methodology and results, while the leaderboard shows current model rankings. Researchers encourage others to run their own models and are available to answer queries.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also