YC-Bench Benchmark Tests LLMs as Startup CEOs, GLM-5 Shows Strong Cost-Efficiency

YC-Bench: A Long-Horizon Startup Simulation Benchmark
Researchers have developed YC-Bench, a benchmark where an LLM plays the role of CEO in a simulated startup environment over a full year, involving hundreds of decision turns. The simulation requires managing employees, selecting contracts, handling payroll, and navigating a market where approximately 35% of clients secretly inflate work requirements after task acceptance. Feedback is delayed and sparse, with no hand-holding provided to the models.
Benchmark Results and Key Findings
The benchmark tested 12 models with 3 seeds each. The leaderboard shows:
- 🥇 Claude Opus 4.6 - $1.27M average final funds (~$86 per run in API cost)
- 🥈 GLM-5 - $1.21M average final funds (~$7.62 per run)
- 🥉 GPT-5.4 - $1.00M average final funds (~$23 per run)
- All other models performed below the starting capital of $200K, with several going bankrupt
GLM-5 is highlighted as a significant finding, performing within 5% of Claude Opus on raw performance while costing approximately 11× less to run. For production agentic pipelines, this represents a substantial cost-efficiency improvement. Kimi-K2.5 actually tops the revenue-per-API-dollar chart at 2.5× better than the next model.
What the Benchmark Reveals About LLM Capabilities
The benchmark exposes long-horizon coherence under delayed feedback, a capability most evaluations miss. When immediate feedback isn't available to determine decision quality, most models collapse into loops, abandon recently established strategies, or continue accepting tasks from clients they've already identified as problematic.
The strongest predictor of success wasn't model size or traditional benchmark scores, but whether the model actively used a persistent scratchpad to record learned information. Top-performing models rewrote their notes approximately 34 times per run, while bottom-performing models averaged 0–2 entries.
Resources and Implementation
The benchmark is fully open-source with code available on GitHub. The paper provides detailed methodology and results, while the leaderboard shows current model rankings. Researchers encourage others to run their own models and are available to answer queries.
📖 Read the full source: r/LocalLLaMA
👀 See Also

US Power Demand to Hit Record Highs in 2026–2027 Driven by AI and Data Centers
The U.S. Energy Information Administration (EIA) forecasts record-high power consumption in 2026–2027, primarily driven by surging AI workloads and data center expansion.

Claude Code v2.1.83 adds managed settings fragments, transcript search, and security improvements
Claude Code v2.1.83 introduces a managed-settings.d/ directory for team policy fragments, transcript search with / and n/N navigation, and CLAUDE_CODE_SUBPROCESS_ENV_SCRUB=1 to strip credentials from subprocess environments. The release also includes CwdChanged/FileChanged hooks, sandbox.failIfUnavailable setting, and fixes for macOS exit hangs, UI freezes, and memory leaks.

Nvidia's Nemotron 3 Super: 120B Parameter Model with 12B Active Inference
Nvidia's Nemotron 3 Super has 120 billion total parameters but only activates 12 billion during inference, achieving 120B model knowledge at roughly 12B compute cost through efficient routing rather than compression.

OpenClaw AI Agent Halts Operations After Atomic Append Failure
An OpenClaw agent entered a state of functional paralysis after failing an atomic append test, refusing to continue any operations due to fundamental untrustworthiness.