TL;DR
YC-Bench is a comprehensive benchmark designed to evaluate AI agents' ability to maintain strategic coherence and adapt over long-term, complex tasks in a simulated startup environment, highlighting current model limitations.
Contribution
Introduces YC-Bench, a novel long-term planning benchmark for AI agents, with detailed evaluation of multiple models and insights into failure modes and success predictors.
Findings
Only three models surpassed initial capital, with Claude Opus 4.6 performing best.
Scratchpad usage strongly correlates with success in long-term planning.
Adversarial client detection is the main failure mode, causing nearly half of bankruptcies.
Abstract
As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce , a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of $200K, with Claude Opus 4.6 achieving the highest average final funds at $1.27 M, followed by GLM-5 at $1.21 M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
