$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Muyu He; Adit Jain; Anand Kumar; Vincent Tu; Soumyadeep Bakshi; Sachin Patro; Nazneen Rajani

arXiv:2604.01212·cs.CL·April 2, 2026

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi, Sachin Patro, Nazneen Rajani

PDF

1 Repo

TL;DR

YC-Bench is a comprehensive benchmark designed to evaluate AI agents' ability to maintain strategic coherence and adapt over long-term, complex tasks in a simulated startup environment, highlighting current model limitations.

Contribution

Introduces YC-Bench, a novel long-term planning benchmark for AI agents, with detailed evaluation of multiple models and insights into failure modes and success predictors.

Findings

01

Only three models surpassed initial capital, with Claude Opus 4.6 performing best.

02

Scratchpad usage strongly correlates with success in long-term planning.

03

Adversarial client detection is the main failure mode, causing nearly half of bankruptcies.

Abstract

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $YC-Bench$ , a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of $200K, with Claude Opus 4.6 achieving the highest average final funds at $1.27 M, followed by GLM-5 at $1.21 M…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

collinear-ai/yc-bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.