RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades
Xinbo Xu, Ruihan Yang, Haiyang Shen, Wendong Xu, Bofei Gao, Ruoyu Wu, Kean Shi, Weichu Xie, Xuanzhong Chen, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

TL;DR
RoadmapBench is a new benchmark for evaluating AI coding agents on long-horizon, real-world software development tasks across multiple repositories and languages, revealing current models' limited capabilities.
Contribution
The paper introduces RoadmapBench, a benchmark with 115 real-world, long-horizon coding tasks based on open-source version upgrades, addressing a gap in existing evaluation methods.
Findings
Strongest model solves 39.1% of tasks
Weakest model solves 5.2% of tasks
Long-horizon development remains largely unsolved
Abstract
Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
