RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

Xinbo Xu; Ruihan Yang; Haiyang Shen; Wendong Xu; Bofei Gao; Ruoyu Wu; Kean Shi; Weichu Xie; Xuanzhong Chen; Ming Wu; Jason Zeng; Michael Heinrich; Elvis Zhang; Liang Chen; Kuan Li; Baobao Chang

arXiv:2605.15846·cs.SE·May 20, 2026

RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

Xinbo Xu, Ruihan Yang, Haiyang Shen, Wendong Xu, Bofei Gao, Ruoyu Wu, Kean Shi, Weichu Xie, Xuanzhong Chen, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

PDF

TL;DR

RoadmapBench is a new benchmark for evaluating AI coding agents on long-horizon, real-world software development tasks across multiple repositories and languages, revealing current models' limited capabilities.

Contribution

The paper introduces RoadmapBench, a benchmark with 115 real-world, long-horizon coding tasks based on open-source version upgrades, addressing a gap in existing evaluation methods.

Findings

01

Strongest model solves 39.1% of tasks

02

Weakest model solves 5.2% of tasks

03

Long-horizon development remains largely unsolved

Abstract

Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.