NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Jingzhe Ding; Shengda Long; Changxin Pu; Huan Zhou; Hongwan Gao; Xiang Gao; Chao He; Yue Hou; Fei Hu; Zhaojian Li; Weiran Shi; Zaiyuan Wang; Daoguang Zan; Chenchen Zhang; Xiaoxu Zhang; Qizhi Chen; Xianfu Cheng; Bo Deng; Qingshui Gu; Kai Hua; Juntao Lin; Pai Liu; Mingchen Li; Xuanguang Pan; Zifan Peng; Yujia Qin; Yong Shan; Zhewen Tan; Weihao Xie; Zihan Wang; Yishuo Yuan; Jiayu Zhang; Enduo Zhao; Yunfei Zhao; He Zhu; Liya Zhu; Chenyang Zou; Ming Ding; Jianpeng Jiao; Jiaheng Liu; Minghao Liu; Qian Liu; Chongyang Tao; Jian Yang; Tong Yang; Zhaoxiang Zhang; Xinjie Chen; Wenhao Huang; Ge Zhang

arXiv:2512.12730·cs.CL·January 9, 2026

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li

PDF

Open Access

TL;DR

NL2Repo-Bench introduces a new benchmark to evaluate the ability of coding agents to generate complete software repositories from natural language, emphasizing long-horizon reasoning and planning over multiple steps.

Contribution

The paper presents NL2Repo Bench, a novel benchmark designed to assess long-horizon repository generation, revealing current limitations and failure modes of state-of-the-art coding agents.

Findings

01

Most agents achieve below 40% test pass rate.

02

Long-horizon failure modes include premature termination and loss of coherence.

03

Current models struggle with sustained planning over many interaction steps.

Abstract

Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Advanced Software Engineering Methodologies · Scientific Computing and Data Management