RepoZero: Can LLMs Generate a Code Repository from Scratch?
Zhaoxi Zhang, Yiming Xu, Jiahui Liang, Weikang Li, Xiaoshuai Chen, Liwei Qian, Xin Pei, Jizhou Huang, Run Sun, Yunfang Wu

TL;DR
RepoZero introduces a novel, automated benchmark for evaluating LLMs' ability to generate complete software repositories from scratch through execution-based verification.
Contribution
This work presents RepoZero, the first fully automated, execution-based benchmark for repository-level code generation, and proposes the ACE framework for iterative test-driven refinement.
Findings
State-of-the-art LLMs achieve only 30-55% pass rates on RepoZero.
RepoZero exposes significant gaps in current LLM capabilities for full repository synthesis.
The ACE framework improves code generation through iterative testing and error correction.
Abstract
Large Language Models (LLMs) have recently shown remarkable progress in code generation, yet their ability to construct complete software repositories from scratch remains poorly understood. A fundamental bottleneck is the lack of verifiable and scalable evaluation: existing benchmarks either focus on patch-based editing or rely on human or LLM-based judgments, which introduce bias and limit reproducibility. In this work, we present RepoZero, the first benchmark that enables fully automated, execution-based verification of repository-level generation from scratch. Our key idea is to reformulate generation as repository reproduction: given only API specifications, an agent must re-implement an entire repository such that its behavior matches the original implementation. This design allows for strict black-box validation via output equivalence, while naturally supporting large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
