Commit0: Library Generation from Scratch
Wenting Zhao, Nan Jiang, Celine Lee, Justin T Chiu, Claire Cardie,, Matthias Gall\'e, Alexander M Rush

TL;DR
Commit0 is a new benchmark for AI code generation that challenges models to create entire libraries from scratch based on specifications and interactive feedback, moving beyond simple code snippets.
Contribution
It introduces a comprehensive benchmark with interactive feedback for AI to generate complex libraries, advancing beyond static code generation tasks.
Findings
Current models can pass some unit tests but not fully reproduce libraries.
Interactive feedback improves code quality and test pass rates.
Benchmark facilitates development of more capable AI code generation systems.
Abstract
With the goal of benchmarking generative systems beyond expert software development ability, we introduce Commit0, a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests, with the goal of producing an implementation of this API accordingly. The implementation is validated through running these unit tests. As a benchmark, Commit0 is designed to move beyond static one-shot code generation towards agents that must process long-form natural language specifications, adapt to multi-stage feedback, and generate code with complex dependencies. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate. Our experiments demonstrate that while current agents can pass some unit tests, none…
Peer Reviews
Decision·ICLR 2025 Poster
1. **Ambitious benchmark even if not for current gen models.** I particularly like that the benchmark pushes beyond current codegen evaluation setups by targeting full library implementations. Although this might seem too difficult a task currently, having such a benchmark could produce interesting solutions in the space (a la swebench), esp. for such long-horizon tasks. 2. **Multiple sources of feedback.** It's nice to see the benchmark itself integrate several sources of feedback such as lint
**Missing actionable insights.** While commit0 tackles a newer task compared to related work like HumanEval, SWEBench, and R2E, I'm not convinced the insights gathered are interesting or new. While the community can attempt to extract more, it would be nice for the paper to suggest a few directions even with primitive experiments. For instance, the diminishing returns on iterating with execution feedback has already been shown in prior work. I'm wondering if there's any evidence that this task w
- The authors curated a new benchmark commit0 that could assess agents' abilities of implementing various python functions. This benchmark could help with evaluating and thus further improving agents' code and repo generation abilities. - The authors introduced a new framework SDE-I which could assist the repo generation process for agents. - The authors perform some ablation studies that might provide insights into how agents utilize additional information of specifications/tests
- The authors did not provide comprehensive evaluation of their proposed agent SDE-I on the full benchmark commit0 that they curated. They only provide full results of SDE-I of stage 1 on the full benchmark. Thus, their results might not fully reflect the actual abilities of SDE-I on commit0. It would be good to have the results of all stages of SDE-I on commit0. - The authors did not evaluate other existing agents on commit0, which makes it less clear how current agents perform on such tasks.
Commit0 is a much more realistic measure of how well an LLM will do software development on a task that matches what real human software developers have to do in their jobs. Including the iterative nature of software development by passing back lint and test errors to the LLM to allow it to attempt further edits and improvements to the code is great. Having a whole set of functions to implement to support a scenario is great to force the LLM to handle the complexity of tasks real human softwar
W0: The computational burden to solve the full Commit0 test set is immense, so the cost to run this benchmark might be prohibitive for many labs to participate. What does it cost currently? W1: Real developers have to produce specs for the high-level design and shared data structures, and also unit tests and docs for all the private functions they are writing. That sort of thing I guess could be all done inside the SDE-I agent, but the ability for the agent to ask itself, what is the spec for t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques
