ProgramBench: Can Language Models Rebuild Programs From Scratch?
John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press

TL;DR
ProgramBench is a new benchmark that evaluates language models' ability to develop complete software projects from scratch based on minimal input, revealing current models' limitations in holistic software engineering tasks.
Contribution
The paper introduces ProgramBench, a comprehensive benchmark for assessing language models' capacity to architect and implement entire software systems from scratch.
Findings
No model fully solves any task in the benchmark.
The best model passes 95% of tests on only 3% of tasks.
Models tend to produce monolithic, single-file implementations.
Abstract
Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
