CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried,, Carolyn Rose

TL;DR
CodeBenchGen is a framework that creates scalable, execution-based code generation benchmarks from real-world code, enabling more comprehensive evaluation of code generation systems across diverse scenarios.
Contribution
It introduces a novel method to generate execution-based benchmarks from natural code sources using large language models, expanding evaluation capabilities.
Findings
Created the Exec-CSN dataset with 1,931 examples from GitHub repositories.
81.3% of examples are solvable by humans, indicating practical relevance.
Conducted code generation experiments demonstrating the framework's utility.
Abstract
To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is easily executable or has human-written tests. To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks from naturally occurring code sources. Specifically, we leverage a large language model (LLM) to sandbox arbitrary pieces of code into evaluation examples, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries converted from code in 367 GitHub repositories taken from the Code- SearchNet dataset. To demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Software Testing and Debugging Techniques
