Execution-Based Evaluation for Open-Domain Code Generation
Zhiruo Wang, Shuyan Zhou, Daniel Fried, Graham Neubig

TL;DR
This paper introduces ODEX, a new open-domain code generation dataset with execution-based evaluation, covering multiple languages and libraries, revealing behavioral differences among top models and aiming to advance research in practical code generation.
Contribution
The paper presents ODEX, the first execution-based open-domain code generation dataset with multilingual support and diverse libraries, enabling more realistic evaluation of code models.
Findings
CODEX outperforms CODEGEN overall
Scaling improves CODEGEN's performance
Open vs. closed domain gaps vary with model size
Abstract
To extend the scope of coding queries to more realistic settings, we propose ODEX, the first Open-Domain EXecution-based natural language (NL) to Python code generation dataset. ODEX has 945 NL-Code pairs spanning 79 diverse libraries, along with 1,707 human-written test cases for execution. Our NL-Code pairs are harvested from StackOverflow forums to encourage natural and practical coding queries. Moreover, ODEX supports four natural languages as intents, in English, Spanish, Japanese, and Russian. ODEX unveils intriguing behavioral differences among top-performing code language models (LM). While CODEX achieves better overall results, CODEGEN improves effectively via scaling -- CODEGEN 6.1B performs comparably with CODEX 12B. Both models show substantial gaps between open and closed domains, but CODEGEN gaps tend to decrease with model size while CODEX gaps increase. We release ODEX…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Machine Learning and Data Classification
MethodsTest · CodeGen
