CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai, Liang, Ying Li, Qianxiang Wang, Tao Xie

TL;DR
CoderEval is a new benchmark for evaluating code generation models on pragmatic, context-dependent functions from real-world projects, revealing current models perform significantly worse on non-standalone code.
Contribution
The paper introduces CoderEval, a benchmark with 460 real-world code tasks and a platform for assessing context-dependent code generation effectiveness.
Findings
Models perform better on standalone functions than on context-dependent functions.
Current models struggle with non-standalone, pragmatic code generation scenarios.
Analysis suggests leveraging contextual information can improve model performance.
Abstract
Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To evaluate the effectiveness of these models, multiple existing benchmarks are proposed, including only cases of generating a standalone function, i.e., a function that may invoke or access only built-in functions and standard libraries. However, non-standalone functions, which typically are not included in the existing benchmarks, constitute more than 70% of the functions in popular open-source projects, and evaluating models' effectiveness on standalone functions cannot reflect these models' effectiveness on pragmatic code generation scenarios. To help bridge the preceding gap, in this paper, we propose a benchmark named CoderEval, consisting of 230 Python and 230…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
