CoderEval: A Benchmark of Pragmatic Code Generation with Generative   Pre-trained Models

Hao Yu; Bo Shen; Dezhi Ran; Jiaxin Zhang; Qi Zhang; Yuchi Ma; Guangtai; Liang; Ying Li; Qianxiang Wang; Tao Xie

arXiv:2302.00288·cs.SE·February 26, 2024

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai, Liang, Ying Li, Qianxiang Wang, Tao Xie

PDF

TL;DR

CoderEval is a new benchmark for evaluating code generation models on pragmatic, context-dependent functions from real-world projects, revealing current models perform significantly worse on non-standalone code.

Contribution

The paper introduces CoderEval, a benchmark with 460 real-world code tasks and a platform for assessing context-dependent code generation effectiveness.

Findings

01

Models perform better on standalone functions than on context-dependent functions.

02

Current models struggle with non-standalone, pragmatic code generation scenarios.

03

Analysis suggests leveraging contextual information can improve model performance.

Abstract

Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To evaluate the effectiveness of these models, multiple existing benchmarks are proposed, including only cases of generating a standalone function, i.e., a function that may invoke or access only built-in functions and standard libraries. However, non-standalone functions, which typically are not included in the existing benchmarks, constitute more than 70% of the functions in popular open-source projects, and evaluating models' effectiveness on standalone functions cannot reflect these models' effectiveness on pragmatic code generation scenarios. To help bridge the preceding gap, in this paper, we propose a benchmark named CoderEval, consisting of 230 Python and 230…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.