DevEval: Evaluating Code Generation in Practical Software Projects

Jia Li; Ge Li; Yunfei Zhao; Yongmin Li; Zhi Jin; Hao Zhu; Huanyu Liu,; Kaibo Liu; Lecheng Wang; Zheng Fang; Lanshen Wang; Jiazheng Ding; Xuanming; Zhang; Yihong Dong; Yuqi Zhu; Bin Gu; Mengfei Yang

arXiv:2401.06401·cs.SE·March 7, 2024·2 cites

DevEval: Evaluating Code Generation in Practical Software Projects

Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Zhi Jin, Hao Zhu, Huanyu Liu,, Kaibo Liu, Lecheng Wang, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming, Zhang, Yihong Dong, Yuqi Zhu, Bin Gu, Mengfei Yang

PDF

Open Access

TL;DR

DevEval is a new benchmark designed to evaluate large language models' code generation capabilities in realistic software project scenarios, addressing limitations of previous benchmarks.

Contribution

The paper introduces DevEval, a practical, large-scale benchmark aligned with real-world projects, and evaluates popular LLMs' performance on it.

Findings

01

GPT-3.5-turbo achieves a Pass@1 of 42

02

DevEval reveals current LLMs' limitations in practical code generation

03

Open-sourced benchmark facilitates future research

Abstract

How to evaluate Large Language Models (LLMs) in code generation is an open question. Many benchmarks have been proposed but are inconsistent with practical software projects, e.g., unreal program distributions, insufficient dependencies, and small-scale project contexts. Thus, the capabilities of LLMs in practical projects are still unclear. In this paper, we propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects and covering 10 domains. Compared to previous benchmarks, DevEval aligns to practical projects in multiple dimensions, e.g., real program distributions, sufficient dependencies, and enough-scale project contexts. We assess five popular LLMs on DevEval (e.g., gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Software Engineering Techniques and Practices