EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations
Jia Li, Ge Li, Xuanming Zhang, Yunfei Zhao, Yihong Dong, Zhi Jin,, Binhua Li, Fei Huang, Yongbin Li

TL;DR
EvoCodeBench is a dynamic, domain-aware benchmark for evaluating large language models in code generation, addressing data leakage and domain-specific performance issues to better guide practitioners.
Contribution
The paper introduces EvoCodeBench, a novel evolving benchmark with domain annotations and domain-specific evaluation metrics for more accurate LLM assessment.
Findings
GPT-4's Pass@1 is only 20.74% on EvoCodeBench-2403.
LLMs perform variably across different programming domains.
StarCoder 2-15B excels in the Database domain, outperforming larger models.
Abstract
How to evaluate Large Language Models (LLMs) in code generation remains an open question. Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the latter hinders practitioners from selecting superior LLMs for specific programming domains. To address these two limitations, we propose a new benchmark - EvoCodeBench, which has the following advances: (1) Evolving data. EvoCodeBench will be dynamically updated every period (e.g., 6 months) to avoid data leakage. This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories. (2) A domain taxonomy and domain labels. Based on the statistics of open-source communities, we design a programming domain taxonomy consisting of 10 popular domains. Based on the taxonomy, we annotate each sample in EvoCodeBench with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Software Engineering Research · Software Testing and Debugging Techniques
MethodsLinear Layer · Dense Connections · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding · Absolute Position Encodings · Attention Is All You Need · Multi-Head Attention · Softmax
