EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, Zhi Jin

TL;DR
EvoCodeBench is a new evolving benchmark aligned with real-world code repositories, designed to evaluate large language models' coding abilities more accurately and comprehensively, addressing limitations of existing benchmarks.
Contribution
The paper introduces EvoCodeBench, a benchmark aligned with real-world repositories, with comprehensive annotations and an automatic updating pipeline, enabling more realistic evaluation of LLMs in code generation.
Findings
GPT-4 achieves only 20.73% Pass@1 on EvoCodeBench.
Existing LLMs show significant shortcomings in real-world code generation.
EvoCodeBench provides a more realistic and evolving evaluation environment.
Abstract
How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper proposes a new benchmark - EvoCodeBench to address the preceding problems, which has three primary advances. (1) EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k). (3) EvoCodeBench is an evolving benchmark to avoid data leakage. We build an automatic pipeline to update EvoCodeBench from the latest repositories. We release the first version - EvoCodeBench-2403, containing 275 samples from 25 real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Software Engineering Research · Software Testing and Debugging Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing
