EvoCodeBench: An Evolving Code Generation Benchmark Aligned with   Real-World Code Repositories

Jia Li; Ge Li; Xuanming Zhang; Yihong Dong; Zhi Jin

arXiv:2404.00599·cs.CL·April 2, 2024·5 cites

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, Zhi Jin

PDF

Open Access 1 Repo 1 Datasets

TL;DR

EvoCodeBench is a new evolving benchmark aligned with real-world code repositories, designed to evaluate large language models' coding abilities more accurately and comprehensively, addressing limitations of existing benchmarks.

Contribution

The paper introduces EvoCodeBench, a benchmark aligned with real-world repositories, with comprehensive annotations and an automatic updating pipeline, enabling more realistic evaluation of LLMs in code generation.

Findings

01

GPT-4 achieves only 20.73% Pass@1 on EvoCodeBench.

02

Existing LLMs show significant shortcomings in real-world code generation.

03

EvoCodeBench provides a more realistic and evolving evaluation environment.

Abstract

How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper proposes a new benchmark - EvoCodeBench to address the preceding problems, which has three primary advances. (1) EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k). (3) EvoCodeBench is an evolving benchmark to avoid data leakage. We build an automatic pipeline to update EvoCodeBench from the latest repositories. We release the first version - EvoCodeBench-2403, containing 275 samples from 25 real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seketeam/evocodebench
noneOfficial

Datasets

LJ0815/EvoCodeBench
dataset· 232 dl
232 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Software Engineering Research · Software Testing and Debugging Techniques

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing