Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM
Chunqiu Steven Xia, Yinlin Deng, Lingming Zhang

TL;DR
EvoEval introduces an evolving benchmark suite for code generation, revealing significant performance drops and ranking shifts among LLMs, highlighting limitations of existing static benchmarks in measuring true coding proficiency.
Contribution
The paper presents EvoEval, a novel dynamic benchmarking framework that evolves existing code benchmarks to better evaluate LLMs' coding abilities across diverse and changing problem domains.
Findings
Performance drops of up to 47.7% on EvoEval compared to standard benchmarks.
Significant ranking changes among LLMs when evaluated with EvoEval.
Existing benchmarks may overestimate LLMs' true coding capabilities.
Abstract
LLMs have become the go-to choice for code generation tasks, with an exponential increase in the training, development, and usage of LLMs specifically for code generation. To evaluate the ability of LLMs on code, both academic and industry practitioners rely on popular handcrafted benchmarks. However, prior benchmarks contain only a very limited set of problems, both in quantity and variety. Further, due to popularity and age, many benchmarks are prone to data leakage where example solutions can be readily found on the web and thus potentially in training data. Such limitations inevitably lead us to inquire: Is the leaderboard performance on existing benchmarks reliable and comprehensive enough to measure the program synthesis ability of LLMs? To address this, we introduce EvoEval -- a program synthesis benchmark suite created by evolving existing benchmarks into different targeted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRough Sets and Fuzzy Logic · Data Mining Algorithms and Applications
MethodsSparse Evolutionary Training
