Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval:   Evolving Coding Benchmarks via LLM

Chunqiu Steven Xia; Yinlin Deng; Lingming Zhang

arXiv:2403.19114·cs.SE·March 29, 2024·3 cites

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM

Chunqiu Steven Xia, Yinlin Deng, Lingming Zhang

PDF

Open Access 1 Repo

TL;DR

EvoEval introduces an evolving benchmark suite for code generation, revealing significant performance drops and ranking shifts among LLMs, highlighting limitations of existing static benchmarks in measuring true coding proficiency.

Contribution

The paper presents EvoEval, a novel dynamic benchmarking framework that evolves existing code benchmarks to better evaluate LLMs' coding abilities across diverse and changing problem domains.

Findings

01

Performance drops of up to 47.7% on EvoEval compared to standard benchmarks.

02

Significant ranking changes among LLMs when evaluated with EvoEval.

03

Existing benchmarks may overestimate LLMs' true coding capabilities.

Abstract

LLMs have become the go-to choice for code generation tasks, with an exponential increase in the training, development, and usage of LLMs specifically for code generation. To evaluate the ability of LLMs on code, both academic and industry practitioners rely on popular handcrafted benchmarks. However, prior benchmarks contain only a very limited set of problems, both in quantity and variety. Further, due to popularity and age, many benchmarks are prone to data leakage where example solutions can be readily found on the web and thus potentially in training data. Such limitations inevitably lead us to inquire: Is the leaderboard performance on existing benchmarks reliable and comprehensive enough to measure the program synthesis ability of LLMs? To address this, we introduce EvoEval -- a program synthesis benchmark suite created by evolving existing benchmarks into different targeted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

evo-eval/evoeval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRough Sets and Fuzzy Logic · Data Mining Algorithms and Applications

MethodsSparse Evolutionary Training