PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Ankit Yadav; Himanshu Beniwal; Mayank Singh

arXiv:2401.03855·cs.CL·July 8, 2024·1 cites

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Ankit Yadav, Himanshu Beniwal, Mayank Singh

PDF

Open Access 1 Video

TL;DR

PythonSaga is a new benchmark with diverse, balanced Python coding tasks designed to provide a more accurate evaluation of code-generating large language models, revealing their current limitations.

Contribution

The paper introduces PythonSaga, a comprehensive benchmark addressing biases and difficulty imbalances in existing Python code generation evaluations.

Findings

01

Existing benchmarks are biased towards certain concepts.

02

Current models perform poorly on the new benchmark.

03

Many tasks remain easy for current models.

Abstract

Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs· underline

Taxonomy

TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Educational Assessment and Pedagogy

MethodsSparse Evolutionary Training