LiveCodeBench: Holistic and Contamination Free Evaluation of Large   Language Models for Code

Naman Jain; King Han; Alex Gu; Wen-Ding Li; Fanjia Yan; Tianjun Zhang,; Sida Wang; Armando Solar-Lezama; Koushik Sen; Ion Stoica

arXiv:2403.07974·cs.SE·June 7, 2024·23 cites

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang,, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica

PDF

Open Access 10 Models 5 Datasets

TL;DR

LiveCodeBench is a comprehensive, contamination-free benchmark that evaluates large language models on a broad range of coding tasks, including self-repair and execution, using continuously updated problems from multiple contest platforms.

Contribution

It introduces a new holistic evaluation benchmark for LLMs on code, addressing limitations of existing benchmarks and including diverse capabilities beyond code generation.

Findings

01

Identifies contamination issues in existing benchmarks.

02

Provides performance comparisons of 18 base and 34 instruction-tuned LLMs.

03

Highlights potential overfitting in current evaluation methods.

Abstract

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Model-Driven Software Engineering Techniques

MethodsBalanced Selection