LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang,, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica

TL;DR
LiveCodeBench is a comprehensive, contamination-free benchmark that evaluates large language models on a broad range of coding tasks, including self-repair and execution, using continuously updated problems from multiple contest platforms.
Contribution
It introduces a new holistic evaluation benchmark for LLMs on code, addressing limitations of existing benchmarks and including diverse capabilities beyond code generation.
Findings
Identifies contamination issues in existing benchmarks.
Provides performance comparisons of 18 base and 34 instruction-tuned LLMs.
Highlights potential overfitting in current evaluation methods.
Abstract
Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/gemma-3-270mmodel· 83k dl· ♡ 100383k dl♡ 1003
- 🤗google/gemma-3n-E2B-it-litert-lmmodel· 5.7k dl· ♡ 3865.7k dl♡ 386
- 🤗google/gemma-3n-E4B-it-litert-lmmodel· 4.9k dl· ♡ 3844.9k dl♡ 384
- 🤗google/gemma-3n-E2B-itmodel· 272k dl· ♡ 290272k dl♡ 290
- 🤗google/gemma-3-270m-itmodel· 111k dl· ♡ 569111k dl♡ 569
- 🤗google/gemma-3n-E4B-it-litert-previewmodel· ♡ 1479♡ 1479
- 🤗google/gemma-3n-E4Bmodel· 3.8k dl· ♡ 1363.8k dl♡ 136
- 🤗google/gemma-3n-E4B-itmodel· 50k dl· ♡ 89050k dl♡ 890
- 🤗unsloth/gemma-3n-E2B-it-GGUFmodel· 19k dl· ♡ 6019k dl♡ 60
- 🤗unsloth/gemma-3-270m-itmodel· 24k dl· ♡ 2324k dl♡ 23
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Model-Driven Software Engineering Techniques
MethodsBalanced Selection
