LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Zihan Zheng; Zerui Cheng; Zeyu Shen; Shang Zhou; Kaiyuan Liu; Hansen He; Dongruixuan Li; Stanley Wei; Hangyi Hao; Jianzhu Yao; Peiyao Sheng; Zixuan Wang; Wenhao Chai; Aleksandra Korolova; Peter Henderson; Sanjeev Arora; Pramod Viswanath; Jingbo Shang; Saining Xie

arXiv:2506.11928·cs.SE·June 16, 2025

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai, Aleksandra Korolova, Peter Henderson, Sanjeev Arora, Pramod Viswanath, Jingbo Shang, Saining Xie

PDF

Open Access

TL;DR

This paper evaluates the limitations of current large language models in competitive programming by introducing a new benchmark and analysis, revealing they lag behind human experts especially in complex reasoning tasks.

Contribution

The paper presents LiveCodeBench Pro, a new benchmark with expert annotations, and provides detailed diagnostics of LLMs' performance gaps in competitive programming.

Findings

01

LLMs achieve only 53% pass@1 on medium problems without tools

02

LLMs score 0% on hard problems, where humans excel

03

Performance driven by implementation skills, not reasoning

Abstract

Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSports Analytics and Performance · Consumer Market Behavior and Pricing