Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?

Xiangyang Li; Xiaopeng Li; Kuicai Dong; Quanhu Zhang; Rongju Ruan; Xinyi Dai; Xiaoshuang Liu; Shengchun Xu; Yasheng Wang; Ruiming Tang

arXiv:2506.12713·cs.SE·October 21, 2025

Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?

Xiangyang Li, Xiaopeng Li, Kuicai Dong, Quanhu Zhang, Rongju Ruan, Xinyi Dai, Xiaoshuang Liu, Shengchun Xu, Yasheng Wang, Ruiming Tang

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This paper introduces Humanity's Last Code Exam (HLCE), a challenging benchmark from top programming competitions, revealing current advanced LLMs' limited success and highlighting the need for further improvements in reasoning and code generation.

Contribution

The paper presents HLCE, a new challenging benchmark for code generation based on top programming contests, along with a reproducible evaluation framework and insights into LLMs' capabilities and self-awareness.

Findings

01

Advanced LLMs achieve only around 12-16% pass@1 on HLCE.

02

LLMs' self-recognition abilities are not strongly correlated with code performance.

03

Test-time scaling laws suggest significant room for improvement in LLMs' complex programming skills.

Abstract

Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. To better reflected the advanced reasoning and code generation ability, We introduce Humanity's Last Code Exam (HLCE), comprising 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI) spanning 2010 - 2024. As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation. Through our comprehensive evaluation, we observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively. Meanwhile, we propose a novel "self-recognition" task to measure LLMs' awareness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

humanity-s-last-code-exam/hlce
noneOfficial

Datasets

Videos

Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?· underline

Taxonomy

TopicsComparative and International Law Studies · Artificial Intelligence in Law