AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Zihan Wang; Jiaze Chen; Zhicheng Liu; Markus Mak; Yidi Du; Geonsik Moon; Luoqi Xu; Aaron Tua; Kunshuo Peng; Jiayi Lu; Mingfei Xia; Boqian Zou; Chenyang Ran; Guang Tian; Shoutai Zhu; Yeheng Duan; Zhenghui Kang; Zhenxing Lin; Shangshu Li; Qiang Luo; Qingshen Long; Zhiyong Chen; Yihan Xiao; Yurong Wu; Daoguang Zan; Yuyi Fu; Mingxuan Wang; Ming Ding

arXiv:2508.16402·cs.SE·August 25, 2025

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, Chenyang Ran, Guang Tian, Shoutai Zhu, Yeheng Duan, Zhenghui Kang, Zhenxing Lin, Shangshu Li, Qiang Luo, Qingshen Long, Zhiyong Chen

PDF

1 Datasets

TL;DR

AetherCode introduces a new, more challenging benchmark for evaluating LLMs' programming skills, addressing previous limitations by including difficult problems and high-quality test cases from top competitions.

Contribution

It presents AetherCode, a benchmark with harder problems and expert-validated test suites from premier programming contests, improving the assessment of LLMs' coding abilities.

Findings

01

AetherCode offers a more rigorous evaluation framework.

02

LLMs show a larger gap to human experts on AetherCode.

03

Benchmark sets a new standard for future code reasoning research.

Abstract

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

m-a-p/AetherCode
dataset· 552 dl
552 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.