When Elo Lies: Hidden Biases in Codeforces-Based Evaluation of Large Language Models
Shenyu Zheng, Ximing Dong, Xiaoshuang Liu, Gustavo Oliva, Chong Chun Yong, Dayi Lin, Boyuan Chen, Shaowei Wang, Ahmed E. Hassan

TL;DR
This paper critically examines the biases and unreliability of Codeforces-based Elo ratings for evaluating large language models, revealing significant sensitivities to experimental conditions and advocating for standardized evaluation protocols.
Contribution
It systematically analyzes hidden factors affecting Elo scores, demonstrating their high sensitivity and instability, and highlights the need for transparent, standardized evaluation methods.
Findings
Elo scores vary by up to 1,122 points due to contest selection.
Submission order can shift scores by approximately 394 points.
Evaluation variability can reach 349 points in identical conditions.
Abstract
As Large Language Models (LLMs) achieve breakthroughs in complex reasoning, Codeforces-based Elo ratings have emerged as a prominent metric for evaluating competitive programming capabilities. However, these ratings are often reported without critical experimental details, leading to significant discrepancies illustrated by recent reports where the score of the same model version fluctuated by nearly 500 points. This paper presents a systematic empirical study on the hidden factors biasing Elo evaluations: (1) the temporal ordering of submissions, (2) contest difficulty selection, and (3) run to run stochastic variability of LLMs. Utilizing a controlled benchmark of 37 recent Codeforces contests and 13,691 generated test cases, we demonstrate that Elo scores are highly sensitive to these parameters. Our findings reveal that varying submission orders can shift scores by 394 points, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Ethics and Social Impacts of AI · Text Readability and Simplification
