Rigor, Reliability, and Reproducibility Matter: A Decade-Scale Survey of 572 Code Benchmarks
Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, Zibin Zheng, Michael R. Lyu, Shing-Chi Cheung

TL;DR
This paper surveys 572 code benchmarks over a decade, highlighting issues in quality and advocating for rigorous, reliable, and reproducible benchmarks, supported by a new guideline and a human study on awareness.
Contribution
It provides a comprehensive decade-scale survey of code benchmarks, identifies quality gaps, and introduces the HOW2BENCH guideline to improve benchmark standards.
Findings
Growing neglect of code coverage in recent benchmarks
Significant effort needed to improve benchmark quality
Lack of awareness about benchmark importance
Abstract
Code-related benchmarks play a critical role in evaluating large language models (LLMs), yet their quality fundamentally shapes how the community interprets model capabilities. In the past few years, awareness of benchmark quality has grown. Yet, after a decade-scale (2014-2025) survey over 572 code benchmarks, we observed a lag between growing awareness and actual practice. For example, in 2025 alone, the number of benchmarks that ignore code coverage when providing test cases nearly matches the total count accumulated across the previous ten years. In response, we take a clear position: Code benchmarks must prioritize rigor in benchmark construction, reliability in evaluation, and reproducibility in release. To operationalize this position, we introduce a code benchmark guideline HOW2BENCH with 55 checklists. Finally, our further human study also exposed that the current issues not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Computational and Text Analysis Methods
MethodsSparse Evolutionary Training
