Rigor, Reliability, and Reproducibility Matter: A Decade-Scale Survey of 572 Code Benchmarks

Jialun Cao; Yuk-Kit Chan; Zixuan Ling; Wenxuan Wang; Shuqing Li; Mingwei Liu; Ruixi Qiao; Yuting Han; Chaozheng Wang; Boxi Yu; Pinjia He; Shuai Wang; Zibin Zheng; Michael R. Lyu; Shing-Chi Cheung

arXiv:2501.10711·cs.SE·February 10, 2026

Rigor, Reliability, and Reproducibility Matter: A Decade-Scale Survey of 572 Code Benchmarks

Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, Zibin Zheng, Michael R. Lyu, Shing-Chi Cheung

PDF

Open Access

TL;DR

This paper surveys 572 code benchmarks over a decade, highlighting issues in quality and advocating for rigorous, reliable, and reproducible benchmarks, supported by a new guideline and a human study on awareness.

Contribution

It provides a comprehensive decade-scale survey of code benchmarks, identifies quality gaps, and introduces the HOW2BENCH guideline to improve benchmark standards.

Findings

01

Growing neglect of code coverage in recent benchmarks

02

Significant effort needed to improve benchmark quality

03

Lack of awareness about benchmark importance

Abstract

Code-related benchmarks play a critical role in evaluating large language models (LLMs), yet their quality fundamentally shapes how the community interprets model capabilities. In the past few years, awareness of benchmark quality has grown. Yet, after a decade-scale (2014-2025) survey over 572 code benchmarks, we observed a lag between growing awareness and actual practice. For example, in 2025 alone, the number of benchmarks that ignore code coverage when providing test cases nearly matches the total count accumulated across the previous ten years. In response, we take a clear position: Code benchmarks must prioritize rigor in benchmark construction, reliability in evaluation, and reproducibility in release. To operationalize this position, we introduce a code benchmark guideline HOW2BENCH with 55 checklists. Finally, our further human study also exposed that the current issues not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Computational and Text Analysis Methods

MethodsSparse Evolutionary Training