Questionable practices in machine learning
Gavin Leech, Juan J. Vazquez, Niclas Kupper, Misha Yagudin, Laurence, Aitchison

TL;DR
This paper identifies and discusses 44 questionable practices in machine learning research, especially in evaluating large language models, highlighting issues that undermine result integrity and reproducibility.
Contribution
It provides a comprehensive list of QRPs and irreproducible practices, emphasizing the need for improved evaluation standards in ML research.
Findings
Identification of 44 questionable practices
Highlighting challenges in LLM evaluation on benchmarks
Discussion of irreproducible research issues
Abstract
Evaluating modern ML models is hard. The strong incentive for researchers and companies to report a state-of-the-art result on some metric often leads to questionable research practices (QRPs): bad practices which fall short of outright research fraud. We describe 44 such practices which can undermine reported results, giving examples where possible. Our list emphasises the evaluation of large language models (LLMs) on public benchmarks. We also discuss "irreproducible research practices", i.e. decisions that make it difficult or impossible for other researchers to reproduce, build on or audit previous research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
