Questionable practices in machine learning

Gavin Leech; Juan J. Vazquez; Niclas Kupper; Misha Yagudin; Laurence; Aitchison

arXiv:2407.12220·cs.LG·October 31, 2024·2 cites

Questionable practices in machine learning

Gavin Leech, Juan J. Vazquez, Niclas Kupper, Misha Yagudin, Laurence, Aitchison

PDF

Open Access

TL;DR

This paper identifies and discusses 44 questionable practices in machine learning research, especially in evaluating large language models, highlighting issues that undermine result integrity and reproducibility.

Contribution

It provides a comprehensive list of QRPs and irreproducible practices, emphasizing the need for improved evaluation standards in ML research.

Findings

01

Identification of 44 questionable practices

02

Highlighting challenges in LLM evaluation on benchmarks

03

Discussion of irreproducible research issues

Abstract

Evaluating modern ML models is hard. The strong incentive for researchers and companies to report a state-of-the-art result on some metric often leads to questionable research practices (QRPs): bad practices which fall short of outright research fraud. We describe 44 such practices which can undermine reported results, giving examples where possible. Our list emphasises the evaluation of large language models (LLMs) on public benchmarks. We also discuss "irreproducible research practices", i.e. decisions that make it difficult or impossible for other researchers to reproduce, build on or audit previous research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification