Leakage and the Reproducibility Crisis in ML-based Science
Sayash Kapoor, Arvind Narayanan

TL;DR
This paper highlights the widespread issue of data leakage in ML-based science, demonstrating its impact on reproducibility and proposing model info sheets to improve methodological transparency and error detection.
Contribution
It provides a comprehensive taxonomy of leakage types, shows how leakage causes reproducibility failures, and introduces model info sheets as a solution for better reporting and error prevention.
Findings
Data leakage is prevalent across multiple scientific fields.
Reproducibility failures are often caused by undetected leakage.
Model info sheets can effectively identify and prevent leakage errors.
Abstract
The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Anomaly Detection Techniques and Applications
MethodsINFO: An Efficient Optimization Algorithm based on Weighted Mean of Vectors · Logistic Regression
