
TL;DR
This study systematically evaluates four types of data leakage in machine learning, revealing that selection leakage significantly inflates performance metrics, while normalization leakage is negligible.
Contribution
It provides empirical evidence ranking leakage types by severity, challenging traditional emphasis on normalization leakage and highlighting the importance of selection leakage.
Findings
Selection leakage inflates scores by about 90%.
Normalization leakage has negligible impact.
Leakage effects scale with model complexity and dataset size.
Abstract
Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation - fitting scalers on full data) is negligible: all nine conditions produce . Class II (selection - peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
