
TL;DR
This paper introduces a formal grammar with constraints and a terminal assessment gate to structurally prevent data leakage in machine learning workflows, supported by empirical landscape analysis and reference implementations.
Contribution
It presents a novel grammar and enforcement mechanism to eliminate data leakage risks in ML workflows, backed by empirical data and available in Python and R.
Findings
648 papers had data leakage issues across 30 fields
The constraints are grounded in measured effect sizes from 2,047 datasets
Two reference implementations in Python and R are provided
Abstract
Data leakage has been identified in 648 published machine learning papers across 30 scientific fields. The knowledge to prevent it exists; the tools do not enforce it. This paper presents a grammar - eight typed primitives, a directed acyclic graph, and four hard constraints - that makes the most damaging leakage types structurally unrepresentable. The core mechanism is a terminal assessment gate: the first call-time-enforced evaluate/assess boundary in an ML framework, backed by a specification precise enough for independent reimplementation. A companion landscape study across 2,047 datasets grounds the constraints in measured effect sizes. Two reference implementations (Python, R) are available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
