A Grammar of Machine Learning Workflows

Simon Roth

arXiv:2603.10742·cs.LG·April 7, 2026

A Grammar of Machine Learning Workflows

Simon Roth

PDF

TL;DR

This paper introduces a formal grammar with constraints and a terminal assessment gate to structurally prevent data leakage in machine learning workflows, supported by empirical landscape analysis and reference implementations.

Contribution

It presents a novel grammar and enforcement mechanism to eliminate data leakage risks in ML workflows, backed by empirical data and available in Python and R.

Findings

01

648 papers had data leakage issues across 30 fields

02

The constraints are grounded in measured effect sizes from 2,047 datasets

03

Two reference implementations in Python and R are provided

Abstract

Data leakage has been identified in 648 published machine learning papers across 30 scientific fields. The knowledge to prevent it exists; the tools do not enforce it. This paper presents a grammar - eight typed primitives, a directed acyclic graph, and four hard constraints - that makes the most damaging leakage types structurally unrepresentable. The core mechanism is a terminal assessment gate: the first call-time-enforced evaluate/assess boundary in an ML framework, backed by a specification precise enough for independent reimplementation. A companion landscape study across 2,047 datasets grounds the constraints in measured effect sizes. Two reference implementations (Python, R) are available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.