From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience
Zhiwei Li, Carl Kesselman, Tran Huy Nguyen, Benjamin Yixing Xu, Kyle Bolo, Kimberley Yu

TL;DR
This paper presents a data-centric framework for improving reproducibility and transparency in collaborative machine learning projects by formalizing data, features, workflows, and decisions, demonstrated through a clinical glaucoma detection case study.
Contribution
It introduces a structured, lifecycle-aware approach with six artifacts to formalize relationships and enhance reproducibility in collaborative ML workflows.
Findings
Supports iterative exploration and decision tracking
Enhances reproducibility and provenance preservation
Demonstrated effectiveness in a clinical ML use case
Abstract
Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Cell Image Analysis Techniques · Research Data Management Practices
MethodsHigh-Order Consensuses · Fragmentation
