Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models
Zhiwei Li, Carl Kesselman, Mike D'Arch, Michael Pazzani, Benjamin, Yizing Xu

TL;DR
This paper introduces Deriva-ML, a FAIRness-based data management approach that enhances reproducibility and quality in machine learning models for eScience by applying data-centric principles throughout the ML lifecycle.
Contribution
It presents an architecture and tools for applying FAIR data principles to ML workflows, improving reproducibility and data quality in collaborative eScience projects.
Findings
FAIR data management improves ML reproducibility.
Tools enable better data handling across ML lifecycle.
Use cases demonstrate practical benefits in eScience investigations.
Abstract
Increasingly, artificial intelligence (AI) and machine learning (ML) are used in eScience applications [9]. While these approaches have great potential, the literature has shown that ML-based approaches frequently suffer from results that are either incorrect or unreproducible due to mismanagement or misuse of data used for training and validating the models [12, 15]. Recognition of the necessity of high-quality data for correct ML results has led to data-centric ML approaches that shift the central focus from model development to creation of high-quality data sets to train and validate the models [14, 20]. However, there are limited tools and methods available for data-centric approaches to explore and evaluate ML solutions for eScience problems which often require collaborative multidisciplinary teams working with models and data that will rapidly evolve as an investigation unfolds…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices
MethodsFocus
