Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine   Learning Models

Zhiwei Li; Carl Kesselman; Mike D'Arch; Michael Pazzani; Benjamin; Yizing Xu

arXiv:2407.01608·cs.LG·July 3, 2024

Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models

Zhiwei Li, Carl Kesselman, Mike D'Arch, Michael Pazzani, Benjamin, Yizing Xu

PDF

Open Access

TL;DR

This paper introduces Deriva-ML, a FAIRness-based data management approach that enhances reproducibility and quality in machine learning models for eScience by applying data-centric principles throughout the ML lifecycle.

Contribution

It presents an architecture and tools for applying FAIR data principles to ML workflows, improving reproducibility and data quality in collaborative eScience projects.

Findings

01

FAIR data management improves ML reproducibility.

02

Tools enable better data handling across ML lifecycle.

03

Use cases demonstrate practical benefits in eScience investigations.

Abstract

Increasingly, artificial intelligence (AI) and machine learning (ML) are used in eScience applications [9]. While these approaches have great potential, the literature has shown that ML-based approaches frequently suffer from results that are either incorrect or unreproducible due to mismanagement or misuse of data used for training and validating the models [12, 15]. Recognition of the necessity of high-quality data for correct ML results has led to data-centric ML approaches that shift the central focus from model development to creation of high-quality data sets to train and validate the models [14, 20]. However, there are limited tools and methods available for data-centric approaches to explore and evaluate ML solutions for eScience problems which often require collaborative multidisciplinary teams working with models and data that will rapidly evolve as an investigation unfolds…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Research Data Management Practices

MethodsFocus