Workflow Provenance in the Lifecycle of Scientific Machine Learning

Renan Souza; Leonardo G. Azevedo; V\'itor Louren\c{c}o; Elton Soares,; Raphael Thiago; Rafael Brand\~ao; Daniel Civitarese; Emilio Vital Brazil,; Marcio Moreno; Patrick Valduriez; Marta Mattoso; Renato Cerqueira; Marco A.; S. Netto

arXiv:2010.00330·cs.DB·August 26, 2021

Workflow Provenance in the Lifecycle of Scientific Machine Learning

Renan Souza, Leonardo G. Azevedo, V\'itor Louren\c{c}o, Elton Soares,, Raphael Thiago, Rafael Brand\~ao, Daniel Civitarese, Emilio Vital Brazil,, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A., S. Netto

PDF

TL;DR

This paper introduces a workflow provenance approach for scientific machine learning, enabling comprehensive, scalable, and low-overhead data analysis to improve reproducibility and understanding in multidisciplinary domains.

Contribution

It presents a novel provenance-based framework with a W3C PROV data model, lifecycle characterization, and architecture, validated through a large-scale Oil & Gas HPC case study.

Findings

01

Supports integrated domain and ML queries with low overhead

02

Achieves high scalability and query acceleration

03

Enables better reproducibility and data understanding in scientific ML

Abstract

Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses even more challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.