Workflow Provenance in the Lifecycle of Scientific Machine Learning
Renan Souza, Leonardo G. Azevedo, V\'itor Louren\c{c}o, Elton Soares,, Raphael Thiago, Rafael Brand\~ao, Daniel Civitarese, Emilio Vital Brazil,, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A., S. Netto

TL;DR
This paper introduces a workflow provenance approach for scientific machine learning, enabling comprehensive, scalable, and low-overhead data analysis to improve reproducibility and understanding in multidisciplinary domains.
Contribution
It presents a novel provenance-based framework with a W3C PROV data model, lifecycle characterization, and architecture, validated through a large-scale Oil & Gas HPC case study.
Findings
Supports integrated domain and ML queries with low overhead
Achieves high scalability and query acceleration
Enables better reproducibility and data understanding in scientific ML
Abstract
Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses even more challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
