Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering
Renan Souza, Leonardo Azevedo, V\'itor Louren\c{c}o, Elton Soares,, Raphael Thiago, Rafael Brand\~ao, Daniel Civitarese, Emilio Vital Brazil,, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A., S. Netto

TL;DR
This paper introduces PROV-ML, a new provenance data representation tailored for the ML lifecycle in CSE, enabling better tracking and querying of data provenance across complex workflows with low overhead.
Contribution
It provides a detailed characterization of provenance data in CSE ML workflows, proposes PROV-ML built on W3C PROV and ML Schema, and extends existing systems for improved provenance capture and querying.
Findings
Provenance data can be effectively represented with PROV-ML.
The extended system supports provenance queries with a standard vocabulary.
Evaluation shows the approach scales with 48 GPUs in parallel.
Abstract
Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stakeholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
