Provenance Data in the Machine Learning Lifecycle in Computational   Science and Engineering

Renan Souza; Leonardo Azevedo; V\'itor Louren\c{c}o; Elton Soares,; Raphael Thiago; Rafael Brand\~ao; Daniel Civitarese; Emilio Vital Brazil,; Marcio Moreno; Patrick Valduriez; Marta Mattoso; Renato Cerqueira; Marco A.; S. Netto

arXiv:1910.04223·cs.DC·October 22, 2019

Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Renan Souza, Leonardo Azevedo, V\'itor Louren\c{c}o, Elton Soares,, Raphael Thiago, Rafael Brand\~ao, Daniel Civitarese, Emilio Vital Brazil,, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A., S. Netto

PDF

TL;DR

This paper introduces PROV-ML, a new provenance data representation tailored for the ML lifecycle in CSE, enabling better tracking and querying of data provenance across complex workflows with low overhead.

Contribution

It provides a detailed characterization of provenance data in CSE ML workflows, proposes PROV-ML built on W3C PROV and ML Schema, and extends existing systems for improved provenance capture and querying.

Findings

01

Provenance data can be effectively represented with PROV-ML.

02

The extended system supports provenance queries with a standard vocabulary.

03

Evaluation shows the approach scales with 48 GPUs in parallel.

Abstract

Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stakeholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.