Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data   Principles

Sheeba Samuel; Frank L\"offler; Birgitta K\"onig-Ries

arXiv:2006.12117·cs.LG·June 23, 2020

Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles

Sheeba Samuel, Frank L\"offler, Birgitta K\"onig-Ries

PDF

TL;DR

This paper addresses the reproducibility crisis in machine learning by proposing methods to enhance provenance tracking, applying FAIR data principles, and demonstrating a tool to improve reproducibility in ML workflows.

Contribution

It introduces approaches for end-to-end reproducibility in ML pipelines, emphasizing provenance, FAIR data practices, and the use of ProvBook for capturing experiment provenance.

Findings

01

ProvBook helps capture and compare ML experiment provenance.

02

Applying FAIR principles improves ML reproducibility.

03

Preliminary results show increased reproducibility with our approach.

Abstract

Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more and more important that the results of ML experiments are reproducible. Unfortunately, that often is not the case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals and initial steps in supporting the end-to-end reproducibility of ML pipelines. We investigate which factors beyond the availability of source code and datasets influence reproducibility of ML experiments. We propose ways to apply FAIR data practices to ML workflows. We present our preliminary results on the role of our tool, ProvBook, in capturing and comparing provenance of ML experiments and their reproducibility using Jupyter Notebooks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.