Provenance and data differencing for workflow reproducibility analysis
Paolo Missier, Simon Woodman, Hugo Hiden, Paul Watson

TL;DR
This paper introduces a new framework and algorithm, extPDIFF, for analyzing workflow reproducibility by comparing provenance traces and identifying divergence points, supported by a cloud-based implementation.
Contribution
It presents a novel algorithm for provenance comparison that identifies specific divergence points and supports semantic data comparison, enhancing reproducibility analysis.
Findings
extPDIFF effectively detects differences in workflow provenance.
The framework clarifies various meanings of reproducibility.
Implementation demonstrates practical utility in cloud environments.
Abstract
One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for other purposes. In the field of e-Science, services -- often choreographed through workflow, process data to generate results. The reproduction of results is often not straightforward as the computational objects may not be made available or may have been updated since the results were generated. For example, services are often updated to fix bugs or improve algorithms. This paper addresses these problems in three ways. Firstly, it introduces a new framework to clarify the range of meanings of "reproducibility". Secondly, it describes a new algorithm, \PDIFF, that uses a comparison of workflow provenance traces to determine whether an experiment has been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices · Distributed and Parallel Computing Systems
