Building Containerized Environments for Reproducibility and Traceability of Scientific Workflows
Paula Olaya, Jay Lofstead, and Michela Taufer

TL;DR
This paper presents a containerized system that enhances reproducibility and traceability of scientific workflows by automatically capturing provenance metadata and building record trails with minimal overhead.
Contribution
It introduces a novel container-based environment that annotates workflows and captures provenance metadata, improving trust in simulation results.
Findings
Effective provenance metadata collection with low overhead
Built-in record trails improve transparency and reproducibility
Applicable to various scientific workflows
Abstract
Scientists rely on simulations to study natural phenomena. Trusting the simulation results is vital to develop sciences in any field. One approach to build trust is to ensure the reproducibility and traceability of the simulations through the annotation of executions at the system-level; by the generation of record trails of data moving through the simulation workflow. In this work, we present a system-level solution that leverages the intrinsic characteristics of containers (i.e., portability, isolation, encapsulation, and unique identifiers). Our solution consists of a containerized environment capable to annotate workflows, capture provenance metadata, and build record trails. We assess our environment on four different workflows and measure containerization costs in terms of time and space. Our solution, built with a tolerable time and space overhead, enables transparent and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Data Storage Technologies · Research Data Management Practices
