PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems
Runzhou Han, Mai Zheng, Suren Byna, Houjun Tang, Bin Dong, Dong Dai,, Yong Chen, Dongkyun Kim, Joseph Hassoun, David Thorsley, Matthew Wolf

TL;DR
PROV-IO+ is a versatile, cross-platform provenance framework for scientific data on HPC systems that accurately captures data lineage with minimal overhead, supporting diverse workflows and outperforming existing solutions.
Contribution
The paper introduces PROV-IO+, a novel I/O-centric provenance model and framework that enables end-to-end provenance support across different HPC platforms with minimal manual effort.
Findings
PROV-IO+ achieves less than 3.5% tracking overhead in experiments.
It effectively supports both containerized and non-containerized workflows.
PROV-IO+ outperforms the state-of-the-art system ProvLake.
Abstract
Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins of data products, usage patterns of datasets). Unfortunately, existing provenance solutions cannot address the challenges due to their incompatible provenance models and/or system implementations. In this paper, we analyze four representative scientific workflows in collaboration with the domain scientists to identify concrete provenance needs. Based on the first-hand analysis, we propose a provenance framework called PROV-IO+, which includes an I/O-centric provenance model for describing scientific data and the associated I/O operations and environments precisely. Moreover, we build a prototype of PROV-IO+ to enable end-to-end provenance support on real HPC systems with little manual effort. The PROV-IO+ framework can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices · Distributed and Parallel Computing Systems
