Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability
Renan Souza, Tyler J. Skluzacek, Sean R. Wilkinson, Maxim Ziatdinov,, Rafael Ferreira da Silva

TL;DR
This paper introduces MIDA, a lightweight runtime approach for multi-workflow data integration in scientific discovery, leveraging data observability and provenance to enable efficient, cross-facility analysis with minimal overhead.
Contribution
MIDA provides a novel, adaptable framework for real-time multi-workflow data integration using data observability without instrumentation, suitable for heterogeneous scientific environments.
Findings
Successfully integrated data from Dask and MLFlow in a distributed deep learning case study.
Achieved near-zero overhead for up to 100,000 tasks on 1,680 CPU cores.
Demonstrated scalability on the Summit supercomputer with up to 276 GPUs.
Abstract
Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
MethodsAdapter
