HyProv: Hybrid Provenance Management for Scientific Workflows
Vasilis Bountris, Lauritz Thamsen, Ulf Leser

TL;DR
HyProv is a hybrid provenance management system designed for scientific workflows that combines centralized and federated approaches to enable scalable, real-time, and workflow-aware provenance queries with low latency.
Contribution
It introduces a novel hybrid architecture that efficiently manages and queries workflow provenance data at scale, integrating centralized and federated components for improved performance.
Findings
Scales to large workflows with low latency
Answers provenance queries with sub-second response times
Adds modest CPU and memory overhead
Abstract
Provenance plays a crucial role in scientific workflow execution, for instance by providing data for failure analysis, real-time monitoring, or statistics on resource utilization for right-sizing allocations. The workflows themselves, however, become increasingly complex in terms of involved components. Furthermore, they are executed on distributed cluster infrastructures, which makes the real-time collection, integration, and analysis of provenance data challenging. Existing provenance systems struggle to balance scalability, real-time processing, online provenance analytics, and integration across different components and compute resources. Moreover, most provenance solutions are not workflow-aware; by focusing on arbitrary workloads, they miss opportunities for workflow systems where optimization and analysis can exploit the availability of a workflow specification that dictates, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
