Efficiently Processing Workflow Provenance Queries on SPARK
Rajmohan C, Pranay Lohia, Himanshu Gupta, Siddhartha Brahma, Mauricio, Hernandez, Sameep Mehta

TL;DR
This paper presents a novel framework leveraging Spark to efficiently process fine-grained workflow provenance queries on large datasets by computing and partitioning weakly connected components, enabling real-time query responses.
Contribution
The paper introduces a weakly connected component based framework for provenance query processing that significantly improves efficiency on large-scale workflow provenance data.
Findings
Answers provenance queries in real-time on graphs with up to 500 million nodes and edges.
Outperforms naive approaches in processing large-scale provenance data.
Effectively partitions large components into weakly connected sets for faster query processing.
Abstract
In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value. This minimal volume of data is then processed to figure out the provenance of the queried attribute-value. The proposed framework computes weakly connected components on the workflow provenance graph and further partitions the large components as a collection of weakly connected sets. The framework exploits the workflow dependency graph to effectively partition the large components into a collection of weakly connected sets. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices · Data Quality and Management
