Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization
Hui Miao, Amol Deshpande

TL;DR
This paper introduces graph segmentation and summarization operators to improve understanding and querying of complex, evolving provenance graphs in data science platforms, enabling more efficient and insightful analysis.
Contribution
It proposes novel high-level graph query operators with efficient algorithms for segmentation and summarization of provenance graphs, addressing their verbosity and evolution.
Findings
Segmentation operator efficiently queries derivation relationships.
Summarization operator effectively combines similar segments.
Proposed methods outperform existing approaches in speed and effectiveness.
Abstract
Increasingly modern data science platforms today have non-intrusive and extensible provenance ingestion mechanisms to collect rich provenance and context information, handle modifications to the same file using distinguishable versions, and use graph data models (e.g., property graphs) and query languages (e.g., Cypher) to represent and manipulate the stored provenance/context information. Due to the schema-later nature of the metadata, multiple versions of the same files, and unfamiliar artifacts introduced by team members, the "provenance graph" is verbose and evolving, and hard to understand; using standard graph query model, it is difficult to compose queries and utilize this valuable information. In this paper, we propose two high-level graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduce a graph segmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices · Data Quality and Management
