OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs [Technical Report]
Fotis Psallidas, Ashvin Agrawal, Chandru Sugunan, Khaled Ibrahim,, Konstantinos Karanasos, Jes\'us Camacho-Rodr\'iguez, Avrilia Floratou, Carlo, Curino, Raghu Ramakrishnan

TL;DR
OneProvenance is a system that efficiently extracts coarse-grained provenance from database query logs, significantly reducing overhead and noise, and is already deployed at scale by Microsoft for improved data governance.
Contribution
It introduces novel event transformations and filtering techniques for log-based provenance extraction, achieving up to 18X performance improvement over existing methods.
Findings
Up to 18X faster provenance extraction compared to baselines.
Reduces noise and improves accuracy of provenance graphs.
Deployed at scale by Microsoft Purview for real-world use.
Abstract
Provenance encodes information that connects datasets, their generation workflows, and associated metadata (e.g., who or when executed a query). As such, it is instrumental for a wide range of critical governance applications (e.g., observability and auditing). Unfortunately, in the context of database systems, extracting coarse-grained provenance is a long-standing problem due to the complexity and sheer volume of database workflows. Provenance extraction from query event logs has been recently proposed as favorable because, in principle, can result in meaningful provenance graphs for provenance applications. Current approaches, however, (a) add substantial overhead to the database and provenance extraction workflows and (b)~extract provenance that is noisy, omits query execution dependencies, and is not rich enough for upstream applications. To address these problems, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Cloud Computing and Resource Management · Data Quality and Management
