Optimizing Provenance Computations
Xing Niu, Boris Glavic

TL;DR
This paper introduces provenance-specific optimizations and a cost-based framework to significantly improve the efficiency of provenance computations in databases, enabling faster and more scalable data provenance analysis.
Contribution
It presents algebraic equivalences and an extensible optimization framework for provenance queries, implemented in the GProM system, to enhance performance without modifying the underlying DBMS.
Findings
Performance improved by several orders of magnitude
Effective for diverse provenance tasks
Optimization framework is easily retrofitted into existing systems
Abstract
Data provenance is essential for debugging query results, auditing data in cloud environments, and explaining outputs of Big Data analytics. A well-established technique is to represent provenance as annotations on data and to instrument queries to propagate these annotations to produce results annotated with provenance. However, even sophisticated optimizers are often incapable of producing efficient execution plans for instrumented queries, because of their inherent complexity and unusual structure. Thus, while instrumentation enables provenance support for databases without requiring any modification to the DBMS, the performance of this approach is far from optimal. In this work, we develop provenance specific optimizations to address this problem. Specifically, we introduce algebraic equivalences targeted at instrumented queries and discuss alternative, equivalent ways of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Data Storage Technologies · Research Data Management Practices
