Provenance for Large-scale Datalog

David Zhao; Pavle Subotic; Bernhard Scholz

arXiv:1907.05045·cs.PL·July 18, 2019·1 cites

Provenance for Large-scale Datalog

David Zhao, Pavle Subotic, Bernhard Scholz

PDF

Open Access

TL;DR

This paper presents a scalable provenance debugging technique for large-scale Datalog programs, introducing a new evaluation strategy and proof annotations that efficiently handle millions of tuples with minimal overhead.

Contribution

It introduces a novel bottom-up evaluation strategy with a new provenance lattice and fixed-point semantics, enabling scalable debugging of large Datalog programs.

Findings

01

Achieves high performance with a 1.27x overhead on average

02

Handles tens of millions of output tuples effectively

03

More flexible than existing provenance debugging techniques

Abstract

Logic programming languages such as Datalog have become popular as Domain Specific Languages (DSLs) for solving large-scale, real-world problems, in particular, static program analysis and network analysis. The logic specifications which model analysis problems, process millions of tuples of data and contain hundreds of highly recursive rules. As a result, they are notoriously difficult to debug. While the database community has proposed several data-provenance techniques that address the Declarative Debugging Challenge for Databases, in the cases of analysis problems, these state-of-the-art techniques do not scale. In this paper, we introduce a novel bottom-up Datalog evaluation strategy for debugging: our provenance evaluation strategy relies on a new provenance lattice that includes proof annotations, and a new fixed-point semantics for semi-naive evaluation. A debugging query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Research Data Management Practices