Making massive probabilistic databases practical
Andrei Todor, Alin Dobra, Tamer Kahveci, Christopher Dudley

TL;DR
This paper introduces efficient methods for answering aggregate and non-aggregate queries on massive probabilistic databases using Probability Generating Functions, enabling scalable analysis of large uncertain datasets.
Contribution
It presents a novel PGF-based approach for exact and approximate query answering, significantly improving scalability over existing systems.
Findings
Methods are orders of magnitude faster than MayBMS and SPROUT.
Can scale to several terabytes of data on TPC-H queries.
Achieves efficient query processing on large probabilistic datasets.
Abstract
Existence of incomplete and imprecise data has moved the database paradigm from deterministic to proba- babilistic information. Probabilistic databases contain tuples that may or may not exist with some probability. As a result, the number of possible deterministic database instances that can be observed from a probabilistic database grows exponentially with the number of probabilistic tuples. In this paper, we consider the problem of answering both aggregate and non-aggregate queries on massive probabilistic databases. We adopt the tuple independence model, in which each tuple is assigned a probability value. We develop a method that exploits Probability Generating Functions (PGF) to answer such queries efficiently. Our method maintains a polynomial for each tuple. It incrementally builds a master polynomial that expresses the distribution of the possible result values precisely. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Bayesian Modeling and Causal Inference
