Poisson Sampling over Acyclic Joins
Liese Bekkers, Frank Neven, Lorrens Pantelis, Stijn Vansummeren

TL;DR
This paper presents a nearly instance-optimal algorithm for Poisson sampling over acyclic joins, enabling efficient sampling and join processing with practical implementation strategies and experimental validation on real data.
Contribution
It introduces a novel algorithm for Poisson sampling over acyclic joins that is nearly instance-optimal and integrates a random-access index for efficient sampling without materializing the full join.
Findings
The proposed algorithm outperforms traditional repeated Bernoulli trials in practice.
The random-access index can be used to implement Yannakakis' join algorithm with sampling.
Practical implementation choices significantly improve performance in column store environments.
Abstract
We introduce the problem of Poisson sampling over joins: compute a sample of the result of a join query by conceptually performing a Bernoulli trial for each join tuple, using a non-uniform and tuple-specific probability. We propose an algorithm for Poisson sampling over acyclic joins that is nearly instance-optimal, running in time O(N + k \log N) where N is the size of the input database, and k is the size of the resulting sample. Our algorithm hinges on two building blocks: (1) The construction of a random-access index that allows, given a number i, to randomly access the i-th join tuple without fully materializing the (possibly large) join result; (2) The probing of this index to construct the result sample. We study the engineering trade-offs required to make both components practical, focusing on their implementation in column stores, and identify best-performing alternatives for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Distributed systems and fault tolerance
