Subset Sampling over Joins
Aryan Esmailpour, Xiao Hu, Jinchao Huang, Stavros Sintos

TL;DR
This paper introduces efficient algorithms for subset sampling over relational joins, enabling representative data subset selection in large, complex datasets without full join materialization, crucial for scalable data analytics and machine learning.
Contribution
It presents the first algorithms for subset sampling over acyclic joins, including static, one-shot, and dynamic indexing methods with near-optimal complexity.
Findings
Algorithms support multiple independent samples efficiently.
Techniques handle dynamic data with insertions.
Achieve near-optimal time and space complexity.
Abstract
Subset sampling (also known as Poisson sampling), where the decision to include any specific element in the sample is made independently of all others, is a fundamental primitive in data analytics, enabling efficient approximation by processing representative subsets rather than massive datasets. While sampling from explicit lists is well-understood, modern applications -- such as machine learning over relational data -- often require sampling from a set defined implicitly by a relational join. In this paper, we study the problem of \emph{subset sampling over joins}: drawing a random subset from the join results, where each join result is included independently with some probability. We address the general setting where the probability is derived from input tuple weights via decomposable functions (e.g., product, sum, min, max). Since the join size can be exponentially larger than the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Advanced Database Systems and Queries
