Subset Sampling over Joins

Aryan Esmailpour; Xiao Hu; Jinchao Huang; Stavros Sintos

arXiv:2512.16321·cs.DB·December 19, 2025

Subset Sampling over Joins

Aryan Esmailpour, Xiao Hu, Jinchao Huang, Stavros Sintos

PDF

Open Access

TL;DR

This paper introduces efficient algorithms for subset sampling over relational joins, enabling representative data subset selection in large, complex datasets without full join materialization, crucial for scalable data analytics and machine learning.

Contribution

It presents the first algorithms for subset sampling over acyclic joins, including static, one-shot, and dynamic indexing methods with near-optimal complexity.

Findings

01

Algorithms support multiple independent samples efficiently.

02

Techniques handle dynamic data with insertions.

03

Achieve near-optimal time and space complexity.

Abstract

Subset sampling (also known as Poisson sampling), where the decision to include any specific element in the sample is made independently of all others, is a fundamental primitive in data analytics, enabling efficient approximation by processing representative subsets rather than massive datasets. While sampling from explicit lists is well-understood, modern applications -- such as machine learning over relational data -- often require sampling from a set defined implicitly by a relational join. In this paper, we study the problem of \emph{subset sampling over joins}: drawing a random subset from the join results, where each join result is included independently with some probability. We address the general setting where the probability is derived from input tuple weights via decomposable functions (e.g., product, sum, min, max). Since the join size can be exponentially larger than the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Management and Algorithms · Advanced Database Systems and Queries