Exact PPS Sampling with Bounded Sample Size
Brian Hentschel, Peter J. Haas, Yuanyuan Tian

TL;DR
This paper introduces a new PPS sampling method that guarantees a fixed maximum sample size while maintaining the probability proportional to size property, improving control over sample size and processing efficiency.
Contribution
The paper presents a novel PPS sampling scheme that enforces the PPS property with a bounded sample size, balancing accuracy and resource constraints.
Findings
Ensures sample size never exceeds target n
Maintains PPS property at all times
Operates with O(1) amortized processing time per item
Abstract
Probability proportional to size (PPS) sampling schemes with a target sample size aim to produce a sample comprising a specified number of items while ensuring that each item in the population appears in the sample with a probability proportional to its specified "weight" (also called its "size"). These two objectives, however, cannot always be achieved simultaneously. Existing PPS schemes prioritize control of the sample size, violating the PPS property if necessary. We provide a new PPS scheme that allows a different trade-off: our method enforces the PPS property at all times while ensuring that the sample size never exceeds the target value . The sample size is exactly equal to if possible, and otherwise has maximal expected value and minimal variance. Thus we bound the sample size, thereby avoiding storage overflows and helping to control the time required for analytics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Database Systems and Queries · Data Management and Algorithms
