Variance-Optimal Offline and Streaming Stratified Random Sampling
Trong Duc Nguyen, Ming-Hung Shih, Divesh Srivastava, Srikanta, Tirthapura, Bojian Xu

TL;DR
This paper introduces VOILA, a variance-optimal offline stratified sampling method that handles bounded strata, and S-VOILA, a streaming algorithm that approximates this optimal allocation with minimal variance increase.
Contribution
It proposes the first variance-optimal offline stratified sampling method for bounded strata and a streaming algorithm that closely approximates this optimal allocation.
Findings
VOILA achieves 1.4 to 50 times lower variance than Neyman allocation.
S-VOILA's variance is typically close to the offline optimal VOILA.
Theoretical lower bound shows any streaming algorithm must have at least an Omega(r) factor worse variance.
Abstract
Stratified random sampling (SRS) is a fundamental sampling technique that provides accurate estimates for aggregate queries using a small size sample, and has been used widely for approximate query processing. A key question in SRS is how to partition a target sample size among different strata. While Neyman allocation provides a solution that minimizes the variance of an estimate using this sample, it works under the assumption that each stratum is abundant, i.e., has a large number of data points to choose from. This assumption may not hold in general: one or more strata may be bounded, and may not contain a large number of data points, even though the total data size may be large. We first present VOILA, an offline method for allocating sample sizes to strata in a variance-optimal manner, even for the case when one or more strata may be bounded. We next consider SRS on streaming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Data Stream Mining Techniques
