Stream sampling for variance-optimal estimation of subset sums
Edith Cohen, Nick Duffield, Haim Kaplan, Carsten Lund, and Mikkel, Thorup

TL;DR
This paper introduces $ ext{varopt}_k$, a reservoir sampling scheme that achieves variance-optimal unbiased estimation of subset sums from streaming data, outperforming previous methods in efficiency and accuracy.
Contribution
The paper presents a new reservoir sampling scheme, $ ext{varopt}_k$, that is variance-optimal for unbiased subset sum estimation, with improved bounds and efficiency.
Findings
Minimizes average variance over all subset sizes
Provides tighter worst-case variance bounds
Operates efficiently in $O( ext{log} k)$ time per item
Abstract
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, , that dominates all previous schemes in terms of estimation quality. provides {\em variance optimal unbiased estimation of subset sums}. More precisely, if we have seen items of the stream, then for {\em any} subset size , our scheme based on samples minimizes the average variance over all subsets of size . In fact, the optimality is against any off-line scheme with samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Complexity and Algorithms in Graphs · Stochastic Gradient Optimization Techniques
