Sketch-Based Estimation of Subpopulation-Weight
Edith Cohen, Haim Kaplan

TL;DR
This paper introduces novel unbiased estimators and confidence bounds for subpopulation weights using bottom-k sketches, enhancing approximate query processing for massive data sets across various applications.
Contribution
It presents new estimators and bounds tailored for different data applications, improving accuracy and efficiency in subpopulation weight estimation using bottom-k sketches.
Findings
Effective estimators for subpopulation weights derived using Horvitz-Thompson approach.
Confidence bounds tailored for different data scenarios.
Demonstrated benefits on Pareto distributed data.
Abstract
Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records' attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling and the classic weighted sampling without replacement. They can be computed efficiently for many representations of the data including distributed databases and data streams. We derive novel unbiased estimators and efficient confidence bounds for subpopulation weight. Our estimators and bounds are tailored by distinguishing between applications (such as data streams) where the total weight of the sketched set can be computed by the summarization algorithm without a significant use of additional resources, and applications (such as sketches of network neighborhoods) where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Graph Theory and Algorithms · Data Management and Algorithms
