Sketch-Based Estimation of Subpopulation-Weight

Edith Cohen; Haim Kaplan

arXiv:0802.3448·cs.DB·February 26, 2008·2 cites

Sketch-Based Estimation of Subpopulation-Weight

Edith Cohen, Haim Kaplan

PDF

Open Access

TL;DR

This paper introduces novel unbiased estimators and confidence bounds for subpopulation weights using bottom-k sketches, enhancing approximate query processing for massive data sets across various applications.

Contribution

It presents new estimators and bounds tailored for different data applications, improving accuracy and efficiency in subpopulation weight estimation using bottom-k sketches.

Findings

01

Effective estimators for subpopulation weights derived using Horvitz-Thompson approach.

02

Confidence bounds tailored for different data scenarios.

03

Demonstrated benefits on Pareto distributed data.

Abstract

Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records' attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling and the classic weighted sampling without replacement. They can be computed efficiently for many representations of the data including distributed databases and data streams. We derive novel unbiased estimators and efficient confidence bounds for subpopulation weight. Our estimators and bounds are tailored by distinguishing between applications (such as data streams) where the total weight of the sketched set can be computed by the summarization algorithm without a significant use of additional resources, and applications (such as sketches of network neighborhoods) where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Graph Theory and Algorithms · Data Management and Algorithms