Proof: Accelerating Approximate Aggregation Queries with Expensive   Predicates

Daniel Kang; John Guibas; Peter Bailis; Tatsunori Hashimoto; Yi Sun,; Matei Zaharia

arXiv:2107.12525·math.ST·July 30, 2021

Proof: Accelerating Approximate Aggregation Queries with Expensive Predicates

Daniel Kang, John Guibas, Peter Bailis, Tatsunori Hashimoto, Yi Sun,, Matei Zaharia

PDF

Open Access

TL;DR

This paper introduces ABae, a stratified sampling method with proxy models for faster approximate aggregation queries with expensive predicates, and provides a theoretical analysis of its mean squared error decay rate.

Contribution

It offers a theoretical analysis of ABae, showing its MSE decay rate and demonstrating it can match optimal stratified sampling performance with proper sample allocation.

Findings

01

MSE of ABae decays at rate O(N^{-1}) with proper sample allocation

02

ABae achieves near-optimal performance compared to known-stratum algorithms

03

Theoretical bounds validate ABae's efficiency for approximate aggregation

Abstract

Given a dataset $D$ , we are interested in computing the mean of a subset of $D$ which matches a predicate. ABae leverages stratified sampling and proxy models to efficiently compute this statistic given a sampling budget $N$ . In this document, we theoretically analyze ABae and show that the MSE of the estimate decays at rate $O (N_{1}^{- 1} + N_{2}^{- 1} + N_{1}^{1/2} N_{2}^{- 3/2})$ , where $N = K \cdot N_{1} + N_{2}$ for some integer constant $K$ and $K \cdot N_{1}$ and $N_{2}$ represent the number of samples used in Stage 1 and Stage 2 of ABae respectively. Hence, if a constant fraction of the total sample budget $N$ is allocated to each stage, we will achieve a mean squared error of $O (N^{- 1})$ which matches the rate of mean squared error of the optimal stratified sampling algorithm given a priori knowledge of the predicate positive rate and standard deviation per stratum.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Machine Learning and Algorithms · Machine Learning and Data Classification