Proof: Accelerating Approximate Aggregation Queries with Expensive Predicates
Daniel Kang, John Guibas, Peter Bailis, Tatsunori Hashimoto, Yi Sun,, Matei Zaharia

TL;DR
This paper introduces ABae, a stratified sampling method with proxy models for faster approximate aggregation queries with expensive predicates, and provides a theoretical analysis of its mean squared error decay rate.
Contribution
It offers a theoretical analysis of ABae, showing its MSE decay rate and demonstrating it can match optimal stratified sampling performance with proper sample allocation.
Findings
MSE of ABae decays at rate O(N^{-1}) with proper sample allocation
ABae achieves near-optimal performance compared to known-stratum algorithms
Theoretical bounds validate ABae's efficiency for approximate aggregation
Abstract
Given a dataset , we are interested in computing the mean of a subset of which matches a predicate. ABae leverages stratified sampling and proxy models to efficiently compute this statistic given a sampling budget . In this document, we theoretically analyze ABae and show that the MSE of the estimate decays at rate , where for some integer constant and and represent the number of samples used in Stage 1 and Stage 2 of ABae respectively. Hence, if a constant fraction of the total sample budget is allocated to each stage, we will achieve a mean squared error of which matches the rate of mean squared error of the optimal stratified sampling algorithm given a priori knowledge of the predicate positive rate and standard deviation per stratum.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Machine Learning and Algorithms · Machine Learning and Data Classification
