MISS: Finding Optimal Sample Sizes for Approximate Analytics
Xuebin Su, Hongzhi Wang, Jianzhong Li, Hong Gao

TL;DR
This paper introduces MISS, a framework for efficiently determining minimal sample sizes in approximate query processing, balancing accuracy, efficiency, and broad applicability.
Contribution
It proposes a linear error model and a novel iterative sample selection framework, L^2Miss, to optimize sample sizes for various error metrics in AQP.
Findings
L^2Miss achieves accurate sample size estimation across multiple query types.
The framework balances statistical accuracy and computational efficiency.
Empirical results demonstrate significant improvements over existing methods.
Abstract
Nowadays, sampling-based Approximate Query Processing (AQP) is widely regarded as a promising way to achieve interactivity in big data analytics. To build such an AQP system, finding the minimal sample size for a query regarding given error constraints in general, called Sample Size Optimization (SSO), is an essential yet unsolved problem. Ideally, the goal of solving the SSO problem is to achieve statistical accuracy, computational efficiency and broad applicability all at the same time. Existing approaches either make idealistic assumptions on the statistical properties of the query, or completely disregard them. This may result in overemphasizing only one of the three goals while neglect the others. To overcome these limitations, we first examine carefully the statistical properties shared by common analytical queries. Then, based on the properties, we propose a linear model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Data Stream Mining Techniques
