Index-Assisted Stratified Sampling for Online Aggregation

Yunnan Yu; Zhuoyue Zhao

arXiv:2604.28141·cs.DB·May 1, 2026

Index-Assisted Stratified Sampling for Online Aggregation

Yunnan Yu, Zhuoyue Zhao

PDF

TL;DR

This paper introduces an index-assisted stratified sampling method for online aggregation that reduces sampling costs and improves query latency, especially for high-variance data.

Contribution

It proposes a two-phase sampling framework with optimal stratification and allocation strategies, addressing challenges in applying classic stratified sampling to index-assisted systems.

Findings

01

Achieves up to 3x speedup over uniform sampling.

02

Demonstrates up to 98708x speedup compared to scan-based stratified sampling.

03

Proves optimal stratification and sample size allocation strategies.

Abstract

Ad-hoc queries over frequently updated data in a flat schema are common in real-time data analysis applications and often require very low latency. Online aggregation can achieve so by providing approximate aggregation answers with confidence bound guarantees. It relies on the ability to draw samples online in a linear time to sample size rather than database size, which can be supported by index-assisted Sampling-based Approximate Query Processing (S-AQP) systems. However, the query latencies of approximate queries in these systems can still suffer from excessive sampling cost required to achieve a desired confidence bound, due to increased sample size for data with high variance in value distribution and selectivity. Classic stratified sampling methods with Neyman allocation can minimize sample size in theory, but several challenges prevent it from being applicable in index-assisted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.