Approximating quantiles in very large datasets
Reza Hosseini

TL;DR
This paper introduces a new algorithm for approximating quantiles in extremely large datasets, overcoming memory and computational limitations, with adjustable precision and demonstrated effectiveness.
Contribution
The paper develops a novel quantile approximation algorithm based on data partitioning, improving accuracy and customization over existing median-of-medians methods.
Findings
The proposed algorithm effectively approximates quantiles in petabyte-scale datasets.
It offers deterministic precision and can be tailored to specific accuracy requirements.
The median-of-medians approach performs poorly for large datasets, motivating the new method.
Abstract
Very large datasets are often encountered in climatology, either from a multiplicity of observations over time and space or outputs from deterministic models (sometimes in petabytes= 1 million gigabytes). Loading a large data vector and sorting it, is impossible sometimes due to memory limitations or computing power. We show that a proposed algorithm to approximating the median, "the median of the median" performs poorly. Instead we develop an algorithm to approximate quantiles of very large datasets which works by partitioning the data or use existing partitions (possibly of non-equal size). We show the deterministic precision of this algorithm and how it can be adjusted to get customized precisions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Bayesian Modeling and Causal Inference
