A Quick and Exact Method for Distributed Quantile Computation
Ivan Cao, Jaromir J. Saloni, David A. G. Harrison

TL;DR
GK Select is an exact, efficient distributed quantile computation method in Spark that avoids full data shuffles, matching the complexity of approximate sketches while significantly outperforming full sorts in large-scale data.
Contribution
Introduces GK Select, a novel exact quantile algorithm for Spark that reduces shuffle costs and achieves linear-time extraction using GK Sketch.
Findings
Achieves 10.5x speedup over full sort on 10^9 values
Matches the complexity of GK Sketch for exact quantiles
Operates efficiently with constant number of actions
Abstract
Quantile computation is a core primitive in large-scale data analytics. In Spark, practitioners typically rely on the Greenwald-Khanna (GK) Sketch, an approximate method. When exact quantiles are required, the default option is an expensive global sort. We present GK Select, an exact Spark algorithm that avoids full-data shuffles and completes in a constant number of actions. GK Select leverages GK Sketch to identify a near-target pivot, extracts all values within the error bound around this pivot in each partition in linear time, and then tree-reduces the resulting candidate sets. We show analytically that GK Select matches the executor-side time complexity of GK Sketch while returning the exact quantile. Empirically, GK Select achieves sketch-level latency and outperforms Spark's full sort by approximately 10.5x on 10^9 values across 120 partitions on a 30-core AWS EMR cluster.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
