Chasing Similarity: Distribution-aware Aggregation Scheduling (Extended Version)
Feilong Liu, Ario Salmasi, Spyros Blanas, Anastasios Sidiropoulos

TL;DR
This paper introduces GRASP, a distribution-aware aggregation scheduling protocol that optimizes parallel data aggregation by reducing network communication and improving efficiency, especially for high-cardinality data, through a phased, similarity-based approach.
Contribution
The paper formulates a performance model for parallel aggregation, proves NP-hardness of optimal plans, and proposes GRASP, a novel, distribution-aware scheduling protocol that outperforms existing methods.
Findings
GRASP reduces data transmission by aggregating similar partitions.
GRASP outperforms repartition-based aggregation by 3.5x.
GRASP outperforms LOOM by 2.0x.
Abstract
Parallel aggregation is a ubiquitous operation in data analytics that is expressed as GROUP BY in SQL, reduce in Hadoop, or segment in TensorFlow. Parallel aggregation starts with an optional local pre-aggregation step and then repartitions the intermediate result across the network. While local pre-aggregation works well for low-cardinality aggregations, the network communication cost remains significant for high-cardinality aggregations even after local pre-aggregation. The problem is that the repartition-based algorithm for high-cardinality aggregation does not fully utilize the network. In this work, we first formulate a mathematical model that captures the performance of parallel aggregation. We prove that finding optimal aggregation plans from a known data distribution is NP-hard, assuming the Small Set Expansion conjecture. We propose GRASP, a GReedy Aggregation Scheduling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Data Management and Algorithms · Data Stream Mining Techniques
