The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines
Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, David, Garc\'ia-Soriano, Nicolas Kourtellis, Marco Serafini

TL;DR
This paper introduces Partial Key Grouping (PKG), a novel load balancing scheme for distributed stream processing that significantly improves load distribution, throughput, and latency by adapting the 'power of two choices' technique.
Contribution
The paper presents PKG, a scalable partitioning scheme that combines key splitting and local load estimation to outperform traditional methods in distributed stream processing.
Findings
PKG reduces load imbalance by up to several orders of magnitude.
PKG achieves nearly-perfect load balance in experiments.
Deployment on a Storm cluster shows up to 60% throughput improvement and 45% latency reduction.
Abstract
We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 60% in throughput and up to 45% in latency when deployed on a real Storm cluster.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
