Building Graphs at a Large Scale: Union Find Shuffle
Saigopal Thota, Mridul Jain, Nishad Kamat, Saikiran Malikireddy,, Pruthvi Raj Eranti, Albin Kuruvilla

TL;DR
This paper introduces Union Find Shuffle, a scalable distributed algorithm for building large-scale connected components in graphs, capable of handling billions of nodes and links efficiently on cloud infrastructure.
Contribution
The paper presents a novel distributed algorithm, Union Find Shuffle, with path compression, optimized for large-scale graph connectivity tasks on commodity cloud systems.
Findings
Successfully scaled to 75 billion nodes and 60 billion links.
Achieved seamless performance with skewed data and large components.
Demonstrated efficiency compared to similar approaches.
Abstract
Large scale graph processing using distributed computing frameworks is becoming pervasive and efficient in the industry. In this work, we present a highly scalable and configurable distributed algorithm for building connected components, called Union Find Shuffle (UFS) with Path Compression. The scale and complexity of the algorithm are a function of the number of partitions into which the data is initially partitioned, and the size of the connected components. We discuss the complexity and the benchmarks compared to similar approaches. We also present current benchmarks of our production system, running on commodity out-of-the-box cloud Hadoop infrastructure, where the algorithm was deployed over a year ago, scaled to around 75 Billion nodes and 60 Billions linkages (and growing). We highlight the key aspects of our algorithm which enable seamless scaling and performance even in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
