PAT: a new algorithm for all-gather and reduce-scatter operations at scale
Sylvain Jeaugey

TL;DR
The paper introduces PAT, a scalable algorithm for all-gather and reduce-scatter operations that reduces communication overhead and improves performance at large scale, especially for small message sizes.
Contribution
It presents a novel Parallel Aggregated Trees (PAT) algorithm that enhances collective communication efficiency across any number of ranks with minimal buffers.
Findings
Logarithmic network transfers for small operations
Reduced long-distance communication
Independent buffer requirements from total operation size
Abstract
This paper describes a new algorithm called PAT, for Parallel Aggregated Trees, and which can be used to implement all-gather and reduce-scatter operations. This algorithm works on any number of ranks, has a logarithmic number of network transfers for small size operations, minimizes long-distance communication, and requires a logarithmic amount of internal buffers, independently from the total operation size. It is aimed at improving the performance of the NCCL library in cases where the ring algorithm would be inefficient, as its linear latency would show poor performance for small sizes and/or at scale.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadar Systems and Signal Processing · Neural Networks and Applications · Target Tracking and Data Fusion in Sensor Networks
