Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami, Nudelman, Gil Bloch, Torsten Hoefler

TL;DR
This paper introduces a bandwidth-optimal Allgather algorithm utilizing hardware multicast and SmartNIC offloading to enhance distributed AI training efficiency, reducing traffic and scaling to high-speed links.
Contribution
The paper presents a novel multicast-based Allgather algorithm combined with SmartNIC offloading, achieving bandwidth efficiency and scalability in distributed AI communication.
Findings
2x traffic reduction on 188-node testbed
Scales to 1.6 Tbit/s links
Uses multicast for optimal Allgather scheduling
Abstract
In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing
