Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for   Distributed AI

Mikhail Khalilov; Salvatore Di Girolamo; Marcin Chrapek; Rami; Nudelman; Gil Bloch; Torsten Hoefler

arXiv:2408.13356·cs.DC·November 12, 2024·2 cites

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami, Nudelman, Gil Bloch, Torsten Hoefler

PDF

Open Access

TL;DR

This paper introduces a bandwidth-optimal Allgather algorithm utilizing hardware multicast and SmartNIC offloading to enhance distributed AI training efficiency, reducing traffic and scaling to high-speed links.

Contribution

The paper presents a novel multicast-based Allgather algorithm combined with SmartNIC offloading, achieving bandwidth efficiency and scalability in distributed AI communication.

Findings

01

2x traffic reduction on 188-node testbed

02

Scales to 1.6 Tbit/s links

03

Uses multicast for optimal Allgather scheduling

Abstract

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing