Load Balancing for AI Training Workloads
Sarah McClure, Evyatar Cohen, Alex Shpiner, Mark Silberstein, Sylvia Ratnasamy, Scott Shenker, Isaac Keslassy

TL;DR
This paper systematically evaluates load-balancing strategies for AI training workloads, revealing that packet spraying approaches outperform traditional methods and introducing Ofan, a switch-based implementation that improves performance under network failures.
Contribution
It provides a comprehensive evaluation of load-balancing designs, compares host- and switch-based approaches, and introduces Ofan, a novel switch-based implementation of destination rotation.
Findings
Packet spraying outperforms traditional load balancing methods.
Host-based packet spraying is more resilient to link failures.
Ofan offers performance gains over other approaches.
Abstract
The extreme bandwidth demands of AI training has made load-balancing a critical component in AI fabrics, and a variety of load-balancing designs have emerged in recent work from both industry and research. However, there is currently little consensus on which design approach dominates or the conditions under which an approach dominates. We also lack an understanding of how far these approaches are from optimal. We provide a technical foundation for answering these questions by systematically evaluating leading load-balancing designs, while decoupling them from specific congestion control and loss recovery stacks. We find that load-balancing based on packet spraying dominates traditional approaches that load balance traffic at flow, flowlet, or subflow granularities. When comparing host- vs switch-based approaches to packet spraying, we find that they perform similarly in failure-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
