Load Balancing for AI Training Workloads

Sarah McClure; Evyatar Cohen; Alex Shpiner; Mark Silberstein; Sylvia Ratnasamy; Scott Shenker; Isaac Keslassy

arXiv:2507.21372·cs.NI·February 16, 2026

Load Balancing for AI Training Workloads

Sarah McClure, Evyatar Cohen, Alex Shpiner, Mark Silberstein, Sylvia Ratnasamy, Scott Shenker, Isaac Keslassy

PDF

TL;DR

This paper systematically evaluates load-balancing strategies for AI training workloads, revealing that packet spraying approaches outperform traditional methods and introducing Ofan, a switch-based implementation that improves performance under network failures.

Contribution

It provides a comprehensive evaluation of load-balancing designs, compares host- and switch-based approaches, and introduces Ofan, a novel switch-based implementation of destination rotation.

Findings

01

Packet spraying outperforms traditional load balancing methods.

02

Host-based packet spraying is more resilient to link failures.

03

Ofan offers performance gains over other approaches.

Abstract

The extreme bandwidth demands of AI training has made load-balancing a critical component in AI fabrics, and a variety of load-balancing designs have emerged in recent work from both industry and research. However, there is currently little consensus on which design approach dominates or the conditions under which an approach dominates. We also lack an understanding of how far these approaches are from optimal. We provide a technical foundation for answering these questions by systematically evaluating leading load-balancing designs, while decoupling them from specific congestion control and loss recovery stacks. We find that load-balancing based on packet spraying dominates traditional approaches that load balance traffic at flow, flowlet, or subflow granularities. When comparing host- vs switch-based approaches to packet spraying, we find that they perform similarly in failure-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.