Congestion-Aware Path Selection for Load Balancing in AI Clusters
Erfan Nosrati, Majid Ghaderi

TL;DR
Hopper is a host-level load balancing technique for RDMA networks in AI clusters that dynamically switches paths to reduce congestion, improving training efficiency without requiring specialized hardware.
Contribution
This paper introduces Hopper, a novel congestion-aware load balancing method optimized for RDMA traffic in AI clusters, operating without specialized hardware or switch modifications.
Findings
Reduces average flow completion time by up to 20%
Decreases 99-percentile tail flow completion time by up to 14%
Operates effectively with no need for hardware changes
Abstract
Fast training of large machine learning models requires distributed training on AI clusters consisting of thousands of GPUs. The efficiency of distributed training crucially depends on the efficiency of the network interconnecting GPUs in the cluster. These networks are commonly built using RDMA following a Clos-like datacenter topology. To efficiently utilize the network bandwidth, load balancing is employed to distribute traffic across multiple redundant paths. While there exists numerous techniques for load-balancing in traditional datacenters, these are often either optimized for TCP traffic or require specialized network hardware, thus limiting their utility in AI clusters. This paper presents the design and evaluation of Hopper, a new load-balancing technique optimized for RDMA traffic in AI clusters. Operating entirely at the host level, Hopper requires no specialized hardware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
