Congestion-Aware Path Selection for Load Balancing in AI Clusters

Erfan Nosrati; Majid Ghaderi

arXiv:2506.08132·cs.NI·June 11, 2025

Congestion-Aware Path Selection for Load Balancing in AI Clusters

Erfan Nosrati, Majid Ghaderi

PDF

Open Access

TL;DR

Hopper is a host-level load balancing technique for RDMA networks in AI clusters that dynamically switches paths to reduce congestion, improving training efficiency without requiring specialized hardware.

Contribution

This paper introduces Hopper, a novel congestion-aware load balancing method optimized for RDMA traffic in AI clusters, operating without specialized hardware or switch modifications.

Findings

01

Reduces average flow completion time by up to 20%

02

Decreases 99-percentile tail flow completion time by up to 14%

03

Operates effectively with no need for hardware changes

Abstract

Fast training of large machine learning models requires distributed training on AI clusters consisting of thousands of GPUs. The efficiency of distributed training crucially depends on the efficiency of the network interconnecting GPUs in the cluster. These networks are commonly built using RDMA following a Clos-like datacenter topology. To efficiently utilize the network bandwidth, load balancing is employed to distribute traffic across multiple redundant paths. While there exists numerous techniques for load-balancing in traditional datacenters, these are often either optimized for TCP traffic or require specialized network hardware, thus limiting their utility in AI clusters. This paper presents the design and evaluation of Hopper, a new load-balancing technique optimized for RDMA traffic in AI clusters. Operating entirely at the host level, Hopper requires no specialized hardware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware-Defined Networks and 5G · Cloud Computing and Resource Management · IoT and Edge/Fog Computing