When Scaling Fails: Network and Fabric Effects on Distributed GPU Training Performance
Dinesh Gopalan, Ratul Ali

TL;DR
This paper investigates why distributed GPU training often fails to scale predictably, highlighting network and fabric effects such as topology, congestion, and locality that impact performance beyond small clusters.
Contribution
It provides an empirical analysis of real-world scaling issues, identifying key network and fabric factors affecting distributed GPU training performance.
Findings
Network topology and congestion significantly impact scaling.
Fabric design and communication patterns influence performance variability.
Common failure modes include synchronization issues and contention.
Abstract
Scaling distributed GPU training is commonly assumed to yield predictable performance gains as additional nodes are added. In practice, many large-scale deployments encounter diminishing returns and unstable behavior well before theoretical limits are reached. This paper examines why scaling fails in real systems, with a focus on the role of network and fabric effects that are often overlooked by higher-level training frameworks. We present an empirical study of distributed GPU training performance across multiple production-scale clusters. Our results show that network topology, congestion dynamics, collective synchronization behavior, and GPU locality frequently dominate end-to-end training performance once workloads move beyond a small number of nodes. Identical models and software stacks can exhibit sharply different scaling characteristics depending on fabric design and runtime…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance
