When Scaling Fails: Network and Fabric Effects on Distributed GPU Training Performance

Dinesh Gopalan; Ratul Ali

arXiv:2603.04424·cs.NI·March 6, 2026

When Scaling Fails: Network and Fabric Effects on Distributed GPU Training Performance

Dinesh Gopalan, Ratul Ali

PDF

Open Access

TL;DR

This paper investigates why distributed GPU training often fails to scale predictably, highlighting network and fabric effects such as topology, congestion, and locality that impact performance beyond small clusters.

Contribution

It provides an empirical analysis of real-world scaling issues, identifying key network and fabric factors affecting distributed GPU training performance.

Findings

01

Network topology and congestion significantly impact scaling.

02

Fabric design and communication patterns influence performance variability.

03

Common failure modes include synchronization issues and contention.

Abstract

Scaling distributed GPU training is commonly assumed to yield predictable performance gains as additional nodes are added. In practice, many large-scale deployments encounter diminishing returns and unstable behavior well before theoretical limits are reached. This paper examines why scaling fails in real systems, with a focus on the role of network and fabric effects that are often overlooked by higher-level training frameworks. We present an empirical study of distributed GPU training performance across multiple production-scale clusters. Our results show that network topology, congestion dynamics, collective synchronization behavior, and GPU locality frequently dominate end-to-end training performance once workloads move beyond a small number of nodes. Identical models and software stacks can exhibit sharply different scaling characteristics depending on fabric design and runtime…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance