Isolated Scheduling for Distributed Training Tasks in GPU Clusters
Xinchi Han, Weihao Jiang, Peirui Cao, Qinwei Yang, Yunzhuo, Liu, Shuyao Qi, Shengkai Lin, Shizhen Zhao

TL;DR
This paper introduces vClos, a novel network topology and communication pattern optimization for GPU clusters to eliminate network contention in distributed machine learning, improving training efficiency and fairness.
Contribution
The paper proposes vClos, a new approach to optimize network topology and communication in GPU clusters, and introduces OCS-vClos with optical switches to further reduce resource fragmentation.
Findings
vClos reduces network contention in GPU clusters.
Experimental results show vClos outperforms existing strategies.
OCS-vClos further improves resource utilization.
Abstract
Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU clusters face network contention caused by hash-collision problem which not only further increases the overhead of communication, but also creates unfairness and affects the user experience. In this paper, we firstly analyse how network contention affects the training time in a cluster with 32 NVIDIA V100 GPUs. Then we propose vClos to eliminate network contention by jointly optimizing network topology and communication pattern in distributed training. An OCS-vClos which introduces a layer of optical circuit switches (OCSs) in the leaf-spine network is also proposed to reduce potential network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · IoT and Edge/Fog Computing · Brain Tumor Detection and Classification
