Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities
Yunze Wei, Tianshuo Hu, Cong Liang, Yong Cui

TL;DR
This paper reviews communication optimization in distributed deep neural network training, analyzing current architectures and proposing expanded co-design paradigms to better utilize heterogeneous resources and improve training efficiency.
Contribution
It introduces an extended five-layer paradigm and co-design strategies for communication optimization in distributed training, highlighting cross-layer collaboration opportunities.
Findings
Layers are relatively independent, allowing for cross-layer optimization.
Proposes five-layer paradigm for better communication efficiency.
Highlights potential of heterogeneous resource utilization.
Abstract
The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources, necessitating distributed training. As GPU performance has rapidly evolved in recent years, computation time has shrunk, making communication a larger portion of the overall training time. Consequently, optimizing communication for distributed training has become crucial. In this article, we briefly introduce the general architecture of distributed deep neural network training and analyze relationships among Parallelization Strategy, Collective Communication Library, and Network from the perspective of communication optimization, which forms a three-layer paradigm. We then review current representative research advances within this three-layer paradigm. We find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management
MethodsLib
