Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Feng Liang, Zhen Zhang, Haifeng Lu, Victor C. M. Leung, Yanyi Guo,, Xiping Hu

TL;DR
This survey comprehensively reviews recent algorithms, frameworks, and infrastructures for improving communication efficiency in large-scale distributed deep learning, addressing challenges like fault tolerance, scalability, and heterogeneity.
Contribution
It provides a detailed overview of recent advancements in communication-efficient methods, including algorithms, resource strategies, and infrastructure technologies, with a case study on large language model training.
Findings
Efficient algorithms for model synchronization and data compression.
Strategies for resource allocation and task scheduling.
Impact analysis of communication overhead in large-scale heterogeneous settings.
Abstract
With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the large-scale scenario poses new challenges that include fault tolerance, scalability of algorithms and infrastructures, and heterogeneity in data sets, models, and resources. Due to intensive synchronization of models and sharing of data across GPUs and computing nodes during distributed training and inference processes, communication efficiency becomes the bottleneck for achieving high performance at a large scale. This article surveys the literature over the period of 2018-2023 on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning at various levels, including algorithms, frameworks, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification
MethodsFocus
