On the Performance and Memory Footprint of Distributed Training: An Empirical Study on Transformers
Zhengxian Lu, Fangyu Wang, Zhiwei Xu, Fei Yang, Tao Li

TL;DR
This paper provides a comprehensive empirical and theoretical analysis of the performance and memory challenges in distributed training of Transformer models, highlighting the impact of various strategies and architectural considerations.
Contribution
It introduces a specialized analytical framework for Transformer training, compares distributed strategies, and reveals insights into pipeline parallelism advantages and memory overhead issues.
Findings
Pipeline parallelism outperforms data parallelism for Transformers.
Memory overhead can increase with suboptimal model partitioning.
Communication block size and waiting time significantly affect training performance.
Abstract
Transformer models have emerged as potent solutions to a wide array of multidisciplinary challenges. The deployment of Transformer architectures is significantly hindered by their extensive computational and memory requirements, necessitating the reliance on advanced efficient distributed training methodologies. Prior research has delved into the performance bottlenecks associated with distributed training, aiming to unravel these bottlenecks and suggest optimization directions. However, such analyses often overlook three aspects unique to Transformer models: the specialized architecture, the dependency on various distributed strategies, and the requirement to balance computational and memory overhead. This paper aims to bridge this gap by offering a comprehensive examination of the performance bottlenecks inherent in distributed training of Transformer models, leveraging both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Privacy-Preserving Technologies in Data
