Rethinking Memory and Communication Cost for Efficient Large Language Model Training
Chan Wu, Hanxiao Zhang, Lin Ju, Jinjing Huang, Youshao Xiao, Zhaoxin, Huan, Siyuan Li, Fanzhuang Meng, Lei Liang, Xiaolu Zhang, Jun Zhou

TL;DR
This paper introduces PaRO, a balanced memory-communication strategy, and HO-Ring topology, significantly enhancing large language model training efficiency and scalability by optimizing communication and memory trade-offs.
Contribution
The paper proposes PaRO, a novel fine-grained sharding strategy, and HO-Ring topology, to improve training speed and communication efficiency in large language models.
Findings
PaRO improves training throughput by 1.19x-2.50x over SOTA methods.
HO-Ring enhances communication efficiency by 36.5%.
Achieves near-linear scalability in training large language models.
Abstract
Recently, various distributed strategies for large language model training have been proposed. However, these methods provided limited solutions for the trade-off between memory consumption and communication cost. In this paper, we rethink the impact of memory consumption and communication costs on the training speed of large language models, and propose a memory-communication balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides comprehensive options which reduces the amount and frequency of inter-group communication with minor memory redundancy by fine-grained sharding strategy, thereby improving the training efficiency in various training scenarios. Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring) communication topology to enhance communication efficiency between nodes or across switches in large language model training. Our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Natural Language Processing Techniques
MethodsZeRO · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
