Rethinking Memory and Communication Cost for Efficient Large Language   Model Training

Chan Wu; Hanxiao Zhang; Lin Ju; Jinjing Huang; Youshao Xiao; Zhaoxin; Huan; Siyuan Li; Fanzhuang Meng; Lei Liang; Xiaolu Zhang; Jun Zhou

arXiv:2310.06003·cs.LG·October 31, 2023

Rethinking Memory and Communication Cost for Efficient Large Language Model Training

Chan Wu, Hanxiao Zhang, Lin Ju, Jinjing Huang, Youshao Xiao, Zhaoxin, Huan, Siyuan Li, Fanzhuang Meng, Lei Liang, Xiaolu Zhang, Jun Zhou

PDF

Open Access

TL;DR

This paper introduces PaRO, a balanced memory-communication strategy, and HO-Ring topology, significantly enhancing large language model training efficiency and scalability by optimizing communication and memory trade-offs.

Contribution

The paper proposes PaRO, a novel fine-grained sharding strategy, and HO-Ring topology, to improve training speed and communication efficiency in large language models.

Findings

01

PaRO improves training throughput by 1.19x-2.50x over SOTA methods.

02

HO-Ring enhances communication efficiency by 36.5%.

03

Achieves near-linear scalability in training large language models.

Abstract

Recently, various distributed strategies for large language model training have been proposed. However, these methods provided limited solutions for the trade-off between memory consumption and communication cost. In this paper, we rethink the impact of memory consumption and communication costs on the training speed of large language models, and propose a memory-communication balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides comprehensive options which reduces the amount and frequency of inter-group communication with minor memory redundancy by fine-grained sharding strategy, thereby improving the training efficiency in various training scenarios. Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring) communication topology to enhance communication efficiency between nodes or across switches in large language model training. Our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Natural Language Processing Techniques

MethodsZeRO · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings