SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data   Parallelism for LLM Training

Jinda Jia; Cong Xie; Hanlin Lu; Daoce Wang; Hao Feng; Chengming Zhang,; Baixi Sun; Haibin Lin; Zhi Zhang; Xin Liu; Dingwen Tao

arXiv:2410.15526·cs.LG·November 26, 2024

SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang,, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao

PDF

Open Access 1 Video

TL;DR

SDP4Bit introduces a novel 4-bit communication quantization method for sharded data parallelism in large language model training, significantly reducing communication overhead with minimal accuracy loss and improving training speed.

Contribution

It proposes a new 4-bit quantization technique with algorithm-system co-design for efficient LLM training, addressing communication bottlenecks in sharded data parallelism.

Findings

01

Achieves up to 4.08× speedup in training throughput.

02

Maintains negligible impact on training loss with 4-bit quantization.

03

Demonstrates effectiveness on GPT models with up to 6.7 billion parameters.

Abstract

Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challenge in the scalability of ShardedDP is the intensive communication of weights and gradients. While compression techniques can alleviate this issue, they often result in worse accuracy. Driven by this limitation, we propose SDP4Bit (Toward 4Bit Communication Quantization in Sharded Data Parallelism for LLM Training), which effectively reduces the communication of weights and gradients to nearly 4 bits via two novel techniques: quantization on weight differences, and two-level gradient smooth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training· slideslive

Taxonomy

TopicsAdvanced Data Storage Technologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Layer Normalization · Residual Connection · Cosine Annealing · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Weight Decay · Linear Layer · Softmax · Multi-Head Attention