SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang,, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao

TL;DR
SDP4Bit introduces a novel 4-bit communication quantization method for sharded data parallelism in large language model training, significantly reducing communication overhead with minimal accuracy loss and improving training speed.
Contribution
It proposes a new 4-bit quantization technique with algorithm-system co-design for efficient LLM training, addressing communication bottlenecks in sharded data parallelism.
Findings
Achieves up to 4.08× speedup in training throughput.
Maintains negligible impact on training loss with 4-bit quantization.
Demonstrates effectiveness on GPT models with up to 6.7 billion parameters.
Abstract
Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challenge in the scalability of ShardedDP is the intensive communication of weights and gradients. While compression techniques can alleviate this issue, they often result in worse accuracy. Driven by this limitation, we propose SDP4Bit (Toward 4Bit Communication Quantization in Sharded Data Parallelism for LLM Training), which effectively reduces the communication of weights and gradients to nearly 4 bits via two novel techniques: quantization on weight differences, and two-level gradient smooth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Data Storage Technologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Layer Normalization · Residual Connection · Cosine Annealing · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Weight Decay · Linear Layer · Softmax · Multi-Head Attention
