Quantized Distributed Training of Large Models with Convergence Guarantees
Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh

TL;DR
This paper introduces QSDP, a quantized variant of fully-sharded data parallel training that maintains convergence guarantees, enabling scalable training of large models with reduced communication overhead and preserved accuracy.
Contribution
QSDP supports gradient and weight quantization with theoretical convergence guarantees, simplifying implementation and eliminating communication bottlenecks in large-scale model training.
Findings
QSDP achieves up to 2.2x speedup in training large models.
QSDP maintains model accuracy comparable to unquantized FSDP.
Theoretical proof of convergence for quantized SGD variants.
Abstract
Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as the vast majority of the communication involves the model's weights, direct compression alters convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports both gradient and weight quantization with theoretical guarantees, is simple to implement and has essentially no overheads. To derive QSDP we prove that a natural modification of SGD achieves convergence even when we only maintain quantized weights,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Softmax · Linear Warmup With Cosine Annealing · Residual Connection · Weight Decay · Discriminative Fine-Tuning · Dropout · Dense Connections
