Quantized Distributed Training of Large Models with Convergence   Guarantees

Ilia Markov; Adrian Vladu; Qi Guo; Dan Alistarh

arXiv:2302.02390·cs.LG·February 7, 2023·5 cites

Quantized Distributed Training of Large Models with Convergence Guarantees

Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh

PDF

Open Access 1 Video

TL;DR

This paper introduces QSDP, a quantized variant of fully-sharded data parallel training that maintains convergence guarantees, enabling scalable training of large models with reduced communication overhead and preserved accuracy.

Contribution

QSDP supports gradient and weight quantization with theoretical convergence guarantees, simplifying implementation and eliminating communication bottlenecks in large-scale model training.

Findings

01

QSDP achieves up to 2.2x speedup in training large models.

02

QSDP maintains model accuracy comparable to unquantized FSDP.

03

Theoretical proof of convergence for quantized SGD variants.

Abstract

Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as the vast majority of the communication involves the model's weights, direct compression alters convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports both gradient and weight quantization with theoretical guarantees, is simple to implement and has essentially no overheads. To derive QSDP we prove that a natural modification of SGD achieves convergence even when we only maintain quantized weights,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Quantized Distributed Training of Large Models with Convergence Guarantees· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Softmax · Linear Warmup With Cosine Annealing · Residual Connection · Weight Decay · Discriminative Fine-Tuning · Dropout · Dense Connections