AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
Wenxiang Lin, Juntao Huang, Luhan Zhang, Laili Li, Xiang Bao, Mengyang Zhang, Bing Wang, Shaohuai Shi

TL;DR
AGoQ introduces novel activation and gradient quantization techniques that significantly reduce memory usage and accelerate training of large language models without sacrificing accuracy.
Contribution
The paper presents AGoQ, a new quantization method with layer-aware activation and 8-bit gradient quantization for efficient LLM training.
Findings
Reduces memory by up to 52% during training.
Achieves up to 1.34× faster training speed.
Maintains comparable accuracy and convergence on LLaMA models.
Abstract
Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introduce AGoQ, incorporating two new techniques: 1) a layer-aware activation quantization algorithm that allocates appropriate bit-widths for activations of various layers based on their types and pipeline stages to achieve near 4-bit activation storage, and 2) a gradient quantization algorithm that reduces memory usage and shortens communication time by employing 8-bit gradient storage and precision-preserving 8-bit All-Reduce communication. We conduct extensive experiments using different sizes of LLMs on two GPU clusters (up to 64 GPUs), and the experimental results show that our AGoQ reduces the memory by up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
