LoCo: Low-Bit Communication Adaptor for Large-scale Model Training
Xingyu Xie, Zhijie Lin, Kim-Chuan Toh, Pan Zhou

TL;DR
LoCo is a low-bit communication adaptor that compensates gradients before compression, enabling efficient large-scale model training with minimal loss in training quality and significant speed improvements.
Contribution
It introduces a gradient compensation mechanism using historical error estimates, compatible with common optimizers and sharding strategies, ensuring efficient and high-quality training.
Findings
Improves Adam training speed by 14% to 40% on large models
Maintains training quality while reducing communication overhead
Compatible with popular training frameworks like Megatron-LM and FSDP
Abstract
To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
MethodsStochastic Gradient Descent · Mixture of Experts · Adam · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
