LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

Xingyu Xie; Zhijie Lin; Kim-Chuan Toh; Pan Zhou

arXiv:2407.04480·cs.LG·December 2, 2024·1 cites

LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

Xingyu Xie, Zhijie Lin, Kim-Chuan Toh, Pan Zhou

PDF

Open Access 1 Repo

TL;DR

LoCo is a low-bit communication adaptor that compensates gradients before compression, enabling efficient large-scale model training with minimal loss in training quality and significant speed improvements.

Contribution

It introduces a gradient compensation mechanism using historical error estimates, compatible with common optimizers and sharding strategies, ensuring efficient and high-quality training.

Findings

01

Improves Adam training speed by 14% to 40% on large models

02

Maintains training quality while reducing communication overhead

03

Compatible with popular training frameworks like Megatron-LM and FSDP

Abstract

To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XingyuXie/LoCo
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsStochastic Gradient Descent · Mixture of Experts · Adam · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings