Distributed Sign Momentum with Local Steps for Training Transformers
Shuhua Yu, Ding Zhou, Cong Xie, An Xu, Zhi Zhang, Xin Liu, Soummya Kar

TL;DR
This paper introduces a communication-efficient distributed sign momentum method with local steps for training large-scale Transformer models, achieving faster convergence and better empirical performance.
Contribution
It proposes a novel distributed sign momentum algorithm with multiple local steps, providing convergence analysis and demonstrating empirical improvements in training GPT-2 models.
Findings
Significantly improves training efficiency over existing methods.
Achieves an $O(1/ oot 4 T)$ convergence rate for stochastic gradient descent.
Empirically outperforms other distributed training methods on GPT-2 pre-training.
Abstract
Pre-training Transformer models is resource-intensive, and recent studies have shown that sign momentum is an efficient technique for training large-scale deep learning models, particularly Transformers. However, its application in distributed training remains underexplored. This paper investigates a novel communication-efficient distributed sign momentum method with multiple local steps, to cope with the scenarios where communicating at every step is prohibitive. Our proposed method allows for a broad class of base optimizers for local steps, and uses sign momentum in the global step, where momentum is generated from differences accumulated during local steps. For generic base optimizers, by approximating the sign operator with a randomized version that acts as a continuous analog in expectation, we present a general convergence analysis, which specializes to an rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Label Smoothing · Dropout · Discriminative Fine-Tuning · Linear Layer · Cosine Annealing · Attention Dropout · Layer Normalization
