Distributed Sign Momentum with Local Steps for Training Transformers

Shuhua Yu; Ding Zhou; Cong Xie; An Xu; Zhi Zhang; Xin Liu; Soummya Kar

arXiv:2411.17866·cs.LG·March 11, 2025

Distributed Sign Momentum with Local Steps for Training Transformers

Shuhua Yu, Ding Zhou, Cong Xie, An Xu, Zhi Zhang, Xin Liu, Soummya Kar

PDF

Open Access 1 Repo

TL;DR

This paper introduces a communication-efficient distributed sign momentum method with local steps for training large-scale Transformer models, achieving faster convergence and better empirical performance.

Contribution

It proposes a novel distributed sign momentum algorithm with multiple local steps, providing convergence analysis and demonstrating empirical improvements in training GPT-2 models.

Findings

01

Significantly improves training efficiency over existing methods.

02

Achieves an $O(1/ oot 4 T)$ convergence rate for stochastic gradient descent.

03

Empirically outperforms other distributed training methods on GPT-2 pre-training.

Abstract

Pre-training Transformer models is resource-intensive, and recent studies have shown that sign momentum is an efficient technique for training large-scale deep learning models, particularly Transformers. However, its application in distributed training remains underexplored. This paper investigates a novel communication-efficient distributed sign momentum method with multiple local steps, to cope with the scenarios where communicating at every step is prohibitive. Our proposed method allows for a broad class of base optimizers for local steps, and uses sign momentum in the global step, where momentum is generated from differences accumulated during local steps. For generic base optimizers, by approximating the sign operator with a randomized version that acts as a continuous analog in expectation, we present a general convergence analysis, which specializes to an $O (1/ T)$ rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shuhuayu/dist-sign-momentum
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Label Smoothing · Dropout · Discriminative Fine-Tuning · Linear Layer · Cosine Annealing · Attention Dropout · Layer Normalization