No More Adam: Learning Rate Scaling at Initialization is All You Need
Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen

TL;DR
This paper introduces SGD-SaI, a simple learning rate scaling method at initialization that replaces adaptive optimizers like AdamW, achieving comparable or better performance in training Transformers while reducing memory usage.
Contribution
The authors propose SGD-SaI, a novel learning rate scaling technique at initialization that eliminates the need for adaptive gradient methods in training deep neural networks.
Findings
SGD-SaI matches or outperforms AdamW in Transformer training tasks.
SGD-SaI significantly reduces memory usage compared to AdamW.
SGD-SaI demonstrates robustness across various tasks and hyperparameters.
Abstract
In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTeaching and Learning Programming
MethodsLinear Layer · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Cosine Annealing · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning
