No More Adam: Learning Rate Scaling at Initialization is All You Need

Minghao Xu; Lichuan Xiang; Xu Cai; Hongkai Wen

arXiv:2412.11768·cs.LG·December 18, 2024

No More Adam: Learning Rate Scaling at Initialization is All You Need

Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen

PDF

Open Access 1 Repo

TL;DR

This paper introduces SGD-SaI, a simple learning rate scaling method at initialization that replaces adaptive optimizers like AdamW, achieving comparable or better performance in training Transformers while reducing memory usage.

Contribution

The authors propose SGD-SaI, a novel learning rate scaling technique at initialization that eliminates the need for adaptive gradient methods in training deep neural networks.

Findings

01

SGD-SaI matches or outperforms AdamW in Transformer training tasks.

02

SGD-SaI significantly reduces memory usage compared to AdamW.

03

SGD-SaI demonstrates robustness across various tasks and hyperparameters.

Abstract

In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anonymousalethiometer/sgd_sai
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTeaching and Learning Programming

MethodsLinear Layer · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Cosine Annealing · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning