Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Abulikemu Abuduweili; Changliu Liu

arXiv:2412.02153·cs.LG·February 12, 2025

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Abulikemu Abuduweili, Changliu Liu

PDF

Open Access

TL;DR

This paper identifies the zero initialization of second-order moments in Adam as a key issue and proposes non-zero initialization strategies that improve stability and performance in training deep neural networks.

Contribution

It introduces simple non-zero initialization methods for Adam's second-order moments, significantly improving its stability and generalization in training deep models.

Findings

01

Non-zero initialization stabilizes Adam training.

02

Improved final performance with new initialization.

03

Adam matches recent adaptive optimizer variants.

Abstract

Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from suboptimal generalization compared to stochastic gradient descent (SGD) and exhibit instability, particularly when training Transformer models. In this work, we show the standard initialization of the second-order moment estimation ( $v_{0} = 0$ ) as a significant factor contributing to these limitations. We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values, using either data-driven or random initialization strategies. Empirical evaluations demonstrate that our approach not only stabilizes convergence but also enhances the final performance of adaptive gradient optimizers. Furthermore, by adopting the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Photoacoustic and Ultrasonic Imaging

MethodsAttention Is All You Need · Absolute Position Encodings · Residual Connection · Softmax · Label Smoothing · Dropout · Dense Connections · Layer Normalization · Linear Layer · Multi-Head Attention