Extending AdamW by Leveraging Its Second Moment and Magnitude
Guoqiang Zhang, Niwa Kenta, W. Bastiaan Kleijn

TL;DR
This paper introduces Aida, an extension of AdamW that leverages the second moment and magnitude to relax learning rate constraints, improving stability and performance in optimization tasks.
Contribution
Aida extends AdamW by tracking higher moments and magnitudes, enabling larger learning rates and better stability, with theoretical analysis and empirical validation.
Findings
Aida with specific (p,q) settings outperforms AdamW in various tasks.
Theoretical analysis shows local stability depends on non-zero weight decay.
Empirical results demonstrate improved optimization performance with Aida.
Abstract
Recent work [4] analyses the local convergence of Adam in a neighbourhood of an optimal solution for a twice-differentiable function. It is found that the learning rate has to be sufficiently small to ensure local stability of the optimal solution. The above convergence results also hold for AdamW. In this work, we propose a new adaptive optimisation method by extending AdamW in two aspects with the purpose to relax the requirement on small learning rate for local stability, which we refer to as Aida. Firstly, we consider tracking the 2nd moment r_t of the pth power of the gradient-magnitudes. r_t reduces to v_t of AdamW when p=2. Suppose {m_t} is the first moment of AdamW. It is known that the update direction m_{t+1}/(v_{t+1}+epsilon)^0.5 (or m_{t+1}/(v_{t+1}^0.5+epsilon) of AdamW (or Adam) can be decomposed as the sign vector sign(m_{t+1}) multiplied elementwise by a vector of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNumerical methods in inverse problems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Label Smoothing · Byte Pair Encoding · Softmax · Absolute Position Encodings · Adam · AdamW
