Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise
Enea Monzio Compagnoni, Tianlin Liu, Rustem Islamov, Frank Norbert, Proske, Antonio Orvieto, Aurelien Lucchi

TL;DR
This paper develops novel stochastic differential equations to accurately model adaptive optimizers like SignSGD, RMSprop, and AdamW, revealing their dynamics, robustness, and the complex role of noise in training deep neural networks.
Contribution
It introduces new SDE models for adaptive optimizers, providing a deeper theoretical understanding and empirical validation of their behavior in deep learning.
Findings
SignSGD converges faster and is more robust to heavy-tail noise than SGD.
The role of noise in AdamW and RMSpropW is complex and differs from SignSGD.
The SDE models accurately predict optimizer behavior across various neural network architectures.
Abstract
Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. This work introduces novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
MethodsAdam · AdamW · Stochastic Gradient Descent
