Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates
Steffen Dereich, Robin Graeber, Arnulf Jentzen

TL;DR
This paper proves that popular adaptive stochastic gradient descent methods like Adam do not converge when using non-vanishing learning rates, highlighting fundamental limitations in their theoretical guarantees.
Contribution
It provides a rigorous proof that Adam and similar adaptive optimizers fail to converge with fixed, non-zero learning rates, addressing a key open question.
Findings
Adam and similar methods do not converge with non-vanishing learning rates
Established pathwise bounds for a class of adaptive SGD methods
Highlights limitations of adaptive optimizers in certain training scenarios
Abstract
Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have revolutionized our ways of working and living in modern societies. For example, SGD methods are used to train powerful large language models (LLMs) such as versions of ChatGPT and Gemini, SGD methods are employed to create successful generative AI based text-to-image creation models such as Midjourney, DALL-E, and Stable Diffusion, but SGD methods are also used to train DNNs to approximately solve scientific models such as partial differential equation (PDE) models from physics and biology and optimal control and stopping problems from engineering. It is known that the plain vanilla standard SGD method fails to converge even in the situation of several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM
MethodsDiffusion · Stochastic Gradient Descent · RMSProp · Adam
