Non-convergence of Adam and other adaptive stochastic gradient descent   optimization methods for non-vanishing learning rates

Steffen Dereich; Robin Graeber; Arnulf Jentzen

arXiv:2407.08100·cs.LG·July 12, 2024·1 cites

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

Steffen Dereich, Robin Graeber, Arnulf Jentzen

PDF

Open Access

TL;DR

This paper proves that popular adaptive stochastic gradient descent methods like Adam do not converge when using non-vanishing learning rates, highlighting fundamental limitations in their theoretical guarantees.

Contribution

It provides a rigorous proof that Adam and similar adaptive optimizers fail to converge with fixed, non-zero learning rates, addressing a key open question.

Findings

01

Adam and similar methods do not converge with non-vanishing learning rates

02

Established pathwise bounds for a class of adaptive SGD methods

03

Highlights limitations of adaptive optimizers in certain training scenarios

Abstract

Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have revolutionized our ways of working and living in modern societies. For example, SGD methods are used to train powerful large language models (LLMs) such as versions of ChatGPT and Gemini, SGD methods are employed to create successful generative AI based text-to-image creation models such as Midjourney, DALL-E, and Stable Diffusion, but SGD methods are also used to train DNNs to approximately solve scientific models such as partial differential equation (PDE) models from physics and biology and optimal control and stopping problems from engineering. It is known that the plain vanilla standard SGD method fails to converge even in the situation of several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM

MethodsDiffusion · Stochastic Gradient Descent · RMSProp · Adam