Non-convergence to global minimizers in data driven supervised deep   learning: Adam and stochastic gradient descent optimization provably fail to   converge to global minimizers in the training of deep neural networks with   ReLU activation

Thang Do; Sonja Hannibal; Arnulf Jentzen

arXiv:2410.10533·cs.LG·February 18, 2025

Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation

Thang Do, Sonja Hannibal, Arnulf Jentzen

PDF

Open Access

TL;DR

This paper proves that common stochastic gradient descent methods, including Adam and RMSProp, almost surely do not converge to global minimizers when training deep neural networks with ReLU activation, especially as network size increases.

Contribution

It provides the first rigorous proof that popular SGD variants fail to reach global minima in deep ReLU networks, highlighting fundamental limitations of these optimization methods.

Findings

01

High probability of non-convergence to global minimizers

02

Non-convergence probability increases exponentially with network size

03

Results apply to a wide class of SGD-based optimizers

Abstract

Deep learning methods - consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays key tools to solve data driven supervised learning problems. Despite the great success of SGD methods in the training of DNNs, it remains a fundamental open problem of research to explain the success and the limitations of such methods in rigorous theoretical terms. In particular, even in the standard setup of data driven supervised learning problems, it remained an open research problem to prove (or disprove) that SGD methods converge in the training of DNNs with the popular rectified linear unit (ReLU) activation function with high probability to global minimizers in the optimization landscape. In this work we answer this question negatively. Specifically, in this work we prove for a large class of SGD methods that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM

MethodsStochastic Gradient Descent · RMSProp · AMSGrad · NADAM · Adam