Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation
Thang Do, Sonja Hannibal, Arnulf Jentzen

TL;DR
This paper proves that common stochastic gradient descent methods, including Adam and RMSProp, almost surely do not converge to global minimizers when training deep neural networks with ReLU activation, especially as network size increases.
Contribution
It provides the first rigorous proof that popular SGD variants fail to reach global minima in deep ReLU networks, highlighting fundamental limitations of these optimization methods.
Findings
High probability of non-convergence to global minimizers
Non-convergence probability increases exponentially with network size
Results apply to a wide class of SGD-based optimizers
Abstract
Deep learning methods - consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays key tools to solve data driven supervised learning problems. Despite the great success of SGD methods in the training of DNNs, it remains a fundamental open problem of research to explain the success and the limitations of such methods in rigorous theoretical terms. In particular, even in the standard setup of data driven supervised learning problems, it remained an open research problem to prove (or disprove) that SGD methods converge in the training of DNNs with the popular rectified linear unit (ReLU) activation function with high probability to global minimizers in the optimization landscape. In this work we answer this question negatively. Specifically, in this work we prove for a large class of SGD methods that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM
MethodsStochastic Gradient Descent · RMSProp · AMSGrad · NADAM · Adam
