Gradient descent provably escapes saddle points in the training of   shallow ReLU networks

Patrick Cheridito; Arnulf Jentzen; Florian Rossmannek

arXiv:2208.02083·cs.LG·September 12, 2024·1 cites

Gradient descent provably escapes saddle points in the training of shallow ReLU networks

Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek

PDF

Open Access

TL;DR

This paper extends dynamical systems theory to show that gradient descent effectively avoids saddle points and converges to global minima in shallow ReLU and leaky ReLU networks, even under relaxed regularity conditions.

Contribution

It proves a center-stable manifold theorem applicable to non-regular loss functions and demonstrates gradient descent's ability to bypass saddle points in shallow ReLU networks.

Findings

01

Gradient descent bypasses most saddle points in shallow ReLU networks.

02

Convergence to global minima is achieved under certain initialization conditions.

03

The results hold even when regularity conditions are relaxed.

Abstract

Dynamical systems theory has recently been applied in optimization to prove that gradient descent algorithms bypass so-called strict saddle points of the loss function. However, in many modern machine learning applications, the required regularity conditions are not satisfied. In this paper, we prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements. We explore its relevance for various machine learning tasks, with a particular focus on shallow rectified linear unit (ReLU) and leaky ReLU networks with scalar input. Building on a detailed examination of critical points of the square integral loss function for shallow ReLU and leaky ReLU networks relative to an affine target function, we show that gradient descent circumvents most saddle points. Furthermore, we prove convergence to global minima…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Model Reduction and Neural Networks