Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations
Alexandru Cr\u{a}ciun, Debarghya Ghoshdastidar

TL;DR
This paper proves that the gradient descent map is non-singular for realistic neural networks with piecewise analytic activations, which supports theoretical guarantees for avoiding saddle points and convergence to global minima.
Contribution
It establishes the non-singularity of the GD map for practical neural network architectures with common activation functions, extending prior theoretical results.
Findings
GD map is non-singular for almost all step-sizes in realistic networks
Supports theoretical guarantees for avoiding saddle points
Extends convergence analysis to practical neural network settings
Abstract
The theory of training deep networks has become a central question of modern machine learning and has inspired many practical advancements. In particular, the gradient descent (GD) optimization algorithm has been extensively studied in recent years. A key assumption about GD has appeared in several recent works: the \emph{GD map is non-singular} -- it preserves sets of measure zero under preimages. Crucially, this assumption has been used to prove that GD avoids saddle points and maxima, and to establish the existence of a computable quantity that determines the convergence to global minima (both for GD and stochastic GD). However, the current literature either assumes the non-singularity of the GD map or imposes restrictive assumptions, such as Lipschitz smoothness of the loss (for example, Lipschitzness does not hold for deep ReLU networks with the cross-entropy loss) and restricts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
