Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?
Samet Oymak, Mahdi Soltanolkotabi

TL;DR
This paper investigates the behavior of gradient descent in overparameterized nonlinear models, revealing that it converges rapidly to a global minimum close to the initial point along a shortest path, with implications for neural network training.
Contribution
The paper introduces a new theoretical framework showing gradient descent converges efficiently to a near-initial global minimum in overparameterized settings, with novel potential functions and martingale techniques.
Findings
Gradient descent converges geometrically to a global minimum.
Iterates follow a near shortest path from initialization to the solution.
Results apply across various domains like matrix recovery and neural networks.
Abstract
Many modern learning tasks involve fitting nonlinear models to data which are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Due to this overparameterization, the training loss may have infinitely many global minima and it is critical to understand the properties of the solutions found by first-order optimization schemes such as (stochastic) gradient descent starting from different initializations. In this paper we demonstrate that when the loss has certain properties over a minimally small neighborhood of the initial point, first order methods such as (stochastic) gradient descent have a few intriguing properties: (1) the iterates converge at a geometric rate to a global optima even when the loss is nonconvex, (2) among all global optima of the loss the iterates converge to one with a near minimal distance to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM
MethodsStochastic Gradient Descent
