On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
Romain Petit, Clarice Poon, Gabriel Peyr\'e

TL;DR
This paper proves that for wide shallow neural networks with bounded nonlinearities, gradient descent almost always converges to global minimizers due to the instability of non-global solutions, extending previous results.
Contribution
It generalizes the understanding of global convergence in wide shallow models to include multi-head attention and vector output sigmoid networks, building on and completing prior theoretical frameworks.
Findings
Non-global minimizers are unstable under gradient descent.
Gradient descent converges to global minimizers in the many neurons limit.
The mean field training dynamic is well-posed and stable for sub-Gaussian initializations.
Abstract
A surprising phenomenon in the training of neural networks is the ability of gradient descent to find global minimizers of the training loss despite its non-convexity. Following earlier works, we investigate this behavior for wide shallow networks. Existing results essentially cover the case of ReLU activations and the case of sigmoid activations with scalar output weights. We study a large class of models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. Building upon [Chizat and Bach, 2018], we prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics. Thus, when the initial distribution of the parameters has full support (which includes the popular Gaussian case), and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
