On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

Romain Petit; Clarice Poon; Gabriel Peyr\'e

arXiv:2605.10775·math.OC·May 12, 2026

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

Romain Petit, Clarice Poon, Gabriel Peyr\'e

PDF

TL;DR

This paper proves that for wide shallow neural networks with bounded nonlinearities, gradient descent almost always converges to global minimizers due to the instability of non-global solutions, extending previous results.

Contribution

It generalizes the understanding of global convergence in wide shallow models to include multi-head attention and vector output sigmoid networks, building on and completing prior theoretical frameworks.

Findings

01

Non-global minimizers are unstable under gradient descent.

02

Gradient descent converges to global minimizers in the many neurons limit.

03

The mean field training dynamic is well-posed and stable for sub-Gaussian initializations.

Abstract

A surprising phenomenon in the training of neural networks is the ability of gradient descent to find global minimizers of the training loss despite its non-convexity. Following earlier works, we investigate this behavior for wide shallow networks. Existing results essentially cover the case of ReLU activations and the case of sigmoid activations with scalar output weights. We study a large class of models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. Building upon [Chizat and Bach, 2018], we prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics. Thus, when the initial distribution of the parameters has full support (which includes the popular Gaussian case), and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.