# Analysis of a Two-Layer Neural Network via Displacement Convexity

**Authors:** Adel Javanmard, Marco Mondelli, Andrea Montanari

arXiv: 1901.01375 · 2019-08-20

## TL;DR

This paper studies the global convergence of gradient descent in training two-layer neural networks with bump-like components, revealing a connection to Wasserstein gradient flows and displacement convexity that ensures exponential convergence.

## Contribution

It demonstrates that as the number of neurons grows and bump width shrinks, the training dynamics converge to a Wasserstein gradient flow with displacement convexity, providing new theoretical insights.

## Key findings

- Gradient descent converges to Wasserstein gradient flow as neurons increase.
- Limit of the flow is a viscous porous medium equation when bump width tends to zero.
- Displacement convexity of the cost function ensures exponential convergence.

## Abstract

Fitting a function by using linear combinations of a large number $N$ of `simple' components is one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is non-convex and is solved by gradient descent or its variants. Unfortunately, little is known about global convergence properties of these approaches.   Here we consider the problem of learning a concave function $f$ on a compact convex domain $\Omega\subseteq {\mathbb R}^d$, using linear combinations of `bump-like' components (neurons). The parameters to be fitted are the centers of $N$ bumps, and the resulting empirical risk minimization problem is highly non-convex. We prove that, in the limit in which the number of neurons diverges, the evolution of gradient descent converges to a Wasserstein gradient flow in the space of probability distributions over $\Omega$. Further, when the bump width $\delta$ tends to $0$, this gradient flow has a limit which is a viscous porous medium equation. Remarkably, the cost function optimized by this gradient flow exhibits a special property known as displacement convexity, which implies exponential convergence rates for $N\to\infty$, $\delta\to 0$.   Surprisingly, this asymptotic theory appears to capture well the behavior for moderate values of $\delta, N$. Explaining this phenomenon, and understanding the dependence on $\delta,N$ in a quantitative manner remains an outstanding challenge.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.01375/full.md

## Figures

35 figures with captions in the complete paper: https://tomesphere.com/paper/1901.01375/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/1901.01375/full.md

---
Source: https://tomesphere.com/paper/1901.01375