Analysis of a Two-Layer Neural Network via Displacement Convexity

Adel Javanmard; Marco Mondelli; Andrea Montanari

arXiv:1901.01375·math.ST·August 20, 2019

Analysis of a Two-Layer Neural Network via Displacement Convexity

Adel Javanmard, Marco Mondelli, Andrea Montanari

PDF

TL;DR

This paper studies the global convergence of gradient descent in training two-layer neural networks with bump-like components, revealing a connection to Wasserstein gradient flows and displacement convexity that ensures exponential convergence.

Contribution

It demonstrates that as the number of neurons grows and bump width shrinks, the training dynamics converge to a Wasserstein gradient flow with displacement convexity, providing new theoretical insights.

Findings

01

Gradient descent converges to Wasserstein gradient flow as neurons increase.

02

Limit of the flow is a viscous porous medium equation when bump width tends to zero.

03

Displacement convexity of the cost function ensures exponential convergence.

Abstract

Fitting a function by using linear combinations of a large number $N$ of `simple' components is one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is non-convex and is solved by gradient descent or its variants. Unfortunately, little is known about global convergence properties of these approaches. Here we consider the problem of learning a concave function $f$ on a compact convex domain $Ω \subseteq R^{d}$ , using linear combinations of `bump-like' components (neurons). The parameters to be fitted are the centers of $N$ bumps, and the resulting empirical risk minimization problem is highly non-convex. We prove that, in the limit in which the number of neurons diverges, the evolution of gradient…

Equations622

\hat{f} (x; w) = \frac{1}{N} i = 1 \sum N a_{i} σ (x; w_{i}) .

\hat{f} (x; w) = \frac{1}{N} i = 1 \sum N a_{i} σ (x; w_{i}) .

\displaystyle R_{N}({\boldsymbol{a}},{\boldsymbol{w}})={\mathbb{E}}\Big{\{}\Big{[}y-\frac{1}{N}\sum_{i=1}^{N}a_{i}\sigma({\boldsymbol{x}};{\boldsymbol{w}}_{i})\Big{]}^{2}\Big{\}}\,.

\displaystyle R_{N}({\boldsymbol{a}},{\boldsymbol{w}})={\mathbb{E}}\Big{\{}\Big{[}y-\frac{1}{N}\sum_{i=1}^{N}a_{i}\sigma({\boldsymbol{x}};{\boldsymbol{w}}_{i})\Big{]}^{2}\Big{\}}\,.

E (y_{j} ∣ x_{j}) = f (x_{j}),

E (y_{j} ∣ x_{j}) = f (x_{j}),

\hat{f} (x; w) = \frac{1}{N} i = 1 \sum N K^{δ} (x - w_{i}),

\hat{f} (x; w) = \frac{1}{N} i = 1 \sum N K^{δ} (x - w_{i}),

\displaystyle R_{N}({\boldsymbol{w}})=R_{\#}+{\mathbb{E}}\big{\{}\big{[}f({\boldsymbol{x}})-\frac{1}{N}\sum_{i=1}^{N}K^{\delta}({\boldsymbol{x}}-{\boldsymbol{w}}_{i})\big{]}^{2}\big{\}}\,,

\displaystyle R_{N}({\boldsymbol{w}})=R_{\#}+{\mathbb{E}}\big{\{}\big{[}f({\boldsymbol{x}})-\frac{1}{N}\sum_{i=1}^{N}K^{\delta}({\boldsymbol{x}}-{\boldsymbol{w}}_{i})\big{]}^{2}\big{\}}\,,

R_{N}({\boldsymbol{w}})={\mathbb{E}}\big{\{}\big{[}f({\boldsymbol{x}})-\frac{1}{N}\sum_{i=1}^{N}K^{\delta}({\boldsymbol{x}}-{\boldsymbol{w}}_{i})\big{]}^{2}\big{\}}\,.

R_{N}({\boldsymbol{w}})={\mathbb{E}}\big{\{}\big{[}f({\boldsymbol{x}})-\frac{1}{N}\sum_{i=1}^{N}K^{\delta}({\boldsymbol{x}}-{\boldsymbol{w}}_{i})\big{]}^{2}\big{\}}\,.

N \to \infty, ε \to 0 lim \overset{ρ}{^}_{t / ε}^{(N)} = ρ_{t}^{δ},

N \to \infty, ε \to 0 lim \overset{ρ}{^}_{t / ε}^{(N)} = ρ_{t}^{δ},

R^{δ} (ρ)

R^{δ} (ρ)

R (ρ)

R (ρ)

R_{N} (w^{k}) \leq R_{N} (w^{0}) e^{- 2 α k ε} + err (N, d, ε, δ),

R_{N} (w^{k}) \leq R_{N} (w^{0}) e^{- 2 α k ε} + err (N, d, ε, δ),

⟨ y, \nabla^{2} f (x) y ⟩ \leq - α ∣ y ∣^{2}, \forall x \in Ω, y \in R^{d},

⟨ y, \nabla^{2} f (x) y ⟩ \leq - α ∣ y ∣^{2}, \forall x \in Ω, y \in R^{d},

(A4)

(A4)

K (- x) = K (x), supp (K) \subseteq B (0, c_{0}) .

Ω^{δ} = λ_{δ} Ω,

Ω^{δ} = λ_{δ} Ω,

\lambda_{\delta}=\sup\big{\{}\lambda\geq 0:\;\lambda\Omega\oplus{\sf B}({\boldsymbol{0}},c_{0}\,\delta)\subseteq\Omega\big{\}}\,.

\lambda_{\delta}=\sup\big{\{}\lambda\geq 0:\;\lambda\Omega\oplus{\sf B}({\boldsymbol{0}},c_{0}\,\delta)\subseteq\Omega\big{\}}\,.

w_{i}^{k + 1} = P {w_{i}^{k} - ε \nabla K^{δ} (x_{k + 1} - w_{i}^{k}) (y_{k + 1} - \hat{f} (x_{k + 1}; w^{k})) + 2 ε τ g_{i}^{k + 1}} .

w_{i}^{k + 1} = P {w_{i}^{k} - ε \nabla K^{δ} (x_{k + 1} - w_{i}^{k}) (y_{k + 1} - \hat{f} (x_{k + 1}; w^{k})) + 2 ε τ g_{i}^{k + 1}} .

\displaystyle{\sf P}({\boldsymbol{z}})=\arg\min\big{\{}|{\boldsymbol{z}}-{\boldsymbol{x}}|:\;\;\;{\boldsymbol{x}}\in\Omega^{\delta}\big{\}}\,.

\displaystyle{\sf P}({\boldsymbol{z}})=\arg\min\big{\{}|{\boldsymbol{z}}-{\boldsymbol{x}}|:\;\;\;{\boldsymbol{x}}\in\Omega^{\delta}\big{\}}\,.

\displaystyle{\sf P}({\boldsymbol{z}})=\begin{cases}{\boldsymbol{z}}&\mbox{ if $|{\boldsymbol{z}}|\leq r-c_{0}\delta$,}\\ (r-c_{0}\delta){\boldsymbol{z}}/|{\boldsymbol{z}}|&\mbox{ if $|{\boldsymbol{z}}|>r-c_{0}\delta$.}\end{cases}

\displaystyle{\sf P}({\boldsymbol{z}})=\begin{cases}{\boldsymbol{z}}&\mbox{ if $|{\boldsymbol{z}}|\leq r-c_{0}\delta$,}\\ (r-c_{0}\delta){\boldsymbol{z}}/|{\boldsymbol{z}}|&\mbox{ if $|{\boldsymbol{z}}|>r-c_{0}\delta$.}\end{cases}

\displaystyle\inf_{\rho}R^{\delta}(\rho)\leq R^{\delta}(f)=\nu_{0}\int_{\Omega}\big{[}f({\boldsymbol{x}})-K^{\delta}*f({\boldsymbol{x}})\big{]}^{2}{\rm d}{\boldsymbol{x}}\,.

\displaystyle\inf_{\rho}R^{\delta}(\rho)\leq R^{\delta}(f)=\nu_{0}\int_{\Omega}\big{[}f({\boldsymbol{x}})-K^{\delta}*f({\boldsymbol{x}})\big{]}^{2}{\rm d}{\boldsymbol{x}}\,.

\begin{split}\partial_{t}\rho_{t}({\boldsymbol{w}})&=\nabla\cdot\big{(}\rho_{t}({\boldsymbol{w}})\nabla\Psi({\boldsymbol{w}};\rho_{t})\big{)}+\tau\Delta\rho_{t}({\boldsymbol{w}})\,,\\ \Psi({\boldsymbol{w}};\rho)&\equiv-\nu_{0}\,K^{\delta}\ast f({\boldsymbol{w}})+\nu_{0}\,K^{\delta}\ast K^{\delta}\ast\rho({\boldsymbol{w}})\,,\end{split}

\begin{split}\partial_{t}\rho_{t}({\boldsymbol{w}})&=\nabla\cdot\big{(}\rho_{t}({\boldsymbol{w}})\nabla\Psi({\boldsymbol{w}};\rho_{t})\big{)}+\tau\Delta\rho_{t}({\boldsymbol{w}})\,,\\ \Psi({\boldsymbol{w}};\rho)&\equiv-\nu_{0}\,K^{\delta}\ast f({\boldsymbol{w}})+\nu_{0}\,K^{\delta}\ast K^{\delta}\ast\rho({\boldsymbol{w}})\,,\end{split}

⟨ n (w), ρ_{t} (w) \nablaΨ (w; ρ_{t}) ρ_{0} = ρ_{\mbox init}^{δ}, + τ \nabla ρ_{t} (w)⟩ = 0 \forall w \in \partial Ω^{δ},

⟨ n (w), ρ_{t} (w) \nablaΨ (w; ρ_{t}) ρ_{0} = ρ_{\mbox init}^{δ}, + τ \nabla ρ_{t} (w)⟩ = 0 \forall w \in \partial Ω^{δ},

F^{δ} (ρ) = \frac{1}{2} R^{δ} (ρ) - τ S (ρ), S (ρ) = - \int ρ (w) lo g ρ (w) d w .

F^{δ} (ρ) = \frac{1}{2} R^{δ} (ρ) - τ S (ρ), S (ρ) = - \int ρ (w) lo g ρ (w) d w .

\partial_{t} ρ_{t} (w)

\partial_{t} ρ_{t} (w)

ρ_{0} ⟨ n (w), ν_{0} ρ_{t} (w) \nabla (f (w) - ρ_{t} (w)) = ρ_{\mbox init}, - τ \nabla ρ_{t} (w)⟩ = 0 \forall w \in \partial Ω .

ρ_{0} ⟨ n (w), ν_{0} ρ_{t} (w) \nabla (f (w) - ρ_{t} (w)) = ρ_{\mbox init}, - τ \nabla ρ_{t} (w)⟩ = 0 \forall w \in \partial Ω .

W_{2} (ρ_{0}, ρ_{1})^{2} = γ \in Γ (ρ_{0}, ρ_{1}) in f \int ∥ x - y ∥_{2}^{2} γ (d x, d y),

W_{2} (ρ_{0}, ρ_{1})^{2} = γ \in Γ (ρ_{0}, ρ_{1}) in f \int ∥ x - y ∥_{2}^{2} γ (d x, d y),

(1 - t) F (ρ_{0}) + t F (ρ_{1}) - F (ρ_{t}) \geq \frac{1}{2} λ t (1 - t) W_{2} (ρ_{0}, ρ_{1})^{2} .

(1 - t) F (ρ_{0}) + t F (ρ_{1}) - F (ρ_{t}) \geq \frac{1}{2} λ t (1 - t) W_{2} (ρ_{0}, ρ_{1})^{2} .

\displaystyle K(x)=C_{d}\kappa(|x|)\,,\;\;\;\kappa(t)=\begin{cases}1-t^{2}-2t^{3}+2t^{4}&\mbox{ for $t\leq c_{0}=1$,}\\ 0&\mbox{ otherwise,}\end{cases}

\displaystyle K(x)=C_{d}\kappa(|x|)\,,\;\;\;\kappa(t)=\begin{cases}1-t^{2}-2t^{3}+2t^{4}&\mbox{ for $t\leq c_{0}=1$,}\\ 0&\mbox{ otherwise,}\end{cases}

f (x) = \frac{c _{1} - lo g ( e ^{⟨ q_{1}, x ⟩} + e ^{⟨ q_{2}, x ⟩} )}{c _{2}},

f (x) = \frac{c _{1} - lo g ( e ^{⟨ q_{1}, x ⟩} + e ^{⟨ q_{2}, x ⟩} )}{c _{2}},

\hat{f} (x; w, a) = i = 1 \sum N a_{i} K^{δ} (x - w_{i}),

\hat{f} (x; w, a) = i = 1 \sum N a_{i} K^{δ} (x - w_{i}),

\hat{a} = (Z^{T} Z + λ I)^{- 1} Z^{T} y,

\hat{a} = (Z^{T} Z + λ I)^{- 1} Z^{T} y,

\begin{split}&\sup_{k\in[0,T/\varepsilon]\cap\mathbb{N}}\bigg{|}\sum_{i=1}^{N}g({\boldsymbol{w}}_{i}^{k})-\int g({\boldsymbol{w}})\rho_{k\varepsilon}({\rm d}{\boldsymbol{w}})\bigg{|}\leq z\,{\sf err}(N,d,{\varepsilon},\delta)\,\,e^{C_{*}p\delta^{-(d+2)}\,T},\\ &\sup_{k\in[0,T/\varepsilon]\cap\mathbb{N}}|R_{N}({\boldsymbol{w}}^{k})-R^{\delta}(\rho_{k\epsilon})|\leq z\,{\sf err}(N,d,{\varepsilon},\delta)\,\,e^{C_{*}p\delta^{-(d+2)}\,T},\end{split}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\stackMath

Analysis of a Two-Layer Neural Network via

Displacement Convexity

Adel Javanmard, Marco Mondelli and Andrea Montanari Data Science and Operations Department, Marshall School of Business, University of Southern CaliforniaDepartment of Electrical Engineering, Stanford UniversityDepartment of Electrical Engineering and Department of Statistics, Stanford University

Abstract

Fitting a function by using linear combinations of a large number $N$ of ‘simple’ components is one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is non-convex and is solved by gradient descent or its variants. Unfortunately, little is known about global convergence properties of these approaches.

Here we consider the problem of learning a concave function $f$ on a compact convex domain $\Omega\subset{\mathbb{R}}^{d}$ , using linear combinations of ‘bump-like’ components (neurons). The parameters to be fitted are the centers of $N$ bumps, and the resulting empirical risk minimization problem is highly non-convex. We prove that, in the limit in which the number of neurons diverges, the evolution of gradient descent converges to a Wasserstein gradient flow in the space of probability distributions over $\Omega$ . Further, when the bump width $\delta$ tends to [math], this gradient flow has a limit which is a viscous porous medium equation. Remarkably, the cost function optimized by this gradient flow exhibits a special property known as displacement convexity, which implies exponential convergence rates for $N\to\infty$ , $\delta\to 0$ .

Surprisingly, this asymptotic theory appears to capture well the behavior for moderate values of $\delta,N$ . Explaining this phenomenon, and understanding the dependence on $\delta,N$ in a quantitative manner remains an outstanding challenge.

1 Introduction

In supervised learning, we are given data $\{(y_{j},{\boldsymbol{x}}_{j})\}_{j\leq n}$ which are often assumed to be independent and identically distributed from a common law ${\mathbb{P}}$ on ${\mathbb{R}}\times{\mathbb{R}}^{d}$ (here ${\boldsymbol{x}}_{j}\in{\mathbb{R}}^{d}$ is a feature vector, and $y_{j}\in{\mathbb{R}}$ is a label or response variable). We would like to find a function $\hat{f}:{\mathbb{R}}^{d}\to{\mathbb{R}}$ to predict the labels at new points ${\boldsymbol{x}}\in{\mathbb{R}}^{d}$ . Throughout this paper, we will quantify the quality of our prediction by square loss, hence we are interested in minimizing $R(\hat{f})={\mathbb{E}}\{(y-\hat{f}({\boldsymbol{x}}))^{2}\}$ .

One of the most fruitful ideas in this context is to use functions that are linear combinations of simple components:

[TABLE]

Here $\sigma:{\mathbb{R}}^{d}\times{\mathbb{R}}^{D}\to{\mathbb{R}}$ is a component function (a ‘neuron’ or ‘unit’ in the neural network parlance), and ${\boldsymbol{w}}=({\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{N})\in{\mathbb{R}}^{D\times N}$ , ${\boldsymbol{a}}=(a_{1},\dots,a_{N})\in{\mathbb{R}}^{N}$ are parameters to be learnt from data. Standard choices for the activation function are $\sigma({\boldsymbol{x}};{\boldsymbol{w}})=(1+\exp(-\langle{\boldsymbol{w}},{\boldsymbol{x}}\rangle))^{-1}$ (sigmoid) or $\sigma({\boldsymbol{x}};{\boldsymbol{w}})=\max(\langle{\boldsymbol{w}},{\boldsymbol{x}}\rangle;0)$ (ReLU). In this paper we will instead study a class of activation that depends on the difference ${\boldsymbol{x}}-{\boldsymbol{w}}$ . The objective is to minimize the population (prediction) risk

[TABLE]

Special instantiations of this idea include (we provide only pointers to the immense literature on each topic):

•

Two-layer neural networks [Ros62, AB09];

•

Sparse deconvolution [Don92, CFG14];

•

Kernel ridge regression and related random feature methods [CST00, RR08];

•

Boosting [Sch03, Fri01, BY03].

Despite the impressive practical success of these methods, the risk function $R_{N}({\boldsymbol{w}})$ is highly non-convex and little is known about global convergence of algorithms that try to minimize it (we refer to Section 2 for further discussion of the related literature).

Notable exceptions to the last statement are provided by random features and by boosting algorithms. In random feature methods, the parameters ${\boldsymbol{w}}_{i}$ are not optimized over (they are drawn i.i.d. from some common distribution), and the resulting risk function becomes convex in the weights $(a_{1},\dots,a_{N})$ to be learnt. While this is a fruitful idea, it gives up the degrees of freedom afforded by the ${\boldsymbol{w}}_{i}$ ’s.

Boosting overcomes non-convexity by fitting the components ${\boldsymbol{w}}_{1}$ , …, ${\boldsymbol{w}}_{N}$ one at the time, sequentially. The underlying assumption is that the problem of minimizing $R_{N}({\boldsymbol{w}})$ with respect to one of the hidden units ${\boldsymbol{w}}_{i}$ is tractable. However, this is generally not the case when the parameters ${\boldsymbol{w}}_{i}$ belong to a high-dimensional space.

The risk function (1.2) crystalizes a central conundrum in statistical learning. In a number of applications (especially at low noise), it is rarely the case that low prediction error can be achieved through a function that is linear in the raw covariates, e.g. $\hat{f}(x)=\langle{\boldsymbol{w}},{\boldsymbol{x}}\rangle$ . In a classical setting, the statistician would craft nonlinear features out of the covariates on the basis of expert knowledge. For the model of Eq. (1.1), this amounts to constructing vectors ${\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{N}$ . Statistical methods would then be confined to the convex task of fitting the coefficients $a_{1},\dots,a_{N}$ . This step is well understood from a statistical and computational perspective.

Modern machine learning approaches (boosting, neural networks, etc.) hold the promise of automatizing feature extraction, hence producing superior performances in a wide variety of applications. Unfortunately, we are still far from understanding in which cases optimizing over the ${\boldsymbol{w}}_{i}$ ’s yields a significant improvement over –say– choosing them randomly. This central challenge intertwines statistical and computational aspects. It is not hard to see that varying the weights ${\boldsymbol{w}}_{i}$ ’s produces a significantly larger function class [Bac17]. The relevant question is what part of this class can be accessed using gradient descent or other practical algorithms.

The main objective of this paper is to introduce a nonparametric regression model in which these questions can be addressed rigorously. The model is interesting for at least two reasons: $(i)$ From a theoretical point of view, global convergence can be proved in the limit of a large neurons. The proof relies on a mathematical mechanism that has not been explored in the statistics or machine learning literature before. $(ii)$ From a practical point of view, the model is nontrivial enough to illustrate the potential advantage of fitting the features ${\boldsymbol{w}}_{i}$ (we demonstrate this numerically in Section 4.)

Let $\Omega\subset{\mathbb{R}}^{d}$ be a compact convex set with $\mathscrsfs{C}^{2}$ boundary. We assume $\{(y_{j},{\boldsymbol{x}}_{j})\}_{j\geq 1}$ to be i.i.d. where ${\boldsymbol{x}}_{j}\sim{\sf Unif}(\Omega)$ and

[TABLE]

with $f:\Omega\to{\mathbb{R}}$ a smooth function. We try to fit these data using a combination of bumps, namely

[TABLE]

where $K^{\delta}({\boldsymbol{x}})=\delta^{-d}K({\boldsymbol{x}}/\delta)$ , $K:{\mathbb{R}}^{d}\to{\mathbb{R}}_{\geq 0}$ is a first order kernel with compact support, and ${\boldsymbol{w}}_{i}\in\Omega^{\delta}$ for $i\leq N$ . Here $\Omega^{\delta}$ is a slightly smaller compact set, with $\Omega^{\delta}\to\Omega$ as $\delta\to 0$ . (Note that in our setting the hidden units ${\boldsymbol{w}}_{i}$ and input data ${\boldsymbol{x}}_{j}$ have same dimensions, i.e., $d=D$ .) We refer to Section 5 for a formal statement of our assumptions. From Eq. (1.2), we have

[TABLE]

where $R_{\#}={\mathbb{E}}[(y-f({\boldsymbol{x}}))^{2}]$ and we use the fact that ${\mathbb{E}}[y-f({\boldsymbol{x}})|{\boldsymbol{x}}]=0$ . Since the constant $R_{\#}$ does not depend on parameters ${\boldsymbol{w}}$ , it does not matter in optimizing $R_{N}({\boldsymbol{w}})$ over ${\boldsymbol{w}}$ and henceforth we write, with a slight abuse of notation,

[TABLE]

The model (1.4) is general enough to include a broad class of radial-basis function (RBF) networks which are known to be universal function approximators [PS91]. To the best of our knowledge, there is no result on the global convergence of stochastic gradient descent for learning RBF networks, and this paper establishes the first result of this type.

It is important to emphasize a few differences with respect to standard RBF networks. First of all, we do not require the kernel $K({\boldsymbol{x}})$ to be radial, i.e. to depend uniquely on the norm $|{\boldsymbol{x}}|$ . Second, we require $K$ to have compact support. This is mainly a technical requirement that simplifies some arguments: we expect our results to be generalizable to kernels that decay rapidly enough. Finally, and most crucially, the form (1.4) does not include non-uniform weights for the $N$ components. A more standard formulation would posit $\hat{f}({\boldsymbol{x}};{\boldsymbol{w}})=\sum_{i=1}^{N}a_{i}K^{\delta}({\boldsymbol{x}}-{\boldsymbol{w}}_{i})$ and learn the weights $a_{i}$ from data, see Eq. (1.1). We deliberately set the weights to a fixed value because the risk function is convex in ${\boldsymbol{a}}=(a_{i})_{i\leq N}$ , and hence fitting ${\boldsymbol{a}}$ ’s to global optimality is ‘easy.’ Indeed, universal approximation could be achieved by keeping the centers ${\boldsymbol{w}}_{i}$ fixed (and sufficiently dense in $\Omega$ ) and only adjusting ${\boldsymbol{a}}$ . As discussed above, our focus is on the role of the ${\boldsymbol{w}}_{i}$ ’s.

Our main result is a proof that, for sufficiently large $N$ and small $\delta$ , gradient descent algorithms converge to weights ${\boldsymbol{w}}$ with nearly optimum prediction error, provided $f$ is strongly concave. Let us emphasize that the resulting population risk $R_{N}({\boldsymbol{w}})$ is non-convex regardless of the concavity properties of $f$ . Our proof unveils a novel mechanism by which global convergence takes place. Convergence results for non-convex empirical risk minimization are generally proved by carefully ruling out local minima in the cost function (see Section 2 for pointers to this literature). Instead we prove that, as $N\to\infty$ , $\delta\to 0$ , the gradient descent dynamics converges to a gradient flow in Wasserstein space, and that the corresponding cost function is ‘displacement convex.’ Breakthrough results in optimal transport theory guarantee dimension-free convergence rates for this limiting dynamics [CJM*+*01, CMV03, CMV06]. In particular, we expect the cost function $R_{N}({\boldsymbol{w}})$ to have many local minima, which are however completely neglected by the gradient descent dynamics.

More specifically, our first step is to show that – for large $N$ – the evolution of the weights ${\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{N}$ under gradient descent can be replaced by the evolution of a probability distribution111Throughout, $\mathscrsfs{P}_{2}({\cal X})$ denotes the space of probability distributions on ${\cal X}$ , endowed with Wasserstein metric $W_{2}$ . $\rho^{\delta}\in\mathscrsfs{P}_{2}(\Omega)$ , which approximates their empirical distribution. Namely, if $({\boldsymbol{w}}^{k}_{1},\dots,{\boldsymbol{w}}^{k}_{N})$ denote the weights after $k$ iterations with step size ${\varepsilon}$ , and $\hat{\rho}^{(N)}_{k}=\sum_{i=1}^{N}\delta_{{\boldsymbol{w}}_{i}^{k}}/N$ is their empirical distribution, then we have

[TABLE]

where the limit holds in the sense of weak convergence or in $W_{1}$ distance (the two are equivalent since $\Omega$ is compact). The limit evolution $(\rho^{\delta}_{t})_{t\geq 0}$ satisfies a partial differential equation (PDE) that can also be described as the Wasserstein $W_{2}$ gradient flow (i.e. gradient flow in $\mathscrsfs{P}_{2}(\Omega)$ ), for the following effective risk

[TABLE]

where $\nu_{0}=1/|\Omega|$ and $|\Omega|$ denotes the volume of the set $\Omega$ . Here $\ast$ denotes the usual convolution. Let us emphasize that the convergence to Wasserstein gradient flow holds regardless of the concavity of $f$ .

The use of $W_{2}$ gradient flows to analyze two-layer neural networks was recently developed in several papers [MMN18, RVE18, CB18, SS18]. However, we cannot rely on earlier results because of the specific boundary conditions in our problem. We constrain the ${\boldsymbol{w}}_{i}\in\Omega^{\delta}$ by running projected stochastic gradient descent (SGD): at each step ${\boldsymbol{w}}_{i}$ moves in the direction of a stochastic gradient of $R_{N}({\boldsymbol{w}})$ and then projected back to $\Omega^{\delta}$ . This results in a PDE with Neumann boundary condition on $\Omega^{\delta}$ , which is not covered by previous theory. We establish a quantitative version of the limit (1.5) via propagation-of-chaos techniques.

Even if the cost (1.6) is quadratic and convex in $\rho$ , its $W_{2}$ gradient flow can have multiple fixed points, and hence global convergence cannot be guaranteed. Global convergence results were proven in [MMN18] and in [CB18] by showing that, for all $t\geq 0$ $\rho^{\delta}_{t}$ has a density that is either smooth, or strictly positive everywhere. However, these convergence results are non-quantitative, and do not provide convergence rates222An argument indicating convergence in a time polynomial in $d$ was put forward in [WLLM18], but for a different type of continuous flow..

Indeed, the mathematical property that controls global convergence of $W_{2}$ gradient flow is not ordinary convexity but displacement convexity. Roughly speaking, displacement convexity is convexity along geodesics of the $W_{2}$ metric, see Section 3.5. The risk function (1.6) is not displacement convex. Indeed, its quadratic term reads $\nu_{0}\int K_{\delta}\ast K_{\delta}({\boldsymbol{x}}-{\boldsymbol{x}}^{\prime})\rho({\boldsymbol{x}})\rho({\boldsymbol{x}}^{\prime}){\rm d}{\boldsymbol{x}}{\rm d}{\boldsymbol{x}}^{\prime}$ which is not displacement convex unless $K_{\delta}\ast K_{\delta}$ is convex (see Lemma H.1), which cannot be in our setting. However, for small $\delta$ , we can formally approximate $K^{\delta}\ast\rho\approx\rho$ , and hence hope to replace the risk function (1.6) with a simpler one

[TABLE]

Most of our technical work is devoted to making rigorous this $\delta\to 0$ approximation. Namely, we prove that, as $\delta\to 0$ , $\rho^{\delta}_{t}\Rightarrow\rho_{t}$ where $\rho_{t}$ follows the $W_{2}$ gradient flow for the risk $R(\rho)$ .

Remarkably, the risk function $R(\rho)$ is strongly displacement convex (provided $f$ is strongly concave). A long line of work in PDE and optimal transport theory establishes dimension-free convergence rates for its $W_{2}$ gradient flow [CJM*+*01, CMV03, CMV06]. Namely, if $f$ is $\alpha$ -strongly concave, then $R(\rho_{t})\leq R(\rho_{0})\,e^{-2\alpha t}$ . By using the approximation results outlined above, we obtain global convergence for SGD. With high probability,

[TABLE]

where the error term ${\sf err}$ vanishes as $N\to\infty$ , ${\varepsilon},\delta\to 0$ in a suitable order.

This result implies that SGD converges exponentially fast to a near-global optimum with a rate that is controlled by the convexity parameter $\alpha$ .

Our bounds are not sharp enough to provide quantitative control on the error term ${\sf err}(N,d,{\varepsilon},\delta)$ , especially in high dimension. Nevertheless, the convergence rate predicted by our asymptotic theory is in excellent agreement with numerical simulations, cf. Section 4. Explaining this surprising quantitative agreement is an outstanding challenge.

2 Related literature

The present work ties in several lines of research, some of which were already mentioned in the introduction. A substantial amount of work has been devoted to analyzing two-layer neural networks and developing algorithms with convergence guarantees, see e.g. [ZSJ*+*17, Tia17, BJW18]. However these approaches are typically based on tensor factorization or similar initialization steps that are not used in practice, and do not scale well (although polynomially) in high dimension.

The landscape of empirical risk minimization was also studied in a number of papers, see e.g. [LY17, SJL18]. However, global convergence was only proved in the extremely overparametrized regime in which the neural network essentially behaves as kernel ridge regression [DZPS18].

Classical theory of neural networks was largely devoted to the two-layer case [AB09], although the focus was on representation and approximation questions [Cyb89, Bar93], as well as on generalization error. It was already clear in that context that a two-layer network is conveniently characterized by the empirical distribution of the hidden neurons, and that it is useful to relax this from a distribution with $N$ atoms, to a general probability measure. This representation plays an important role, for instance, in [Bar98], and was exploited again under the label of ‘convex neural networks’ in [BRV*+*06].

Over the last year, several groups independently revisited this connection, with the objective of understanding the landscape structure of two-layer networks, and the dynamics of gradient descent methods [NS17, MMN18, RVE18, SS18, CB18, MMM19]. In particular, it was proven in [MMN18] that, under certain smoothness condition on the underlying data distribution, the gradient descent evolution is well approximated by a Wasserstein gradient flow, provided that the number of neurons exceeds the data dimensions. As mentioned above, the algorithm treated here differs from the ones analyzed in earlier work, because the weights ${\boldsymbol{w}}_{i}$ are constrained to lie in the convex set $\Omega^{\delta}$ . We enforce this constraint by using projected SGD, i.e. projecting at each step the weights onto the set $\Omega^{\delta}$ . We generalize the analysis of [MMN18], obtaining convergence to a PDE with Neumann (reflecting) boundary conditions. As in [MMN18], we build on ideas that were first developed in the context of interacting particle systems [Dob79, Szn91].

The Wasserstein gradient flow approach was used in [MMN18, CB18] to establish global convergence results. However, these results fall short of our objectives for several reasons:

•

The global convergence result of [CB18] rely on certain homogeneity properties of the neurons that are lacking here. We could obtain homogeneity by adding coefficients to Eq. (1.4), i.e. considering $\hat{f}({\boldsymbol{x}};{\boldsymbol{w}})=\sum_{i=1}^{N}a_{i}K^{\delta}({\boldsymbol{x}}-{\boldsymbol{w}}_{i})$ and minimizing the risk with respect to the coefficients $a_{i}$ . As mentioned above, we refrain from introducing coefficients not to oversimplify the problem: when $N\to\infty$ , it is sufficient to fit the coefficients $a_{i}$ to achieve vanishing risk. Fitting the $a_{i}$ ’s is a least squares problem.

•

Most importantly, the techniques [MMN18, CB18] do not establish any convergence rates. This is not surprising, as those results hold under weak assumptions on the data distribution and the activation function. In particular, [MMN18, CB18, MMM19] cover general risk functions of the form (1.2) under certain smoothness and boundedness conditions on $\sigma$ and on the functions $V({\boldsymbol{w}})=-{\mathbb{E}}\{f({\boldsymbol{x}})\sigma({\boldsymbol{x}};{\boldsymbol{w}})\}$ , $U({\boldsymbol{w}}_{1},{\boldsymbol{w}}_{2})={\mathbb{E}}\{\sigma({\boldsymbol{x}};{\boldsymbol{w}}_{1})\sigma({\boldsymbol{x}};{\boldsymbol{w}}_{2})\}$ . In such a general setting [MMN18] provides examples in which the Wasserstein gradient flow has multiple fixed points, which are singular with respect to the Lebesgue measure. Global convergence is established in [MMN18, CB18] by proving that PDE solution $\rho_{t}$ has a strictly positive density. However, it is difficult to imagine this condition to hold in a quantitative dimension-independent manner.

In contrast, our results are a first step towards dimension-independent convergence rate, in a more restricted setting than [MMN18, CB18, MMM19].

In summary, our results do not subsume earlier work, that assumes a more general setting, but rather establish stronger results in narrower context. Indeed, we believe that specific structural conditions must be imposed on the data distribution and activation function for the Wasserstein gradient flow approach to yield quantitative convergence rates. This paper presents one specific set of assumptions. Although our results are not strong enough to establish non-asymptotic convergence rates, they point clearly in that direction.

3 Model and assumptions

3.1 Notations

We will use lowercase boldface for vectors, e.g. ${\boldsymbol{x}},{\boldsymbol{y}},\dots$ , uppercase for random variables, e.g. $X,Y,\dots$ , and uppercase boldface for random vectors, e.g. $\boldsymbol{X},\boldsymbol{Y},\dots$ . The scalar product of two vectors is denoted by $\langle{\boldsymbol{x}},{\boldsymbol{y}}\rangle=\sum_{i=1}^{d}x_{i}y_{i}$ , and the $\ell_{2}$ norm of a vector is denoted by $|{\boldsymbol{x}}|$ . The Euclidean ball in ${\mathbb{R}}^{d}$ with center ${\boldsymbol{x}}$ and radius $r$ is denoted by ${\sf B}({\boldsymbol{x}};r)$ . Given a set $\Omega\subseteq{\mathbb{R}}^{d}$ , we denote by $|\Omega|$ its volume.

We will refer to several function spaces in what follows. The most common is the space of $p$ -th integrable functions $\mathscrsfs{L}^{p}({\cal X})$ on a measure space $({\cal X},{\mathcal{F}},\mu)$ . Given a function $f:{\cal X}\to{\mathbb{R}}$ , we denote by $\|f\|_{\mathscrsfs{L}^{p}({\cal X})}$ its $\mathscrsfs{L}^{p}$ norm, namely $\|f\|_{\mathscrsfs{L}^{p}({\cal X})}^{p}=\int_{{\cal X}}|f(x)|^{p}\,\mu({\rm d}x)$ . For $S\subseteq{\mathbb{R}}^{m}$ , $\mathscrsfs{C}^{k}(S)$ denotes the space of continuous functions $f:S\to{\mathbb{R}}$ with continuous derivatives up to order $k$ . In particular, $\mathscrsfs{C}(S)$ denotes the space of continuous real-valued functions defined on $S$ . In addition, for $T\in{\mathbb{R}}_{+}$ and a metric space $\mathcal{M}$ (with distance $d_{\mathcal{M}}$ ), $\mathscrsfs{C}([0,T],\mathcal{M})$ denotes the set of continuous functions $f:[0,T]\to\mathcal{M}$ , endowed with the distance between two functions $f,g\in\mathscrsfs{C}([0,T],\mathcal{M})$ defined as $d_{\mathscrsfs{C}([0,T],\mathcal{M})}(f,g)\equiv\sup_{t\in[0,T]}d_{\mathcal{M}}(f(t),g(t))$ . For a function $f:S\to{\mathbb{R}}$ , we let $\|f\|_{\rm Lip}\equiv\sup_{{\boldsymbol{x}}\neq{\boldsymbol{y}}\in S}|f({\boldsymbol{x}})-f({\boldsymbol{y}})|/|{\boldsymbol{x}}-{\boldsymbol{y}}|$ be the Lipschitz constant of the function $f$ . Finally, as mentioned above, $\mathscrsfs{P}_{2}({\cal X})$ denotes the space of probability distributions on ${\cal X}$ , endowed with the Wasserstein metric $W_{2}$

Throughout the paper, we use $C$ to denote finite constants, which can vary from point to point. When these constants can depend on some of the problem parameters, e.g. $a,b,c$ , we will write $C(a,b,c)$ . When they are absolute numerical constants, we will emphasize this by writing $C_{*}$ .

3.2 Data

As mentioned above, we are given data $(y_{j},{\boldsymbol{x}}_{j})\sim_{\rm i.i.d.}{\mathbb{P}}$ where ${\boldsymbol{x}}_{j}\sim{\sf Unif}(\Omega)$ , with $\Omega\subset{\mathbb{R}}^{d}$ a compact convex set, and $y_{j}=f({\boldsymbol{x}}_{j})+{\varepsilon}_{j}$ , with $f:\Omega\to{\mathbb{R}}_{\geq 0}$ . We assume the ${\varepsilon}_{j}$ to be i.i.d. $\sigma^{2}$ -subgaussian random variables with ${\mathbb{E}}({\varepsilon}_{j}|{\boldsymbol{x}}_{j})=0$ . We assume the function $f$ to be concave and smooth.

Our formal assumptions on the set $\Omega$ and the function $f$ are as follows:

(A1)

$\Omega\supseteq{\sf B}({\boldsymbol{0}};r)$ , with $r>0$ , is a compact convex set with $\mathscrsfs{C}^{2}$ boundary. 2. (A2)

$f:\Omega\to{\mathbb{R}}_{\geq 0}$ uniformly concave, i.e., there exists $\alpha>0$ such that

[TABLE]

where $\nabla^{2}f$ denotes the Hessian of $f$ . 3. (A3)

$f\in\mathscrsfs{C}^{\infty}(\Omega)$ , with $\|f\|_{\mathscrsfs{L}^{\infty}(\Omega)},\|\nabla f\|_{\mathscrsfs{L}^{\infty}(\Omega)}\leq C_{*}$ for an absolute constant $C_{*}$ .

Without loss of generality, we can also assume that $\int_{\Omega}f({\boldsymbol{x}})\,{\rm d}{\boldsymbol{x}}=1$ . As a running example, we will use $\Omega={\sf B}({\boldsymbol{0}};r)$ , where we remind $r$ is defined in Assumption (A1).

Remark 3.1.

The assumption ${\boldsymbol{x}}_{j}\sim{\sf Unif}(\Omega)$ is quite strong but simplifies our analysis. We believe our approach can be generalized to a broader family of probability distribution for the covariates ${\boldsymbol{x}}_{j}$ , but defer these generalizations to future work.

3.3 Neural network and SGD

Let $K\in\mathscrsfs{C}^{2}({\mathbb{R}}^{d})$ be a non-negative symmetric first order kernel with compact support. Formally, we assume that

[TABLE]

The assumptions of symmetry and compact support are not crucial, but simplify some of the technical details later. We will further assume $\|\nabla K\|_{\mathscrsfs{L}^{\infty}(\mathbb{R}^{d})}$ , $\|\nabla^{2}K\|_{\mathscrsfs{L}^{\infty}(\mathbb{R}^{d})}$ and $c_{0}$ to be independent of the ambient dimension $d$ . Notice that this requirement follows from the differentiability and compact support assumptions if $K({\boldsymbol{x}})=\kappa(\|{\boldsymbol{x}}\|_{2})$ is a radial function.

For $\delta>0$ , let $K^{\delta}({\boldsymbol{x}})=\delta^{-d}K({\boldsymbol{x}}/\delta)$ . We try to fit the function (1.4) with parameters ${\boldsymbol{w}}=({\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{N})$ . These parameters are constrained to ${\boldsymbol{w}}_{i}\in\Omega^{\delta}$ which is a suitable scaling of $\Omega$ , as defined in the following. Given $\delta<r/c_{0}$ , with $r$ defined in (A1), define

[TABLE]

where

[TABLE]

For two sets $A,B\subseteq{\mathbb{R}}^{d}$ , their Minkowski sum is defined as $A\oplus B=\{{\boldsymbol{x}}+{\boldsymbol{y}}:{\boldsymbol{x}}\in A,{\boldsymbol{y}}\in B\}$ . Note that $\lambda_{\delta}\in[0,1]$ for all $\delta$ . Furthermore, $\Omega\supseteq{\sf B}({\boldsymbol{0}};r)$ implies $\lambda_{\delta}>0$ for all $\delta<r/c_{0}$ . Finally, $\lambda_{\delta=0}=1$ , whence $\Omega^{\delta=0}=\Omega$ . In our running example, $\Omega^{\delta}={\sf B}({\mathbf{0}};r-c_{0}\delta)$ is a ball of slightly smaller radius. Clearly, since $\Omega$ is convex, $\Omega^{\delta}$ is convex as well.

We use stochastic gradient descent to minimize the population risk (1.2). At each step, we use a new data point $(y_{k},{\boldsymbol{x}}_{k})$ , thus the sample size is equal to the number of iterations of the algorithm. Assuming for simplicity constant step size ${\varepsilon}>0$ , we update the parameters by

[TABLE]

Here ${\boldsymbol{g}}_{i}^{k+1}\sim{\sf N}(0,{\boldsymbol{I}}_{d})$ is Gaussian noise which we take to be i.i.d. across time and neuron indices, $k$ and $i$ , and ${\sf P}$ is the orthogonal projector onto $\Omega^{\delta}$ :

[TABLE]

The noise term $\sqrt{2{\varepsilon}\tau}\,{\boldsymbol{g}}_{i}^{k+1}$ is added mainly for technical reasons. Namely, it allows us to control the smoothness of the solutions of the resulting PDE. In simulations we do not find it useful, and we believe that a more careful analysis would be able to establish smoothness without the noise term.

Again, in our running example, we have

[TABLE]

We initialize SGD with $({\boldsymbol{w}}_{i}^{0})_{i\leq N}\sim_{\rm i.i.d.}\rho^{\delta}_{\mbox{\tiny\rm init}}\in\mathscrsfs{P}_{2}(\Omega^{\delta})$ , where $\rho^{\delta}_{\mbox{\tiny\rm init}}$ is a scaling of a fixed distribution $\rho_{\mbox{\tiny\rm init}}\in\mathscrsfs{P}_{2}(\Omega)$ , i.e. $\rho^{\delta}_{\mbox{\tiny\rm init}}(S)=\rho_{\mbox{\tiny\rm init}}(S/\lambda_{\delta})$ . We assume that the initialization is smooth:

(A5)

$\rho_{\mbox{\tiny\rm init}}\in\mathscrsfs{C}^{\infty}(\Omega^{\delta})$ .

3.4 PDE Model, $\delta>0$

In the $N\to\infty$ limit the population risk is approximated by the effective risk $R^{\delta}:\mathscrsfs{P}_{2}(\Omega^{\delta})\to{\mathbb{R}}$ defined in Eq. (1.6). We emphasize that $\rho$ is a probability distribution supported on $\Omega^{\delta}$ . Note that

[TABLE]

In particular $\lim_{\delta\to 0}\inf_{\rho\in\mathscrsfs{P}_{2}(\Omega)}R^{\delta}(\rho)=0$ .

Our first main result is that the dynamics of SGD is well approximated by the following PDE (see Section 5.1 for a formal statement):

[TABLE]

with initial and boundary conditions

[TABLE]

where ${\boldsymbol{n}}({\boldsymbol{x}})$ denotes the inward normal vector to $\partial\Omega^{\delta}$ at ${\boldsymbol{x}}$ .

A rigorous definition of solutions of this PDE, along with some of their properties, is given in Appendix B. In Appendix C, we discuss the connection between the PDE (3.9) and the so-called “nonlinear dynamics”, i.e. a stochastic differential equation that captures the trajectories of the weights ${\boldsymbol{w}}_{i}^{k}$ . Using this connection, we prove existence and uniqueness of weak solutions of Eq. (3.9). In the proofs, we will often assume $\nu_{0}=1$ , which amounts to a rescaling of time $t$ .

For $\tau=0$ , the evolution defined by Eq. (3.9) corresponds to the gradient flow in Wasserstein metric for the risk function $R^{\delta}(\rho)$ . For $\tau>0$ , it is the gradient flow for the free energy functional $F^{\delta}(\rho)$ defined below

[TABLE]

3.5 Limit PDE, $\delta=0$

As mentioned above, in the limit $\delta\to 0$ the risk function $R^{\delta}(\rho)$ is well approximated by $R:\mathscrsfs{L}^{2}(\Omega)\to{\mathbb{R}}$ , where $R(\rho)=\nu_{0}\|f-\rho\|_{\mathscrsfs{L}^{2}(\Omega)}^{2}$ , cf. Eq. (1.7).

The corresponding Wasserstein gradient flow is also known as viscous porous medium equation [Váz07] and it is given by

[TABLE]

with initial and boundary conditions

[TABLE]

In Appendix A, we give the definition of a weak solution for the PDE (3.12) with initial and boundary conditions (3.13). We also prove that the weak solution of the PDE (3.12) is unique, under a mild integrability condition. Again, in proofs we will assume without loss of generality $\nu_{0}=1$ .

As in the $\delta>0$ case, the evolution defined by Eq. (3.12) is the gradient flow for the free energy $F(\rho)=(1/2)R(\rho)-\tau S(\rho)$ . Our analysis uses a key property of the risk function $R(\rho)=\nu_{0}\|f-\rho\|_{\mathscrsfs{L}^{2}(\Omega)}^{2}$ (and the free energy): displacement convexity [McC97]. For the reader’s convenience, we recall its definition here, referring to [AGS08, Vil08, San15] for further background. Given two probability measures $\rho_{0},\rho_{1}\in\mathscrsfs{P}_{2}(\Omega)$ , their $W_{2}$ distance is defined by

[TABLE]

where the infimum is taken over the set $\Gamma(\rho_{0},\rho_{1})$ of couplings of $\rho_{0}$ , $\rho_{1}$ (i.e. probability measures on $\Omega\times\Omega$ whose first marginal coincides with $\rho_{0}$ , and second with $\rho_{1}$ ). The infimum is achieved by weak compactness of $\mathscrsfs{P}_{2}(\Omega)$ .

The metric space $(\mathscrsfs{P}_{2}(\Omega),W_{2})$ is a ‘length space,’ and in particular it is possible to construct geodesics, i.e. paths of minimum length connecting any two probability measures $\rho_{0},\rho_{1}$ . Geodesics have a simple description. Let $\gamma_{*}$ be the coupling achieving the infimum in the definition of $W_{2}(\rho_{0},\rho_{1})$ . Letting $(\boldsymbol{X}_{0},\boldsymbol{X}_{1})\sim\gamma_{*}$ , we define $\rho_{t}$ to be the distribution of $\boldsymbol{X}_{t}=(1-t)\boldsymbol{X}_{0}+t\boldsymbol{X}_{1}$ . The curve $t\mapsto\rho_{t}$ , indexed by $t\in[0,1]$ turns out to be the geodesic between $\rho_{0}$ and $\rho_{1}$ in $(\mathscrsfs{P}_{2}(\Omega),W_{2})$ .

Displacement convexity is convexity along geodesics. Namely, a function ${\mathcal{F}}:\mathscrsfs{P}_{2}(\Omega)\to{\mathbb{R}}$ is $\lambda$ -strongly displacement convex if

[TABLE]

A useful observation is that displacement convexity implies that all local minima of ${\mathcal{F}}$ are global minimizer. Indeed, by (3.15) it is straightforward to see that ${\mathcal{F}}$ has at most one global minimizer $\rho^{*}$ . Also, for every other point $\rho$ , the geodesic between $\rho$ and $\rho_{*}$ is a strictly decreasing path for the function ${\mathcal{F}}$ . Now, suppose that $\bar{\rho}\neq\rho_{*}$ is a local minimum. Then, there exists a neighborhood $U$ around $\bar{\rho}$ such that, for any $\rho\in U$ , ${\mathcal{F}}(\rho)\geq{\mathcal{F}}(\bar{\rho})$ . However, the strictly decreasing path between $\bar{\rho}$ and $\rho_{*}$ passes through the neighborhood $U$ , which leads to a contradiction and so $\rho=\rho_{*}$

It follows from [McC97] that the risk function $R(\rho)$ and the free energy $F(\rho)$ are strongly displacement convex.

Remark 3.2.

The concavity assumption on the regression function $f$ (Assumption (A2)) defines a nonparametric class under which global convergence can be established, with convergence rates uniquely determined by the curvature $\alpha$ (in the limit $N\to\infty$ , $\delta\to 0$ ). Nonparametric estimation of concave functions has attracted considerable attention over recent years, see e.g. [HD13, CS16], and is –by itself– an interesting domain of applicability.

However, our projected SGD algorithm is potentially applicable to any data set, and will return a meaningful estimate $\hat{f}$ regardless whether $f$ is concave or not. Indeed, in the next section we present numerical simulations indicating convergence to a near-global optimum even for non-concave functions $f$ .

From mathematical point of view, Assumption (A2) is only used to show the convergence of the solution of the viscous porous medium equation (limit PDE, $\delta=0$ ) to the unique global minimizer of the free energy $F(\rho)=(1/2)R(\rho)-\tau S(\rho)$ , as formally stated in Theorem F.8. Concavity is not needed for the other results in the paper, namely approximating the SGD trajectory with the solution of the PDE ( $\delta>0$ ), see Theorem 5.1, and the convergence of the solution of the PDE ( $\delta>0$ ) to the solution of the viscous porous medium equation, see Theorem 5.2. It is therefore foreseeable a more general analysis that relaxes the concavity assumption.

4 Numerical illustrations

In this section we provide some simple numerical illustrations of our setting, and compare numerical results with the predictions of the Wasserstein gradient flow theory.

It is easy to construct examples of strongly concave functions, satisfying our assumptions. One can start from any strongly concave continuous function $f_{0}$ on a compact convex set $\Omega$ , add a constant to make it non-negative, and multiply it by a constant to normalize its integral. The resulting function $f({\boldsymbol{x}})=(c_{1}+f_{0}({\boldsymbol{x}}))/c_{2}$ satisfies our conditions. Notable examples of concave functions are given by log-moment generating functions $f_{0}({\boldsymbol{x}})=-\log{\mathbb{E}}_{{\boldsymbol{Z}}}\exp\{\langle{\boldsymbol{x}},{\boldsymbol{Z}}\rangle\}$ , where the random variable ${\boldsymbol{Z}}$ satisfies mild assumptions (e.g., it is bounded and its distribution is not supported on a proper subspace of ${\mathbb{R}}^{d}$ ). In general, given any twice differentiable function $g_{0}$ , the function $f_{0}({\boldsymbol{x}})=g_{0}({\boldsymbol{x}})-c_{*}\|{\boldsymbol{x}}\|^{2}_{2}$ is strongly concave for $c_{*}$ large enough.

4.1 A one-dimensional concave function

We set $\Omega=[-1,1]$ and $f(x)=(1-e^{x-1})/(1-e^{-2})$ (we choose the normalization so that $\int_{-1}^{1}f(x){\rm d}x=1$ ). Note that $f$ is uniformly concave in $[-1,1]$ . We set the kernel $K$ as follows:

[TABLE]

where $C_{d}$ is a normalization constant ensuring that $\int_{-1}^{1}K(x)\,{\rm d}x=1$ . The initialization $\rho_{\mbox{\tiny\rm init}}$ is a truncated Gaussian: $\rho_{\mbox{\tiny\rm init}}(x)=c\cdot\exp(-x^{2}/(2\sigma^{2}))\,{\boldsymbol{1}}_{[-1,1]}(x)$ , with $\sigma=1/3$ .

We find empirically that standard stochastic gradient descent (SGD) without the projection ${\sf P}$ onto $\Omega^{\delta}$ works well in this example, and consider this algorithm for simplicity in our first illustrations. We pick $N=200$ , $\tau=0$ (noiseless SGD), and constant step size $\varepsilon=10^{-6}$ . In Figure 1, left column, we plot the true function $f(\,\cdot\,)$ together with the neural network estimate $\hat{f}(\,\cdot\,;{\boldsymbol{w}}^{k})$ at several points in time $t$ (time is related to the number of iterations $k$ via $t=k{\varepsilon}$ ). Different plots correspond to different values of $\delta$ with $\delta\in\{1/5,1/10,1/20\}$ . We observe that the network estimates $\hat{f}(\,\cdot\,;{\boldsymbol{w}}^{k})$ seem to converge to a limit curve which is an approximation of the true function $f$ . As expected, the quality of the approximation improves as $\delta$ gets smaller.

In the right column, we report the evolution of the population risk (1.2) normalized by $\|f\|^{2}_{\mathscrsfs{L}^{2}(\Omega)}$ . For comparison, we plot the evolution of the risk (1.7) as predicted by the limit PDE (3.12) with $\tau=0$ . We solve the PDE (3.12) numerically using a finite difference scheme that enforces the conservation law $\int\rho(x,t){\rm d}x=1$ , see, e.g., [Tho13]. In the finite difference scheme, we choose time step and spatial step $\Delta t=10^{-5}$ and $\Delta x=10^{-2}$ , respectively. The curve obtained by this numerical solution appears to capture well the evolution of SGD towards optimality. The main difference is that, while the PDE (3.12) corresponds to $\delta=0$ , and hence evolves towards a global optimum at zero risk, SGD converges to a non-zero risk value, which can be interpreted as the approximation error, decreasing with $\delta$ .

In Figure 2, we illustrate the numerical solution of the PDE (3.12) by plotting (i) the regression function $f$ together with the PDE solution $\rho_{t}$ (which coincides with the prediction $\hat{f}$ at $\delta=0$ ) at several times $t$ , and (ii) the PDE prediction for the risk $R(\rho_{t})$ (1.7) normalized with respect to $\|f\|^{2}_{\mathscrsfs{L}^{2}(\Omega)}$ (this plot aggregates data from Figs. 1.(b), (d), (f)). We also compare the risk (1.7) to the population risk $R_{N}({\boldsymbol{w}}^{k})$ achieved by SGD for different values of $\delta$ . Note that, as $\delta$ becomes smaller, the risk converges to the predicted curve. The risk of the limit PDE (3.12) converges to [math] exponentially fast in $t$ , as predicted by the strong displacement convexity of $R(\rho)$ .

In Figure 3, we consider the SGD algorithm with projection ${\sf P}$ , see (3.5). We pick $N=200$ , $\tau=0$ , ${\varepsilon}=10^{-6}$ and $\delta=1/20$ . On the left, we illustrate the evolution of the value of $40$ weights chosen at random; and on the right, we plot the histogram of their empirical distribution at $t=5$ . Note that this histogram matches well the regression function $f$ plotted in black.

4.2 A two-dimensional concave example

Next, we consider a two-dimensional example. We set $\Omega=[-1,1]^{2}$ and

[TABLE]

with ${\boldsymbol{q}}_{1}=(2.5127,-2.4490)$ , ${\boldsymbol{q}}_{2}=(0.0596,1.9908)$ and where $c_{1}$ and $c_{2}$ are chosen so that $f$ is non-negative and $\int_{\Omega}f({\boldsymbol{x}})\,{\rm d}{\boldsymbol{x}}=1$ . The kernel $K$ is given by $K({\boldsymbol{x}})=C_{d}\kappa(|{\boldsymbol{x}}|)$ , where $\kappa$ is defined in (4.1) and $C_{d}$ is a normalization constant ensuring that $\int_{{\sf B}({\boldsymbol{0}};1)}K({\boldsymbol{x}})\,{\rm d}{\boldsymbol{x}}=1$ . Again, the initialization $\rho_{\mbox{\tiny\rm init}}$ is a truncated Gaussian: $\rho_{\mbox{\tiny\rm init}}({\boldsymbol{x}})=c\cdot\exp(-|{\boldsymbol{x}}|^{2}/(2\sigma^{2}))\,{\boldsymbol{1}}_{[-1,1]^{2}}({\boldsymbol{x}})$ , with $\sigma=1/3$ . We compare the normalized risk of SGD with no projection ${\sf P}$ ( $N=2000$ , $\tau=0$ and ${\varepsilon}=10^{-6}$ ) for $\delta\in\{1/3,1/5,1/10\}$ with that of the limit PDE (3.12). Figure 4 shows that, already at $\delta=1/10$ , the risk of SGD converges to the predicted curve and the risk of the limit PDE (3.12) tends to [math] exponentially fast in $t$ .

4.3 Comparing feature learning to random features

As discussed in the introduction, it is useful to consider the more general model

[TABLE]

with parameters ${\boldsymbol{a}}=(a_{1},\ldots,a_{N})$ as well as ${\boldsymbol{w}}=({\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{N})$ . This setting allows to compare two different approaches:

$(i)$

Random feature regression: the weights ${\boldsymbol{w}}$ are chosen independently of the labels $y_{i}$ (we allow for dependence on the covariates ${\boldsymbol{x}}_{i}$ ).

$(ii)$

Feature learning: the weights ${\boldsymbol{w}}$ depend on the data $(y_{i},{\boldsymbol{x}}_{i})$ .

In order to compare these two approaches, we assume to be given i.i.d. data $\{(y_{i},{\boldsymbol{x}}_{i})\}_{i\leq n}$ , with ${\boldsymbol{x}}_{i}\sim{\sf Unif}(\Omega)$ , $y_{i}=f({\boldsymbol{x}}_{i})$ and determine the parameters ${\boldsymbol{a}}$ by the same method, ridge regression. More explicitly, define the matrix ${\boldsymbol{Z}}\in\mathbb{R}^{n\times N}$ as $({\boldsymbol{Z}})_{i,j}=K^{\delta}({\boldsymbol{x}}_{i}-{\boldsymbol{w}}_{j})$ . Then, we estimate ${\boldsymbol{a}}$ via

[TABLE]

where $\lambda$ is chosen via cross-validation on a hold-out set, comprising $10\%$ of the samples.

In Figure 5, we compare the performance of three different ways to construct the weights ${\boldsymbol{w}}$ : ‘random ${\boldsymbol{w}}$ ,’ we choose the weights ${\boldsymbol{w}}_{i}$ independently and uniformly at random in $\Omega$ (blue triangles pointing down); ‘ ${\boldsymbol{w}}=$ data points,’ we choose the weights ${\boldsymbol{w}}_{i}$ uniformly at random among the data points (green circles); ‘optimized ${\boldsymbol{w}}$ ,’ we use the output of the projected SGD algorithm of the previous sections (red triangles pointing up). The first two can be regarded as ‘random features’ approaches, while the latter is a ‘feature learning’ method.

For the optimized ${\boldsymbol{w}}$ , we use exactly the same algorithm in as in (3.5) (without coefficients ${\boldsymbol{a}}$ in the SGD update), with the only difference that each SGD step is carried out with respect to an independent sample from the empirical data, with replacement. SGD is stopped after $k_{\max}$ iteration, and the coefficient $\hat{{\boldsymbol{a}}}$ are computed according to (4.3). Notice that this procedure is probably suboptimal, and it would be better to optimize ${\boldsymbol{a}}$ and ${\boldsymbol{w}}$ jointly: we choose this simpler two-stage procedure to have a more direct application of the algorithm analyzed in the paper, and a comparison with the random feature methods. We set $\tau=0$ (noiseless SGD), and constant step size $\varepsilon=5\cdot 10^{-4}$ . The number of iterations $k_{\max}\in\{5\cdot 10^{3},15\cdot 10^{3},5\cdot 10^{4},15\cdot 10^{4},5\cdot 10^{5},15\cdot 10^{5}\}$ is chosen via cross-validation, by using the same hold-out set employed to optimize $\lambda$ .

We set $\Omega=[-1,1]^{4}$ and define $y_{j}=f({\boldsymbol{x}}_{j})$ , where $f({\boldsymbol{x}})$ takes the form (4.2) with ${\boldsymbol{q}}_{1}=(-0.3832,0.3074,-0.3198,0.4792)$ and ${\boldsymbol{q}}_{2}=(0.3502,-0.1471,$ $0.1685,0.0546)$ . Again, $c_{1}$ and $c_{2}$ are chosen so that $f$ is non-negative and $\int_{\Omega}f({\boldsymbol{x}})\,{\rm d}{\boldsymbol{x}}=1$ ; the kernel $K$ is given by $K({\boldsymbol{x}})=C_{d}\kappa(|{\boldsymbol{x}}|)$ , where $\kappa$ is defined in Eq. (4.1) and $C_{d}$ ensures that $\int_{{\sf B}({\boldsymbol{0}};1)}K({\boldsymbol{x}})\,{\rm d}{\boldsymbol{x}}=1$ .

After estimating ${\boldsymbol{w}}_{i}$ and $a_{i}$ by either methods, we generate a test set of $10,000$ samples and use it to estimate the generalization error. We perform $20$ independent trials of the experiment, and we plot the average risk normalized by $\|f\|^{2}_{\mathscrsfs{L}^{2}(\Omega)}$ together with the error bar at 1 standard deviation. In Figure 5-(a), we fix the number of neurons $N=200$ and we plot the normalized risk as a function of the number of data points $n$ . In Figure 5-(b), we fix the number of samples $n$ to $2000$ and we plot the normalized risk as a function of the number of neurons $N$ . The data set used for cross-validation has size $\max(n/10,40)$ . Note that feature learning leads to improved performance in both settings. The improvement becomes more pronounced with the sample size $n$ , presumably because a better set of weights ${\boldsymbol{w}}_{i}$ can be learnt. On the other hand, when the number of neurons $N$ becomes very large, random ${\boldsymbol{w}}_{i}$ ’s are already covering $\Omega$ densely enough, and there is no significant advantage in feature learning.

4.4 A non-concave one-dimensional example

We set $\Omega=[-1,1]$ and $f(x)=(x+\sin(5x-\pi/2)-c_{1})/c_{2}$ , where $c_{1}$ and $c_{2}$ are chosen so that $f$ is non-negative and $\int_{\Omega}f(x)\,{\rm d}x=1$ . Note that the target function $f$ is bimodal, thus it is not concave. We perform the same numerical experiment described in Section 4.1. In Figure 6, left column, we plot the true function $f(\,\cdot\,)$ together with the neural network estimate $\hat{f}(\,\cdot\,;{\boldsymbol{w}}^{k})$ at several points in time $t$ , where different plots correspond to different values of $\delta\in\{1/5,1/10,1/20\}$ . In the right column, we report the evolution of the population risk (1.2) normalized by $\|f\|^{2}_{\mathscrsfs{L}^{2}(\Omega)}$ . In Figure 7, we plot (i) the regression function $f$ together with the PDE solution $\rho_{t}$ at several times $t$ , and (ii) the PDE prediction for the risk $R(\rho_{t})$ (1.7) (normalized with respect to $\|f\|^{2}_{\mathscrsfs{L}^{2}(\Omega)}$ ) compared with the population risk $R_{N}({\boldsymbol{w}}^{k})$ achieved by SGD for different values of $\delta$ . Even if the target function is not concave, the results are similar to those presented in the concave case: (i) the network estimates $\hat{f}(\,\cdot\,;{\boldsymbol{w}}^{k})$ seem to converge to a limit curve which is an approximation of the true function $f$ , (ii) the quality of the approximation improves as $\delta$ gets smaller, and (iii) the risk of the limit PDE (3.12) converges to [math] exponentially fast in $t$ .

4.5 Failure for small $N$

We repeat the same experiment described in Section 4.1 for a smaller number of neurons $N=20$ . As can be seen in Figures 8 and 9, the quality of the approximation becomes worse as $\delta$ gets smaller. This is expected because with small number of activations, reducing their bandwidth $\delta$ leads to a worse performance as they are all zero on a large part of the space. Put differently, the number of neurons is too small to guarantee convergence of SGD to the predictions of the Wasserstein gradient flow theory.

5 Main results

5.1 Convergence of SGD to the PDE (3.9) at $\delta>0$ fixed

We now state our result concerning the convergence of the SGD dynamics (3.5) to the PDE (3.9). Note that this result does not require concavity of $f$ . Its proof is presented in Appendix D.

Theorem 5.1.

Assume that conditions (A1), (A3)-(A5)* hold. Consider the SGD update (3.5) with initialization $({\boldsymbol{w}}_{i}^{0})_{i\leq N}\sim_{\rm i.i.d.}\rho_{\mbox{\tiny\rm init}}^{\delta}$ and constant step size $\varepsilon$ . For $t\geq 0$ , let $\rho_{t}$ be the unique solution of the PDE (3.9) with initial and boundary conditions (3.10), and assume ${\rm supp}(\rho^{\delta}_{\mbox{\tiny\rm init}})\subseteq{\sf B}({\boldsymbol{0}},r)$ Then, for any fixed $t\geq 0$ , $\rho^{(N)}_{\lfloor t/\varepsilon\rfloor}\Rightarrow\rho_{t}$ almost surely along any sequence ( $N,\varepsilon=\varepsilon_{N}$ ) such that $N\to\infty$ , $\varepsilon_{N}\to 0$ .*

Furthermore, for any $\delta\leq 1$ , $T\geq 1$ , ${\varepsilon}\leq 1$ , $p\in{\mathbb{N}}$ , and for any $g:\mathbb{R}^{d}\to\mathbb{R}$ with $\|g\|_{\rm Lip}\leq 1$ , the following happens with probability at least $1-z^{-2p}$ ,

[TABLE]

where

[TABLE]

Our proof is based on the same approach developed in [MMN18]. We prove that solutions of the PDE (3.9) are in correspondence with distributions over trajectories $(\boldsymbol{X}_{t})_{t\geq 0}$ in $\Omega$ satisfying the following stochastic differential equation

[TABLE]

where $(\boldsymbol{B}_{t})_{t\geq 0}$ is a standard Brownian motion and ${\rm d}{\boldsymbol{\Phi}}_{t}$ is the boundary reflection (in the sense of a Skorokhod problem). The density $\rho_{t}$ is determined, self consistently, via $\rho_{t}={\rm Law}(\boldsymbol{X}_{t})$ . We prove existence and uniqueness of solutions to this problem, and refer to the corresponding stochastic process $(\boldsymbol{X}_{t})_{t\geq 0}$ as nonlinear dynamics. This in turn implies existence and uniqueness of the solutions of the PDE (3.9).

We next construct a coupling between the network weights $({\boldsymbol{w}}^{k}_{1},\dots,{\boldsymbol{w}}^{k}_{N})\in(\Omega^{\delta})^{N}$ , and $N$ i.i.d. trajectories of the nonlinear dynamics $(\boldsymbol{X}^{t}_{1},\dots,\boldsymbol{X}^{t}_{N})\in(\Omega^{\delta})^{N}$ . Controlling the expected distance in this coupling yields Theorem 5.1.

Remark 5.1.

The error term in Eq. (5.1) is completely analogous to the error in a similar theorem proved in [MMN18]. The constant $\delta^{-d}$ appearing here is obtained by bounding the Lipschitz constant of $\nabla\Psi({\boldsymbol{w}};\rho)$ . As already mentioned, the main technical difficulty with respect to [MMN18] is posed by the Neumann (reflecting) boundary conditions. Indeed, even if we are given a solution of the PDE (3.9), existence and uniqueness of solutions of the Skorokhod problem (5.3) is a highly non-trivial fact first established in [Tan79, LS84]. As a consequence, while the main proof idea is similar to the one in [MMN18], its implementation is significantly different.

Remark 5.2.

As discussed in Appendix D, our proof applies to a more general version of the PDE (3.9) and correspondingly of the SGD dynamics (3.5), where $\Psi$ takes the form $\Psi({\boldsymbol{w}},\rho)=V({\boldsymbol{w}})+\int U({\boldsymbol{w}},{\boldsymbol{w}}^{\prime})\,\rho({\rm d}{\boldsymbol{w}}^{\prime})$ , for $V:\Omega\to{\mathbb{R}}$ , $U:\Omega\times\Omega\to{\mathbb{R}}$ two smooth functions. The SGD update (3.5) is generalized as in [MMN18], and Theorem 5.1 holds with the terms containing $\delta$ (i.e., $\delta^{-2d-1}$ and $\delta^{-(d+2)}$ ) replaced by a constant that depends uniquely on $\|\nabla V\|_{\mathscrsfs{L}^{\infty}(\Omega)}$ , $\|\nabla U\|_{\mathscrsfs{L}^{\infty}(\Omega\times\Omega)}$ , $\|\nabla^{2}V\|_{\mathscrsfs{L}^{\infty}(\Omega)}$ , $\|\nabla^{2}U\|_{\mathscrsfs{L}^{\infty}(\Omega\times\Omega)}$ .

5.2 Convergence to the solutions of porous medium equation

We next prove that the solution of the PDE (3.9) converges, as $\delta\to 0$ , to the unique solution of the porous medium equation (3.12). As for Theorem 5.1, this result does not rely on the concavity assumption for $f$ .

Theorem 5.2.

Assume that conditions (A1) and (A3)-(A5)* hold. Denote by $\rho^{\delta}$ the unique solution of the PDE (3.9) with initial condition $\rho^{\delta}_{0}=\rho_{\mbox{\tiny\rm init}}$ . Then*

$(a)$

The porous medium equation (3.12) admits a weak solution $\rho:(t,{\boldsymbol{x}})\mapsto\rho_{t}({\boldsymbol{x}})$ with initial and boundary conditions (3.13). Further, this solution is unique under the additional condition $\rho\in\mathscrsfs{L}^{4}([0,T]\times\Omega)$ . 2. $(b)$

For almost all $t\in[0,T]$ , we have $\rho_{t}^{\delta}\to\rho_{t}$ in $\mathscrsfs{L}^{2}(\Omega)$ as $\delta\to 0$ .

While this statement is very natural at a heuristic level, its proof is actually the bulk of our technical work. Similar approximation results have been proved in the past by Oelschläger, Philipowski, Figalli [Oel02, Phi07, FP08], but they do not apply directly to the present case unless $f=0$ (also, we have to deal with different boundary conditions).

Our proof follows a classical compactness argument, generalizing the approach of [FP08]. Namely we consider the sequence of trajectories $(\rho^{\delta}_{t})_{t\in[0,T]}$ indexed by the width $\delta$ . We prove that that this family is bounded and equicontinuous in $\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(\Omega))$ , and hence admits converging subsequences $(\rho^{\delta_{n}}_{t})_{t\in[0,T]}\to(\rho_{t})_{t\in[0,T]}$ . We next prove that any such converging subsequence converges in $\mathscrsfs{L}^{2}(\Omega\times[0,T])$ and that the limit is a weak solution of the porous medium equation (3.12). Unfortunately, uniqueness of weak solutions of the PME (3.12) is –to the best of our knowledge– an open problem. However, we generalize methods from [Oel02] to show that any subsequential limit is actually in $\mathscrsfs{L}^{4}(\Omega\times[0,T])$ , and prove that the weak solution is unique under this condition. This allows us to conclude that $(\rho^{\delta}_{t})_{t\in[0,T]}$ converges to this unique weak solution $(\rho_{t})_{t\in[0,T]}$ .

5.3 Global convergence of SGD

Let us now state the main result of this paper: SGD converges to a model with nearly optimal risk.

Theorem 5.3.

Assume that conditions (A1)-(A5)* hold, and recall that $\alpha>0$ is the concavity parameter of the function $f$ , i.e., $\langle{\boldsymbol{y}},\nabla^{2}f({\boldsymbol{x}}){\boldsymbol{y}}\rangle\leq-\alpha|{\boldsymbol{y}}|^{2}$ for all ${\boldsymbol{x}}\in\Omega$ , ${\boldsymbol{y}}\in{\mathbb{R}}^{d}$ .*

Consider the SGD update (3.5) with initialization $({\boldsymbol{w}}_{i}^{0})_{i\leq N}\sim_{\rm i.i.d.}\rho_{\mbox{\tiny\rm init}}$ and constant step size $\varepsilon$ . Assume ${\rm supp}(\rho_{\mbox{\tiny\rm init}})\subseteq{\sf B}({\boldsymbol{0}};r)$ . Then, for any $k\leq T/{\varepsilon}$ , the following holds with probability at least $1-1/z$ ,

[TABLE]

where

[TABLE]

Remark 5.3.

The error term $2\tau\,\Delta^{\prime}(k,{\varepsilon},d)$ in Eq. (5.4) is always non-negative. In fact, $\Delta^{\prime}(k,{\varepsilon},d)\geq 0$ as $S(\rho)\leq\log|\Omega|$ for any $\rho\in{\mathcal{P}}_{2}(\Omega)$ . Furthermore, by applying Jensen’s inequality, we have that, for any $\rho\in{\mathcal{P}}_{2}(\Omega)$ ,

[TABLE]

which gives the following upper bound

[TABLE]

Recall that $\tau$ controls the variance of the noise, which is added at each step of the SGD algorithm for technical purposes. Thus, we can take $\tau$ sufficiently small so that the term $2\tau\Delta^{\prime}(k,{\varepsilon},d)$ is arbitrarily small.

Remark 5.4.

The proof of Theorem 5.3 provides a somewhat more explicit expression for the error term $\Delta(N,{\varepsilon},T,d,\delta,z)$ in Eq. (5.4). Namely, for an arbitrary but fixed $p\in{\mathbb{N}}$ ,

[TABLE]

The term $\Delta_{1}$ bounds the error due to describing the SGD dynamics using the PDE (3.9). It vanishes when $N\to\infty$ , ${\varepsilon}\to 0$ , under the stated conditions. The term $\Delta_{2}$ captures the error due to approximating the PDE (3.9) with the porous medium equation (3.12). Finally, the term $e^{-2\alpha k\varepsilon}$ describes the convergence to equilibrium of the solution of the porous medium equation.

The proof of Theorem 5.3 is presented in Appendix F and relies crucially on regularity results for the PDE (3.9) which are established in Appendix E.

More specifically, the proof is based on three steps, which we spell out once more:

$(i)$

We approximate the dynamics of SGD by the PDE (3.9) at $\delta>0$ fixed. In doing so, we incur an error $\Delta_{1}$ which is controlled using Theorem 5.1. 2. $(ii)$

We approximate the solution $\rho^{\delta}_{t}$ of the PDE (3.9) at $\delta>0$ using the solution $\rho_{t}$ of the porous medium equation (3.12), as stated in Theorem 5.2. 3. $(iii)$

We use results from [CJM*+*01, CMV03, CMV06] to prove that the latter solution converges exponentially fast to the global optimum, with rate $O(e^{-2\alpha t})$ .

Given Theorems 5.1, 5.2, and the results of [CJM*+*01, CMV03, CMV06], this proof is relatively direct. We emphasize that, unlike Theorems 5.1, 5.2, the proof Theorem 5.3 relies in a crucial way on our structural assumptions, namely the concavity of $f$ , and the structure of the bump-like activation $K_{\delta}({\boldsymbol{x}}-{\boldsymbol{w}}_{i})$ .

Remark 5.5.

If we settle for the less ambitious goal of proving global convergence without the explicit dimension-independent rate $e^{-2\alpha k{\varepsilon}}$ , and there are no boundary conditions ( $\Omega={\mathbb{R}}^{d}$ ), we can achieve this goal using [MMN18, Theorem 5]. This result guarantees convergence in a number of SGD steps that potentially depends on $\tau$ (the noise injected in SGD) as well as the dimensions $d$ , and the width $\delta$ , but does not require to assume strong concavity of $f$ . On the other hand, numerical experiments are consistent with the conclusion that rates are independent of these parameters, cf. e.g. Fig. 1 where dependence on $\delta$ is explored.

6 Discussion

It is instructive to compare the general strategy followed in this paper (and in related work, e.g. [MMN18, MMM19]) and the results we obtain, to a more classical approach in theoretical statistics. For the sake of clarity, we will abstract away most of the details of the present problem, and focus on the most important differences.

Consider a general setting in which we want to minimize the population risk $R({\boldsymbol{w}})={\mathbb{E}}_{y,{\boldsymbol{x}}}L({\boldsymbol{w}};y,{\boldsymbol{x}})$ , where $L$ is a non-convex loss function and ${\boldsymbol{w}}\in{\mathbb{R}}^{D}$ are parameters (in our problem ${\boldsymbol{w}}=({\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{N})$ are the first-layer weights and $D=dN$ ). We are given $n$ i.i.d. samples $\{(y_{j},{\boldsymbol{x}}_{j})\}_{j\leq n}$ .

A standard theoretical analysis of this problem uses empirical risk minimization. Namely, we define the empirical risk $\widehat{R}_{n}({\boldsymbol{w}})=\widehat{\mathbb{E}}_{y,{\boldsymbol{x}}}L({\boldsymbol{w}};y,{\boldsymbol{x}})$ (with $\widehat{\mathbb{E}}_{n}$ denoting the empirical average), and compute the minimizer $\hat{\boldsymbol{w}}_{n}\in\arg\min_{{\boldsymbol{w}}}\widehat{R}_{n}({\boldsymbol{w}})$ , for instance by gradient descent. Theoretical analysis proceeds –conceptually– in two steps. First, one proves that the empirical risk minimizer is a near-minimizer of the population risk. Namely

[TABLE]

This is normally proved through a uniform convergence argument to establish a bound $\sup_{{\boldsymbol{w}}}|\widehat{R}_{n}({\boldsymbol{w}})-R({\boldsymbol{w}})|\leq{\sf err}(D,n)/2$ . Here ${\sf err}(D,n)$ is an error term that (hopefully) vanishes as $n\to\infty$ for $D$ fixed. Second, one proves that gradient descent (with respect to the cost function $\widehat{R}_{n}$ ) converges to a minimizer $\hat{\boldsymbol{w}}_{n}$ . This is achieved by showing that, with high probability, the landscape ${\boldsymbol{w}}\mapsto\widehat{R}_{n}({\boldsymbol{w}})$ satisfies some strong conditions that guarantee convergence of gradient descent (or other algorithms). For instance, one desirable (although not sufficient) property is that $\widehat{R}_{n}$ does not have local minima other than the global minima, provided that the sample size is large enough. A substantial literature applies this general scheme (with significant refinements) to a variety of non-convex problems in high-dimensional statistics, including phase retrieval, clustering, matrix completion, error-in-variables models, and so on. We refer to [MBM*+*18] for examples and a more detailed survey.

Unfortunately this approach runs into substantial difficulties when treating complex models such as multi-layer neural networks. We can name at least two sources of difficulties. First of all, the number of parameters $D$ in the model is often comparable with the sample size $n$ , and therefore uniform convergence of the empirical risk to population risk does not hold. For instance, in the present model, we could use a number of parameters $Nd\gtrsim n$ : indeed, such an example is considered in Figure 5-(a), where $Nd=800$ and $n\in\{100,\dots,2000\}$ . Of course this problem can be addressed by constraining other measures of complexity than the number of parameters [Bar98], but the common practice is not to add such regularizers in the training.

The second source of difficulties is that studying the risk landscape, and ruling out local minima is extremely difficult, even if we limit ourselves to the $n=\infty$ limit, i.e. the population risk $R({\boldsymbol{w}})$ . In two-layers neural networks, part of this difficulty is due to the fact that the risk (1.2) is invariant under permutations of the $N$ neurons, and hence it has (generically) at least $N!$ global minima related by permutations, and a large number of saddle points connecting them.

The approach pursued in this paper builds on two simple remarks, which are connected to the previous difficulties:

$(i)$

Uniform convergence of the empirical risk $\widehat{R}_{n}({\boldsymbol{w}})$ to the population risk $R({\boldsymbol{w}})$ is not necessary, nor it is necessary to control the random deviations of the whole landscape of the empirical risk. What is instead important is to control the landscape of the empirical risk along the trajectory of gradient descent from a given initialization.

A convenient way to implement this idea is to consider SGD in a one-pass setting in which each sample is used only once. In the limit of small step size, this converges to gradient flow with respect to $R({\boldsymbol{w}})$ . 2. $(ii)$

Absence of local minima in the population landscape $R({\boldsymbol{w}})$ is not necessary either. What is instead important is absence of local minima along the gradient flow trajectory for $R({\boldsymbol{w}})$ or, more precisely, the fact that the gradient flow trajectory converges to a global minimum.

These remarks suggest the following proof strategy. Let ${\boldsymbol{w}}(t)$ denote the gradient flow trajectory from a given initialization ${\boldsymbol{w}}(0)={\boldsymbol{w}}_{0}$ (namely $\dot{{\boldsymbol{w}}}(t)=-\nabla R({\boldsymbol{w}}(t))$ ), and ${\boldsymbol{w}}^{k}$ be the (random) parameters produced after $k$ SGD steps. We first prove that gradient flow converges to a global optimum, possibly with explicit convergence rate $\Delta(t)$ :

[TABLE]

where $\Delta(t)\to 0$ as $t\to\infty$ . We then show that the SGD trajectory, after $k$ steps, is well approximated by the gradient flow for $R({\boldsymbol{w}})$ provided the step size ${\varepsilon}$ is small. For instance we might prove that there exists a numerical constant $c_{0}$ such that, for any $k{\varepsilon}\leq T$ , with high probability

[TABLE]

The reader might recognize that the last estimate is analogous to the one obtained in Theorem 5.1, while the estimate 6.2 is what we obtain from displacement convexity (after taking the limit $\delta\to 0$ using Theorem 5.2). Putting the two estimates together, and recalling that we can run a total of $n$ SGD steps (in the one-pass setting), we get

[TABLE]

where we set $\hat{\boldsymbol{w}}={\boldsymbol{w}}^{k}$ . The error is reminiscent of a bias-variance tradeoff: the first term is a bias due to early stopping; the second is instead the stochastic approximation error. We can now optimize $n$ as to minimize this error. For instance, if $\Delta(t)=e^{-c_{1}t}$ , and ${\sf err}(T)=e^{c_{2}T}$ , we can choose ${\varepsilon}\propto(\log n/n)$ , yielding $R(\hat{\boldsymbol{w}})\leq\min_{{\boldsymbol{w}}}R({\boldsymbol{w}})+C(\log n)^{c_{0}}/n^{c^{\prime}}$ where $c^{\prime}=c_{0}c_{1}/(c_{1}+c_{2})$ .

In summary, within the present approach, the generalization error is bounded via a tradeoff between the convergence rate of gradient flow in the population risk, and the error of approximating the gradient flow by SGD. A side benefit of this proof strategy is that it guarantees the existence of an efficient algorithm to compute the weights $\hat{\boldsymbol{w}}$ .

As mentioned, the above discussion omits several challenges that are posed by the model treated in this paper. Most notably: $(1)$ We are trying to optimize $N$ weight vectors ${\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{N}\in{\mathbb{R}}^{d}$ , but the loss only depends on the empirical distribution of these vectors $\hat{\rho}^{(N)}=N^{-1}\sum_{i=1}^{N}\delta_{{\boldsymbol{w}}_{i}}$ . It is therefore natural to define a gradient flow in the space of probability distributions, which is nothing but the PDE (3.9). This also help addressing the challenge posed by by the fact that, as $N$ increases, the dimension of the parameter space increases and convergence to the population behavior might fail. We are embedding all the values of $N$ in the space $\mathscrsfs{P}({\mathbb{R}}^{d})$ . $(2)$ We cannot prove a bound of the form (6.2) for the original PDE (3.9) and have to approximate this by the porous medium equation (3.12).

Because of these additional challenges, our bounds are not nearly as neat as in Eqs. (6.2), 6.3 and depend on the additional parameters $d,\delta$ : in particular, the approximation by the porous medium equation in Theorem 5.2 is non-quantitative. We therefore refrain from optimizing the tradeoff between convergence rate of gradient flow, and error in stochastic approximation, which would result in suboptimal statistical guarantees, and defer this objective to future work.

Acknowledgements

A. Javanmard was partially supported by an Outlier Research in Business (iORB) grant from the USC Marshall School of Business, a Google Faculty Research award and the NSF CAREER award DMS-1844481. M. Mondelli was supported by an Early Postdoc.Mobility fellowship from the Swiss National Science Foundation and by the Simons Institute for the Theory of Computing. A. Montanari was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162 and ONR N00014-18-1-2729. This work was carried out in part while the authors were visiting the Simons Institute for the Theory of Computing.

Appendix A Uniqueness of weak solutions of limit PDE ( $\delta=0$ )

In this appendix, we prove that the limit PDE obtained for $\delta\to 0$ , namely the porous medium equation (3.12) has at most one solution in $\mathscrsfs{L}^{4}(\Omega\times[0,T])$ . Existence of such solutions will follow from the results of Appendix F, and in particular from Lemma F.4.

For the sake of clarity, we repeat the definitions of Section 3.5. Let $\Omega\subseteq{\mathbb{R}}^{d}$ be a compact convex set with $\mathscrsfs{C}^{2}$ boundary. We denote by $\mathscrsfs{P}_{2}(\Omega)$ the space of probability measures on $\Omega$ endowed with Wasserstein’s $W_{2}$ distance. Since $\Omega$ is compact, the induced topology is equivalent to weak convergence. We consider the following PDE:

[TABLE]

with initial and boundary conditions

[TABLE]

Throughout this appendix, we adopt the notation $\Phi(\rho)=\tau\rho+\nu_{0}\,\rho^{2}/2$ . Let us formally define the concept of weak solutions for the PDE (A.1).

For the next statement, it is useful to recall that $\mathscrsfs{C}^{2,1}(\Omega\times[0,T])$ denotes the class of functions $f:\Omega\times[0,T]\to{\mathbb{R}}$ with continuous partial derivatives $D_{t}f({\boldsymbol{x}},t)$ , $D_{{\boldsymbol{x}}}^{{\boldsymbol{\alpha}}}f({\boldsymbol{x}},t)$ for all $\|{\boldsymbol{\alpha}}\|_{2}\leq 2$ .

Definition A.1 (Weak solution of limit PDE).

We say that $\rho\in\mathscrsfs{C}([0,T],$ $\mathscrsfs{P}_{2}(\Omega))$ is a weak solution of the PDE (A.1), with initial and boundary conditions (A.2) if

$\rho_{t}$ * has density $\rho(\,\cdot\,,t)$ with respect to Lebesgue measure, and $\rho\in\mathscrsfs{L}^{2}(\Omega\times[0,T])$ .* 2. 2.

For any test function $h\in\mathscrsfs{C}^{2,1}(\Omega\times[0,T])$ , satisfying $\langle{\boldsymbol{n}}({\boldsymbol{x}}),\nabla h({\boldsymbol{x}},t)\rangle=0$ for all ${\boldsymbol{x}}\in\partial\Omega,t\in[0,T]$ , we have

[TABLE]

We now prove a uniqueness result, under a mild integrability condition.

Lemma A.2 (Uniqueness of limit PDE).

Let $\rho,\tilde{\rho}\in\mathscrsfs{L}^{4}(\Omega\times[0,T])$ be two weak solutions of the PDE (A.1) with initial and boundary conditions (A.2), in the sense of Definition A.1. Then, $\rho=\tilde{\rho}$ , almost everywhere.

Proof.

Note that setting $\nu_{0}=1$ corresponds to scaling time by a factor $\nu_{0}$ and to substituting $\tau$ with $\tau\,\nu_{0}$ . Since the proof holds for any $\tau>0$ , without loss of generality we can set $\nu_{0}=1$ .

The proof follows ideas from [Váz07, Theorem 6.5]. We write the identity (A.3) for $\rho$ and $\tilde{\rho}$ and subtract them to get

[TABLE]

where we use the shorthand $\rho_{t}({\boldsymbol{x}})\equiv\rho({\boldsymbol{x}},t)$ and $h_{t}({\boldsymbol{x}})\equiv h({\boldsymbol{x}},t)$ . Define $u_{t}=\rho_{t}-\tilde{\rho}_{t}$ and $\eta_{t}=\tau+(\rho_{t}+\tilde{\rho}_{t})/2$ . Then,

[TABLE]

Note that $\eta_{t}({\boldsymbol{x}})\geq\tau$ and define the truncated function $\eta_{t}^{M}=\min(M,\eta_{t})$ . We next choose a smooth test function $\theta:\Omega\times[0,T]\to{\mathbb{R}}_{\geq 0}$ , $({\boldsymbol{x}},t)\mapsto\theta_{t}({\boldsymbol{x}})$ and consider the following backward problem:

[TABLE]

Here, $\hat{\eta}_{t}$ is a smooth approximation of $\eta_{t}^{M}$ , such that $\tau\leq\hat{\eta}_{t}({\boldsymbol{x}})\leq M$ . (We will make precise below in what sense $\hat{\eta}_{t}$ has to approximate $\eta_{t}^{M}$ . For the moment, it can be a general smooth function satisfying the bounds $\tau\leq\hat{\eta}_{t}({\boldsymbol{x}})\leq M$ .) Note that (A.6) is a backward parabolic problem with smooth coefficients and with Neumann boundary conditions. Hence, by classical results on quasilinear parabolic PDEs [LSU88], it admits a solution $h_{t}\in\mathscrsfs{C}^{2,1}(\Omega\times[0,T])$ . Rewriting (A.5) for such a test function $h_{t}$ , we get

[TABLE]

This immediately implies that

[TABLE]

By applying Cauchy-Schwarz inequality, we have that

[TABLE]

To bound the first term on the right-hand side of (A), we consider a smooth positive bounded function $\mu(t)$ , defined on $[0,T]$ , whose properties will be discussed later. Define the shorthand $\tilde{\theta}_{t}({\boldsymbol{x}})\equiv\theta_{t}({\boldsymbol{x}})+\langle\nabla f({\boldsymbol{x}}),\nabla h_{t}({\boldsymbol{x}})\rangle$ . We multiply the parabolic PDE (A.6) by $\mu(t)\Delta h_{t}({\boldsymbol{x}})$ and integrate to obtain

[TABLE]

We next write

[TABLE]

Here $(a)$ follows from integration by parts in the integral over $\Omega$ and using the fact that $\langle{\boldsymbol{n}}({\boldsymbol{x}}),\nabla h_{t}({\boldsymbol{x}})\rangle=0$ for ${\boldsymbol{x}}\in\partial\Omega$ and $t\in[0,T]$ . Also, $(b)$ follows from integration by parts in the integral over $t$ . Finally $(c)$ holds because $h_{T}({\boldsymbol{x}})=0$ for ${\boldsymbol{x}}\in\Omega$ and $\mu(0)\geq 0$ .

Getting back to (A) and using the properties of function $\mu(t)$ , we have

[TABLE]

The penultimate step follows from integration by parts and the constraint $\langle\nabla h_{t}({\boldsymbol{x}}),{\boldsymbol{n}}({\boldsymbol{x}})\rangle=0$ , for ${\boldsymbol{x}}\in\partial\Omega$ and $t\in[0,T]$ , and the last step follows by applying Cauchy-Schwartz inequality. We continue by applying Cauchy-Schwartz inequality again to get

[TABLE]

where $C=\sup_{{\boldsymbol{x}}\in\Omega}|\nabla f({\boldsymbol{x}})|$ . Combining Equations (A.11) and (A.12), we get

[TABLE]

where $\mu_{\max}=\sup_{t\in[0,T]}\mu(t)$ . We find a smooth function $\mu(t)$ such that

$\mu(t)\geq\mu_{\min}>0$ , for $t\in[0,T]$ , 2. 2.

$\mu^{\prime}(t)-\frac{2C^{2}}{\tau}\mu(t)\geq 0$ .

A particular choice is

[TABLE]

We then obtain from (A) that

[TABLE]

Now by employing (A.14) in bound (A) combined with (A.7) we get

[TABLE]

Next we note that

[TABLE]

Call the first integral $I_{1}$ and denote the second one by $I_{2}$ . The integrand in $I_{2}$ is pointwise bounded by

[TABLE]

Since $\rho_{t},\tilde{\rho}_{t}\in\mathscrsfs{L}^{4}$ , we have that $(\Phi(\rho_{t})-\Phi(\tilde{\rho}_{t}))^{2}$ has bounded integral. Hence, we can choose $M$ large enough such that $I_{2}$ is arbitrarily small. Moreover we can choose the smooth approximation $\hat{\eta}_{t}$ such that $I_{1}$ is also arbitrarily small. Putting everything together, we obtain that

[TABLE]

where ${\varepsilon}$ is an arbitrary small fixed constant.

In addition, since $\hat{\eta}_{t}({\boldsymbol{x}})\geq\tau$ , invoking (A.15) we have

[TABLE]

Since $\frac{\mu_{\max}}{\mu_{\min}}=e^{\frac{2C^{2}}{\tau}T}<\infty$ and $\theta$ are independent of ${\varepsilon}$ , by choosing ${\varepsilon}$ arbitrarily small, we conclude that

[TABLE]

Since $\theta_{t}({\boldsymbol{x}})\geq 0$ was an arbitrary smooth function supported on $\Omega\times[0,T]$ , this implies that $u\leq 0$ , almost everywhere. By repeating a similar argument, we get $u\geq 0$ , almost everywhere. The result follows. ∎

Appendix B General results on the PDE (3.9) ( $\delta>0$ )

This appendix contains some basic results on the PDE (3.9). Although these facts are standard, we collect them here for the reader’s convenience.

In fact, we will consider a more general PDE, which also includes as a special case the one studied in [MMN18]. We consider a compact convex domain $D$ , with a non-empty interior. The general PDE is parametrized by two functions $V\in\mathscrsfs{C}^{2}(D)$ and $U\in\mathscrsfs{C}^{2}(D\times D)$ , with $U({\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2})=U({\boldsymbol{x}}_{2},{\boldsymbol{x}}_{1})$ . (Unlike in [MMN18], we consider the case of a compact domain with Neumann boundary conditions.) Given $\rho\in\mathscrsfs{P}_{2}(D)$ , we define

[TABLE]

and consider the PDE

[TABLE]

with initial and boundary conditions

[TABLE]

We will typically write $\rho_{t}(\,\cdot\,)$ for a solution of this equation, in order to emphasize that it is a function of $t$ that takes values in $\mathscrsfs{P}_{2}(D)$ , and $\rho({\boldsymbol{x}},t)$ for the corresponding density, viewed as a function on $D\times[0,T]$ . Let us formally define the concept of weak solutions for the PDE (B.2).

Note that the PDE (3.9) is a special case of this setting with $D=\Omega^{\delta}$ , and $V({\boldsymbol{w}})$ and $U({\boldsymbol{w}}_{1},{\boldsymbol{w}}_{2})=U({\boldsymbol{w}}_{1}-{\boldsymbol{w}}_{2})$ defined as follows:

[TABLE]

Remark B.1.

For the special choice of $V$ and $U$ given by (B.4) the following properties hold:

$V:\Omega^{\delta}\to{\mathbb{R}}$ is convex for any $\delta>0$ . 2. 2.

$\lim_{\delta\to 0}\sup_{{\boldsymbol{w}}\in\Omega^{\delta}}|V({\boldsymbol{w}})+\nu_{0}\,f({\boldsymbol{w}})|=0$ . 3. 3.

$U({\boldsymbol{w}})=\nu_{0}\,\delta^{-2d}K^{(2)}({\boldsymbol{w}}/\delta)$ , where $K^{(2)}=K*K$ .

Proof.

We have $V({\boldsymbol{w}})=-\nu_{0}\int K^{\delta}({\boldsymbol{x}})f({\boldsymbol{w}}-{\boldsymbol{x}}){\rm d}{\boldsymbol{x}}$ . Hence,

[TABLE]

This proves that $V({\boldsymbol{w}})$ is convex. The next two properties are straightforward. ∎

Definition B.1 (Weak solution of PDE).

We say that $\rho:[0,T]\to\mathscrsfs{P}_{2}(D)$ is a weak solution of (B.2) with initial and boundary conditions (B.3) if $\rho\in\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(D))$ and, for any test function $h\in\mathscrsfs{C}^{2,1}(D\times[0,T])$ , satisfying $\langle{\boldsymbol{n}}({\boldsymbol{x}}),\nabla h({\boldsymbol{x}},t)\rangle=0$ for all ${\boldsymbol{x}}\in\partial D,t\in[0,T]$ , we have

[TABLE]

We now state and prove Duhamel’s principle for the PDE (B.2). Duhamel’s principle follows from the fact that the right-hand side of (B.2) contains the linear diffusion term $\tau\Delta\rho$ , and it will be crucial for the proofs that will follow.

Lemma B.2 (Duhamel’s principle).

Assume $\tau>0$ . Let $G^{D}({\boldsymbol{x}},{\boldsymbol{y}};t)$ denote the heat kernel with Neumann boundary conditions, defined in (G.1)-(G.3). Let $\rho$ be a weak solution of the PDE (B.2) with initial and boundary conditions (B.3). Then, for any $t>0$ , $\rho_{t}({\rm d}{\boldsymbol{x}})$ has a density, denoted by $\rho(\;\cdot\;,t)$ , which satisfies, for any $t>0$ ,

[TABLE]

Proof.

By rescaling time, without loss of generality, we set $\tau=1$ . Let $\varphi\in\mathscrsfs{C}^{2}(D)$ , and define

[TABLE]

By the properties of the heat kernel, we have:

[TABLE]

Let $\rho_{t}$ be a weak solution. We choose the test function $h({\boldsymbol{x}},s)=G_{\varphi}({\boldsymbol{x}};t-s)$ in (B.5) with $T=t$ . Note that by (B.8), this test function satisfies the Neumann boundary condition. In addition, by (B.9) we obtain

[TABLE]

By an application of Fubini’s theorem, this implies

[TABLE]

Since $\varphi\in\mathscrsfs{C}^{2}(D)$ is arbitrary, we obtain that $\rho_{t}$ admits a density and (B.6) follows. ∎

As an intermediate step towards proving existence and uniqueness, we consider a linearized problem

[TABLE]

with initial and boundary conditions

[TABLE]

Here, $\Psi_{*}:D\times{\mathbb{R}}\to{\mathbb{R}}$ is independent of $\rho$ , and weak solutions are defined as for the original problem (with Neumann boundary conditions).

Corollary B.3 (Uniqueness of linearized problem).

Assume that $\tau>0$ and also that

[TABLE]

Then, the PDE (B.13) with initial and boundary conditions (B.14) has at most one weak solution.

Proof.

Without loss of generality, we will set $\tau=1$ . Assume by contradiction that $\rho^{(1)}$ , $\rho^{(2)}$ are two solutions. Fix arbitrary $0\leq t^{\prime}\leq t$ . Then, by an application of (B.6) to $\Psi_{*}({\boldsymbol{x}},t)$ , we have

[TABLE]

where we used the estimates of Theorem G.1. By taking supremum over $0\leq t^{\prime}\leq t$ form both sides, we obtain that for $t<1/(C(D)^{2}\|\nabla\Psi_{*}\|_{\mathscrsfs{L}^{\infty}(D\times[0,T])}^{2})$ ,

[TABLE]

Therefore, the two solutions coincide if we fix the initial condition $\rho^{(1)}(\,\cdot\,,0)=\rho^{(2)}(\,\cdot\,,0)=\rho_{\mbox{\tiny\rm init}}$ . For larger $t$ , the claim follows by iterating the above argument. ∎

Appendix C Nonlinear dynamics

The ‘nonlinear dynamics’ plays an important role in our proof of Theorem 5.1. In this section we adopt the same general setting as in Appendix B, remembering that for our application we set $D=\Omega^{\delta}$ and $U,V$ as per Eq. (B.4).

Given $\rho:[0,T]\to\mathscrsfs{P}_{2}(D)$ , consider the following stochastic differential equation for a process $(\boldsymbol{X}_{t})_{t\in[0,T]}$ , with a reflecting boundary condition (known as ‘Skorokhod problem’)

[TABLE]

where $(\boldsymbol{B}_{t})_{t\geq 0}$ is a standard $d$ -dimensional Brownian motion and $({\boldsymbol{\Phi}}_{t})_{t\geq 0}$ enforces the reflecting boundary by satisfying the following constraints (recall that ${\boldsymbol{n}}({\boldsymbol{x}})$ is the normal to $\partial D$ at ${\boldsymbol{x}}\in\partial D$ , directed inside):

$(i)$

$({\boldsymbol{\Phi}}_{t})_{t\geq 0}$ is adapted (and hence so is $(\boldsymbol{X}_{t})_{t\geq 0}$ ).

$(ii)$

$t\mapsto{\boldsymbol{\Phi}}_{t}$ has (almost surely) bounded variation. Denoting by $\|{\boldsymbol{\Phi}}\|_{\mbox{\tiny\rm TV}}(t)$ the total variation of ${\boldsymbol{\Phi}}$ on the interval $[0,t]$ , we define the measure $\mu_{\Phi}$ on $[0,T]$ by $\mu_{\Phi}([0,t])=\|{\boldsymbol{\Phi}}\|_{\mbox{\tiny\rm TV}}(t)$ .

$(iii)$

$\mu_{\Phi}(\{t:\,\boldsymbol{X}_{t}\in D^{\circ}\})=0$ , where $D^{\circ}$ denotes the interior of $D$ .

$(iv)$

We have that, for $t\in[0,T]$ ,

[TABLE]

where ${\boldsymbol{N}}_{s}={\boldsymbol{n}}(\boldsymbol{X}_{s})$ , for $\mu_{\Phi}$ -almost every $s$ .

Then, $(\boldsymbol{X}_{t},{\boldsymbol{\Phi}}_{t})_{t\in[0,T]}$ is said to solve the Skorokhod problem.

Lemma C.1 (Existence, uniqueness and continuity of Skorokhod problem).

Fix $\rho_{\mbox{\tiny\rm init}}\in\mathscrsfs{P}_{2}(D)$ and let $\rho:[0,T]\to\mathscrsfs{P}_{2}(D)$ with $\rho_{0}=\rho_{\mbox{\tiny\rm init}}$ . Then, the Skorokhod problem (C.1), (C.2) admits a unique solution $(\boldsymbol{X}_{t})_{t\geq 0}$ with continuous paths. Define $\mathscrsfs{F}(\rho)_{t}\in\mathscrsfs{P}_{2}(D)$ , for $t\in[0,T]$ , by letting $\mathscrsfs{F}(\rho)_{t}={\rm Law}(\boldsymbol{X}_{t})$ . Then, $\mathscrsfs{F}(\rho)\in\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(D))$ .

Proof.

Let ${\boldsymbol{b}}({\boldsymbol{x}},t)\equiv-\nabla\Psi({\boldsymbol{x}},\rho_{t})$ and notice that, by the smoothness of $U,V$ , and compactness of $D$ , this is a Lipschitz continuous function of ${\boldsymbol{x}}$ . Hence the problem (C.1), (C.2) admits a unique solution by [Tan79, Theorem 4.1].

We are left with the task of proving that $t\mapsto\mathscrsfs{F}(\rho)_{t}$ is continuous in $W_{2}$ metric. Notice that

[TABLE]

By [Tan79, Lemma 2.2], we have, for any $s\leq t$ ,

[TABLE]

Taking expectation, we get

[TABLE]

whence the continuity follows. ∎

Definition C.2 (Solution of nonlinear dynamics).

We say that $\rho\in\mathscrsfs{C}([0,T];\mathscrsfs{P}_{2}(D))$ is a solution of the nonlinear dynamics if $\mathscrsfs{F}(\rho)=\rho$ , namely

[TABLE]

Lemma C.3.

Assume $\tau>0$ . If $\rho:[0,T]\to\mathscrsfs{P}_{2}(D)$ is a weak solution of the PDE (B.2) with initial and boundary conditions (B.3), then it is a solution of the nonlinear dynamics. Vice versa, if $\rho:[0,T]\to\mathscrsfs{P}_{2}(D)$ is a solution of the nonlinear dynamics, then it is a weak solution of PDE (B.2) with initial and boundary conditions (B.3).

Proof.

Let $\rho$ be a weak solution of the PDE (B.2), and assume $\tau>0$ . Let $(\boldsymbol{X}_{t})_{t\geq 0}$ be the unique solution of the Skorokhod problem (C.1), (C.2), cf. Lemma C.1. Let $\tilde{\rho}_{t}\equiv{\rm Law}(\boldsymbol{X}_{t})$ , $t\geq 0$ , i.e. $\tilde{\rho}\equiv\mathscrsfs{F}(\rho)$ . For $g\in\mathscrsfs{C}^{2}(D)$ , satisfying $\langle{\boldsymbol{n}}({\boldsymbol{x}}),\nabla g({\boldsymbol{x}})\rangle=0$ for all ${\boldsymbol{x}}\in\partial D$ , compute

[TABLE]

Here $(a)$ follows from Ito’s formula for continuous semimartingales [RW94], $(b)$ since $\boldsymbol{X}_{s}\in\partial D$ and ${\boldsymbol{N}}_{s}={\boldsymbol{n}}(\boldsymbol{X}_{s})$ for $\mu_{\Phi}$ -almost every $s$ , and $(c)$ by the definition of $\tilde{\rho}$ . We conclude that $\tilde{\rho}$ is a weak solution of the linearized PDE (B.13), with $\Psi_{*}({\boldsymbol{x}},t)=\Psi({\boldsymbol{x}},\rho_{t})$ . Since $\rho$ also solves the same linearized PDE, we conclude by Lemma B.3 that $\tilde{\rho}_{t}=\rho_{t}$ for all $t\in[0,T]$ , and therefore $\rho$ is a solution of the nonlinear dynamics.

Next, assume that $\rho:[0,T]\to\mathscrsfs{P}_{2}(D)$ is a solution of the nonlinear dynamics. Then by the same application of Ito’s formula to the process $\boldsymbol{X}_{t}$ , we have

[TABLE]

which coincides with the claim that $\rho$ is a weak solution of the PDE (B.2). ∎

Theorem C.4 (Existence and uniqueness of nonlinear dynamics).

For any initial condition $\rho_{\mbox{\tiny\rm init}}\in\mathscrsfs{P}_{2}(D)$ , and any $T>0$ , the nonlinear dynamics (C.6) admits a unique solution $\rho:[0,T]\to\mathscrsfs{P}_{2}(D)$ with $\rho_{0}=\rho_{\mbox{\tiny\rm init}}$ . As a consequence, the PDE (B.2) with initial and boundary conditions (B.3) has a unique solution.

Proof.

Note that it is sufficient to prove the claim for $T\leq T_{0}$ , where $T_{0}>0$ is a small enough constant, since this implies the claim for arbitrary $T$ by breaking $[0,T]$ into intervals of size smaller than $T_{0}$ .

We claim that $\mathscrsfs{F}$ is a contraction on $\mathscrsfs{C}([0,T],{\mathcal{P}}_{2}(D))$ endowed with the metric $d(\rho,\tilde{\rho})\equiv\sup_{t\in[0,T]}W_{2}(\rho,\tilde{\rho})$ . To show that this is the case, define ${\boldsymbol{b}}({\boldsymbol{x}},t)\equiv-\nabla\Psi({\boldsymbol{x}},\rho_{t})$ , $\tilde{\boldsymbol{b}}({\boldsymbol{x}},t)\equiv-\nabla\Psi({\boldsymbol{x}},\tilde{\rho}_{t})$ . By the smoothness of $U$ , $V$ and by the compactness of $D$ , we have that ${\boldsymbol{b}}$ and $\tilde{\boldsymbol{b}}$ are Lipschitz continuous in ${\boldsymbol{x}}$ , with Lipschitz constant $L$ independent of $t,\rho,\tilde{\rho}$ . Further,

[TABLE]

Let $(\boldsymbol{X}_{t},{\boldsymbol{\Phi}}_{t})$ and $(\tilde{\boldsymbol{X}},\tilde{\boldsymbol{\Phi}}_{t})$ are be solution of the Skorokhod problem (C.2), with drift coefficients ${\boldsymbol{b}}({\boldsymbol{x}},t)$ , $\tilde{\boldsymbol{b}}({\boldsymbol{x}},t)$ . We couple the processes $\boldsymbol{X}_{t}$ and $\tilde{\boldsymbol{X}}_{t}$ by using the same initial condition $\boldsymbol{X}_{0}$ and same Brownian motion $\boldsymbol{B}_{t}$ :

[TABLE]

Define

[TABLE]

and notice that, by the above remarks,

[TABLE]

Further, by [Tan79, Remark 2.2], we have

[TABLE]

Define $\Delta(t)\equiv{\mathbb{E}}\{|\boldsymbol{X}_{t}-\tilde{\boldsymbol{X}}_{t}|^{2}\}$ and $\overline{\Delta}(t)\equiv\sup_{s\leq t}\Delta(s)$ . By taking the expectation of the last inequality and using Jensen’s inequality, we get

[TABLE]

which immediately implies

[TABLE]

Hence, for $T_{0}<(2L)^{-1/2}$ ,

[TABLE]

Selecting $T_{0}$ small enough, so that $(2CT_{0}^{2})/(1-2LT_{0}^{2})\leq 1/2$ , we obtain

[TABLE]

This proves that $\mathscrsfs{F}$ is a contraction as claimed. By Lemma C.1, $\mathscrsfs{F}$ maps $\mathscrsfs{C}([0,T],{\mathcal{P}}_{2}(D))$ into itself. Furthermore, $\mathscrsfs{C}([0,T],{\mathcal{P}}_{2}(D))$ is complete with respect to the metric $d$ . As a result, there exists a unique fixed point. ∎

We conclude this section by stating a result about the discretization of the nonlinear dynamics. Fix a solution $(\rho_{t})_{t\geq 0}$ of the PDE (B.2) with initial condition $\rho_{0}=\rho_{\mbox{\tiny\rm init}}$ , a step size ${\varepsilon}>0$ and define recursively the random variables $(\boldsymbol{X}^{{\varepsilon}})_{k\in{\mathbb{N}}}$ by

[TABLE]

This can be viewed as an Euler discretization of the stochastic differential equation (C.1), (C.2), and the next theorem establishes that this is indeed a close approximation of the original process. It is just an immediate consequence of a result of Slomiński [Slo94, Slo01].

Theorem C.5 (Theorem 3.2 in [Slo01]).

Consider the nonlinear dynamics defined by Eqs. (C.1), (C.2). Assume ${\sf B}({\boldsymbol{0}};r)\subseteq D$ , and $\|\nabla V\|_{\mathscrsfs{L}^{\infty}(D)}$ , $\|\nabla U\|_{\mathscrsfs{L}^{\infty}(D\times D)}$ , $\|\nabla V\|_{{\rm Lip}}$ , $\|\nabla U\|_{{\rm Lip}}\leq L$ . Also assume that ${\rm supp}(\rho_{\mbox{\tiny\rm init}})\subseteq{\sf B}({\boldsymbol{0}},r)$ . Construct the Euler scheme (C.19), (C.20) on the same probability space by letting $\boldsymbol{X}^{{\varepsilon}}_{0}=\boldsymbol{X}_{0}$ and ${\boldsymbol{g}}_{k}=(\boldsymbol{B}((k+1){\varepsilon})-\boldsymbol{B}(k{\varepsilon}))/\sqrt{{\varepsilon}}$ . Then, for any $p\in{\mathbb{N}}$ , $T\in{\mathbb{R}}_{\geq 0}$ ,

[TABLE]

Proof.

The proof is obtained simply by chasing the constants in the proof of Theorem 3.2 (part (ii)) of [Slo01], and using the optimal constant in the Burkholder-Davis-Gundy inequality (which yields $C(p)\leq(C_{*}p)^{2p}$ in [Slo01, Eq. (2.7)]). ∎

Appendix D Convergence of SGD to the PDE: Proof of Theorem 5.1

The proof is a ‘propagation of chaos’ argument [Szn91]. While the basic idea is similar to the one used in [MMN18], implementing it requires different estimates because of the reflecting boundary conditions. In particular, we rely on tools developed in the study of discretizations of reflecting stochastic differential equations.

We will prove a more general theorem that implies Theorem 5.1 as a special case, and also applies to the setting of [MMN18]. Namely, we consider data $\{{\boldsymbol{z}}_{i}=(y_{i},{\boldsymbol{x}}_{i})\}_{i\geq 1}$ i.i.d. with common distribution ${\mathbb{P}}$ on ${\mathbb{R}}\times{\mathbb{R}}^{d_{0}}$ , and parameters ${\boldsymbol{w}}_{i}\in D\subseteq{\mathbb{R}}^{d}$ . These parameters are initially sampled independently from distribution $\rho_{0}\in\mathscrsfs{P}_{2}(D)$ , and then evolve according to

[TABLE]

Here ${\sf P}$ is the projection on the closed convex domain $D\subseteq{\mathbb{R}}^{d}$ with non-empty interior. The setting of Theorem 5.1 is recovered by taking $\sigma({\boldsymbol{x}};{\boldsymbol{w}})=K_{\delta}({\boldsymbol{x}}-{\boldsymbol{w}})$ , $D=\Omega^{\delta}$ , ${\boldsymbol{x}}_{k}\sim{\sf Unif}(\Omega)$ , ${\mathbb{E}}\{y_{k}|{\boldsymbol{x}}_{k}\}=f({\boldsymbol{x}}_{k})$ .

We make the following assumptions:

(G1)

$\|y\|_{\infty}$ , $\|\sigma\|_{\infty}={\rm ess}\sup_{{\boldsymbol{w}}\in D,{\boldsymbol{x}}}|\sigma({\boldsymbol{x}};{\boldsymbol{w}})|\leq\sigma_{\infty}$ , and $\nabla_{{\boldsymbol{w}}}\sigma({\boldsymbol{x}};{\boldsymbol{w}})$ is $\gamma$ -subgaussian.

(G2)

Letting $V({\boldsymbol{w}})=-{\mathbb{E}}\{y\sigma({\boldsymbol{x}};{\boldsymbol{w}})\}$ , $U({\boldsymbol{w}}_{1},{\boldsymbol{w}}_{2})\equiv{\mathbb{E}}\{\sigma({\boldsymbol{x}};{\boldsymbol{w}}_{1})\sigma({\boldsymbol{x}};{\boldsymbol{w}}_{2})\}$ , both $V$ and $U$ are differentiable with Lipschitz continuous derivative, namely $\|\nabla V\|_{{\rm Lip}},\|\nabla U\|_{{\rm Lip}}\leq L$ . Further, we assume $\|\nabla U\|_{\mathscrsfs{L}^{\infty}(D\times D)}<\infty$ .

Theorem D.1.

Consider the general update (D.1) with initialization $({\boldsymbol{w}}_{i}^{0})_{i\leq N}\sim_{iid}\rho_{0}=\rho_{\mbox{\tiny\rm init}}$ , under the conditions ${\sf(G1)}$ , ${\sf(G2)}$ above. For $t\geq 0$ , let $\rho_{t}$ be the unique solution of the PDE (B.2) with initial and boundary conditions (B.3). Assume ${\rm supp}(\rho_{\mbox{\tiny\rm init}})\subseteq{\sf B}({\boldsymbol{0}},r)$ .

Then, for $T\geq 0$ $TL\geq 1$ , any $g:\mathbb{R}^{d}\to\mathbb{R}$ with $\|g\|_{\rm Lip}\leq 1$ and for $\varepsilon\leq 1$ , $p\in{\mathbb{N}}$ , the following holds with probability at least $1-z^{-2p}$ :

[TABLE]

where

[TABLE]

Theorem 5.1 follows as a special case of Theorem D.1 by considering $\sigma({\boldsymbol{x}};{\boldsymbol{w}})=K_{\delta}({\boldsymbol{x}}-{\boldsymbol{w}})$ and letting $\sigma_{\infty}\leq C_{*}\delta^{-d}$ , $\gamma=C_{*}\delta^{-d-1}$ and $L=C_{*}\delta^{-2d-1}$ .

Proof.

Let ${\mathcal{F}}_{k}$ denote the sigma algebra generated by $({\boldsymbol{z}}_{j})_{j\leq k}$ and denote the empirical distribution of $({\boldsymbol{w}}_{i}^{k})_{i\leq N}$ by $\rho^{(N)}_{k}\equiv\sum_{i=1}^{n}\delta_{{\boldsymbol{w}}_{i}^{k}}$ . Note that

[TABLE]

We introduce two auxiliary processes $(\overline{\boldsymbol{w}}_{i}^{k})_{i\leq N}$ $(\hat{\boldsymbol{w}}_{i}^{k})_{i\leq N}$ , with initial conditions $\overline{\boldsymbol{w}}_{i}^{0}=\hat{\boldsymbol{w}}_{i}^{0}={\boldsymbol{w}}_{i}^{0}$ , as follows:

•

The trajectories $(\hat{\boldsymbol{w}}_{i}^{k})_{k\geq 0}$ are i.i.d. copies of the nonlinear dynamics introduced in Appendix C, sampled at times $t=k{\varepsilon}$ . Namely, for any $k\in{\mathbb{R}}$

[TABLE]

In particular, for any $k$ , $(\hat{\boldsymbol{w}}_{i}^{k})_{i\leq N}\sim_{iid}\rho_{k{\varepsilon}}$ .

•

The trajectories $(\overline{\boldsymbol{w}}_{i}^{k})_{k\geq 0}$ are obtained by the Euler discretization of the non-linear dynamics:

[TABLE]

As above, $(\rho_{s})_{s\geq 0}$ is the solution of the PDE (B.2). Note that, again, the $(\overline{\boldsymbol{w}}_{i}^{k})_{i\leq N}$ are i.i.d. although their distribution does not coincide with $\rho_{k{\varepsilon}}$ .

We construct these three processes on the same space by letting $\boldsymbol{B}_{i}((k+1){\varepsilon})=\boldsymbol{B}_{i}(k{\varepsilon})+\sqrt{{\varepsilon}}{\boldsymbol{g}}_{i}^{k+1}$ , and define the distances (for $q\geq 1$ )

[TABLE]

Theorem C.5 yields, for $p\in{\mathbb{N}}$ ,

[TABLE]

Note that ${\boldsymbol{w}}_{i}^{k}$ , $\overline{\boldsymbol{w}}_{i}^{k}$ take the form

[TABLE]

where ${\boldsymbol{M}}_{i}^{k},\overline{\boldsymbol{M}}_{i}^{k}$ are martingales with respect to the filtration ${\mathcal{F}}_{k}$ : ${\mathbb{E}}\{{\boldsymbol{M}}_{i}^{k}|{\mathcal{F}}_{k-1}\}={\boldsymbol{M}}_{i}^{k-1}$ , ${\mathbb{E}}\{\overline{\boldsymbol{M}}_{i}^{k}|{\mathcal{F}}_{k-1}\}=\overline{\boldsymbol{M}}_{i}^{k-1}$ , and ${\boldsymbol{V}}_{i}^{k}$ , $\overline{\boldsymbol{V}}_{i}^{k}$ are ${\mathcal{F}}_{k-1}$ -measurable. Explicitly

[TABLE]

Finally, ${\boldsymbol{\varphi}}^{k}_{i}$ , $\overline{\boldsymbol{\varphi}}^{k}_{i}$ are corrections to satisfy the constraint ${\boldsymbol{w}}_{i}^{k},\overline{\boldsymbol{w}}_{i}^{k}\in D$ . Indeed the above can be viewed as Skorokhod problems with unknowns $({\boldsymbol{w}}_{i},{\boldsymbol{\varphi}}_{i})$ and $(\overline{\boldsymbol{w}}_{i},\overline{\boldsymbol{\varphi}}_{i})$ .

Using [Slo94, Theorem 1] (where we can set $C_{p}=(C_{*}p)^{2p}$ which is the tight constant in the Burkholder-Davis-Gundy inequality), we get

[TABLE]

where $[{\boldsymbol{M}}]_{k}$ denotes the quadratic variation of the martingale ${\boldsymbol{M}}$ , and $|{\boldsymbol{V}}|_{k}$ is the total variation of the process ${\boldsymbol{V}}$ . We then have

[TABLE]

Note that under the stated assumption the martingale increments ${\boldsymbol{Z}}^{\ell}_{i}$ are sub-Gaussian with variance proxy upper bounded by $v^{2}=C_{*}{\varepsilon}^{2}\sigma_{\infty}^{2}\gamma^{2}$ . Therefore, by using the moment generating function of $\chi^{2}_{d}$ distribution, we have

[TABLE]

Hence,

[TABLE]

By using the inequality $x^{p}\leq e^{x}p!$ , this implies, for $\alpha\leq\sqrt{d/2}$ ,

[TABLE]

Equivalently,

[TABLE]

By taking $\alpha=\sqrt{p/k}$ (which is allowed provided $p\leq\sqrt{kd/2}$ ), we obtain that

[TABLE]

We next consider the total variation of the process ${\boldsymbol{V}}_{i}$ in Eq. (D.17). We have

[TABLE]

Using the Lipschitz property of $\nabla V$ , $\nabla U$ , we get

[TABLE]

For the second term, we get, by triangular inequality,

[TABLE]

We next use the expression ${\boldsymbol{G}}({\boldsymbol{w}};\rho)=\nabla V({\boldsymbol{w}})+\int\nabla U({\boldsymbol{w}},{\boldsymbol{w}}^{\prime})\,\rho({\rm d}{\boldsymbol{w}}^{\prime})$ , and the fact that $\hat{\boldsymbol{w}}_{i}^{\ell}\sim\rho_{\ell{\varepsilon}}$ , to get

[TABLE]

Using once more the Lipschitz property of $\nabla U$ , and the symmetry of the distributions of $({\boldsymbol{w}}^{\ell})_{i\leq N}$ , $(\hat{\boldsymbol{w}}^{\ell})_{i\leq N}$ under permutations, we obtain

[TABLE]

Finally, $|\nabla U(\overline{\boldsymbol{w}}_{i}^{\ell},\hat{\boldsymbol{w}}_{j}^{\ell})|\leq L$ and therefore the vector

[TABLE]

is sub-Gaussian, with variance proxy upper bounded by $v^{2}=L^{2}/N$ . This implies that ${\mathbb{E}}\{|\boldsymbol{W}|^{2p}\}^{1/(2p)}\leq C_{*}\sqrt{dp}\,v$ , and therefore

[TABLE]

Substituting (D.23), (D.24), (D.27), (D.28) in Eq. (D.17), we obtain

[TABLE]

Using Eq. (D.10) and Gronwall inequality, along with the fact that $k{\varepsilon}\leq T$ , this yields

[TABLE]

By using Eq. (D.10) again, we get

[TABLE]

By Markov inequality along with the Jensen inequality applied to the convex function $x^{2p}$ , we have

[TABLE]

where in the third step we used (D.8) and (D.9). Set $\Delta=z\,e^{C_{*}pLT}{\sf err}(N,d,{\varepsilon})$ . Thus, we obtain

[TABLE]

with probability at least $1-z^{-2p}$ .

The bounds in Eq. (D.3) follow straightforwardly from Eq. (D.31) as in the proofs of Lemma 3.3 and 3.4 in the supplementary material of [MMN18]. ∎

Appendix E Regularity of the solutions of the PDE (3.9) ( $\delta>0$ )

In this section we prove some standard regularity properties of the solutions of the PDE (3.9), for $\delta>0$ , and indeed for the more general PDE (B.2). First of all, we show that the weak solution of the PDE (B.2) is in fact strong, i.e., $\rho\in\mathscrsfs{C}^{2,1}(\Omega^{\delta},[0,T])$ and the equation (B.2) holds pointwise. We will then prove upper bounds on $\nabla K^{\delta}\ast\rho$ and $\nabla U^{\delta}\ast\rho$ that are uniform in $\delta$ . These will be crucial in order to take the $\delta\to 0$ limit in the next section.

We start by proving a bound on the $\mathscrsfs{L}^{\infty}$ norm of $\rho$ . In the proofs of the two lemmas that follow, we assume without loss of generality that $\tau=1$ .

Lemma E.1 (Bound on $\mathscrsfs{L}^{\infty}$ norm).

Let $\rho_{t}$ be a weak solution of the PDE (B.2) with initial and boundary conditions (B.3). Recall that $\rho_{t}$ has a density with respect to Lebesgue measure, denoted by $\rho(\,\cdot\,,t)$ . Then, there exists a constant $C(\Omega)$ such that, by letting $L=(\|\nabla V\|_{\mathscrsfs{L}^{\infty}(\Omega)}\vee\|\nabla U\|_{\mathscrsfs{L}^{\infty}(\Omega\times\Omega)})$ , we have

[TABLE]

Proof.

Any solution the PDE (B.2) satisfies Eq. (B.6). Given a measurable (Borel) function $\rho\in m{\mathcal{B}}(\Omega\times[0,T])$ , denote by $\mathscrsfs{D}(\rho)\in m{\mathcal{B}}(\Omega\times[0,T])$ the function given by the right-hand side of (B.2). Let $C(\Omega)$ be the constant in the statement of Theorem G.1 (part 3) and let $C_{U,V}\equiv C(\Omega)(\|\nabla V\|_{\mathscrsfs{L}^{\infty}(\Omega)}+\|\nabla U\|_{\mathscrsfs{L}^{\infty}(\Omega\times\Omega)})$ . We then have

[TABLE]

Hence

[TABLE]

Proceeding analogously for two different densities $\rho,\tilde{\rho}$ , we get

[TABLE]

Hence $\mathscrsfs{D}$ maps $\mathscrsfs{L}^{\infty}(\Omega\times[0,T])$ into itself, and is a contraction for $C_{U,V}\sqrt{T}<1$ . Therefore, it must have a unique fixed point in $\mathscrsfs{L}^{\infty}$ that coincides with the unique solution of PDE (B.2). Let $T_{0}=1/(4C_{UV}^{2})$ . Then for that fixed point $\rho\in\mathscrsfs{L}^{\infty}(\Omega\times[0,T])$ we have from Eq. (E.3)

[TABLE]

The desired claim follow by iterating this inequality $\lceil t/T_{0}\rceil$ times. ∎

Lemma E.2 (Strong solutions of PDE).

Let $\rho_{t}$ be a weak solution of the PDE (B.2) with initial and boundary conditions (B.3), and recall that, for any $t\leq T<\infty$ , this has a density $\rho(\,\cdot\,,t)$ , with $\rho\in\mathscrsfs{L}^{\infty}(\Omega\times[0,T])$ . Fix $q\in\mathbb{N}$ . If $\rho_{\mbox{\tiny\rm init}}\in\mathscrsfs{C}^{q}(\Omega)$ , then $\rho\in\mathscrsfs{C}^{q,1}(\Omega,[0,T])$ .

Proof.

We prove the claim for $q=2$ . For larger values of $q$ , the proof is similar and it only requires to iterate the argument.

The proof uses the same bootstrap technique of [MMN18][Supplementary material, Lemma 6.7]. The only difference is that the Duhamel formula of Eq. (B.6) involves the Neumann heat kernel in $\Omega$ instead of the heat kernel in ${\mathbb{R}}^{d}$ .

Let $S=\Omega\times[0,T]$ and, for $u:\Omega\times[0,T]\to{\mathbb{R}}$ . For $r\in{\mathbb{N}}$ , ${\boldsymbol{\alpha}}=(\alpha_{1},\dots,\alpha_{d})\in{\mathbb{N}}^{d}$ , let $D^{r}_{t}D^{{\boldsymbol{\alpha}}}_{{\boldsymbol{x}}}u$ be the generalized derivative of $u$ , and define the parabolic seminorm

[TABLE]

The proof of [MMN18][Supplementary material, Lemma 6.7] uses the following inequality from [LSU88][Chapter IV, Section 3, Eq. (3.1)]

[TABLE]

Furthermore, (G.11) of Theorem G.1 yields

[TABLE]

Since $G_{R}^{\Omega}\in\mathscrsfs{C}^{\infty}(\Omega\times\Omega\times[0,T])$ , we have that

[TABLE]

which immediately implies that

[TABLE]

The proof of [MMN18][Supplementary material, Lemma 6.7] can be repeated verbatimly with (E.7) replaced by (E.10). ∎

As a consequence of the last lemma, the PDE (B.2) admits unique strong solutions $\rho\in\mathscrsfs{C}^{2,1}(\Omega,[0,T])$ with initial condition $\rho_{\mbox{\tiny\rm init}}$ and Neumann boundary condition. We will use $\rho(t)$ as shortcut for $\rho(\,\cdot\,,t)$ . The rest of this appendix is devoted to prove further regularity results for $\rho(t)$ , which will be crucial in the proofs provided in Appendix F. To emphasize the dependence of $\rho$ on $\delta$ , we will denote this solution by $\rho^{\delta}$ .

In what follows, we will set the initial condition $\rho^{\delta}(0)\equiv\rho_{\mbox{\tiny\rm init}}^{\delta}$ at $\delta>0$ to be defined via $\rho_{\mbox{\tiny\rm init}}^{\delta}({\boldsymbol{w}})=\lambda_{\delta}^{-d}\rho_{\mbox{\tiny\rm init}}({\boldsymbol{w}}/\lambda_{\delta})$ , with $\lambda_{\delta}$ given by Eq. (3.4)

It is useful to recall the definition of free energy, which is given by

[TABLE]

The following lemma provides an expression for the derivative of the free energy with respect to time. Such an expression immediately yields an upper bound on the $\mathscrsfs{L}^{2}(\Omega)$ norm of $K^{\delta}*\rho^{\delta}(t)$ which is independent of $\delta$ .

Lemma E.3.

Let $\rho^{\delta}\in\mathscrsfs{C}^{2,1}(\Omega^{\delta},[0,T])$ be the solution of the PDE (B.2) with initial and boundary conditions (B.3). Then,

[TABLE]

Proof.

By definition

[TABLE]

By differentiating $F^{\delta}(\rho^{\delta}(t))$ along the solution of (B.2), we obtain

[TABLE]

∎

Corollary E.4.

Let $\rho^{\delta}\in\mathscrsfs{C}^{2,1}(\Omega^{\delta},[0,T])$ be the solution of the PDE (B.2) with initial and boundary conditions (B.3). Then,

[TABLE]

where $|\Omega^{\delta}|$ denotes the volume of the set $\Omega^{\delta}$ .

Proof.

By Lemma E.3 we have $F^{\delta}(\rho^{\delta}(t))\leq F^{\delta}(\rho^{\delta}(0))$ . The claim follows by substituting the definition of $F^{\delta}(\rho^{\delta})$ and using $S(\rho^{\delta})\leq\log|\Omega^{\delta}|$ . ∎

Remark E.1.

By Corollary E.4, we are able to provide a $\delta$ -free upper bound on $\nu_{0}\|K^{\delta}{*}\rho^{\delta}(t)-f\|^{2}_{\mathscrsfs{L}^{2}(\Omega)}$ . Specifically, $\Omega^{\delta}\subseteq\Omega$ and hence $|\Omega^{\delta}|\leq|\Omega|$ . We also have

[TABLE]

Note that

[TABLE]

Since $\lambda_{\delta}\to 1$ as $\delta\to 0$ , there exists a $C_{*}>0$ such that for $\delta<C_{*}$ , $\lambda_{\delta}\geq 1/2$ . Thus, the term $S(\rho^{\delta}(0))$ has a $\delta$ -free upper bound.

By Young’s inequality it only remains to give a $\delta$ -free upper bound on the quantity $\|\rho^{\delta}(0)\|_{\mathscrsfs{L}^{2}(\Omega)}$ . Let us write

[TABLE]

Again, for $\delta<C_{*}$ , $\lambda_{\delta}\geq 1/2$ . Also, by Assumption (A5) and the fact that $\Omega$ is compact, we have $\|\rho^{2}_{\mbox{\tiny\rm init}}\|^{2}_{\mathscrsfs{L}^{2}(\Omega)}<\infty$ , which concludes the claim.

We next prove $\delta$ -free upper bound on the gradient of $\nabla K^{\delta}{*}\rho^{\delta}$ .

Lemma E.5.

Let $\rho^{\delta}\in\mathscrsfs{C}^{2,1}(\Omega^{\delta},[0,T])$ be the solution of the PDE (B.2) with initial and boundary conditions (B.3). Then, the following bound holds:

[TABLE]

Proof.

Denote by $\langle f,g\rangle=\int f({\boldsymbol{x}})g({\boldsymbol{x}}){\rm d}{\boldsymbol{x}}$ the standard scalar product in $\mathscrsfs{L}^{2}$ . Then,

[TABLE]

By integrating (E.16) between [math] and $T$ , we obtain

[TABLE]

Hence, (E.15) follows from Corollary E.4. ∎

Remark E.2.

Note that by virtue of Lemma E.5, we are able to get a $\delta$ -free upper bound on the left-hand side of (E.15). Indeed, by definition of $\nabla V$ as per (B.4) and using Assumption (A3), we have the $\delta$ -free bound:

[TABLE]

In addition, by Remark E.1, $\|K^{\delta}{*}\rho^{\delta}(0)\|_{\mathscrsfs{L}^{2}(\Omega)}^{2}$ has $\delta$ -free bound.

Appendix F Global convergence: Proof of Theorems 5.2 and 5.3

We start by showing that $\rho^{\delta}$ admits a limit in a suitable functional space as $\delta\to 0$ .

Lemma F.1 (Existence of converging subsequence).

Let $\rho^{\delta}\in\mathscrsfs{C}^{2,1}(\Omega^{\delta},$ $[0,T])$ be the unique solution of the PDE (B.2) with initial and boundary conditions (B.3). Then, the family $(\rho^{\delta})_{\delta>0}$ is relatively compact in the space $\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(\Omega))$ . In particular any sequence $(\rho^{\delta_{n}})_{n\geq 1}$ , admits a converging subsequence.

Proof.

This follows from the Ascoli-Arzelá’s theorem. Notice that $\mathscrsfs{P}_{2}(\Omega)$ is compact due to the compactness of $\Omega$ . Therefore, it is sufficient to prove that the family is equicontinuous. Using the representation in terms of nonlinear dynamics (cf. Appendix C), we have

[TABLE]

Note that we omit for simplicity the dependence on $\delta$ . Recall that the nonlinear dynamic satisfies (for ${\boldsymbol{b}}({\boldsymbol{x}},t)\equiv-\nabla\Psi({\boldsymbol{x}},\rho_{t})$ )

[TABLE]

By [Slo01, Theorem 2.2], we have

[TABLE]

where $[\boldsymbol{B}]_{s}^{t}$ denotes the quadratic variation of $\boldsymbol{B}$ , and $|{\boldsymbol{V}}|_{s}^{t}$ the total variation of ${\boldsymbol{V}}$ between times $s$ and $t$ . We thus have

[TABLE]

Hence, in order to prove uniform continuity, it is sufficient to show that, for $s,t\leq T$ , $\int_{s}^{t}{\mathbb{E}}\big{\{}|{\boldsymbol{b}}(\boldsymbol{X}_{r},r)|^{2}\big{\}}\,{\rm d}r\leq C$ where $C$ is bounded uniformly in $\delta$ . In order to show that this is the case, notice that

[TABLE]

and the claim follows from Lemma E.5. ∎

We have now proved that the sequence $(\rho^{\delta_{n}})_{n\geq 1}$ admits a converging subsequence, where $\delta_{n}\to 0$ as $n\to\infty$ . Fix such a convergent subsequence and, with an abuse of notation, also denote it by $(\rho^{\delta_{n}})_{n\geq 1}$ . Let $\rho^{\infty}\in\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(\Omega))$ be its limit.

Recall that $\rho^{\delta_{n}}$ is supported in $\Omega^{\delta_{n}}$ . Hence, $K^{\delta_{n}}\ast\rho^{\delta_{n}}$ is supported in $\Omega$ and $K^{\delta_{n}}\ast\rho^{\delta_{n}}\in\mathscrsfs{P}_{2}(\Omega)$ . We will now show that $(K^{\delta_{n}}\ast\rho^{\delta_{n}})_{n\geq 1}$ has the same limit as $(\rho^{\delta_{n}})_{n\geq 1}$ in $\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(\Omega))$ .

Lemma F.2.

The sequence $(K^{\delta_{n}}\ast\rho^{\delta_{n}})_{n\geq 1}$ also converges in $\mathscrsfs{C}([0,T],$ $\mathscrsfs{P}_{2}(\Omega))$ to $\rho^{\infty}$ .

Proof.

By Lemma F.1, the result is implied by the following claim:

[TABLE]

Note that, for bounded $\Omega$ ,

[TABLE]

for any coupling $\gamma$ of the probability distributions of ${\boldsymbol{x}}$ and ${\boldsymbol{y}}$ . Hence,

[TABLE]

As an application,

[TABLE]

Thus, it suffices to show that $\sup_{0\leq t\leq T}W_{1}(K^{\delta_{n}}\ast\rho_{t}^{\delta_{n}},\rho_{t}^{\delta_{n}})\to 0$ as $n\to\infty$ .

Note that

[TABLE]

where the random variables $\boldsymbol{K}_{\delta_{n}}$ and $\boldsymbol{X}_{\delta_{n}}$ have distributions $K^{\delta_{n}}$ and $\rho_{t}^{\delta_{n}}$ , respectively. The quantity $\mathbb{E}\{|\boldsymbol{K}_{\delta_{n}}|\}$ is $O(\delta)$ , since $K$ has bounded absolute first moment, which completes the proof. ∎

We will now prove a stronger convergence result.

Lemma F.3 (Convergence in $\mathscrsfs{L}^{2}$ ).

The measure $\rho^{\infty}$ has a density, which is the limit in $\mathscrsfs{L}^{2}(\Omega\times[0,T])$ of the sequence $(K^{\delta_{n}}\ast\rho^{\delta_{n}})_{n\geq 1}$ .

Proof.

By Corollary E.4, we have that, for any $n\geq 1$ , $K^{\delta_{n}}\ast\rho^{\delta_{n}}\in\mathscrsfs{L}^{2}(\Omega\times[0,T])$ . Let us show that $(K^{\delta_{n}}\ast\rho^{\delta_{n}})_{n\geq 1}$ is a Cauchy sequence in $\mathscrsfs{L}^{2}(\Omega\times[0,T])$ .

As $K^{\delta_{n}}\ast\rho_{t}^{\delta_{n}}\in\mathscrsfs{L}^{2}(\Omega)$ for every $t\in[0,T]$ , its Fourier transform exists and we denote it by $\savestack{\tmpbox}{\stretchto{\scaleto{\scalerel*[\widthof{K^{\delta_{n}}\ast\rho^{\delta_{n}}}]{\kern 0.1pt\mathchar 866\relax\kern 0.1pt}{\rule{0.0pt}{505.89pt}}}{}}{2.4ex}}\stackon[-6.9pt]{K^{\delta_{n}}\ast\rho^{\delta_{n}}}{\tmpbox}$ . Hence, by applying Parseval’s theorem, we have

[TABLE]

Fix $\Lambda>1$ and decompose the integral in the right-hand side of (F.11) as

[TABLE]

Consider the first term of (F.12). By Lemma F.2, and since by Jensen’s inequality $W_{1}(\rho_{1},\rho_{2})\leq W_{2}(\rho_{1},\rho_{2})$ for any two distributions $\rho_{1},\rho_{2}$ , we have $W_{1}(K^{\delta_{n}}*\rho_{t}^{\delta_{n}}-K^{\delta_{n^{\prime}}}*\rho_{t}^{\delta_{n^{\prime}}})\to 0$ , as $n,n^{\prime}\to\infty$ . Since for the complex exponential functions $\|e^{i\langle{\boldsymbol{\lambda}},{\boldsymbol{x}}\rangle}\|_{{\rm Lip}}\leq|{\boldsymbol{\lambda}}|$ , by definition of 1-Wasserstein distance, the integrand in the first term converges pointwise to [math]. Furthermore, the integrand is upper bounded by an integrable function, since $|\savestack{\tmpbox}{\stretchto{\scaleto{\scalerel*[\widthof{K^{\delta_{n}}\ast\rho_{t}^{\delta_{n}}}]{\kern 0.1pt\mathchar 866\relax\kern 0.1pt}{\rule{0.0pt}{505.89pt}}}{}}{2.4ex}}\stackon[-6.9pt]{K^{\delta_{n}}\ast\rho_{t}^{\delta_{n}}}{\tmpbox}({\boldsymbol{\lambda}})|\leq\|K^{\delta_{n}}\ast\rho_{t}^{\delta_{n}}\|_{\mathscrsfs{L}^{2}(\Omega)}\leq C$ for all $n$ and every $t\in[0,T]$ . Hence, by dominated convergence, the first integral in (F.12) converges to [math].

As for the second term of (F.12), the following chain of inequalities holds:

[TABLE]

where in the last equality we have applied again Parseval’s theorem. By Lemma E.5, the integral in the right-hand side of (F.13) is upper bounded by a constant independent of $n$ . Therefore, as $\Lambda\to\infty$ , the second term of (F.12) converges to [math].

As a result, $(K^{\delta_{n}}\ast\rho^{\delta_{n}})_{n\geq 1}$ is a Cauchy sequence in $\mathscrsfs{L}^{2}(\Omega\times[0,T])$ . Let $\tilde{\rho}^{\infty}\in\mathscrsfs{L}^{2}(\Omega\times[0,T])$ be its limit. Furthermore, by Lemma F.2, $(K^{\delta_{n}}\ast\rho^{\delta_{n}})_{n\geq 1}$ has limit $\rho^{\infty}$ in $\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(\Omega))$ . Therefore, the measures $\rho^{\infty}_{t}({\rm d}{\boldsymbol{x}}){\rm d}t$ and $\tilde{\rho}^{\infty}_{t}({\boldsymbol{x}}){\rm d}{\boldsymbol{x}}\,{\rm d}t$ coincide. This implies that the measure $\rho^{\infty}_{t}$ has for almost every $t\in[0,T]$ the density $\tilde{\rho}^{\infty}_{t}\in\mathscrsfs{L}^{2}(\Omega\times[0,T])$ , and the proof is complete. ∎

From now on, with an abuse of notation, we will use $\rho^{\infty}$ to denote also the density which is the limit in $\mathscrsfs{L}^{2}(\Omega\times[0,T])$ of the sequence $(K^{\delta_{n}}\ast\rho^{\delta_{n}})_{n\geq 1}$ .

Lemma F.4 (Convergence to a weak solution of the limit PDE).

Let $\rho^{\infty}$ be the limit in $\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(\Omega))$ of the converging sequence $(\rho^{\delta_{n}})_{n\geq 1}$ . Then, $\rho^{\infty}$ is a weak solution of the PDE (A.1) with initial and boundary conditions (A.2).

Proof.

By Lemma F.3, we have that $\rho^{\infty}\in\mathscrsfs{L}^{2}(\Omega\times[0,T])$ . Choose a test function $h\in\mathscrsfs{C}^{2,1}(\Omega\times[0,T])$ , satisfying $\langle{\boldsymbol{n}}({\boldsymbol{x}}),\nabla h({\boldsymbol{x}},t)\rangle=0$ for all ${\boldsymbol{x}}\in\partial\Omega,t\in[0,T]$ . In order to prove the claim, we need to show that (A.3) holds. Throughout the proof, we will let $\lambda_{n}\equiv\lambda_{\delta_{n}}$ .

Recall that, for any $n\geq 1$ , $\rho^{\delta_{n}}$ is a weak solution of the PDE (B.2) with initial and boundary conditions (B.3). Hence, by Definition B.1, we have that

[TABLE]

for any $h^{\delta_{n}}\in\mathscrsfs{C}^{2,1}(\Omega^{\delta_{n}}\times[0,T])$ satisfying $\langle{\boldsymbol{n}}({\boldsymbol{x}}),\nabla h^{\delta_{n}}({\boldsymbol{x}},t)\rangle=0$ for all ${\boldsymbol{x}}\in\partial\Omega^{\delta_{n}},t\in[0,T]$ . Now, we set

[TABLE]

By definition of $\Omega^{\delta}_{n}$ , we have that $h^{\delta_{n}}\in\mathscrsfs{C}^{2,1}(\Omega^{\delta_{n}}\times[0,T])$ since $h\in\mathscrsfs{C}^{2,1}(\Omega\times[0,T])$ . Furthermore, $\langle{\boldsymbol{n}}({\boldsymbol{x}}),\nabla h({\boldsymbol{x}},t)\rangle=0$ for all ${\boldsymbol{x}}\in\partial\Omega$ , $t\in[0,T]$ immediately implies that $\langle{\boldsymbol{n}}({\boldsymbol{x}}),\nabla h^{\delta_{n}}({\boldsymbol{x}},t)\rangle=0$ for all ${\boldsymbol{x}}\in\partial\Omega^{\delta_{n}}$ , $t\in[0,T]$ .

Recall that

[TABLE]

Thus, (F.14) can be rewritten as

[TABLE]

Since $(\rho^{\delta_{n}})_{n\geq 1}$ converges in $\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(\Omega))$ to $\rho^{\infty}$ by Lemma F.1, we have that

[TABLE]

Furthermore, since $\rho_{0}^{\delta_{n}}({\boldsymbol{x}})=\lambda_{n}^{-d}\rho_{\mbox{\tiny\rm init}}({\boldsymbol{x}}/\lambda_{n})$ , we have that

[TABLE]

Let us use the notation $h_{t}({\boldsymbol{x}})=h({\boldsymbol{x}},t)$ and $h_{t}^{\delta_{n}}({\boldsymbol{x}})=h^{\delta_{n}}({\boldsymbol{x}},t)$ . Again, we set $\rho_{t}^{\delta_{n}}({\boldsymbol{x}})=0$ for ${\boldsymbol{x}}\not\in\Omega^{\delta_{n}}$ . We further define $\tilde{\rho}^{\delta_{n}}_{t}({\boldsymbol{x}})=\lambda_{n}^{d}\rho^{\delta_{n}}_{t}(\lambda_{n}{\boldsymbol{x}})$ , which is a probability density on $\Omega$ . Since $\rho^{\delta_{n}}_{t}(\,\cdot)\to\rho^{\infty}_{t}(\,\cdot)$ in $\mathscrsfs{P}_{2}(\Omega)$ and $\lambda_{n}\to 1$ , we have $\tilde{\rho}^{\delta_{n}}_{t}(\,\cdot)\to\rho^{\infty}_{t}(\,\cdot)$ in $\mathscrsfs{P}_{2}(\Omega)$ as well. Hence

[TABLE]

where the last equality follows since $\lambda_{n}\to 1$ , and $\nabla K^{\delta_{n}}\ast f(\lambda_{n}{\boldsymbol{x}})\to\nabla f({\boldsymbol{x}})$ uniformly in $\Omega$ .

Furthermore, we have that

[TABLE]

The second term in the right-hand side of (F.22) is equal to [math] by integration by parts. The third integral in the right-hand side of (F.22) is upper bounded as follows:

[TABLE]

which converges to [math], as $(K^{\delta_{n}}\ast\rho^{\delta_{n}})_{n\geq 1}$ converges in $\mathscrsfs{L}^{2}(\Omega\times[0,T])$ to $\rho^{\infty}$ . The first term in the right-hand side of (F.22) is upper bounded as follows:

[TABLE]

The first term is upper bounded using

[TABLE]

Notice that

[TABLE]

where $(a)$ follows from an application of Cauchy-Schwartz. By Lemma E.5, we deduce that the right-hand side of (F.26) is bounded uniformly in $\delta_{n}$ . Thus, the first term of (F.24) converges to [math] because of Eq. (F.25). As concerns the second term of (F.24), we have that

[TABLE]

Recall that $\rho^{\delta_{n}}_{t}$ is supported on $\Omega^{\delta_{n}}\subseteq\Omega$ , and $\Omega$ is bounded. In addition, since the kernel $K$ has bounded support, the diameter of the support of $K^{\delta_{n}}$ is at most $\delta_{n}$ times a constant. Consequently, the last term in the right-hand side of (F.27) is upper bounded by

[TABLE]

By using that $K^{\delta_{n}}\ast\rho^{\delta_{n}}\in\mathscrsfs{L}^{2}(\Omega\times[0,T])$ and the result of Lemma E.5, we have that the two last integrals are bounded uniformly in $\delta$ . As a result, the right-hand side of (F.28) converges to [math], which implies that the right-hand side of (F.22) also converges to [math]. By putting this fact together with (F.18) and (F.21), the desired result follows. ∎

We have now proved that $(\rho^{\delta_{n}})_{n\geq 1}$ converges to a weak solution of the limit PDE (A.1). In order to prove the uniqueness of the weak solutions of the limit PDE, we next prove a bound on $\|\rho^{\delta_{n}}_{t}\|_{\mathscrsfs{L}^{4}(\Omega)}$ , which along with Lemma A.2 proves the uniqueness claim.

Lemma F.5 (Uniform bound in $\mathscrsfs{L}^{4}$ ).

Assume that $\rho_{\mbox{\tiny\rm init}},f\in\mathscrsfs{C}^{\infty}(\Omega)$ and consider the sequence $(\rho^{\delta_{n}})_{n\geq 1}$ . Then,

[TABLE]

where

[TABLE]

for some bounded constant $0<C(\Omega)<\infty$ .

Proof.

For simplicity, we indicate the norms $\mathscrsfs{L}^{p}(\Omega)$ by $\|\cdot\|_{p}$ . For a function $g\in\mathscrsfs{C}^{m}(\Omega)$ , we let $\nabla^{\otimes m}g$ be the vector with coordinates ${\partial^{m}}g/{(\partial_{i_{1}}\dotsc\partial_{i_{m}})}$ , with $1\leq i_{1},i_{2},\dotsc,i_{m}\leq d$ . The proof strategy to prove this lemma is to first bound $\|\nabla^{\otimes m}\rho^{\delta_{n}}_{t}\|_{2}$ , for some $m\geq d/4$ , and then apply the Gagliardo-Nirenberg interpolation inequality (cf. Lemma H.3) to bound $\|\rho^{\delta_{n}}_{t}\|_{4}$ . Throughout this proof, we will use $C$ , $C_{k}$ and so on to denote constants that can depend on the domain $\Omega$ , but do not depend on $t$ or $\delta$ .

Before proceeding, we need to establish some notations and definitions.

For a function $g$ and an integer $k\geq 0$ , we denote its Sobolev norms by

[TABLE]

We will use the following relations on Sobolev norms (see [Oel01, Equation (1.14)]):

[TABLE]

Instead of bounding $\|\nabla^{\otimes m}\rho^{\delta_{n}}_{t}\|_{2}$ , we will bound the dominating quantity $\|\rho^{\delta_{n}}_{t}\|_{(m)}$ . To this end, we follow a similar strategy as in [Oel01]. Namely, we derive descriptions of the evolution of $\|(-\Delta)^{m}\rho^{\delta_{n}}_{t}\|_{2}$ and $\|(-\Delta)^{m}(\rho^{\delta_{n}}_{t}\ast K^{\delta_{n}}-f)\|_{2}$ . More precisely, we derive a recursive equation (on $m$ ) for the evolution of a suitably chosen linear combination of these two quantities.

Since $\rho^{\delta_{n}}$ is a solution of the PDE (B.2), we have

[TABLE]

Following along the same lines as in derivation of [Oel01, Equation (3.12)], we obtain

[TABLE]

where $C_{m}$ and $\tilde{C}_{m}$ are positive constants that depend on $m$ and $C>0$ is a constant which can be chosen arbitrarily.

We set $m=\lceil 1+d/2\rceil$ for which we can upper bound the right-hand side of (F.34) as

[TABLE]

We next move to the next quantity. Write

[TABLE]

where the last step follows from (F.33). Note that the first term on the right-hand side can be bounded as

[TABLE]

where the last step follows from Young’s convolution inequality and the fact that $\|K^{\delta_{n}}\|_{1}=1$ .

The second term in (F.36) can be bounded following the same lines as in derivation of [Oel01, Equations (3.3) and (3.16)], which along with (F.37) gives

[TABLE]

Since $f\in\mathscrsfs{C}^{\infty}(\Omega)$ , there exists constant $M>0$ , such that $\|(-\Delta)^{m+1}f\|_{2}\leq M$ , $\|\nabla(-\Delta)^{m}f\|_{2}\leq M$ . Using the particular choice of $m$ , we can upper bound the right-hand side of (F.38) as

[TABLE]

Define $C_{1}\equiv 2\|\rho_{\mbox{\tiny\rm init}}\|_{(2m)}$ and let

[TABLE]

for $n\geq 1$ . Clearly, $T_{n}>0$ by choice of $C_{1}$ . In addition, by applying Sobolev’s inequality (see e.g. [Oel01, Equation (1.12)]), we have

[TABLE]

where $C_{2}>0$ is a constant depending on $d$ . We let $C_{\ast}\equiv C_{1}C_{2}/C$ . Recall that the constant $C>0$ in (F.35) and (F.39) was arbitrary. We choose it in a way that $C<\tau/(2C_{m})$ . We then consider the evolution of the following linear combination of the two quantities we analyzed above. Note that by Equations (F.35) and (F.39), we have for $t\in[0,T_{n}]$ ,

[TABLE]

where in $(a)$ we use the fact that $\|(-\Delta)^{m}\rho^{\delta_{n}}_{t}\|_{2}\leq\|\rho^{\delta_{n}}_{t}\|_{(2m)}$ , which follows immediately from (F.31); $(b)$ follows from the fact that for any function $g\in\mathscrsfs{L}^{2}(\Omega)$ , $\|g\ast K^{\delta_{n}}\|_{2}\leq\|K^{\delta_{n}}\|_{1}\|g\|_{2}=\|g\|_{2}$ , by Young’s inequality for convolution.

Another observation that will be used later is that

[TABLE]

This claim follows by repeating the same argument we had to derive (F.40), for $m=0$ . In this case, we have analogous equations to (F.35) and (F.39), where only the first two terms appear.

Next note that by (F.32), we have for $t\in[0,T_{n}]$ ,

[TABLE]

where the last step is a result of (F.41) and (F.40). Let us stress that $\bar{C}_{m}$ , $C_{\ast}$ , $C_{3}$ are constants that are independent of $n$ .

We further note that

[TABLE]

for $n\geq 1$ . Here, the first step is a result of triangle inequality and the Young’s inequality for convolution along with the fact that $\|K^{\delta_{n}}\|_{1}=1$ . The second step follows from definition of $C_{1}$ . Since $f\in\mathscrsfs{C}^{\infty}(\Omega)$ , $\|f\|_{(2m)}^{2}$ is uniformly bounded over $\Omega$ . We denote the right-hand side of (F.43) by the constant $C_{4}$ . Using bound (F.43) into (F.42) results in

[TABLE]

for $t\in[0,T_{n}]$ . By employing a generalization of Gronwall’s inequality (cf. Lemma H.2 and Remark H.1) we get

[TABLE]

Therefore, for $t\in[0,T_{0}]$ , with

[TABLE]

we have that

[TABLE]

with $C_{5}\equiv\bar{C}_{m}C_{4}$ and $C_{6}\equiv\bar{C}_{m}\tau M^{2}C_{\ast}$ . Note that $C_{5}$ , $C_{6}$ and $T_{0}$ are independent of $n$ , but depend on $d$ . Let $m_{0}=\lceil d/4\rceil$ . Then, by the choice of $m=1+\lceil d/2\rceil$ we have $\|\nabla^{\otimes{m_{0}}}\rho^{\delta_{n}}_{t}\|_{2}\leq\|\rho^{\delta_{n}}_{t}\|_{(2m)}$ , and hence as a result of (F.47), we obtain

[TABLE]

Finally, by applying Gagliardo-Nirenberg interpolation inequality (cf. Lemma H.3) we get

[TABLE]

for some constant $C_{7},C_{8}>0$ , which completes the proof. ∎

Lemma F.6 (Convergence to the unique weak solution of limit PDE).

Let $\rho^{\infty}$ be the limit in $\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(\Omega))$ of the converging sequence $(\rho^{\delta_{n}})_{n\geq 1}$ . Then, $\rho^{\infty}$ is the unique weak solution of the PDE (A.1) in $\mathscrsfs{L}^{4}(\Omega\times[0,T])$ with initial and boundary conditions (A.2).

Proof.

From Lemma F.3, we have that the sequence $(K^{\delta_{n}}\ast\rho^{\delta_{n}})_{n\geq 1}$ converges in $\mathscrsfs{L}^{2}(\Omega\times[0,T])$ to $\rho^{\infty}$ . Furthermore, by Lemma F.5, $\|\rho^{\delta}_{t}\|_{\mathscrsfs{L}^{4}(\Omega)}\leq C(1+T)$ for any $t\in[0,T_{0}]$ , where $C$ is a universal constant. By using Young’s convolution inequality, we also deduce that $\|K^{\delta_{n}}\ast\rho^{\delta}_{t}\|_{\mathscrsfs{L}^{4}(\Omega)}\leq C(1+T)$ for any $t\in[0,T_{0}]$ .

Note that $\mathscrsfs{L}^{4}(\Omega)$ is a reflexive Banach space. Thus, by applying the Banach-Alaoglu theorem, every bounded sequence in $\mathscrsfs{L}^{4}(\Omega)$ has a weakly convergent subsequence. This means that there exist a subsequence $K^{\delta_{n_{k}}}\ast\rho^{\delta_{n_{k}}}$ and a function $\tilde{\rho}\in\mathscrsfs{L}^{4}(\Omega)$ such that, for any $g\in\mathscrsfs{L}^{4/3}(\Omega)$ , we have

[TABLE]

Now, since $\Omega$ is bounded, $K^{\delta_{n_{k}}}\ast\rho^{\delta_{n_{k}}}$ and $\tilde{\rho}$ are also in $\mathscrsfs{L}^{2}(\Omega)$ (as they are in $\mathscrsfs{L}^{4}(\Omega)$ ). Thus, $K^{\delta_{n_{k}}}\ast\rho^{\delta_{n_{k}}}-\tilde{\rho}$ is in $\mathscrsfs{L}^{2}(\Omega)$ , hence it is also in $\mathscrsfs{L}^{4/3}(\Omega)$ . As a result, we can pick $g=K^{\delta_{n_{k}}}\ast\rho^{\delta_{n_{k}}}-\tilde{\rho}$ and obtain

[TABLE]

Therefore, $\tilde{\rho}$ is the limit in $\mathscrsfs{L}^{2}(\Omega)$ of the sequence $K^{\delta_{n_{k}}}\ast\rho^{\delta_{n_{k}}}$ . By uniqueness of the limit, we conclude that $\tilde{\rho}=\rho^{\infty}$ . As a result, $\rho^{\infty}\in\mathscrsfs{L}^{4}(\Omega)$ for any $t\in[0,T_{0}]$ , which implies that $\rho^{\infty}\in\mathscrsfs{L}^{4}(\Omega\times[0,T_{0}])$ . Thus, by Lemma F.4 and Lemma A.2, $\rho^{\infty}$ is the unique weak solution of the PDE (A.1) for $t\in[0,T_{0}]$ . Note that $T_{0}$ is decreasing with $T$ . Thus, we can repeat the same argument with $T-T_{0}$ instead of $T$ and obtain that $\rho^{\infty}$ is the unique weak solution of the PDE (A.1) for $t\in[T_{0},2\,T_{0}]$ . By iterating this procedure $T/T_{0}$ times, the result follows. ∎

At this point, we state and prove a lemma showing that the sequence $(\rho_{t}^{\delta_{n}})_{n\geq 1}$ converges in $\mathscrsfs{L}^{2}(\Omega)$ to $\rho^{\infty}_{t}$ .

Lemma F.7.

For almost all $t\in[0,T]$ , the measure $\rho^{\infty}_{t}$ is the limit in $\mathscrsfs{L}^{2}(\Omega)$ of the sequence $(\rho_{t}^{\delta_{n}})_{n\geq 1}$ .

Proof.

The proof is similar to that of Lemma F.3. Suppose that $t\in[0,T_{0}]$ , where $T_{0}$ is defined in the statement of Lemma F.5. Note that, for any $n\geq 1$ , $\rho^{\delta_{n}}\in\mathscrsfs{L}^{2}(\Omega)$ . Let us show that $(\rho^{\delta_{n}})_{n\geq 1}$ is a Cauchy sequence in $\mathscrsfs{L}^{2}(\Omega)$ .

As $\rho_{t}^{\delta_{n}}\in\mathscrsfs{L}^{2}(\Omega)$ for every $t\in[0,T_{0}]$ , its Fourier transform exists and we denote it by $\widehat{\rho^{\delta_{n}}}$ . Hence, by applying Parseval’s theorem, we have

[TABLE]

Fix $\Lambda>1$ and decompose the integral in the right-hand side of (F.51) as

[TABLE]

Consider the first term of (F.52). By Lemma F.1, and since by Jensen’s inequality $W_{1}(\rho_{1},\rho_{2})\leq W_{2}(\rho_{1},\rho_{2})$ for any two distributions $\rho_{1},\rho_{2}$ , we have $W_{1}(\rho_{t}^{\delta_{n}}-\rho_{t}^{\delta_{n^{\prime}}})\to 0$ , as $n,n^{\prime}\to\infty$ . Since for the complex exponential functions $\|e^{i\langle{\boldsymbol{\lambda}},{\boldsymbol{x}}\rangle}\|_{{\rm Lip}}\leq|{\boldsymbol{\lambda}}|$ , by definition of 1-Wasserstein distance, the integrand in the first term converges pointwise to [math]. Furthermore, the integrand is upper bounded by an integrable function, since $|\widehat{\rho_{t}^{\delta_{n}}}({\boldsymbol{\lambda}})|\leq\|\rho_{t}^{\delta_{n}}\|_{\mathscrsfs{L}^{2}(\Omega)}\leq C$ for all $n$ and every $t\in[0,T_{0}]$ . Hence, by dominated convergence, the first integral in (F.52) converges to [math].

As for the second term of (F.52), the following chain of inequalities holds:

[TABLE]

where in the last equality we have applied again Parseval’s theorem. In the proof of Lemma F.5, we provide an upper bound, which does not depend on $n$ , on the Sobolev norm of $\rho_{t}^{\delta_{n}}$ (see (F.47)). Thus, as $\Lambda\to\infty$ , the second term of (F.52) converges to [math].

By iterating the argument $T/T_{0}$ times, we obtain that $(\rho_{t}^{\delta_{n}})_{n\geq 1}$ is a Cauchy sequence in $\mathscrsfs{L}^{2}(\Omega)$ for $t\in[0,T]$ . Let $\tilde{\rho}_{t}^{\infty}\in\mathscrsfs{L}^{2}(\Omega)$ be its limit. Furthermore, by Lemma F.1, $(\rho^{\delta_{n}})_{n\geq 1}$ has limit $\rho^{\infty}$ in $\mathscrsfs{C}([0,T],\mathscrsfs{P}_{2}(\Omega))$ . Therefore, the measure $\rho^{\infty}_{t}$ has for almost every $t\in[0,T]$ the density $\tilde{\rho}^{\infty}_{t}\in\mathscrsfs{L}^{2}(\Omega)$ , and the proof is complete. ∎

Theorem 5.2 follows from Lemma A.2, Lemma F.6 and Lemma F.7.

Let us define the free energy associated to the PDE (A.1) as

[TABLE]

As explained in Section 3.5, this limit free energy is displacement convex, and hence its $W_{2}$ gradient flow converges to the unique minimizer of (F.54). These facts are stated and proved formally in the theorem that follows.

Theorem F.8.

Assume that the initial condition $\rho^{\infty}(0)\in\mathscrsfs{C}^{\infty}(\Omega)$ . Then, the following results hold:

There exists a unique minimizer in $\mathscrsfs{P}_{2}(\Omega)$ , call it $\rho^{*}$ , of the free energy $F$ defined in (F.54). 2. 2.

For any $t\geq 0$ , we have

[TABLE]

where $\alpha$ is defined in (3.1). 3. 3.

For any $n\geq 1$ and for almost any $t\geq 0$ , we have

[TABLE]

where $\alpha$ is defined in (3.1) and $\Delta(\delta,T,d)\to 0$ as $\delta\to 0$ .

Proof.

The proof follows from the results of [CJM*+*01]. The technical assumptions required by [CJM*+*01] are satisfied by the PDE (A.1), since $\Omega$ is convex and bounded, the initial condition $\rho^{\infty}(0)\in\mathscrsfs{L}^{\infty}(\Omega)$ , and $f$ satisfies the assumptions (A2) and (A3). Note also that the condition $\inf_{\Omega}V=0$ coming from assumption (HV3) of [CJM*+*01] can be relaxed. In fact, adding a constant to $V$ does not change the entropy functional in [CJM*+*01, Eq. (3)] (which corresponds to the free energy (F.54)) and the PDE in [CJM*+*01, Eq. (46)] (which corresponds to the PDE (A.1)).

The uniqueness of the minimizer $\rho^{*}$ follows from [CJM*+*01, Lemma 6], which proves the first result. Since $\rho^{\infty}$ is the unique weak solution of the PDE (A.1) with initial and boundary conditions (A.2), then it coincides with the unique, non-negative mass-preserving solution of [CJM*+*01, Theorem 16]. Thus, the inequality (F.55) readily follows from [CJM*+*01, Theorem 16].

It remains to prove inequality (F.56). By definition of free energy, we obtain

[TABLE]

Recall that, by Lemma F.7, $\rho^{\delta}(t)$ converges to $\rho^{\infty}(t)$ in $\mathscrsfs{L}^{2}(\Omega)$ . Consequently, by using the triangle inequality, we have that the term $R(\rho^{\delta}(t))-R(\rho^{\infty}(t))$ tends to [math] as $\delta\to 0$ .

In order to complete the proof, it remains to show that $S(\rho^{\delta}(t))-S(\rho^{\infty}(t))$ tends to [math] as $\delta\to 0$ . To do so, define

[TABLE]

Note that $A\cup B\cup C=\Omega$ . In fact, suppose that ${\boldsymbol{x}}\not\in B$ and ${\boldsymbol{x}}\not\in C$ . Then, one between $\rho^{\delta}({\boldsymbol{x}},t)$ and $\rho^{\infty}({\boldsymbol{x}},t)$ is $\in[0,1/4]$ and the other is $>1/2$ . Consequently, $|\rho^{\delta}({\boldsymbol{x}},t)-\rho^{\infty}({\boldsymbol{x}},t)|>1/4$ and ${\boldsymbol{x}}\in A$ . This immediately implies that

[TABLE]

We will now upper bound the three integrals in the RHS of (F.59). As for the first term, note that

[TABLE]

where $|A|$ denotes the volume of $A$ . Furthermore,

[TABLE]

Note that $|t\log t|\leq 1$ for $t\in[0,1]$ and $|\log t|\leq t$ for $t\geq 1$ . Thus, the RHS of (F.61) is upper bounded by

[TABLE]

By Lemma F.7, for almost all $t\in[0,T]$ , $\rho^{\delta}(t)$ converges to $\rho^{\infty}(t)$ in $\mathscrsfs{L}^{2}(\Omega)$ . Thus, by (F.60), $|A|$ tends to [math] as $\delta\to 0$ . By Lemma F.6, $\rho^{\infty}(t)\in\mathscrsfs{L}^{4}(\Omega)$ for almost all $t\in[0,T]$ . Furthermore, by Lemma F.5, the quantity $\|\rho^{\delta}(t)\|_{\mathscrsfs{L}^{4}(\Omega)}$ has a $\delta$ -free upper bound for $t\in[0,T_{0}]$ . As a result, for almost all $t\in[0,T_{0}]$ , the first integral in (F.59) tends to [math] as $\delta\to 0$ . By iterating this argument $T/T_{0}$ times, we conclude that for almost all $t\in[0,T_{0}]$ , the first integral in (F.59) tends to [math] as $\delta\to 0$ .

In order to bound the second integral in (F.59), we write

[TABLE]

where in the last inequality we have applied [CT06, Theorem 17.3.3], since $\rho^{\delta}({\boldsymbol{x}},t)$ , $\rho^{\infty}({\boldsymbol{x}},t)\in[0,1/2]$ by definition of $B$ . Note that

[TABLE]

Thus, the RHS of (F.63) is upper bounded by

[TABLE]

where in the last step we have used Cauchy-Schwarz inequality. By Lemma F.7, for almost all $t\in[0,T]$ , $\rho^{\delta}(t)$ converges to $\rho^{\infty}(t)$ in $\mathscrsfs{L}^{2}(\Omega)$ . As a result, the second integral in (F.59) also tends to [math] as $\delta\to 0$ .

Finally, let us bound the third integral in (F.59). Define $h(x)=x\log x$ . Then, for $x>1/4$ ,

[TABLE]

Thus,

[TABLE]

where in the last step we have used Cauchy-Schwarz inequality. By Lemma F.7, for almost all $t\in[0,T]$ , $\rho^{\delta}(t)$ converges to $\rho^{\infty}(t)$ in $\mathscrsfs{L}^{2}(\Omega)$ . By Lemma F.6, $\rho^{\infty}(t)\in\mathscrsfs{L}^{2}(\Omega)$ for almost all $t\in[0,T]$ . Furthermore, by Lemma F.5, the quantity $\|\rho^{\delta}(t)\|_{\mathscrsfs{L}^{2}(\Omega)}$ has a $\delta$ -free upper bound for $t\in[0,T_{0}]$ . As a result, for almost all $t\in[0,T_{0}]$ , the third integral in (F.59) tends to [math] as $\delta\to 0$ . By iterating this argument $T/T_{0}$ times, we conclude that for almost all $t\in[0,T_{0}]$ , the third integral in (F.59) tends to [math] as $\delta\to 0$ , and the proof is complete. ∎

At this point, we are ready to provide the proof of Theorem 5.3.

Proof of Theorem 5.3.

By substituting $z$ with $z^{1/2p}$ in Theorem 5.1, we have that with probability at least $1-1/z$

[TABLE]

where ${\sf err}(N,d,{\varepsilon},\delta)$ is defined in (5.2). The risk $R^{\delta}(\rho^{\delta}_{k{\varepsilon}})$ can be upper bounded as

[TABLE]

where $\Delta_{0}(\delta,T,d)\to 0$ as $\delta\to 0$ , since both $K^{\delta}\ast\rho^{\delta}_{t}$ and $\rho^{\delta}_{t}$ converge in $\mathscrsfs{L}^{2}(\Omega)$ to $\rho^{\infty}_{t}$ . Furthermore, by Theorem F.8,

[TABLE]

where $\Delta(\delta,T,d)\to 0$ as $\delta\to 0$ and we recall that $|\Omega|$ denotes the volume of the set $\Omega$ .

Note that

[TABLE]

since $\rho^{*}$ is the minimizer of $F$ . By combining (F.70) with (F.69), we deduce that

[TABLE]

where in the last step we use again the result of Theorem 5.1 and the fact that $R(\rho^{\infty}(0))-R^{\delta}(\rho^{\infty}(0))$ tends to [math] as $\delta\to 0$ .

By optimizing over $p$ in (F.67), we will set $\Delta_{1}(N,{\varepsilon},T,d,z)$ as in (5.8). We also let $\Delta_{2}(\delta,T,d)=\Delta_{0}(\delta,T,d)+\Delta(\delta,T,d)$ . Then, the result follows by combining (F.67), (F.68) and (F.71). ∎

Appendix G Heat kernel in bounded domains with Neumann boundary

Given the domain $D\subseteq{\mathbb{R}}^{d}$ (compact, with $\mathscrsfs{C}^{2}$ boundary $\partial D$ ), we denote by $G^{D}({\boldsymbol{x}},{\boldsymbol{y}};t)$ the associated heat kernel, with Neumann boundary conditions. We collect here a few well known facts about this kernel (see, e.g., [Tay13, Section 6.1]).

The heat kernel can be defined as a function $G^{D}:D\times D\times{\mathbb{R}}_{>0}$ satisfying

[TABLE]

We will also denote by $G({\boldsymbol{x}},{\boldsymbol{y}};t)$ the heat kernel on ${\mathbb{R}}^{d}$ , namely

[TABLE]

The probabilistic interpretation of $G^{D}$ is as follows (see, e.g., [BGL13]). Let ${\mathbb{E}}_{{\boldsymbol{x}}}$ denote expectation with respect to a Brownian motion $\boldsymbol{X}_{t}$ , with initial condition $\boldsymbol{X}_{0}={\boldsymbol{x}}$ , and reflected at $\partial D$ (see Section C for definitions of this process, following [Tan79]). Then, for any bounded continuous function $\varphi:D\to{\mathbb{R}}$ ,

[TABLE]

Finally, $G^{D}$ can be viewed as the kernel representation of the bounded operator $e^{t\Delta/2}$ in $\mathscrsfs{L}^{2}(D,{\sf Unif})$ . We have

[TABLE]

Hence $G^{D}({\boldsymbol{x}},{\boldsymbol{y}};t)$ can be represented in terms of the eigenfunctions $\phi_{k}$ , and eigenvalues $\lambda_{k}$ , of $-\Delta$ ,

[TABLE]

Here $0=\lambda_{0}<\lambda_{1}\leq\lambda_{2}\leq\dots$ , with $\lim_{k\to\infty}\lambda_{k}=\infty$ , and $\phi_{0}({\boldsymbol{x}})={\boldsymbol{1}}_{D}({\boldsymbol{x}})/{\rm Vol}(D)^{1/2}$ .

Remark G.1.

Since $\Delta$ is self-adjoint in $\mathscrsfs{L}^{2}(D,{\sf Unif})$ , it follows that $G^{D}$ is symmetric, namely $G^{D}({\boldsymbol{x}},{\boldsymbol{y}},t)=G^{D}({\boldsymbol{y}},{\boldsymbol{x}};t)$ , and therefore it satisfies

[TABLE]

Theorem G.1.

The Neumann heat kernel satisfies the following properties:

We have that

[TABLE]

where $G_{R}^{D}\in\mathscrsfs{C}^{\infty}(D\times D\times{\mathbb{R}}_{\geq})$ . 2. 2.

For any $t>0$ , $G^{D}(\;\cdot\;,\;\cdot\;;t)\in\mathscrsfs{C}^{\infty}(D\times D)$ . 3. 3.

We have that, for a constant $C(D)$ ,

[TABLE]

Proof.

Substituting $G^{D}({\boldsymbol{x}},{\boldsymbol{y}};t)=G({\boldsymbol{x}},{\boldsymbol{y}};t)+G_{R}^{D}({\boldsymbol{x}},{\boldsymbol{y}};t)$ into Eqs. (G.1) to (G.3) yields, for ${\boldsymbol{x}}\in D$ ,

[TABLE]

Thus $G_{R}$ satisfies the heat equation in $D\times[0,T]$ and hence $({\boldsymbol{y}},t)\mapsto G^{D}_{R}({\boldsymbol{x}},{\boldsymbol{y}};t)$ is $\mathscrsfs{C}^{\infty}$ inside this domain (see, e.g., [Eva09, Chapter 2, Theorem 8], which refers to Dirichlet boundary condition, but applies equally well to the Neumann case). By symmetry, we have the claimed continuity in $({\boldsymbol{x}},{\boldsymbol{y}})$ , thus proving point 1.

Claim 2 follows by the same decomposition.

Finally, claim 3 follows from Lemma 3.1 in [WY13]. ∎

Appendix H Some useful technical lemmas

Lemma H.1 (Displacement convexity of quadratic functionals).

Let $U:{\mathbb{R}}^{d}\to{\mathbb{R}}^{d}$ be twice differentiable with $|U({\boldsymbol{x}})|\leq C(1+|{\boldsymbol{x}}|^{2})$ , $U({\boldsymbol{x}})=U(-{\boldsymbol{x}})$ , and define $\mathscrsfs{U}:\mathscrsfs{P}_{2}({\mathbb{R}}^{d})\to{\mathbb{R}}$ by $\mathscrsfs{U}(\rho)\equiv\int U({\boldsymbol{x}}-{\boldsymbol{x}}^{\prime})\,\rho({\rm d}{\boldsymbol{x}})\,\rho({\rm d}{\boldsymbol{x}}^{\prime})$ . Then $\mathscrsfs{U}$ is displacement convex if and only if $U$ is convex.

Proof.

Proposition 7.4 in [San15] proves that convexity of $U$ implies displacement convexity of $\mathscrsfs{U}$ . To prove the converse implication, let ${\boldsymbol{x}},{\boldsymbol{\delta}}\in{\mathbb{R}}^{d}$ , ${\boldsymbol{x}}\neq{\boldsymbol{0}}$ and consider the two probability distributions $\rho_{0}=(\delta_{{\boldsymbol{0}}}+\delta_{{\boldsymbol{x}}})/2$ and $\rho_{1}=(\delta_{{\boldsymbol{0}}}+\delta_{{\boldsymbol{x}}+{\boldsymbol{\delta}}})/2$ . For $|{\boldsymbol{\delta}}|<|{\boldsymbol{x}}|$ , the geodesic path connecting these distribution is $\rho_{t}=(\delta_{{\boldsymbol{0}}}+\delta_{{\boldsymbol{x}}+t{\boldsymbol{\delta}}})/2$ , $t\in[0,1]$ . Substituting in the definition of $\mathscrsfs{U}$ , we get

[TABLE]

Hence, displacement convexity implies $\langle{\boldsymbol{\delta}},\nabla^{2}U({\boldsymbol{x}}){\boldsymbol{\delta}}\rangle\geq 0$ . Since this holds for all $|{\boldsymbol{\delta}}|<|{\boldsymbol{x}}|$ , we obtain $\nabla^{2}U({\boldsymbol{x}})\succeq{\boldsymbol{0}}$ for all ${\boldsymbol{x}}\neq{\boldsymbol{0}}$ , which in turns imply that $U$ is convex (by a continuity argument, it is sufficient to lower bound the Hessian everywhere except at a point). ∎

Lemma H.2 (A Gronwall type inequality [Bih56]).

Let $u:[0,T]\to{\mathbb{R}}_{+}$ be a continuous function that satisfies the inequality

[TABLE]

where $A\geq 0$ , $\Psi:[0,T]\to{\mathbb{R}}_{+}$ is continuous and $\omega:{\mathbb{R}}_{+}\to{\mathbb{R}}_{+}$ is continuous and monotone-increasing. Then, the following holds

[TABLE]

with $\Phi:{\mathbb{R}}\mapsto{\mathbb{R}}$ given by

[TABLE]

Remark H.1.

To derive Equation (F.45), we use Lemma H.2 with $\omega(u)=u^{2}$ , $\Psi(s)=\bar{C}_{m}C_{3}$ , $A=\bar{C}_{m}C_{4}+\bar{C}_{m}\tau M^{2}C_{a}stT_{n}$ .

Lemma H.3 (Gagliardo-Nirenberg interpolation inequality, cf. Theorem 1.5.2 of [CM12]).

Fix $1\leq q,r\leq\infty$ and $m$ a positive integer. Let $u\in\mathscrsfs{L}^{q}(\Omega)\cap\mathscrsfs{L}^{r}(\Omega)$ and $\nabla^{\otimes m}u\in\mathscrsfs{L}^{p}(\Omega)$ . For integer $j$ , $0\leq j\leq m$ , and $\theta\in[j/m,1]$ (with the exception $\theta\neq 1$ if $m-j-d/2$ is a non-negative integer), define $p$ by

[TABLE]

Then $\nabla^{\otimes j}u\in\mathscrsfs{L}^{p}(\Omega)$ and satisfies

[TABLE]

with finite arbitrary $1\leq s\leq\max(r,q)$ and $C>0$ and $C_{1}\geq 0$ are independent of $u$ . The constant $C$ is independent of $\Omega$ , while $C_{1}\to 0$ as $|\Omega|\to\infty$ . In particular, the choice $C_{1}=0$ is admissible if $\Omega={\mathbb{R}}^{d}$ .

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AB 09] Martin Anthony and Peter L. Bartlett, Neural network learning: Theoretical foundations , Cambridge University Press, 2009.
2[AGS 08] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré, Gradient flows: in metric spaces and in the space of probability measures , Springer Science & Business Media, 2008.
3[Bac 17] Francis Bach, Breaking the curse of dimensionality with convex neural networks , The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
4[Bar 93] Andrew R. Barron, Universal approximation bounds for superpositions of a sigmoidal function , IEEE Transactions on Information theory 39 (1993), no. 3, 930–945.
5[Bar 98] Peter L. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network , IEEE Transactions on Information Theory 44 (1998), no. 2, 525–536.
6[BGL 13] Dominique Bakry, Ivan Gentil, and Michel Ledoux, Analysis and geometry of markov diffusion operators , vol. 348, Springer Science & Business Media, 2013.
7[Bih 56] Imre Bihari, A generalization of a lemma of Bellman and its application to uniqueness problems of differential equations , Acta Mathematica Hungarica 7 (1956), no. 1, 81–94.
8[BJW 18] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff, Learning two layer rectified neural networks in polynomial time , ar Xiv:1811.01885 (2018).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Analysis of a Two-Layer Neural Network via

Abstract

1 Introduction

2 Related literature

3 Model and assumptions

3.1 Notations

3.2 Data

Remark 3.1**.**

3.3 Neural network and SGD

3.4 PDE Model, δ>0\delta>0δ>0

3.5 Limit PDE, δ=0\delta=0δ=0

Remark 3.2**.**

4 Numerical illustrations

4.1 A one-dimensional concave function

4.2 A two-dimensional concave example

4.3 Comparing feature learning to random features

4.4 A non-concave one-dimensional example

4.5 Failure for small NNN

5 Main results

5.1 Convergence of SGD to the PDE (3.9) at δ>0\delta>0δ>0 fixed

Theorem 5.1**.**

Remark 5.1**.**

Remark 5.2**.**

5.2 Convergence to the solutions of porous medium equation

Theorem 5.2**.**

5.3 Global convergence of SGD

Theorem 5.3**.**

Remark 5.3**.**

Remark 5.4**.**

Remark 5.5**.**

6 Discussion

Acknowledgements

Appendix A Uniqueness of weak solutions of limit PDE (δ=0\delta=0δ=0)

Definition A.1** (Weak solution of limit PDE).**

Lemma A.2** (Uniqueness of limit PDE).**

Proof.

Appendix B General results on the PDE (3.9) (δ>0\delta>0δ>0)

Remark B.1**.**

Proof.

Definition B.1** (Weak solution of PDE).**

Lemma B.2** (Duhamel’s principle).**

Proof.

Corollary B.3** (Uniqueness of linearized problem).**

Proof.

Appendix C Nonlinear dynamics

Lemma C.1** (Existence, uniqueness and continuity of Skorokhod problem).**

Proof.

Definition C.2** (Solution of nonlinear dynamics).**

Lemma C.3**.**

Proof.

Theorem C.4** (Existence and uniqueness of nonlinear dynamics).**

Proof.

Theorem C.5** (Theorem 3.2 in [Slo01]).**

Proof.

Appendix D Convergence of SGD to the PDE: Proof of Theorem 5.1

Theorem D.1**.**

Proof.

Appendix E Regularity of the solutions of the PDE (3.9) (δ>0\delta>0δ>0)

Lemma E.1** (Bound on \mathscrsfsL∞\mathscrsfs{L}^{\infty}\mathscrsfsL∞ norm).**

Proof.

Lemma E.2** (Strong solutions of PDE).**

Proof.

Lemma E.3**.**

Proof.

Corollary E.4**.**

Proof.

Remark E.1**.**

Lemma E.5**.**

Proof.

Remark E.2**.**

Appendix F Global convergence: Proof of Theorems 5.2 and 5.3

Lemma F.1** (Existence of converging subsequence).**

Proof.

Lemma F.2**.**

Remark 3.1.

3.4 PDE Model, $\delta>0$

3.5 Limit PDE, $\delta=0$

Remark 3.2.

4.5 Failure for small $N$

5.1 Convergence of SGD to the PDE (3.9) at $\delta>0$ fixed

Theorem 5.1.

Remark 5.1.

Remark 5.2.

Theorem 5.2.

Theorem 5.3.

Remark 5.3.

Remark 5.4.

Remark 5.5.

Appendix A Uniqueness of weak solutions of limit PDE ( $\delta=0$ )

Definition A.1 (Weak solution of limit PDE).

Lemma A.2 (Uniqueness of limit PDE).

Appendix B General results on the PDE (3.9) ( $\delta>0$ )

Remark B.1.

Definition B.1 (Weak solution of PDE).

Lemma B.2 (Duhamel’s principle).

Corollary B.3 (Uniqueness of linearized problem).

Lemma C.1 (Existence, uniqueness and continuity of Skorokhod problem).

Definition C.2 (Solution of nonlinear dynamics).

Lemma C.3.

Theorem C.4 (Existence and uniqueness of nonlinear dynamics).

Theorem C.5 (Theorem 3.2 in [Slo01]).

Theorem D.1.

Appendix E Regularity of the solutions of the PDE (3.9) ( $\delta>0$ )

Lemma E.1 (Bound on $\mathscrsfs{L}^{\infty}$ norm).

Lemma E.2 (Strong solutions of PDE).

Lemma E.3.

Corollary E.4.

Remark E.1.

Lemma E.5.

Remark E.2.

Lemma F.1 (Existence of converging subsequence).

Lemma F.2.

Lemma F.3 (Convergence in $\mathscrsfs{L}^{2}$ ).

Lemma F.4 (Convergence to a weak solution of the limit PDE).

Lemma F.5 (Uniform bound in $\mathscrsfs{L}^{4}$ ).

Lemma F.6 (Convergence to the unique weak solution of limit PDE).

Lemma F.7.

Theorem F.8.

Remark G.1.

Theorem G.1.

Lemma H.1 (Displacement convexity of quadratic functionals).

Lemma H.2 (A Gronwall type inequality [Bih56]).

Remark H.1.

Lemma H.3 (Gagliardo-Nirenberg interpolation inequality, cf. Theorem 1.5.2 of [CM12]).