Analysis of a Two-Layer Neural Network via Displacement Convexity
Adel Javanmard, Marco Mondelli, Andrea Montanari

TL;DR
This paper studies the global convergence of gradient descent in training two-layer neural networks with bump-like components, revealing a connection to Wasserstein gradient flows and displacement convexity that ensures exponential convergence.
Contribution
It demonstrates that as the number of neurons grows and bump width shrinks, the training dynamics converge to a Wasserstein gradient flow with displacement convexity, providing new theoretical insights.
Findings
Gradient descent converges to Wasserstein gradient flow as neurons increase.
Limit of the flow is a viscous porous medium equation when bump width tends to zero.
Displacement convexity of the cost function ensures exponential convergence.
Abstract
Fitting a function by using linear combinations of a large number of `simple' components is one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is non-convex and is solved by gradient descent or its variants. Unfortunately, little is known about global convergence properties of these approaches. Here we consider the problem of learning a concave function on a compact convex domain , using linear combinations of `bump-like' components (neurons). The parameters to be fitted are the centers of bumps, and the resulting empirical risk minimization problem is highly non-convex. We prove that, in the limit in which the number of neurons diverges, the evolution of gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\stackMath
Analysis of a Two-Layer Neural Network via
Displacement Convexity
Adel Javanmard, Marco Mondelli and Andrea Montanari Data Science and Operations Department, Marshall School of Business, University of Southern CaliforniaDepartment of Electrical Engineering, Stanford UniversityDepartment of Electrical Engineering and Department of Statistics, Stanford University
Abstract
Fitting a function by using linear combinations of a large number of ‘simple’ components is one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is non-convex and is solved by gradient descent or its variants. Unfortunately, little is known about global convergence properties of these approaches.
Here we consider the problem of learning a concave function on a compact convex domain , using linear combinations of ‘bump-like’ components (neurons). The parameters to be fitted are the centers of bumps, and the resulting empirical risk minimization problem is highly non-convex. We prove that, in the limit in which the number of neurons diverges, the evolution of gradient descent converges to a Wasserstein gradient flow in the space of probability distributions over . Further, when the bump width tends to [math], this gradient flow has a limit which is a viscous porous medium equation. Remarkably, the cost function optimized by this gradient flow exhibits a special property known as displacement convexity, which implies exponential convergence rates for , .
Surprisingly, this asymptotic theory appears to capture well the behavior for moderate values of . Explaining this phenomenon, and understanding the dependence on in a quantitative manner remains an outstanding challenge.
1 Introduction
In supervised learning, we are given data which are often assumed to be independent and identically distributed from a common law on (here is a feature vector, and is a label or response variable). We would like to find a function to predict the labels at new points . Throughout this paper, we will quantify the quality of our prediction by square loss, hence we are interested in minimizing .
One of the most fruitful ideas in this context is to use functions that are linear combinations of simple components:
[TABLE]
Here is a component function (a ‘neuron’ or ‘unit’ in the neural network parlance), and , are parameters to be learnt from data. Standard choices for the activation function are (sigmoid) or (ReLU). In this paper we will instead study a class of activation that depends on the difference . The objective is to minimize the population (prediction) risk
[TABLE]
Special instantiations of this idea include (we provide only pointers to the immense literature on each topic):
- •
Two-layer neural networks [Ros62, AB09];
- •
Sparse deconvolution [Don92, CFG14];
- •
Kernel ridge regression and related random feature methods [CST00, RR08];
- •
Boosting [Sch03, Fri01, BY03].
Despite the impressive practical success of these methods, the risk function is highly non-convex and little is known about global convergence of algorithms that try to minimize it (we refer to Section 2 for further discussion of the related literature).
Notable exceptions to the last statement are provided by random features and by boosting algorithms. In random feature methods, the parameters are not optimized over (they are drawn i.i.d. from some common distribution), and the resulting risk function becomes convex in the weights to be learnt. While this is a fruitful idea, it gives up the degrees of freedom afforded by the ’s.
Boosting overcomes non-convexity by fitting the components , …, one at the time, sequentially. The underlying assumption is that the problem of minimizing with respect to one of the hidden units is tractable. However, this is generally not the case when the parameters belong to a high-dimensional space.
The risk function (1.2) crystalizes a central conundrum in statistical learning. In a number of applications (especially at low noise), it is rarely the case that low prediction error can be achieved through a function that is linear in the raw covariates, e.g. . In a classical setting, the statistician would craft nonlinear features out of the covariates on the basis of expert knowledge. For the model of Eq. (1.1), this amounts to constructing vectors . Statistical methods would then be confined to the convex task of fitting the coefficients . This step is well understood from a statistical and computational perspective.
Modern machine learning approaches (boosting, neural networks, etc.) hold the promise of automatizing feature extraction, hence producing superior performances in a wide variety of applications. Unfortunately, we are still far from understanding in which cases optimizing over the ’s yields a significant improvement over –say– choosing them randomly. This central challenge intertwines statistical and computational aspects. It is not hard to see that varying the weights ’s produces a significantly larger function class [Bac17]. The relevant question is what part of this class can be accessed using gradient descent or other practical algorithms.
The main objective of this paper is to introduce a nonparametric regression model in which these questions can be addressed rigorously. The model is interesting for at least two reasons: From a theoretical point of view, global convergence can be proved in the limit of a large neurons. The proof relies on a mathematical mechanism that has not been explored in the statistics or machine learning literature before. From a practical point of view, the model is nontrivial enough to illustrate the potential advantage of fitting the features (we demonstrate this numerically in Section 4.)
Let be a compact convex set with boundary. We assume to be i.i.d. where and
[TABLE]
with a smooth function. We try to fit these data using a combination of bumps, namely
[TABLE]
where , is a first order kernel with compact support, and for . Here is a slightly smaller compact set, with as . (Note that in our setting the hidden units and input data have same dimensions, i.e., .) We refer to Section 5 for a formal statement of our assumptions. From Eq. (1.2), we have
[TABLE]
where and we use the fact that . Since the constant does not depend on parameters , it does not matter in optimizing over and henceforth we write, with a slight abuse of notation,
[TABLE]
The model (1.4) is general enough to include a broad class of radial-basis function (RBF) networks which are known to be universal function approximators [PS91]. To the best of our knowledge, there is no result on the global convergence of stochastic gradient descent for learning RBF networks, and this paper establishes the first result of this type.
It is important to emphasize a few differences with respect to standard RBF networks. First of all, we do not require the kernel to be radial, i.e. to depend uniquely on the norm . Second, we require to have compact support. This is mainly a technical requirement that simplifies some arguments: we expect our results to be generalizable to kernels that decay rapidly enough. Finally, and most crucially, the form (1.4) does not include non-uniform weights for the components. A more standard formulation would posit and learn the weights from data, see Eq. (1.1). We deliberately set the weights to a fixed value because the risk function is convex in , and hence fitting ’s to global optimality is ‘easy.’ Indeed, universal approximation could be achieved by keeping the centers fixed (and sufficiently dense in ) and only adjusting . As discussed above, our focus is on the role of the ’s.
Our main result is a proof that, for sufficiently large and small , gradient descent algorithms converge to weights with nearly optimum prediction error, provided is strongly concave. Let us emphasize that the resulting population risk is non-convex regardless of the concavity properties of . Our proof unveils a novel mechanism by which global convergence takes place. Convergence results for non-convex empirical risk minimization are generally proved by carefully ruling out local minima in the cost function (see Section 2 for pointers to this literature). Instead we prove that, as , , the gradient descent dynamics converges to a gradient flow in Wasserstein space, and that the corresponding cost function is ‘displacement convex.’ Breakthrough results in optimal transport theory guarantee dimension-free convergence rates for this limiting dynamics [CJM*+*01, CMV03, CMV06]. In particular, we expect the cost function to have many local minima, which are however completely neglected by the gradient descent dynamics.
More specifically, our first step is to show that – for large – the evolution of the weights under gradient descent can be replaced by the evolution of a probability distribution111Throughout, denotes the space of probability distributions on , endowed with Wasserstein metric . , which approximates their empirical distribution. Namely, if denote the weights after iterations with step size , and is their empirical distribution, then we have
[TABLE]
where the limit holds in the sense of weak convergence or in distance (the two are equivalent since is compact). The limit evolution satisfies a partial differential equation (PDE) that can also be described as the Wasserstein gradient flow (i.e. gradient flow in ), for the following effective risk
[TABLE]
where and denotes the volume of the set . Here denotes the usual convolution. Let us emphasize that the convergence to Wasserstein gradient flow holds regardless of the concavity of .
The use of gradient flows to analyze two-layer neural networks was recently developed in several papers [MMN18, RVE18, CB18, SS18]. However, we cannot rely on earlier results because of the specific boundary conditions in our problem. We constrain the by running projected stochastic gradient descent (SGD): at each step moves in the direction of a stochastic gradient of and then projected back to . This results in a PDE with Neumann boundary condition on , which is not covered by previous theory. We establish a quantitative version of the limit (1.5) via propagation-of-chaos techniques.
Even if the cost (1.6) is quadratic and convex in , its gradient flow can have multiple fixed points, and hence global convergence cannot be guaranteed. Global convergence results were proven in [MMN18] and in [CB18] by showing that, for all has a density that is either smooth, or strictly positive everywhere. However, these convergence results are non-quantitative, and do not provide convergence rates222An argument indicating convergence in a time polynomial in was put forward in [WLLM18], but for a different type of continuous flow..
Indeed, the mathematical property that controls global convergence of gradient flow is not ordinary convexity but displacement convexity. Roughly speaking, displacement convexity is convexity along geodesics of the metric, see Section 3.5. The risk function (1.6) is not displacement convex. Indeed, its quadratic term reads which is not displacement convex unless is convex (see Lemma H.1), which cannot be in our setting. However, for small , we can formally approximate , and hence hope to replace the risk function (1.6) with a simpler one
[TABLE]
Most of our technical work is devoted to making rigorous this approximation. Namely, we prove that, as , where follows the gradient flow for the risk .
Remarkably, the risk function is strongly displacement convex (provided is strongly concave). A long line of work in PDE and optimal transport theory establishes dimension-free convergence rates for its gradient flow [CJM*+*01, CMV03, CMV06]. Namely, if is -strongly concave, then . By using the approximation results outlined above, we obtain global convergence for SGD. With high probability,
[TABLE]
where the error term vanishes as , in a suitable order.
This result implies that SGD converges exponentially fast to a near-global optimum with a rate that is controlled by the convexity parameter .
Our bounds are not sharp enough to provide quantitative control on the error term , especially in high dimension. Nevertheless, the convergence rate predicted by our asymptotic theory is in excellent agreement with numerical simulations, cf. Section 4. Explaining this surprising quantitative agreement is an outstanding challenge.
2 Related literature
The present work ties in several lines of research, some of which were already mentioned in the introduction. A substantial amount of work has been devoted to analyzing two-layer neural networks and developing algorithms with convergence guarantees, see e.g. [ZSJ*+*17, Tia17, BJW18]. However these approaches are typically based on tensor factorization or similar initialization steps that are not used in practice, and do not scale well (although polynomially) in high dimension.
The landscape of empirical risk minimization was also studied in a number of papers, see e.g. [LY17, SJL18]. However, global convergence was only proved in the extremely overparametrized regime in which the neural network essentially behaves as kernel ridge regression [DZPS18].
Classical theory of neural networks was largely devoted to the two-layer case [AB09], although the focus was on representation and approximation questions [Cyb89, Bar93], as well as on generalization error. It was already clear in that context that a two-layer network is conveniently characterized by the empirical distribution of the hidden neurons, and that it is useful to relax this from a distribution with atoms, to a general probability measure. This representation plays an important role, for instance, in [Bar98], and was exploited again under the label of ‘convex neural networks’ in [BRV*+*06].
Over the last year, several groups independently revisited this connection, with the objective of understanding the landscape structure of two-layer networks, and the dynamics of gradient descent methods [NS17, MMN18, RVE18, SS18, CB18, MMM19]. In particular, it was proven in [MMN18] that, under certain smoothness condition on the underlying data distribution, the gradient descent evolution is well approximated by a Wasserstein gradient flow, provided that the number of neurons exceeds the data dimensions. As mentioned above, the algorithm treated here differs from the ones analyzed in earlier work, because the weights are constrained to lie in the convex set . We enforce this constraint by using projected SGD, i.e. projecting at each step the weights onto the set . We generalize the analysis of [MMN18], obtaining convergence to a PDE with Neumann (reflecting) boundary conditions. As in [MMN18], we build on ideas that were first developed in the context of interacting particle systems [Dob79, Szn91].
The Wasserstein gradient flow approach was used in [MMN18, CB18] to establish global convergence results. However, these results fall short of our objectives for several reasons:
- •
The global convergence result of [CB18] rely on certain homogeneity properties of the neurons that are lacking here. We could obtain homogeneity by adding coefficients to Eq. (1.4), i.e. considering and minimizing the risk with respect to the coefficients . As mentioned above, we refrain from introducing coefficients not to oversimplify the problem: when , it is sufficient to fit the coefficients to achieve vanishing risk. Fitting the ’s is a least squares problem.
- •
Most importantly, the techniques [MMN18, CB18] do not establish any convergence rates. This is not surprising, as those results hold under weak assumptions on the data distribution and the activation function. In particular, [MMN18, CB18, MMM19] cover general risk functions of the form (1.2) under certain smoothness and boundedness conditions on and on the functions , . In such a general setting [MMN18] provides examples in which the Wasserstein gradient flow has multiple fixed points, which are singular with respect to the Lebesgue measure. Global convergence is established in [MMN18, CB18] by proving that PDE solution has a strictly positive density. However, it is difficult to imagine this condition to hold in a quantitative dimension-independent manner.
In contrast, our results are a first step towards dimension-independent convergence rate, in a more restricted setting than [MMN18, CB18, MMM19].
In summary, our results do not subsume earlier work, that assumes a more general setting, but rather establish stronger results in narrower context. Indeed, we believe that specific structural conditions must be imposed on the data distribution and activation function for the Wasserstein gradient flow approach to yield quantitative convergence rates. This paper presents one specific set of assumptions. Although our results are not strong enough to establish non-asymptotic convergence rates, they point clearly in that direction.
3 Model and assumptions
3.1 Notations
We will use lowercase boldface for vectors, e.g. , uppercase for random variables, e.g. , and uppercase boldface for random vectors, e.g. . The scalar product of two vectors is denoted by , and the norm of a vector is denoted by . The Euclidean ball in with center and radius is denoted by . Given a set , we denote by its volume.
We will refer to several function spaces in what follows. The most common is the space of -th integrable functions on a measure space . Given a function , we denote by its norm, namely . For , denotes the space of continuous functions with continuous derivatives up to order . In particular, denotes the space of continuous real-valued functions defined on . In addition, for and a metric space (with distance ), denotes the set of continuous functions , endowed with the distance between two functions defined as . For a function , we let be the Lipschitz constant of the function . Finally, as mentioned above, denotes the space of probability distributions on , endowed with the Wasserstein metric
Throughout the paper, we use to denote finite constants, which can vary from point to point. When these constants can depend on some of the problem parameters, e.g. , we will write . When they are absolute numerical constants, we will emphasize this by writing .
3.2 Data
As mentioned above, we are given data where , with a compact convex set, and , with . We assume the to be i.i.d. -subgaussian random variables with . We assume the function to be concave and smooth.
Our formal assumptions on the set and the function are as follows:
- (A1)
, with , is a compact convex set with boundary. 2. (A2)
uniformly concave, i.e., there exists such that
[TABLE]
where denotes the Hessian of . 3. (A3)
, with for an absolute constant .
Without loss of generality, we can also assume that . As a running example, we will use , where we remind is defined in Assumption (A1).
Remark 3.1**.**
The assumption is quite strong but simplifies our analysis. We believe our approach can be generalized to a broader family of probability distribution for the covariates , but defer these generalizations to future work.
3.3 Neural network and SGD
Let be a non-negative symmetric first order kernel with compact support. Formally, we assume that
[TABLE]
The assumptions of symmetry and compact support are not crucial, but simplify some of the technical details later. We will further assume , and to be independent of the ambient dimension . Notice that this requirement follows from the differentiability and compact support assumptions if is a radial function.
For , let . We try to fit the function (1.4) with parameters . These parameters are constrained to which is a suitable scaling of , as defined in the following. Given , with defined in (A1), define
[TABLE]
where
[TABLE]
For two sets , their Minkowski sum is defined as . Note that for all . Furthermore, implies for all . Finally, , whence . In our running example, is a ball of slightly smaller radius. Clearly, since is convex, is convex as well.
We use stochastic gradient descent to minimize the population risk (1.2). At each step, we use a new data point , thus the sample size is equal to the number of iterations of the algorithm. Assuming for simplicity constant step size , we update the parameters by
[TABLE]
Here is Gaussian noise which we take to be i.i.d. across time and neuron indices, and , and is the orthogonal projector onto :
[TABLE]
The noise term is added mainly for technical reasons. Namely, it allows us to control the smoothness of the solutions of the resulting PDE. In simulations we do not find it useful, and we believe that a more careful analysis would be able to establish smoothness without the noise term.
Again, in our running example, we have
[TABLE]
We initialize SGD with , where is a scaling of a fixed distribution , i.e. . We assume that the initialization is smooth:
- (A5)
.
3.4 PDE Model,
In the limit the population risk is approximated by the effective risk defined in Eq. (1.6). We emphasize that is a probability distribution supported on . Note that
[TABLE]
In particular .
Our first main result is that the dynamics of SGD is well approximated by the following PDE (see Section 5.1 for a formal statement):
[TABLE]
with initial and boundary conditions
[TABLE]
where denotes the inward normal vector to at .
A rigorous definition of solutions of this PDE, along with some of their properties, is given in Appendix B. In Appendix C, we discuss the connection between the PDE (3.9) and the so-called “nonlinear dynamics”, i.e. a stochastic differential equation that captures the trajectories of the weights . Using this connection, we prove existence and uniqueness of weak solutions of Eq. (3.9). In the proofs, we will often assume , which amounts to a rescaling of time .
For , the evolution defined by Eq. (3.9) corresponds to the gradient flow in Wasserstein metric for the risk function . For , it is the gradient flow for the free energy functional defined below
[TABLE]
3.5 Limit PDE,
As mentioned above, in the limit the risk function is well approximated by , where , cf. Eq. (1.7).
The corresponding Wasserstein gradient flow is also known as viscous porous medium equation [Váz07] and it is given by
[TABLE]
with initial and boundary conditions
[TABLE]
In Appendix A, we give the definition of a weak solution for the PDE (3.12) with initial and boundary conditions (3.13). We also prove that the weak solution of the PDE (3.12) is unique, under a mild integrability condition. Again, in proofs we will assume without loss of generality .
As in the case, the evolution defined by Eq. (3.12) is the gradient flow for the free energy . Our analysis uses a key property of the risk function (and the free energy): displacement convexity [McC97]. For the reader’s convenience, we recall its definition here, referring to [AGS08, Vil08, San15] for further background. Given two probability measures , their distance is defined by
[TABLE]
where the infimum is taken over the set of couplings of , (i.e. probability measures on whose first marginal coincides with , and second with ). The infimum is achieved by weak compactness of .
The metric space is a ‘length space,’ and in particular it is possible to construct geodesics, i.e. paths of minimum length connecting any two probability measures . Geodesics have a simple description. Let be the coupling achieving the infimum in the definition of . Letting , we define to be the distribution of . The curve , indexed by turns out to be the geodesic between and in .
Displacement convexity is convexity along geodesics. Namely, a function is -strongly displacement convex if
[TABLE]
A useful observation is that displacement convexity implies that all local minima of are global minimizer. Indeed, by (3.15) it is straightforward to see that has at most one global minimizer . Also, for every other point , the geodesic between and is a strictly decreasing path for the function . Now, suppose that is a local minimum. Then, there exists a neighborhood around such that, for any , . However, the strictly decreasing path between and passes through the neighborhood , which leads to a contradiction and so
It follows from [McC97] that the risk function and the free energy are strongly displacement convex.
Remark 3.2**.**
The concavity assumption on the regression function (Assumption (A2)) defines a nonparametric class under which global convergence can be established, with convergence rates uniquely determined by the curvature (in the limit , ). Nonparametric estimation of concave functions has attracted considerable attention over recent years, see e.g. [HD13, CS16], and is –by itself– an interesting domain of applicability.
However, our projected SGD algorithm is potentially applicable to any data set, and will return a meaningful estimate regardless whether is concave or not. Indeed, in the next section we present numerical simulations indicating convergence to a near-global optimum even for non-concave functions .
From mathematical point of view, Assumption (A2) is only used to show the convergence of the solution of the viscous porous medium equation (limit PDE, ) to the unique global minimizer of the free energy , as formally stated in Theorem F.8. Concavity is not needed for the other results in the paper, namely approximating the SGD trajectory with the solution of the PDE (), see Theorem 5.1, and the convergence of the solution of the PDE () to the solution of the viscous porous medium equation, see Theorem 5.2. It is therefore foreseeable a more general analysis that relaxes the concavity assumption.
4 Numerical illustrations
In this section we provide some simple numerical illustrations of our setting, and compare numerical results with the predictions of the Wasserstein gradient flow theory.
It is easy to construct examples of strongly concave functions, satisfying our assumptions. One can start from any strongly concave continuous function on a compact convex set , add a constant to make it non-negative, and multiply it by a constant to normalize its integral. The resulting function satisfies our conditions. Notable examples of concave functions are given by log-moment generating functions , where the random variable satisfies mild assumptions (e.g., it is bounded and its distribution is not supported on a proper subspace of ). In general, given any twice differentiable function , the function is strongly concave for large enough.
4.1 A one-dimensional concave function
We set and (we choose the normalization so that ). Note that is uniformly concave in . We set the kernel as follows:
[TABLE]
where is a normalization constant ensuring that . The initialization is a truncated Gaussian: , with .
We find empirically that standard stochastic gradient descent (SGD) without the projection onto works well in this example, and consider this algorithm for simplicity in our first illustrations. We pick , (noiseless SGD), and constant step size . In Figure 1, left column, we plot the true function together with the neural network estimate at several points in time (time is related to the number of iterations via ). Different plots correspond to different values of with . We observe that the network estimates seem to converge to a limit curve which is an approximation of the true function . As expected, the quality of the approximation improves as gets smaller.
In the right column, we report the evolution of the population risk (1.2) normalized by . For comparison, we plot the evolution of the risk (1.7) as predicted by the limit PDE (3.12) with . We solve the PDE (3.12) numerically using a finite difference scheme that enforces the conservation law , see, e.g., [Tho13]. In the finite difference scheme, we choose time step and spatial step and , respectively. The curve obtained by this numerical solution appears to capture well the evolution of SGD towards optimality. The main difference is that, while the PDE (3.12) corresponds to , and hence evolves towards a global optimum at zero risk, SGD converges to a non-zero risk value, which can be interpreted as the approximation error, decreasing with .
In Figure 2, we illustrate the numerical solution of the PDE (3.12) by plotting (i) the regression function together with the PDE solution (which coincides with the prediction at ) at several times , and (ii) the PDE prediction for the risk (1.7) normalized with respect to (this plot aggregates data from Figs. 1.(b), (d), (f)). We also compare the risk (1.7) to the population risk achieved by SGD for different values of . Note that, as becomes smaller, the risk converges to the predicted curve. The risk of the limit PDE (3.12) converges to [math] exponentially fast in , as predicted by the strong displacement convexity of .
In Figure 3, we consider the SGD algorithm with projection , see (3.5). We pick , , and . On the left, we illustrate the evolution of the value of weights chosen at random; and on the right, we plot the histogram of their empirical distribution at . Note that this histogram matches well the regression function plotted in black.
4.2 A two-dimensional concave example
Next, we consider a two-dimensional example. We set and
[TABLE]
with , and where and are chosen so that is non-negative and . The kernel is given by , where is defined in (4.1) and is a normalization constant ensuring that . Again, the initialization is a truncated Gaussian: , with . We compare the normalized risk of SGD with no projection (, and ) for with that of the limit PDE (3.12). Figure 4 shows that, already at , the risk of SGD converges to the predicted curve and the risk of the limit PDE (3.12) tends to [math] exponentially fast in .
4.3 Comparing feature learning to random features
As discussed in the introduction, it is useful to consider the more general model
[TABLE]
with parameters as well as . This setting allows to compare two different approaches:
Random feature regression: the weights are chosen independently of the labels (we allow for dependence on the covariates ).
Feature learning: the weights depend on the data .
In order to compare these two approaches, we assume to be given i.i.d. data , with , and determine the parameters by the same method, ridge regression. More explicitly, define the matrix as . Then, we estimate via
[TABLE]
where is chosen via cross-validation on a hold-out set, comprising of the samples.
In Figure 5, we compare the performance of three different ways to construct the weights : ‘random ,’ we choose the weights independently and uniformly at random in (blue triangles pointing down); ‘ data points,’ we choose the weights uniformly at random among the data points (green circles); ‘optimized ,’ we use the output of the projected SGD algorithm of the previous sections (red triangles pointing up). The first two can be regarded as ‘random features’ approaches, while the latter is a ‘feature learning’ method.
For the optimized , we use exactly the same algorithm in as in (3.5) (without coefficients in the SGD update), with the only difference that each SGD step is carried out with respect to an independent sample from the empirical data, with replacement. SGD is stopped after iteration, and the coefficient are computed according to (4.3). Notice that this procedure is probably suboptimal, and it would be better to optimize and jointly: we choose this simpler two-stage procedure to have a more direct application of the algorithm analyzed in the paper, and a comparison with the random feature methods. We set (noiseless SGD), and constant step size . The number of iterations is chosen via cross-validation, by using the same hold-out set employed to optimize .
We set and define , where takes the form (4.2) with and . Again, and are chosen so that is non-negative and ; the kernel is given by , where is defined in Eq. (4.1) and ensures that .
After estimating and by either methods, we generate a test set of samples and use it to estimate the generalization error. We perform independent trials of the experiment, and we plot the average risk normalized by together with the error bar at 1 standard deviation. In Figure 5-(a), we fix the number of neurons and we plot the normalized risk as a function of the number of data points . In Figure 5-(b), we fix the number of samples to and we plot the normalized risk as a function of the number of neurons . The data set used for cross-validation has size . Note that feature learning leads to improved performance in both settings. The improvement becomes more pronounced with the sample size , presumably because a better set of weights can be learnt. On the other hand, when the number of neurons becomes very large, random ’s are already covering densely enough, and there is no significant advantage in feature learning.
4.4 A non-concave one-dimensional example
We set and , where and are chosen so that is non-negative and . Note that the target function is bimodal, thus it is not concave. We perform the same numerical experiment described in Section 4.1. In Figure 6, left column, we plot the true function together with the neural network estimate at several points in time , where different plots correspond to different values of . In the right column, we report the evolution of the population risk (1.2) normalized by . In Figure 7, we plot (i) the regression function together with the PDE solution at several times , and (ii) the PDE prediction for the risk (1.7) (normalized with respect to ) compared with the population risk achieved by SGD for different values of . Even if the target function is not concave, the results are similar to those presented in the concave case: (i) the network estimates seem to converge to a limit curve which is an approximation of the true function , (ii) the quality of the approximation improves as gets smaller, and (iii) the risk of the limit PDE (3.12) converges to [math] exponentially fast in .
4.5 Failure for small
We repeat the same experiment described in Section 4.1 for a smaller number of neurons . As can be seen in Figures 8 and 9, the quality of the approximation becomes worse as gets smaller. This is expected because with small number of activations, reducing their bandwidth leads to a worse performance as they are all zero on a large part of the space. Put differently, the number of neurons is too small to guarantee convergence of SGD to the predictions of the Wasserstein gradient flow theory.
5 Main results
5.1 Convergence of SGD to the PDE (3.9) at fixed
We now state our result concerning the convergence of the SGD dynamics (3.5) to the PDE (3.9). Note that this result does not require concavity of . Its proof is presented in Appendix D.
Theorem 5.1**.**
Assume that conditions (A1), (A3)-(A5)* hold. Consider the SGD update (3.5) with initialization and constant step size . For , let be the unique solution of the PDE (3.9) with initial and boundary conditions (3.10), and assume Then, for any fixed , almost surely along any sequence () such that , .*
Furthermore, for any , , , , and for any with , the following happens with probability at least ,
[TABLE]
where
[TABLE]
Our proof is based on the same approach developed in [MMN18]. We prove that solutions of the PDE (3.9) are in correspondence with distributions over trajectories in satisfying the following stochastic differential equation
[TABLE]
where is a standard Brownian motion and is the boundary reflection (in the sense of a Skorokhod problem). The density is determined, self consistently, via . We prove existence and uniqueness of solutions to this problem, and refer to the corresponding stochastic process as nonlinear dynamics. This in turn implies existence and uniqueness of the solutions of the PDE (3.9).
We next construct a coupling between the network weights , and i.i.d. trajectories of the nonlinear dynamics . Controlling the expected distance in this coupling yields Theorem 5.1.
Remark 5.1**.**
The error term in Eq. (5.1) is completely analogous to the error in a similar theorem proved in [MMN18]. The constant appearing here is obtained by bounding the Lipschitz constant of . As already mentioned, the main technical difficulty with respect to [MMN18] is posed by the Neumann (reflecting) boundary conditions. Indeed, even if we are given a solution of the PDE (3.9), existence and uniqueness of solutions of the Skorokhod problem (5.3) is a highly non-trivial fact first established in [Tan79, LS84]. As a consequence, while the main proof idea is similar to the one in [MMN18], its implementation is significantly different.
Remark 5.2**.**
As discussed in Appendix D, our proof applies to a more general version of the PDE (3.9) and correspondingly of the SGD dynamics (3.5), where takes the form , for , two smooth functions. The SGD update (3.5) is generalized as in [MMN18], and Theorem 5.1 holds with the terms containing (i.e., and ) replaced by a constant that depends uniquely on , , , .
5.2 Convergence to the solutions of porous medium equation
We next prove that the solution of the PDE (3.9) converges, as , to the unique solution of the porous medium equation (3.12). As for Theorem 5.1, this result does not rely on the concavity assumption for .
Theorem 5.2**.**
Assume that conditions (A1) and (A3)-(A5)* hold. Denote by the unique solution of the PDE (3.9) with initial condition . Then*
The porous medium equation (3.12) admits a weak solution with initial and boundary conditions (3.13). Further, this solution is unique under the additional condition . 2.
For almost all , we have in as .
While this statement is very natural at a heuristic level, its proof is actually the bulk of our technical work. Similar approximation results have been proved in the past by Oelschläger, Philipowski, Figalli [Oel02, Phi07, FP08], but they do not apply directly to the present case unless (also, we have to deal with different boundary conditions).
Our proof follows a classical compactness argument, generalizing the approach of [FP08]. Namely we consider the sequence of trajectories indexed by the width . We prove that that this family is bounded and equicontinuous in , and hence admits converging subsequences . We next prove that any such converging subsequence converges in and that the limit is a weak solution of the porous medium equation (3.12). Unfortunately, uniqueness of weak solutions of the PME (3.12) is –to the best of our knowledge– an open problem. However, we generalize methods from [Oel02] to show that any subsequential limit is actually in , and prove that the weak solution is unique under this condition. This allows us to conclude that converges to this unique weak solution .
5.3 Global convergence of SGD
Let us now state the main result of this paper: SGD converges to a model with nearly optimal risk.
Theorem 5.3**.**
Assume that conditions (A1)-(A5)* hold, and recall that is the concavity parameter of the function , i.e., for all , .*
Consider the SGD update (3.5) with initialization and constant step size . Assume . Then, for any , the following holds with probability at least ,
[TABLE]
where
[TABLE]
Remark 5.3**.**
The error term in Eq. (5.4) is always non-negative. In fact, as for any . Furthermore, by applying Jensen’s inequality, we have that, for any ,
[TABLE]
which gives the following upper bound
[TABLE]
Recall that controls the variance of the noise, which is added at each step of the SGD algorithm for technical purposes. Thus, we can take sufficiently small so that the term is arbitrarily small.
Remark 5.4**.**
The proof of Theorem 5.3 provides a somewhat more explicit expression for the error term in Eq. (5.4). Namely, for an arbitrary but fixed ,
[TABLE]
The term bounds the error due to describing the SGD dynamics using the PDE (3.9). It vanishes when , , under the stated conditions. The term captures the error due to approximating the PDE (3.9) with the porous medium equation (3.12). Finally, the term describes the convergence to equilibrium of the solution of the porous medium equation.
The proof of Theorem 5.3 is presented in Appendix F and relies crucially on regularity results for the PDE (3.9) which are established in Appendix E.
More specifically, the proof is based on three steps, which we spell out once more:
We approximate the dynamics of SGD by the PDE (3.9) at fixed. In doing so, we incur an error which is controlled using Theorem 5.1. 2.
We approximate the solution of the PDE (3.9) at using the solution of the porous medium equation (3.12), as stated in Theorem 5.2. 3.
We use results from [CJM*+*01, CMV03, CMV06] to prove that the latter solution converges exponentially fast to the global optimum, with rate .
Given Theorems 5.1, 5.2, and the results of [CJM*+*01, CMV03, CMV06], this proof is relatively direct. We emphasize that, unlike Theorems 5.1, 5.2, the proof Theorem 5.3 relies in a crucial way on our structural assumptions, namely the concavity of , and the structure of the bump-like activation .
Remark 5.5**.**
If we settle for the less ambitious goal of proving global convergence without the explicit dimension-independent rate , and there are no boundary conditions (), we can achieve this goal using [MMN18, Theorem 5]. This result guarantees convergence in a number of SGD steps that potentially depends on (the noise injected in SGD) as well as the dimensions , and the width , but does not require to assume strong concavity of . On the other hand, numerical experiments are consistent with the conclusion that rates are independent of these parameters, cf. e.g. Fig. 1 where dependence on is explored.
6 Discussion
It is instructive to compare the general strategy followed in this paper (and in related work, e.g. [MMN18, MMM19]) and the results we obtain, to a more classical approach in theoretical statistics. For the sake of clarity, we will abstract away most of the details of the present problem, and focus on the most important differences.
Consider a general setting in which we want to minimize the population risk , where is a non-convex loss function and are parameters (in our problem are the first-layer weights and ). We are given i.i.d. samples .
A standard theoretical analysis of this problem uses empirical risk minimization. Namely, we define the empirical risk (with denoting the empirical average), and compute the minimizer , for instance by gradient descent. Theoretical analysis proceeds –conceptually– in two steps. First, one proves that the empirical risk minimizer is a near-minimizer of the population risk. Namely
[TABLE]
This is normally proved through a uniform convergence argument to establish a bound . Here is an error term that (hopefully) vanishes as for fixed. Second, one proves that gradient descent (with respect to the cost function ) converges to a minimizer . This is achieved by showing that, with high probability, the landscape satisfies some strong conditions that guarantee convergence of gradient descent (or other algorithms). For instance, one desirable (although not sufficient) property is that does not have local minima other than the global minima, provided that the sample size is large enough. A substantial literature applies this general scheme (with significant refinements) to a variety of non-convex problems in high-dimensional statistics, including phase retrieval, clustering, matrix completion, error-in-variables models, and so on. We refer to [MBM*+*18] for examples and a more detailed survey.
Unfortunately this approach runs into substantial difficulties when treating complex models such as multi-layer neural networks. We can name at least two sources of difficulties. First of all, the number of parameters in the model is often comparable with the sample size , and therefore uniform convergence of the empirical risk to population risk does not hold. For instance, in the present model, we could use a number of parameters : indeed, such an example is considered in Figure 5-(a), where and . Of course this problem can be addressed by constraining other measures of complexity than the number of parameters [Bar98], but the common practice is not to add such regularizers in the training.
The second source of difficulties is that studying the risk landscape, and ruling out local minima is extremely difficult, even if we limit ourselves to the limit, i.e. the population risk . In two-layers neural networks, part of this difficulty is due to the fact that the risk (1.2) is invariant under permutations of the neurons, and hence it has (generically) at least global minima related by permutations, and a large number of saddle points connecting them.
The approach pursued in this paper builds on two simple remarks, which are connected to the previous difficulties:
Uniform convergence of the empirical risk to the population risk is not necessary, nor it is necessary to control the random deviations of the whole landscape of the empirical risk. What is instead important is to control the landscape of the empirical risk along the trajectory of gradient descent from a given initialization.
A convenient way to implement this idea is to consider SGD in a one-pass setting in which each sample is used only once. In the limit of small step size, this converges to gradient flow with respect to . 2.
Absence of local minima in the population landscape is not necessary either. What is instead important is absence of local minima along the gradient flow trajectory for or, more precisely, the fact that the gradient flow trajectory converges to a global minimum.
These remarks suggest the following proof strategy. Let denote the gradient flow trajectory from a given initialization (namely ), and be the (random) parameters produced after SGD steps. We first prove that gradient flow converges to a global optimum, possibly with explicit convergence rate :
[TABLE]
where as . We then show that the SGD trajectory, after steps, is well approximated by the gradient flow for provided the step size is small. For instance we might prove that there exists a numerical constant such that, for any , with high probability
[TABLE]
The reader might recognize that the last estimate is analogous to the one obtained in Theorem 5.1, while the estimate 6.2 is what we obtain from displacement convexity (after taking the limit using Theorem 5.2). Putting the two estimates together, and recalling that we can run a total of SGD steps (in the one-pass setting), we get
[TABLE]
where we set . The error is reminiscent of a bias-variance tradeoff: the first term is a bias due to early stopping; the second is instead the stochastic approximation error. We can now optimize as to minimize this error. For instance, if , and , we can choose , yielding where .
In summary, within the present approach, the generalization error is bounded via a tradeoff between the convergence rate of gradient flow in the population risk, and the error of approximating the gradient flow by SGD. A side benefit of this proof strategy is that it guarantees the existence of an efficient algorithm to compute the weights .
As mentioned, the above discussion omits several challenges that are posed by the model treated in this paper. Most notably: We are trying to optimize weight vectors , but the loss only depends on the empirical distribution of these vectors . It is therefore natural to define a gradient flow in the space of probability distributions, which is nothing but the PDE (3.9). This also help addressing the challenge posed by by the fact that, as increases, the dimension of the parameter space increases and convergence to the population behavior might fail. We are embedding all the values of in the space . We cannot prove a bound of the form (6.2) for the original PDE (3.9) and have to approximate this by the porous medium equation (3.12).
Because of these additional challenges, our bounds are not nearly as neat as in Eqs. (6.2), 6.3 and depend on the additional parameters : in particular, the approximation by the porous medium equation in Theorem 5.2 is non-quantitative. We therefore refrain from optimizing the tradeoff between convergence rate of gradient flow, and error in stochastic approximation, which would result in suboptimal statistical guarantees, and defer this objective to future work.
Acknowledgements
A. Javanmard was partially supported by an Outlier Research in Business (iORB) grant from the USC Marshall School of Business, a Google Faculty Research award and the NSF CAREER award DMS-1844481. M. Mondelli was supported by an Early Postdoc.Mobility fellowship from the Swiss National Science Foundation and by the Simons Institute for the Theory of Computing. A. Montanari was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162 and ONR N00014-18-1-2729. This work was carried out in part while the authors were visiting the Simons Institute for the Theory of Computing.
Appendix A Uniqueness of weak solutions of limit PDE ()
In this appendix, we prove that the limit PDE obtained for , namely the porous medium equation (3.12) has at most one solution in . Existence of such solutions will follow from the results of Appendix F, and in particular from Lemma F.4.
For the sake of clarity, we repeat the definitions of Section 3.5. Let be a compact convex set with boundary. We denote by the space of probability measures on endowed with Wasserstein’s distance. Since is compact, the induced topology is equivalent to weak convergence. We consider the following PDE:
[TABLE]
with initial and boundary conditions
[TABLE]
Throughout this appendix, we adopt the notation . Let us formally define the concept of weak solutions for the PDE (A.1).
For the next statement, it is useful to recall that denotes the class of functions with continuous partial derivatives , for all .
Definition A.1** (Weak solution of limit PDE).**
We say that is a weak solution of the PDE (A.1), with initial and boundary conditions (A.2) if
* has density with respect to Lebesgue measure, and .* 2. 2.
For any test function , satisfying for all , we have
[TABLE]
We now prove a uniqueness result, under a mild integrability condition.
Lemma A.2** (Uniqueness of limit PDE).**
Let be two weak solutions of the PDE (A.1) with initial and boundary conditions (A.2), in the sense of Definition A.1. Then, , almost everywhere.
Proof.
Note that setting corresponds to scaling time by a factor and to substituting with . Since the proof holds for any , without loss of generality we can set .
The proof follows ideas from [Váz07, Theorem 6.5]. We write the identity (A.3) for and and subtract them to get
[TABLE]
where we use the shorthand and . Define and . Then,
[TABLE]
Note that and define the truncated function . We next choose a smooth test function , and consider the following backward problem:
[TABLE]
Here, is a smooth approximation of , such that . (We will make precise below in what sense has to approximate . For the moment, it can be a general smooth function satisfying the bounds .) Note that (A.6) is a backward parabolic problem with smooth coefficients and with Neumann boundary conditions. Hence, by classical results on quasilinear parabolic PDEs [LSU88], it admits a solution . Rewriting (A.5) for such a test function , we get
[TABLE]
This immediately implies that
[TABLE]
By applying Cauchy-Schwarz inequality, we have that
[TABLE]
To bound the first term on the right-hand side of (A), we consider a smooth positive bounded function , defined on , whose properties will be discussed later. Define the shorthand . We multiply the parabolic PDE (A.6) by and integrate to obtain
[TABLE]
We next write
[TABLE]
Here follows from integration by parts in the integral over and using the fact that for and . Also, follows from integration by parts in the integral over . Finally holds because for and .
Getting back to (A) and using the properties of function , we have
[TABLE]
The penultimate step follows from integration by parts and the constraint , for and , and the last step follows by applying Cauchy-Schwartz inequality. We continue by applying Cauchy-Schwartz inequality again to get
[TABLE]
where . Combining Equations (A.11) and (A.12), we get
[TABLE]
where . We find a smooth function such that
, for , 2. 2.
.
A particular choice is
[TABLE]
We then obtain from (A) that
[TABLE]
Now by employing (A.14) in bound (A) combined with (A.7) we get
[TABLE]
Next we note that
[TABLE]
Call the first integral and denote the second one by . The integrand in is pointwise bounded by
[TABLE]
Since , we have that has bounded integral. Hence, we can choose large enough such that is arbitrarily small. Moreover we can choose the smooth approximation such that is also arbitrarily small. Putting everything together, we obtain that
[TABLE]
where is an arbitrary small fixed constant.
In addition, since , invoking (A.15) we have
[TABLE]
Since and are independent of , by choosing arbitrarily small, we conclude that
[TABLE]
Since was an arbitrary smooth function supported on , this implies that , almost everywhere. By repeating a similar argument, we get , almost everywhere. The result follows. ∎
Appendix B General results on the PDE (3.9) ()
This appendix contains some basic results on the PDE (3.9). Although these facts are standard, we collect them here for the reader’s convenience.
In fact, we will consider a more general PDE, which also includes as a special case the one studied in [MMN18]. We consider a compact convex domain , with a non-empty interior. The general PDE is parametrized by two functions and , with . (Unlike in [MMN18], we consider the case of a compact domain with Neumann boundary conditions.) Given , we define
[TABLE]
and consider the PDE
[TABLE]
with initial and boundary conditions
[TABLE]
We will typically write for a solution of this equation, in order to emphasize that it is a function of that takes values in , and for the corresponding density, viewed as a function on . Let us formally define the concept of weak solutions for the PDE (B.2).
Note that the PDE (3.9) is a special case of this setting with , and and defined as follows:
[TABLE]
Remark B.1**.**
For the special choice of and given by (B.4) the following properties hold:
is convex for any . 2. 2.
. 3. 3.
, where .
Proof.
We have . Hence,
[TABLE]
This proves that is convex. The next two properties are straightforward. ∎
Definition B.1** (Weak solution of PDE).**
We say that is a weak solution of (B.2) with initial and boundary conditions (B.3) if and, for any test function , satisfying for all , we have
[TABLE]
We now state and prove Duhamel’s principle for the PDE (B.2). Duhamel’s principle follows from the fact that the right-hand side of (B.2) contains the linear diffusion term , and it will be crucial for the proofs that will follow.
Lemma B.2** (Duhamel’s principle).**
Assume . Let denote the heat kernel with Neumann boundary conditions, defined in (G.1)-(G.3). Let be a weak solution of the PDE (B.2) with initial and boundary conditions (B.3). Then, for any , has a density, denoted by , which satisfies, for any ,
[TABLE]
Proof.
By rescaling time, without loss of generality, we set . Let , and define
[TABLE]
By the properties of the heat kernel, we have:
[TABLE]
Let be a weak solution. We choose the test function in (B.5) with . Note that by (B.8), this test function satisfies the Neumann boundary condition. In addition, by (B.9) we obtain
[TABLE]
By an application of Fubini’s theorem, this implies
[TABLE]
Since is arbitrary, we obtain that admits a density and (B.6) follows. ∎
As an intermediate step towards proving existence and uniqueness, we consider a linearized problem
[TABLE]
with initial and boundary conditions
[TABLE]
Here, is independent of , and weak solutions are defined as for the original problem (with Neumann boundary conditions).
Corollary B.3** (Uniqueness of linearized problem).**
Assume that and also that
[TABLE]
Then, the PDE (B.13) with initial and boundary conditions (B.14) has at most one weak solution.
Proof.
Without loss of generality, we will set . Assume by contradiction that , are two solutions. Fix arbitrary . Then, by an application of (B.6) to , we have
[TABLE]
where we used the estimates of Theorem G.1. By taking supremum over form both sides, we obtain that for ,
[TABLE]
Therefore, the two solutions coincide if we fix the initial condition . For larger , the claim follows by iterating the above argument. ∎
Appendix C Nonlinear dynamics
The ‘nonlinear dynamics’ plays an important role in our proof of Theorem 5.1. In this section we adopt the same general setting as in Appendix B, remembering that for our application we set and as per Eq. (B.4).
Given , consider the following stochastic differential equation for a process , with a reflecting boundary condition (known as ‘Skorokhod problem’)
[TABLE]
where is a standard -dimensional Brownian motion and enforces the reflecting boundary by satisfying the following constraints (recall that is the normal to at , directed inside):
is adapted (and hence so is ).
has (almost surely) bounded variation. Denoting by the total variation of on the interval , we define the measure on by .
, where denotes the interior of .
We have that, for ,
[TABLE]
where , for -almost every .
Then, is said to solve the Skorokhod problem.
Lemma C.1** (Existence, uniqueness and continuity of Skorokhod problem).**
Fix and let with . Then, the Skorokhod problem (C.1), (C.2) admits a unique solution with continuous paths. Define , for , by letting . Then, .
Proof.
Let and notice that, by the smoothness of , and compactness of , this is a Lipschitz continuous function of . Hence the problem (C.1), (C.2) admits a unique solution by [Tan79, Theorem 4.1].
We are left with the task of proving that is continuous in metric. Notice that
[TABLE]
By [Tan79, Lemma 2.2], we have, for any ,
[TABLE]
Taking expectation, we get
[TABLE]
whence the continuity follows. ∎
Definition C.2** (Solution of nonlinear dynamics).**
We say that is a solution of the nonlinear dynamics if , namely
[TABLE]
Lemma C.3**.**
Assume . If is a weak solution of the PDE (B.2) with initial and boundary conditions (B.3), then it is a solution of the nonlinear dynamics. Vice versa, if is a solution of the nonlinear dynamics, then it is a weak solution of PDE (B.2) with initial and boundary conditions (B.3).
Proof.
Let be a weak solution of the PDE (B.2), and assume . Let be the unique solution of the Skorokhod problem (C.1), (C.2), cf. Lemma C.1. Let , , i.e. . For , satisfying for all , compute
[TABLE]
Here follows from Ito’s formula for continuous semimartingales [RW94], since and for -almost every , and by the definition of . We conclude that is a weak solution of the linearized PDE (B.13), with . Since also solves the same linearized PDE, we conclude by Lemma B.3 that for all , and therefore is a solution of the nonlinear dynamics.
Next, assume that is a solution of the nonlinear dynamics. Then by the same application of Ito’s formula to the process , we have
[TABLE]
which coincides with the claim that is a weak solution of the PDE (B.2). ∎
Theorem C.4** (Existence and uniqueness of nonlinear dynamics).**
For any initial condition , and any , the nonlinear dynamics (C.6) admits a unique solution with . As a consequence, the PDE (B.2) with initial and boundary conditions (B.3) has a unique solution.
Proof.
Note that it is sufficient to prove the claim for , where is a small enough constant, since this implies the claim for arbitrary by breaking into intervals of size smaller than .
We claim that is a contraction on endowed with the metric . To show that this is the case, define , . By the smoothness of , and by the compactness of , we have that and are Lipschitz continuous in , with Lipschitz constant independent of . Further,
[TABLE]
Let and are be solution of the Skorokhod problem (C.2), with drift coefficients , . We couple the processes and by using the same initial condition and same Brownian motion :
[TABLE]
Define
[TABLE]
and notice that, by the above remarks,
[TABLE]
Further, by [Tan79, Remark 2.2], we have
[TABLE]
Define and . By taking the expectation of the last inequality and using Jensen’s inequality, we get
[TABLE]
which immediately implies
[TABLE]
Hence, for ,
[TABLE]
Selecting small enough, so that , we obtain
[TABLE]
This proves that is a contraction as claimed. By Lemma C.1, maps into itself. Furthermore, is complete with respect to the metric . As a result, there exists a unique fixed point. ∎
We conclude this section by stating a result about the discretization of the nonlinear dynamics. Fix a solution of the PDE (B.2) with initial condition , a step size and define recursively the random variables by
[TABLE]
This can be viewed as an Euler discretization of the stochastic differential equation (C.1), (C.2), and the next theorem establishes that this is indeed a close approximation of the original process. It is just an immediate consequence of a result of Slomiński [Slo94, Slo01].
Theorem C.5** (Theorem 3.2 in [Slo01]).**
Consider the nonlinear dynamics defined by Eqs. (C.1), (C.2). Assume , and , , , . Also assume that . Construct the Euler scheme (C.19), (C.20) on the same probability space by letting and . Then, for any , ,
[TABLE]
Proof.
The proof is obtained simply by chasing the constants in the proof of Theorem 3.2 (part (ii)) of [Slo01], and using the optimal constant in the Burkholder-Davis-Gundy inequality (which yields in [Slo01, Eq. (2.7)]). ∎
Appendix D Convergence of SGD to the PDE: Proof of Theorem 5.1
The proof is a ‘propagation of chaos’ argument [Szn91]. While the basic idea is similar to the one used in [MMN18], implementing it requires different estimates because of the reflecting boundary conditions. In particular, we rely on tools developed in the study of discretizations of reflecting stochastic differential equations.
We will prove a more general theorem that implies Theorem 5.1 as a special case, and also applies to the setting of [MMN18]. Namely, we consider data i.i.d. with common distribution on , and parameters . These parameters are initially sampled independently from distribution , and then evolve according to
[TABLE]
Here is the projection on the closed convex domain with non-empty interior. The setting of Theorem 5.1 is recovered by taking , , , .
We make the following assumptions:
- (G1)
, , and is -subgaussian.
- (G2)
Letting , , both and are differentiable with Lipschitz continuous derivative, namely . Further, we assume .
Theorem D.1**.**
Consider the general update (D.1) with initialization , under the conditions , above. For , let be the unique solution of the PDE (B.2) with initial and boundary conditions (B.3). Assume .
Then, for , any with and for , , the following holds with probability at least :
[TABLE]
where
[TABLE]
Theorem 5.1 follows as a special case of Theorem D.1 by considering and letting , and .
Proof.
Let denote the sigma algebra generated by and denote the empirical distribution of by . Note that
[TABLE]
We introduce two auxiliary processes , with initial conditions , as follows:
- •
The trajectories are i.i.d. copies of the nonlinear dynamics introduced in Appendix C, sampled at times . Namely, for any
[TABLE]
In particular, for any , .
- •
The trajectories are obtained by the Euler discretization of the non-linear dynamics:
[TABLE]
As above, is the solution of the PDE (B.2). Note that, again, the are i.i.d. although their distribution does not coincide with .
We construct these three processes on the same space by letting , and define the distances (for )
[TABLE]
Theorem C.5 yields, for ,
[TABLE]
Note that , take the form
[TABLE]
where are martingales with respect to the filtration : , , and , are -measurable. Explicitly
[TABLE]
Finally, , are corrections to satisfy the constraint . Indeed the above can be viewed as Skorokhod problems with unknowns and .
Using [Slo94, Theorem 1] (where we can set which is the tight constant in the Burkholder-Davis-Gundy inequality), we get
[TABLE]
where denotes the quadratic variation of the martingale , and is the total variation of the process . We then have
[TABLE]
Note that under the stated assumption the martingale increments are sub-Gaussian with variance proxy upper bounded by . Therefore, by using the moment generating function of distribution, we have
[TABLE]
Hence,
[TABLE]
By using the inequality , this implies, for ,
[TABLE]
Equivalently,
[TABLE]
By taking (which is allowed provided ), we obtain that
[TABLE]
We next consider the total variation of the process in Eq. (D.17). We have
[TABLE]
Using the Lipschitz property of , , we get
[TABLE]
For the second term, we get, by triangular inequality,
[TABLE]
We next use the expression , and the fact that , to get
[TABLE]
Using once more the Lipschitz property of , and the symmetry of the distributions of , under permutations, we obtain
[TABLE]
Finally, and therefore the vector
[TABLE]
is sub-Gaussian, with variance proxy upper bounded by . This implies that , and therefore
[TABLE]
Substituting (D.23), (D.24), (D.27), (D.28) in Eq. (D.17), we obtain
[TABLE]
Using Eq. (D.10) and Gronwall inequality, along with the fact that , this yields
[TABLE]
By using Eq. (D.10) again, we get
[TABLE]
By Markov inequality along with the Jensen inequality applied to the convex function , we have
[TABLE]
where in the third step we used (D.8) and (D.9). Set . Thus, we obtain
[TABLE]
with probability at least .
The bounds in Eq. (D.3) follow straightforwardly from Eq. (D.31) as in the proofs of Lemma 3.3 and 3.4 in the supplementary material of [MMN18]. ∎
Appendix E Regularity of the solutions of the PDE (3.9) ()
In this section we prove some standard regularity properties of the solutions of the PDE (3.9), for , and indeed for the more general PDE (B.2). First of all, we show that the weak solution of the PDE (B.2) is in fact strong, i.e., and the equation (B.2) holds pointwise. We will then prove upper bounds on and that are uniform in . These will be crucial in order to take the limit in the next section.
We start by proving a bound on the norm of . In the proofs of the two lemmas that follow, we assume without loss of generality that .
Lemma E.1** (Bound on norm).**
Let be a weak solution of the PDE (B.2) with initial and boundary conditions (B.3). Recall that has a density with respect to Lebesgue measure, denoted by . Then, there exists a constant such that, by letting , we have
[TABLE]
Proof.
Any solution the PDE (B.2) satisfies Eq. (B.6). Given a measurable (Borel) function , denote by the function given by the right-hand side of (B.2). Let be the constant in the statement of Theorem G.1 (part 3) and let . We then have
[TABLE]
Hence
[TABLE]
Proceeding analogously for two different densities , we get
[TABLE]
Hence maps into itself, and is a contraction for . Therefore, it must have a unique fixed point in that coincides with the unique solution of PDE (B.2). Let . Then for that fixed point we have from Eq. (E.3)
[TABLE]
The desired claim follow by iterating this inequality times. ∎
Lemma E.2** (Strong solutions of PDE).**
Let be a weak solution of the PDE (B.2) with initial and boundary conditions (B.3), and recall that, for any , this has a density , with . Fix . If , then .
Proof.
We prove the claim for . For larger values of , the proof is similar and it only requires to iterate the argument.
The proof uses the same bootstrap technique of [MMN18][Supplementary material, Lemma 6.7]. The only difference is that the Duhamel formula of Eq. (B.6) involves the Neumann heat kernel in instead of the heat kernel in .
Let and, for . For , , let be the generalized derivative of , and define the parabolic seminorm
[TABLE]
The proof of [MMN18][Supplementary material, Lemma 6.7] uses the following inequality from [LSU88][Chapter IV, Section 3, Eq. (3.1)]
[TABLE]
Furthermore, (G.11) of Theorem G.1 yields
[TABLE]
Since , we have that
[TABLE]
which immediately implies that
[TABLE]
The proof of [MMN18][Supplementary material, Lemma 6.7] can be repeated verbatimly with (E.7) replaced by (E.10). ∎
As a consequence of the last lemma, the PDE (B.2) admits unique strong solutions with initial condition and Neumann boundary condition. We will use as shortcut for . The rest of this appendix is devoted to prove further regularity results for , which will be crucial in the proofs provided in Appendix F. To emphasize the dependence of on , we will denote this solution by .
In what follows, we will set the initial condition at to be defined via , with given by Eq. (3.4)
It is useful to recall the definition of free energy, which is given by
[TABLE]
The following lemma provides an expression for the derivative of the free energy with respect to time. Such an expression immediately yields an upper bound on the norm of which is independent of .
Lemma E.3**.**
Let be the solution of the PDE (B.2) with initial and boundary conditions (B.3). Then,
[TABLE]
Proof.
By definition
[TABLE]
By differentiating along the solution of (B.2), we obtain
[TABLE]
∎
Corollary E.4**.**
Let be the solution of the PDE (B.2) with initial and boundary conditions (B.3). Then,
[TABLE]
where denotes the volume of the set .
Proof.
By Lemma E.3 we have . The claim follows by substituting the definition of and using . ∎
Remark E.1**.**
By Corollary E.4, we are able to provide a -free upper bound on . Specifically, and hence . We also have
[TABLE]
Note that
[TABLE]
Since as , there exists a such that for , . Thus, the term has a -free upper bound.
By Young’s inequality it only remains to give a -free upper bound on the quantity . Let us write
[TABLE]
Again, for , . Also, by Assumption (A5) and the fact that is compact, we have , which concludes the claim.
We next prove -free upper bound on the gradient of .
Lemma E.5**.**
Let be the solution of the PDE (B.2) with initial and boundary conditions (B.3). Then, the following bound holds:
[TABLE]
Proof.
Denote by the standard scalar product in . Then,
[TABLE]
By integrating (E.16) between [math] and , we obtain
[TABLE]
Hence, (E.15) follows from Corollary E.4. ∎
Remark E.2**.**
Note that by virtue of Lemma E.5, we are able to get a -free upper bound on the left-hand side of (E.15). Indeed, by definition of as per (B.4) and using Assumption (A3), we have the -free bound:
[TABLE]
In addition, by Remark E.1, has -free bound.
Appendix F Global convergence: Proof of Theorems 5.2 and 5.3
We start by showing that admits a limit in a suitable functional space as .
Lemma F.1** (Existence of converging subsequence).**
Let be the unique solution of the PDE (B.2) with initial and boundary conditions (B.3). Then, the family is relatively compact in the space . In particular any sequence , admits a converging subsequence.
Proof.
This follows from the Ascoli-Arzelá’s theorem. Notice that is compact due to the compactness of . Therefore, it is sufficient to prove that the family is equicontinuous. Using the representation in terms of nonlinear dynamics (cf. Appendix C), we have
[TABLE]
Note that we omit for simplicity the dependence on . Recall that the nonlinear dynamic satisfies (for )
[TABLE]
By [Slo01, Theorem 2.2], we have
[TABLE]
where denotes the quadratic variation of , and the total variation of between times and . We thus have
[TABLE]
Hence, in order to prove uniform continuity, it is sufficient to show that, for , \int_{s}^{t}{\mathbb{E}}\big{\{}|{\boldsymbol{b}}(\boldsymbol{X}_{r},r)|^{2}\big{\}}\,{\rm d}r\leq C where is bounded uniformly in . In order to show that this is the case, notice that
[TABLE]
and the claim follows from Lemma E.5. ∎
We have now proved that the sequence admits a converging subsequence, where as . Fix such a convergent subsequence and, with an abuse of notation, also denote it by . Let be its limit.
Recall that is supported in . Hence, is supported in and . We will now show that has the same limit as in .
Lemma F.2**.**
The sequence also converges in to .
Proof.
By Lemma F.1, the result is implied by the following claim:
[TABLE]
Note that, for bounded ,
[TABLE]
for any coupling of the probability distributions of and . Hence,
[TABLE]
As an application,
[TABLE]
Thus, it suffices to show that as .
Note that
[TABLE]
where the random variables and have distributions and , respectively. The quantity is , since has bounded absolute first moment, which completes the proof. ∎
We will now prove a stronger convergence result.
Lemma F.3** (Convergence in ).**
The measure has a density, which is the limit in of the sequence .
Proof.
By Corollary E.4, we have that, for any , . Let us show that is a Cauchy sequence in .
As for every , its Fourier transform exists and we denote it by . Hence, by applying Parseval’s theorem, we have
[TABLE]
Fix and decompose the integral in the right-hand side of (F.11) as
[TABLE]
Consider the first term of (F.12). By Lemma F.2, and since by Jensen’s inequality for any two distributions , we have , as . Since for the complex exponential functions , by definition of 1-Wasserstein distance, the integrand in the first term converges pointwise to [math]. Furthermore, the integrand is upper bounded by an integrable function, since for all and every . Hence, by dominated convergence, the first integral in (F.12) converges to [math].
As for the second term of (F.12), the following chain of inequalities holds:
[TABLE]
where in the last equality we have applied again Parseval’s theorem. By Lemma E.5, the integral in the right-hand side of (F.13) is upper bounded by a constant independent of . Therefore, as , the second term of (F.12) converges to [math].
As a result, is a Cauchy sequence in . Let be its limit. Furthermore, by Lemma F.2, has limit in . Therefore, the measures and coincide. This implies that the measure has for almost every the density , and the proof is complete. ∎
From now on, with an abuse of notation, we will use to denote also the density which is the limit in of the sequence .
Lemma F.4** (Convergence to a weak solution of the limit PDE).**
Let be the limit in of the converging sequence . Then, is a weak solution of the PDE (A.1) with initial and boundary conditions (A.2).
Proof.
By Lemma F.3, we have that . Choose a test function , satisfying for all . In order to prove the claim, we need to show that (A.3) holds. Throughout the proof, we will let .
Recall that, for any , is a weak solution of the PDE (B.2) with initial and boundary conditions (B.3). Hence, by Definition B.1, we have that
[TABLE]
for any satisfying for all . Now, we set
[TABLE]
By definition of , we have that since . Furthermore, for all , immediately implies that for all , .
Recall that
[TABLE]
Thus, (F.14) can be rewritten as
[TABLE]
Since converges in to by Lemma F.1, we have that
[TABLE]
Furthermore, since , we have that
[TABLE]
Let us use the notation and . Again, we set for . We further define , which is a probability density on . Since in and , we have in as well. Hence
[TABLE]
where the last equality follows since , and uniformly in .
Furthermore, we have that
[TABLE]
The second term in the right-hand side of (F.22) is equal to [math] by integration by parts. The third integral in the right-hand side of (F.22) is upper bounded as follows:
[TABLE]
which converges to [math], as converges in to . The first term in the right-hand side of (F.22) is upper bounded as follows:
[TABLE]
The first term is upper bounded using
[TABLE]
Notice that
[TABLE]
where follows from an application of Cauchy-Schwartz. By Lemma E.5, we deduce that the right-hand side of (F.26) is bounded uniformly in . Thus, the first term of (F.24) converges to [math] because of Eq. (F.25). As concerns the second term of (F.24), we have that
[TABLE]
Recall that is supported on , and is bounded. In addition, since the kernel has bounded support, the diameter of the support of is at most times a constant. Consequently, the last term in the right-hand side of (F.27) is upper bounded by
[TABLE]
By using that and the result of Lemma E.5, we have that the two last integrals are bounded uniformly in . As a result, the right-hand side of (F.28) converges to [math], which implies that the right-hand side of (F.22) also converges to [math]. By putting this fact together with (F.18) and (F.21), the desired result follows. ∎
We have now proved that converges to a weak solution of the limit PDE (A.1). In order to prove the uniqueness of the weak solutions of the limit PDE, we next prove a bound on , which along with Lemma A.2 proves the uniqueness claim.
Lemma F.5** (Uniform bound in ).**
Assume that and consider the sequence . Then,
[TABLE]
where
[TABLE]
for some bounded constant .
Proof.
For simplicity, we indicate the norms by . For a function , we let be the vector with coordinates , with . The proof strategy to prove this lemma is to first bound , for some , and then apply the Gagliardo-Nirenberg interpolation inequality (cf. Lemma H.3) to bound . Throughout this proof, we will use , and so on to denote constants that can depend on the domain , but do not depend on or .
Before proceeding, we need to establish some notations and definitions.
For a function and an integer , we denote its Sobolev norms by
[TABLE]
We will use the following relations on Sobolev norms (see [Oel01, Equation (1.14)]):
[TABLE]
Instead of bounding , we will bound the dominating quantity . To this end, we follow a similar strategy as in [Oel01]. Namely, we derive descriptions of the evolution of and . More precisely, we derive a recursive equation (on ) for the evolution of a suitably chosen linear combination of these two quantities.
Since is a solution of the PDE (B.2), we have
[TABLE]
Following along the same lines as in derivation of [Oel01, Equation (3.12)], we obtain
[TABLE]
where and are positive constants that depend on and is a constant which can be chosen arbitrarily.
We set for which we can upper bound the right-hand side of (F.34) as
[TABLE]
We next move to the next quantity. Write
[TABLE]
where the last step follows from (F.33). Note that the first term on the right-hand side can be bounded as
[TABLE]
where the last step follows from Young’s convolution inequality and the fact that .
The second term in (F.36) can be bounded following the same lines as in derivation of [Oel01, Equations (3.3) and (3.16)], which along with (F.37) gives
[TABLE]
Since , there exists constant , such that , . Using the particular choice of , we can upper bound the right-hand side of (F.38) as
[TABLE]
Define and let
[TABLE]
for . Clearly, by choice of . In addition, by applying Sobolev’s inequality (see e.g. [Oel01, Equation (1.12)]), we have
[TABLE]
where is a constant depending on . We let . Recall that the constant in (F.35) and (F.39) was arbitrary. We choose it in a way that . We then consider the evolution of the following linear combination of the two quantities we analyzed above. Note that by Equations (F.35) and (F.39), we have for ,
[TABLE]
where in we use the fact that , which follows immediately from (F.31); follows from the fact that for any function , , by Young’s inequality for convolution.
Another observation that will be used later is that
[TABLE]
This claim follows by repeating the same argument we had to derive (F.40), for . In this case, we have analogous equations to (F.35) and (F.39), where only the first two terms appear.
Next note that by (F.32), we have for ,
[TABLE]
where the last step is a result of (F.41) and (F.40). Let us stress that , , are constants that are independent of .
We further note that
[TABLE]
for . Here, the first step is a result of triangle inequality and the Young’s inequality for convolution along with the fact that . The second step follows from definition of . Since , is uniformly bounded over . We denote the right-hand side of (F.43) by the constant . Using bound (F.43) into (F.42) results in
[TABLE]
for . By employing a generalization of Gronwall’s inequality (cf. Lemma H.2 and Remark H.1) we get
[TABLE]
Therefore, for , with
[TABLE]
we have that
[TABLE]
with and . Note that , and are independent of , but depend on . Let . Then, by the choice of we have , and hence as a result of (F.47), we obtain
[TABLE]
Finally, by applying Gagliardo-Nirenberg interpolation inequality (cf. Lemma H.3) we get
[TABLE]
for some constant , which completes the proof. ∎
Lemma F.6** (Convergence to the unique weak solution of limit PDE).**
Let be the limit in of the converging sequence . Then, is the unique weak solution of the PDE (A.1) in with initial and boundary conditions (A.2).
Proof.
From Lemma F.3, we have that the sequence converges in to . Furthermore, by Lemma F.5, for any , where is a universal constant. By using Young’s convolution inequality, we also deduce that for any .
Note that is a reflexive Banach space. Thus, by applying the Banach-Alaoglu theorem, every bounded sequence in has a weakly convergent subsequence. This means that there exist a subsequence and a function such that, for any , we have
[TABLE]
Now, since is bounded, and are also in (as they are in ). Thus, is in , hence it is also in . As a result, we can pick and obtain
[TABLE]
Therefore, is the limit in of the sequence . By uniqueness of the limit, we conclude that . As a result, for any , which implies that . Thus, by Lemma F.4 and Lemma A.2, is the unique weak solution of the PDE (A.1) for . Note that is decreasing with . Thus, we can repeat the same argument with instead of and obtain that is the unique weak solution of the PDE (A.1) for . By iterating this procedure times, the result follows. ∎
At this point, we state and prove a lemma showing that the sequence converges in to .
Lemma F.7**.**
For almost all , the measure is the limit in of the sequence .
Proof.
The proof is similar to that of Lemma F.3. Suppose that , where is defined in the statement of Lemma F.5. Note that, for any , . Let us show that is a Cauchy sequence in .
As for every , its Fourier transform exists and we denote it by . Hence, by applying Parseval’s theorem, we have
[TABLE]
Fix and decompose the integral in the right-hand side of (F.51) as
[TABLE]
Consider the first term of (F.52). By Lemma F.1, and since by Jensen’s inequality for any two distributions , we have , as . Since for the complex exponential functions , by definition of 1-Wasserstein distance, the integrand in the first term converges pointwise to [math]. Furthermore, the integrand is upper bounded by an integrable function, since for all and every . Hence, by dominated convergence, the first integral in (F.52) converges to [math].
As for the second term of (F.52), the following chain of inequalities holds:
[TABLE]
where in the last equality we have applied again Parseval’s theorem. In the proof of Lemma F.5, we provide an upper bound, which does not depend on , on the Sobolev norm of (see (F.47)). Thus, as , the second term of (F.52) converges to [math].
By iterating the argument times, we obtain that is a Cauchy sequence in for . Let be its limit. Furthermore, by Lemma F.1, has limit in . Therefore, the measure has for almost every the density , and the proof is complete. ∎
Theorem 5.2 follows from Lemma A.2, Lemma F.6 and Lemma F.7.
Let us define the free energy associated to the PDE (A.1) as
[TABLE]
As explained in Section 3.5, this limit free energy is displacement convex, and hence its gradient flow converges to the unique minimizer of (F.54). These facts are stated and proved formally in the theorem that follows.
Theorem F.8**.**
Assume that the initial condition . Then, the following results hold:
There exists a unique minimizer in , call it , of the free energy defined in (F.54). 2. 2.
For any , we have
[TABLE]
where is defined in (3.1). 3. 3.
For any and for almost any , we have
[TABLE]
where is defined in (3.1) and as .
Proof.
The proof follows from the results of [CJM*+*01]. The technical assumptions required by [CJM*+*01] are satisfied by the PDE (A.1), since is convex and bounded, the initial condition , and satisfies the assumptions (A2) and (A3). Note also that the condition coming from assumption (HV3) of [CJM*+*01] can be relaxed. In fact, adding a constant to does not change the entropy functional in [CJM*+*01, Eq. (3)] (which corresponds to the free energy (F.54)) and the PDE in [CJM*+*01, Eq. (46)] (which corresponds to the PDE (A.1)).
The uniqueness of the minimizer follows from [CJM*+*01, Lemma 6], which proves the first result. Since is the unique weak solution of the PDE (A.1) with initial and boundary conditions (A.2), then it coincides with the unique, non-negative mass-preserving solution of [CJM*+*01, Theorem 16]. Thus, the inequality (F.55) readily follows from [CJM*+*01, Theorem 16].
It remains to prove inequality (F.56). By definition of free energy, we obtain
[TABLE]
Recall that, by Lemma F.7, converges to in . Consequently, by using the triangle inequality, we have that the term tends to [math] as .
In order to complete the proof, it remains to show that tends to [math] as . To do so, define
[TABLE]
Note that . In fact, suppose that and . Then, one between and is and the other is . Consequently, and . This immediately implies that
[TABLE]
We will now upper bound the three integrals in the RHS of (F.59). As for the first term, note that
[TABLE]
where denotes the volume of . Furthermore,
[TABLE]
Note that for and for . Thus, the RHS of (F.61) is upper bounded by
[TABLE]
By Lemma F.7, for almost all , converges to in . Thus, by (F.60), tends to [math] as . By Lemma F.6, for almost all . Furthermore, by Lemma F.5, the quantity has a -free upper bound for . As a result, for almost all , the first integral in (F.59) tends to [math] as . By iterating this argument times, we conclude that for almost all , the first integral in (F.59) tends to [math] as .
In order to bound the second integral in (F.59), we write
[TABLE]
where in the last inequality we have applied [CT06, Theorem 17.3.3], since , by definition of . Note that
[TABLE]
Thus, the RHS of (F.63) is upper bounded by
[TABLE]
where in the last step we have used Cauchy-Schwarz inequality. By Lemma F.7, for almost all , converges to in . As a result, the second integral in (F.59) also tends to [math] as .
Finally, let us bound the third integral in (F.59). Define . Then, for ,
[TABLE]
Thus,
[TABLE]
where in the last step we have used Cauchy-Schwarz inequality. By Lemma F.7, for almost all , converges to in . By Lemma F.6, for almost all . Furthermore, by Lemma F.5, the quantity has a -free upper bound for . As a result, for almost all , the third integral in (F.59) tends to [math] as . By iterating this argument times, we conclude that for almost all , the third integral in (F.59) tends to [math] as , and the proof is complete. ∎
At this point, we are ready to provide the proof of Theorem 5.3.
Proof of Theorem 5.3.
By substituting with in Theorem 5.1, we have that with probability at least
[TABLE]
where is defined in (5.2). The risk can be upper bounded as
[TABLE]
where as , since both and converge in to . Furthermore, by Theorem F.8,
[TABLE]
where as and we recall that denotes the volume of the set .
Note that
[TABLE]
since is the minimizer of . By combining (F.70) with (F.69), we deduce that
[TABLE]
where in the last step we use again the result of Theorem 5.1 and the fact that tends to [math] as .
By optimizing over in (F.67), we will set as in (5.8). We also let . Then, the result follows by combining (F.67), (F.68) and (F.71). ∎
Appendix G Heat kernel in bounded domains with Neumann boundary
Given the domain (compact, with boundary ), we denote by the associated heat kernel, with Neumann boundary conditions. We collect here a few well known facts about this kernel (see, e.g., [Tay13, Section 6.1]).
The heat kernel can be defined as a function satisfying
[TABLE]
We will also denote by the heat kernel on , namely
[TABLE]
The probabilistic interpretation of is as follows (see, e.g., [BGL13]). Let denote expectation with respect to a Brownian motion , with initial condition , and reflected at (see Section C for definitions of this process, following [Tan79]). Then, for any bounded continuous function ,
[TABLE]
Finally, can be viewed as the kernel representation of the bounded operator in . We have
[TABLE]
Hence can be represented in terms of the eigenfunctions , and eigenvalues , of ,
[TABLE]
Here , with , and .
Remark G.1**.**
Since is self-adjoint in , it follows that is symmetric, namely , and therefore it satisfies
[TABLE]
Theorem G.1**.**
The Neumann heat kernel satisfies the following properties:
We have that
[TABLE]
where . 2. 2.
For any , . 3. 3.
We have that, for a constant ,
[TABLE]
Proof.
Substituting into Eqs. (G.1) to (G.3) yields, for ,
[TABLE]
Thus satisfies the heat equation in and hence is inside this domain (see, e.g., [Eva09, Chapter 2, Theorem 8], which refers to Dirichlet boundary condition, but applies equally well to the Neumann case). By symmetry, we have the claimed continuity in , thus proving point 1.
Claim 2 follows by the same decomposition.
Finally, claim 3 follows from Lemma 3.1 in [WY13]. ∎
Appendix H Some useful technical lemmas
Lemma H.1** (Displacement convexity of quadratic functionals).**
Let be twice differentiable with , , and define by . Then is displacement convex if and only if is convex.
Proof.
Proposition 7.4 in [San15] proves that convexity of implies displacement convexity of . To prove the converse implication, let , and consider the two probability distributions and . For , the geodesic path connecting these distribution is , . Substituting in the definition of , we get
[TABLE]
Hence, displacement convexity implies . Since this holds for all , we obtain for all , which in turns imply that is convex (by a continuity argument, it is sufficient to lower bound the Hessian everywhere except at a point). ∎
Lemma H.2** (A Gronwall type inequality [Bih56]).**
Let be a continuous function that satisfies the inequality
[TABLE]
where , is continuous and is continuous and monotone-increasing. Then, the following holds
[TABLE]
with given by
[TABLE]
Remark H.1**.**
To derive Equation (F.45), we use Lemma H.2 with , , .
Lemma H.3** (Gagliardo-Nirenberg interpolation inequality, cf. Theorem 1.5.2 of [CM12]).**
Fix and a positive integer. Let and . For integer , , and (with the exception if is a non-negative integer), define by
[TABLE]
Then and satisfies
[TABLE]
with finite arbitrary and and are independent of . The constant is independent of , while as . In particular, the choice is admissible if .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AB 09] Martin Anthony and Peter L. Bartlett, Neural network learning: Theoretical foundations , Cambridge University Press, 2009.
- 2[AGS 08] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré, Gradient flows: in metric spaces and in the space of probability measures , Springer Science & Business Media, 2008.
- 3[Bac 17] Francis Bach, Breaking the curse of dimensionality with convex neural networks , The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
- 4[Bar 93] Andrew R. Barron, Universal approximation bounds for superpositions of a sigmoidal function , IEEE Transactions on Information theory 39 (1993), no. 3, 930–945.
- 5[Bar 98] Peter L. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network , IEEE Transactions on Information Theory 44 (1998), no. 2, 525–536.
- 6[BGL 13] Dominique Bakry, Ivan Gentil, and Michel Ledoux, Analysis and geometry of markov diffusion operators , vol. 348, Springer Science & Business Media, 2013.
- 7[Bih 56] Imre Bihari, A generalization of a lemma of Bellman and its application to uniqueness problems of differential equations , Acta Mathematica Hungarica 7 (1956), no. 1, 81–94.
- 8[BJW 18] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff, Learning two layer rectified neural networks in polynomial time , ar Xiv:1811.01885 (2018).
