Convergence rates for the stochastic gradient descent method for non-convex objective functions
Benjamin Fehrman, Benjamin Gess, Arnulf Jentzen

TL;DR
This paper establishes local convergence and rate estimates for stochastic gradient descent on non-convex functions, relevant to machine learning applications, expanding understanding beyond convex scenarios.
Contribution
It provides the first local convergence and rate results for SGD on non-convex, non-globally convex functions, applicable in machine learning.
Findings
Proves local convergence to minima for non-convex functions.
Provides estimates on the rate of convergence.
Applicable to simple objective functions in machine learning.
Abstract
We prove the local convergence to minima and estimates on the rate of convergence for the stochastic gradient descent method in the case of not necessarily globally convex nor contracting objective functions. In particular, the results are applicable to simple objective functions arising in machine learning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Convergence rates for the stochastic gradient descent
method for non-convex objective functions
Benjamin Fehrman1, Benjamin Gess2, and Arnulf Jentzen3
1Mathematical Institute, University of Oxford,
Oxford, United Kingdom,
e-mail: [email protected]
2 Max Planck Institute for Mathematics in the Sciences,
Leipzig, Germany,
Fakultät für Mathematik, Universität Bielefeld,
Bielefeld, Germany,
e-mail: [email protected]
3Seminar for Applied Mathematics, Department of Mathematics,
ETH Zurich, Zurich, Switzerland,
e-mail: [email protected]
Abstract
We prove the local convergence to minima and estimates on the rate of convergence for the stochastic gradient descent method in the case of not necessarily globally convex nor contracting objective functions. In particular, the results are applicable to simple objective functions arising in machine learning.
Contents
-
7.1 A four-parameter network with a linear activation function
-
7.2 A two parameter network with the ReLU activation function
1 Introduction
Stochastic gradient descent algorithms (SGD), going back to [46], are the most common way to train neural networks. Despite their relevance to machine learning and much recent interest, estimates on their rate of convergence have so far only been shown under global contraction or convexity assumptions on the objective function that are often not satisfied by examples arising in machine learning. Indeed, citing from [52], “While SGD has been rigorously analyzed only for convex loss functions […], in deep learning the loss is a non-convex function of the network parameters, hence there are no guarantees that SGD finds the global minimizer.” In the present work, we prove the local convergence of SGD to the set of global minima of the objective function while avoiding such a global convexity or contractivity assumption. The relevance of the obtained results is demonstrated by the application to the training of (simple) neural networks.
Stochastic gradient descent methods are used to numerically minimize functions of the form
[TABLE]
for some product measurable function and some random variable on some probability space . The analysis of SGD has attracted considerable attention in the literature (cf., e.g., [2, 4, 8, 13, 24, 35, 51] and the references therein). In [13, 24], the convergence of SGD with rates assuming the following contraction property for the objective function , which is classical in stochastic approximation theory, was analyzed: There is an and a zero of such that for every it holds that
[TABLE]
In particular, this contraction property implies the uniqueness of the zero of and thus the uniqueness of local minima of . This is in stark contrast to actual objective functions arising in the training of neural networks which are expected to show rich sets of local minima and saddle points/plateaus. Consequently, it is vital for the application to machine learning to avoid such global contraction assumptions. In addition, for example due to the positive homogeneity of the ReLU function, the objective functions typically satisfy certain symmetries, implying that global (and local) minima are not isolated points nor unique, but form (possibly non-compact) manifolds. Indeed, this is demonstrated for simple neural networks in Section 7 below. We are therefore led to the task of analyzing the convergence properties of SGD locally at sets of minima111We emphasize that this is disjoint from the recent works [8, 30, 53] where the global convergence of the gradient of the objective function to zero has been shown for SGD and AdaGrad. This does not imply the local convergence to minima, since the gradient also vanishes in saddles/plateaus.. In the present work we provide estimates on the rate of convergence for SGD under assumptions avoiding a contraction property like (1.2).
Theorem 1.1**.**
Let , , , let be the standard norm on , let be an open set, let be a bounded open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1,1})\big{]}, let satisfy that
[TABLE]
assume for every that is a continuously differentiable function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1,1})|^{2}\big{]}<\infty, assume that is a -dimensional -submanifold of , assume that , assume for every that , for every , let , , be i.i.d. random variables, assume for every , that is continuous uniformly distributed on , assume for every , that and are independent, assume for every , that
[TABLE]
and for every , let be a random variable which satisfies that
[TABLE]
(cf. Lemma 5.11 below). Then there exist , such that for every , , it holds that
[TABLE]
Theorem 1.1 is an immediate consequence of Theorem 5.12 in Section 5 below. The statement of Theorem 1.1 should be interpreted in the following way. We aim to minimize an objective function , where we assume that the set of minima
[TABLE]
is somewhere locally smooth in the sense that there exists an open set such that
[TABLE]
We furthermore assume that is locally in a neighborhood of and that the Hessian is maximally nondegenerate on in the sense that for every it holds that
[TABLE]
Let be a probability space, let be a measurable space, and let , , be i.i.d. random variables. We assume that there exists a measurable function which satisfies for every that
[TABLE]
In particular, since it is oftentimes the case in practice that the deterministic gradient cannot be computed or cannot be efficiently computed, the random gradient provides an efficiently computable stochastic approximation.
The initial data of SGD is sampled from a bounded open set which satisfies that . That is, for every mini-batch size and , the initial data , , are uniformly distributed on , independent, and independent of the driving noise , . We then compute independent solutions to SGD in the sense that for every it holds that
[TABLE]
For a fixed terminal time , for a sampling size , the output of the algorithm at this point is the collection of values , . It remains to identify the value , , that minimizes the objective function.
Much as in the case of the gradient, since the objective function cannot be practically computed, for a terminal time , for a mini-batch size , we introduce the mini-batch approximation which satisfies for every that
[TABLE]
We then identify the value , , that minimizes in the sense that we compute a random variable which satisfies that
[TABLE]
The conclusion of Theorem 1.1 estimates the probability that is an minimizer of the objective function. Precisely, there exist , such that for every , , it holds that
[TABLE]
The limit corresponds to computing the minimizer of exactly. If this can be done efficiently, then the first term on the righthand side of (1.14) vanishes.
The constant , which we compute precisely in Theorem 5.12 below, quantifies two sources of error: the probability that the initial condition lies outside of a basin of attraction and a portion of the probability that SGD beginning in a basin of attraction fails to converge. In Remark 5.61 below and Section 6, we prove that the restriction can be extended to under the additional assumption that is a compact subset of . Finally, it is not necessary to assume that is continuously differentiable, and this assumption can be replaced with the assumption that for every we have that is a locally Lipschitz continuous function of .
We observe that the computational efficiency of the algorithm can be estimated using Theorem 1.1. In particular, it follows from Corollary 5.13 below that there exist constants , , such that for every , for which satisfy that
[TABLE]
it holds that
[TABLE]
For every bounded open set which satisfies that is non-empty, for every , the computational efficiency of the algorithm satisfies that
[TABLE]
It follows from (1.15) that there exists which satisfies for every that
[TABLE]
where the constant depends on the computational cost of computing and but not on the running time , mini-batch size , or sampling size . Furthermore, we prove in Corollary 6.5 below that that computational efficiency can be improved in the case that the local manifold of minima is compact.
The estimate of Theorem 1.1 quantifies two sources of error. The first term on the righthand side of (1.6) quantifies the error introduced by the mini-batch approximation of the objective function. In the case that the objective function can be efficiently computed, this error can be avoided by computing which satisfies that
[TABLE]
for which it follows from Corollary 5.10 below that
[TABLE]
The second term on the rigththand side of (1.6) quantifies the failure of the solutions , , to converge to within distance to the local manifold of minima at time . We quantify this error in Corollary 5.9 below, where we prove that
[TABLE]
The methods of Corollary 5.13 below prove that there exist constants , , such that for every , for which satisfy that
[TABLE]
it holds that
[TABLE]
For every bounded open set with , for every , the computational efficiency of (1.21) satisfies that
[TABLE]
It follows from (1.22) that for every bounded open set with there exists such that for every it holds that
[TABLE]
In particular, the computational efficiency yields a significant improvement when compared with a random sampling algorithm. Precisely, suppose that is a bounded open subset with . Then, since is a -dimensional, -submanifold of , for the Lebesgue-Borel measure , there exists which satisfies that
[TABLE]
If , , are i.i.d. random variables that are continuous uniformly distributed on , it follows from (1.26) that for every it holds that
[TABLE]
For every , , in order to ensure that
[TABLE]
it is necessary to choose which satisfies that
[TABLE]
In particular, there exists which satisfies for every that
[TABLE]
The computational efficiency of the random sampling algorithm is therefore worse than whenever the codimension is greater than . This condition is expected to be satisfied in all practical machine learning applications, where the dimension is large, since for we have . In particular, this condition is satisfied for any if there exists a unique minimum and .
In a non-globally stable setting, i.e. when (1.2) is not satisfied, several obstacles in the proof of local convergence to minima and the estimation of the rate for SGD appear. In particular, even pretending a local minimum to be isolated and such that (1.2) holds in a neighborhood of the minimum, the global analysis put forward in [24] is not immediately localizable, since deterministic bounded sets are not invariant under the dynamics of SGD. On the contrary, with probability one each realization of SGD will eventually leave the basin of attraction , outside of which no control on the dynamics can be expected. Therefore, it becomes necessary to provide estimates on the probability that SGD leaves favorable neighborhoods. Second, as pointed out above, (local) minima are not expected to appear in an isolated manner, but as (local) manifolds. This needs to be accounted for in the mathematical analysis, giving rise to a quantitative analysis inspired by the center manifold theorem, which in turn relies on estimates on the probability of SGD leaving favorable neighborhoods in normal and tangential direction separately. In order to derive estimates on the rate of convergence, these steps are performed in a quantitative way in the proofs of this work. An intriguing observation is that the mathematical analysis of the rate of convergence relies on the use of mini-batches in order to control the loss of iterates in non-attracted regions.
In Sections 3 and 4 we provide an analysis of the deterministic gradient descent algorithm in continuous and discrete time in order to highlight the relevance of the assumptions in simplified settings. We emphasize again that, while the deterministic algorithms converge quickly, the computational costs of computing typically make the implementation of such algorithms infeasible. This is particularly the case when takes the form (1.33) below for a measure that is the empirical measure of a large training set. An advantage of the stochastic algorithm is that, provided is not too large, the mini-batch gradient can be computed efficiently in the case of (1.34) below. The disadvantage is that, inside an attracting set, the algebraic convergence of SGD in expectation is much slower than the exponential convergence of its deterministic counterpart.
1.1 Literature
The stochastic gradient descent algorithm has attained considerable interest in the literature, and a complete account on the existing results would go beyond the scope of this article. We will therefore restrict to works that seem most relevant to the current results and refer to the following works and the references therein for further details: See, for example, [2, 3, 4, 6, 7, 9, 10, 14, 23, 28, 34, 39, 40, 42, 43, 44, 49, 50, 51, 54, 56] and the references mentioned therein for numerical simulations and proofs of convergence rates for SGD type optimization algorithms, [5, 8, 47] and the references mentioned therein for overview articles on SGD type optimization algorithms, and [11, 12, 18, 19, 21, 22, 26, 27, 48] and the references mentioned therein for applications involving neural networks and SGD type optimization algorithms.
The case of a convex loss function is well-understood under mild further assumptions, for example, rates of convergence of the order for SGD have been established in [8, 56]. In the case of a strongly convex objective function these can be improved to , see [20, 37, 38].
The case of a non-convex objective function is considerably less well understood. In this case we have to distinguish two classes of results: The first class proves the convergence to zero (with or without rates) for the gradient of the objective function, thus implying the convergence to a critical point. The second class of results proves the convergence of the values of the loss function to their global minimum. Obviously, the second class of results are stronger and not implied by the first class, since these do not exclude convergence to saddle points or local minima. In the case of non-convex loss function rather complete results are known concerning the minimization of the gradient of the loss function. For example, the convergence of the gradient to zero with rates was shown in Lei, Hu, Li, & Tang [29] assuming a Hölder-regularity condition on the gradient of the loss function. This generalizes previous work Ghadimi, Lan, & Zhang [17] which required a second moment boundedness condition, which in turn was generalized by previous works Ghadimi & Lan [16] and Reddi, Hefny, Sra, Poczos, & Smola [45]. We note that while convergence to the global minimum with rates was obtained in [17] for the convex case, no results on the convergence of the value of the loss function have been shown in the non-convex case.
The convergence of the stochastic gradient descent method has been analysed in the literature under several additional assumptions replacing (strong) convexity, such as the error bounds condition in Luo & Tseng [33], essential strong convexity [31], weak strong convexity [36], the restricted secant inequality [55], and the quadratic growth condition Anitescu [1]. In these works, linear convergence rates are shown. In the notable contribution Karimi, Nutini, & Schmidt [25] have shown that all of these conditions imply the Polyak-Lojasiewicz (PL) inequality, introduced in Lojasiewicz [32] and Polyak [41], under which linear convergence of SGD is proven in [25], thus generalizing these previous works. Recently, further progress was made in Lei, Hu, Li, & Tang in [29] where a boundedness assumption on the gradient of the objective function, required in [25], was relaxed. We note that, while the PL condition does not require convexity, nor the uniqueness of global minimizers, it does exclude the existence of local minima, that is, assuming the PL condition each local minimum is a global minimum. Therefore, it is not implied by the assumptions made in the current work.
1.2 Structure of the work
The paper is organized as follows. We will use the local smoothness of , the local smoothness of the objective function , and the maximal nondegeneracy of the Hessian to identify a basin of attraction for SGD. In Section 2, we present the geometric preliminaries that are used to identify this set. In particular, in Proposition 2.3 below we recall the existence of projections in a local neighborhoods of , in Proposition 2.7 below we recall the existence of local tubular neighborhoods about , in Lemma 2.8 below we prove a useful decomposition of into components normal and tangential to , and in Lemma 2.9 below we prove a contraction estimate that will be used to obtain a convergence rate for the gradient descent algorithms in discrete time.
In Section 3, for objective functions that satisfy the conditions of Theorem 1.1, we analyze the converge of the deterministic gradient descent algorithm in continuous time , , that satisfies for every that
[TABLE]
We prove in Proposition 3.1 below that the local smoothness of , the local smoothness of , and the nondegeneracy of the Hessian imply the existence of a neighborhood such that for every the solution , , converges exponentially fast to . However, since in general neither nor are practically computable, and since continuous gradient descent cannot be implemented, the purpose of this section is to explain in a simplified setting the role of the assumptions and the geometric arguments from Section 2.
In Section 4, for objective functions that satisfy the conditions of Theorem 1.1, we analyze the converge of the deterministic gradient descent algorithm in discrete time , , that satisfies for , , for every that
[TABLE]
We prove in Proposition 4.1 below that there exists a neighborhood such that for every the solution , , converges exponentially quickly to . However, while discrete gradient descent yields an implementable algorithm, the computational costs of and in general make it practically infeasible. The purpose of this section is instead to explain how the geometric preliminaries of Section 2, and in particular Lemma 2.8 and Lemma 2.9, are applied in a simplified discrete setting.
In Section 5, we analyze the convergence of SGD to the manifold of local minima . In Proposition 5.3 below, we prove the convergence of (1.4) to in directions normal to the manifold. Precisely, we identify a basin of attraction such that, on the event that SGD remains in , SGD converges to in expectation with an algebraic rate. It remains to estimate the probability that SGD remains in the basin of attraction .
The first step is contained in Proposition 5.4 below, which estimates the maximal excursion of SGD in expectation. Then, in Proposition 5.7 below, we estimate the probability that SGD remains in a basin of attraction by separating this event into the event that SGD leaves in a direction normal to and the event that SGD leaves in a direction tangential to . Proposition 5.3 is used to estimate the first of these events, and Proposition 5.4 is used to estimate the second. In Theorem 5.8, we combine Proposition 5.3 and Proposition 5.7 to estimate the probability that SGD converges to within distance of .
In Corollary 5.9 below, we estimate the probability that independent copies of SGD fail to converge to within distance of . In Theorem 5.12 below we prove Theorem 1.1, which relies on Lemma 5.11 below and estimates for the mini-batch approximation of the objective function. Finally, in Corollary 5.13 below, we estimate the computational efficiency of the algorithm introduced in Theorem 1.1.
In Section 6, we prove that the estimates of Section 5 can be improved under the additional assumption that is compact. These estimates apply, in particular, to the case when the objective function has a unique minimum. The reason for the improved estimate of Theorem 6.4 below and the improved computational efficiency of Corollary 6.5 below is that, in the compact case, SGD cannot escape a basin of attraction in directions tangential to the manifold. It is therefore sufficient to take a smaller mini-batch approximation of the gradient.
In Section 7, we prove that assumptions of Theorem 1.1 are satisfied by simple loss functions arising in machine learning applications. In particular, we show that the assumptions are satisfied by objective functions which satisfy that
[TABLE]
where , , a measurable function on a measurable space , and is a jointly-measurable artificial neural network. In this case, the function satisfies for every that
[TABLE]
and, for a probability space , the sequence of random variables , , are i.i.d. with distribution . For the objective functions considered in Section 7.1 and Section 7.2 below, the global minima are non-unique and build locally smooth, non-compact manifolds of on which Hessian of the objective function is maximally nondegenerate.
2 Geometric preliminaries
In this section, for an objective function that satisfies the conditions of Theorem 1.1, we will characterize the local geometry of the local manifold of minima . The analysis will rely on on the notion of a projection to which is, however, only well-defined in local neighborhoods of the local manifold.
In the following proposition, we prove that the projection map to the local manifold of minima is locally well-defined and smooth. The proof is a consequence of Foote [15, Lemma] and the smoothness of .
Proposition 2.1**.**
Let , , let be the standard norm on , and let be a non-empty -dimensional -submanifold of . Then for every there exists an open neighborhood such that
- (i)
* is a neighborhood of : it holds that .* 2. (ii)
projections exist in : there exists a unique function which satisfies for every that
[TABLE] 3. (iii)
the projection map is locally -smooth: the map is once continuously differentiable.
Proof of Proposition 2.1.
The proof is an immediate consequence of [15, Lemma] and the -regularity of . ∎
The family of subsets satisfying for a fixed the conclusion of Proposition 2.1 will play an important role in the arguments to follow. We therefore make a global definition, and define the projection map on a global neighborhood of . The existence of the projection map is an immediate consequence of Proposition 2.1.
Definition 2.2**.**
Let , , let be a non-empty -dimensional -submanifold of .
- (i)
For every let satisfy that
[TABLE] 2. (ii)
Let be the unique function which satisfies for every that
[TABLE]
The following proposition proves that for every the tangent space and normal space \big{(}T_{x}(\mathcal{M}\cap U)\big{)}^{\perp} to at are characterized respectively by the null space of Hessian of and the space on which the Hessian of is positive definite.
Proposition 2.3**.**
Let , , let be the standard norm on , let be an open set, let be a three times continuously differentiable function, let satisfy that
[TABLE]
assume that is a non-empty -dimensional -submanifold of and assume for every that . Then for every there exist a -dimensional subvectorspace and a -dimensional subvectorspace such that
- (i)
it holds that
[TABLE] 2. (ii)
it holds for every that
[TABLE] 3. (iii)
it holds that
[TABLE] 4. (iv)
it holds that
[TABLE] 5. (v)
it holds that
[TABLE]
Proof of Proposition 2.3.
Let . Since , the symmetry of the Hessian implies that there exist subspaces such that , that , that
[TABLE]
that , and that
[TABLE]
Let and suppose that is a smooth curve which satisfies . Since , it follows from the chain rule that
[TABLE]
It follows that and therefore, since , it holds that . Since \mathbb{R}^{d}=T_{x}(\mathcal{M}\cap U)\oplus\big{(}T_{x}(\mathcal{M}\cap U)\big{)}^{\perp}, it holds that P_{x}=\big{(}T_{x}(\mathcal{M}\cap U)\big{)}^{\perp}, which completes the proof of Proposition 2.3. ∎
In the following lemma, for a point such that the projection is well-defined, we prove that the difference lies in the space normal to at . This fact will be used to obtain a rate of convergence for the discrete gradient descent algorithms.
Lemma 2.4**.**
Let , , let be the standard norm on , let be an open set, let be a three times continuously differentiable function, let satisfy that
[TABLE]
assume that is a non-empty -dimensional -submanifold of , and assume for every that . Then for every , for every (cf. Definition 2.2), it holds for every that
[TABLE]
Proof of Lemma 2.4.
Let , let , and let denote the projection map. Let . If , the claim is immediate since then . If , for some suppose that is a smooth path which satisfies . It holds that
[TABLE]
Therefore, since the curve was arbitrary, it holds that , which completes the proof of Lemma 2.4. ∎
In the following lemma, we derive a formula for the derivative of the distance function to the manifold in a neighborhood of . The regularity of the distance function and the formula for its differential will be used to prove the convergence of the deterministic gradient descent algorithm in continuous time.
Lemma 2.5**.**
Let , , let be the standard norm on , let be an open set, let be a three times continuously differentiable function, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume that is a non-empty -dimensional -submanifold of , and assume for every that . Then for every , for every (cf. Definitition 2.2), it holds for every that
[TABLE]
Proof of Lemma 2.5.
Let and let . It follows from Proposition 2.1 that
[TABLE]
The chain rule implies for every that
[TABLE]
Since and since it follows from Lemma 2.4 that
[TABLE]
Since for every it holds that
[TABLE]
it holds for every that
[TABLE]
which completes the proof of Lemma 2.5. ∎
We will now quantify what are essentially local tubular neighborhoods of the local manifold . The following definition will play an important role throughout the paper.
Definition 2.6**.**
Let , , let be a non-empty -dimensional -submanifold of . For every , let satisfy that
[TABLE]
A useful feature of the sets defined in Definition 2.6 is that the parameter can be used to quantify distance in directions tangential to the manifold , and the parameter can be used to quantify distance in directions normal to the manifold . The following technical proposition will be used to prove Proposition 4.1 below and Lemma 5.6 below.
Proposition 2.7**.**
Let , , let be the standard norm on , let be a non-empty -dimensional -submanifold of , and let be the function which satisfies for every that
[TABLE]
Then for every , for every (cf. Definition 2.2), there exist such that for every , ,
- (i)
it holds that (cf. Definition 2.6), 2. (ii)
it holds that
[TABLE] 3. (iii)
it holds for every and v\in\big{(}T_{x}(\mathcal{M}\cap U)\big{)}^{\perp} with that
[TABLE]
Proof of Proposition 2.7.
Let . For every let satisfy that
[TABLE]
Let . Since are open, there exist such that for every it holds that
[TABLE]
and for every , that
[TABLE]
Following [15, Lemma], the normal bundle satisfies that
[TABLE]
Since is a -dimensional -submanifold, it follows that is a -dimensional -submanifold. Furthermore, the map which satisfies for every that satisfies for every that
[TABLE]
It follows from the inverse function theorem that there exists such that for every , it holds that
[TABLE]
Let , . We will first prove that . Let . If then it holds by definition that . If , since implies that and since the choice of implies that
[TABLE]
it holds that . Since and since it holds that
[TABLE]
for by Lemma 2.4, it holds that . This completes the proof that . It remains to prove that . Let . It is necessary to show that . The definition of implies that there exist and with which satisfy that . We will prove that . By contradiction, suppose that . This implies that
[TABLE]
It follows from the triangle inequality that
[TABLE]
which proves that
[TABLE]
for by Lemma 2.4 with . Since , it follows from (2.37) that . Since and since , equation (2.38) contradicts (2.33), which states that is injective on the set
[TABLE]
We conclude that , which implies that
[TABLE]
Therefore, it holds that , which completes the proof that . The final claim follows from a repetition of the arguments leading to (2.37) and (2.38). This completes the proof of of Proposition 2.7.∎
The following two lemmas contain the primary use of the nondegeneracy assumption, which states for every that
[TABLE]
The first of these proves that can be split into a component that is approximately normal to the local manifold of minima , and into a component that is approximately tangential to . We will use the normal component to obtain a rate of convergence for the gradient descent algorithms. The contribution of the tangential component will create errors that will need to be controlled.
Lemma 2.8**.**
Let , , let be the standard norm on , let be an open set, let be a three times continuously differentiable function, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume that is a non-empty -dimensional -submanifold of , and assume for every that . Then for every there exist and (cf. Definition 2.2) such that for every , it holds that (cf. Definition 2.6)
[TABLE]
and for every there exists which satisfies such that
[TABLE]
Proof of Lemma 2.8.
Let and . Since is an open set, there exists which satisfies that . Since is open, fix such that for every , it holds that
[TABLE]
Due to the compactness of and the regularity of , there exists which satisfies for every , that
[TABLE]
Let . By integration, since , it holds that
[TABLE]
It follows from (2.47), the local regularity of , and the definition of the projection that there exists which satisfies that
[TABLE]
After defining which satisfies that
[TABLE]
equation (2.48) and estimate (2.49) complete the proof of Lemma 2.8. ∎
The following lemma will play an important role in the analysis of the deterministic and stochastic gradient descent algorithms in discrete time. In the context of Lemma 2.8, for every with well-defined, the following lemma quantifies the convergence of gradient descent to .
Lemma 2.9**.**
Let , , let be the standard norm on , let be an open set, let be a three times continuously differentiable function, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume that is a non-empty -dimensional -submanifold of , and assume for every that . Then for every there exist , such that
[TABLE]
and (cf. Definition 2.2) such that for every , , , it holds that
[TABLE]
that
[TABLE]
and that
[TABLE]
Proof of Lemma 2.9.
Let . Since is an open subset, there exists which satisfies that . Fix such that every , it holds that (cf. Definition 2.6)
[TABLE]
Due to the compactness of and the regularity of , there exists which satisfies for every , that
[TABLE]
Let . For the first claim, using (2.58), fix which satisfies that
[TABLE]
Let . The definition of the distance to implies that
[TABLE]
Since the nondegeneracy assumption states that
[TABLE]
Lemma 2.4 below and (2.58) prove that there exists for which satisfies that
[TABLE]
for which we have that
[TABLE]
where the choice of and (2.62) guarantee that . In combination, estimates (2.60), (2.62), and (2.63) complete the proof of the first claim. The proof of the second claim is similar. For every , the nondegeneracy assumption, Lemma 2.4, and (2.58) prove that there exists which satisfies (2.62) such that
[TABLE]
which completes the proof of Lemma 2.9. ∎
3 Continuous deterministic gradient descent
In this section, for an objective function which satisfies the conditions of Theorem 1.1, we will analyze the local convergence to the local manifold of minima of the deterministic gradient descent algorithm in continuous time , , which satisfies for every that
[TABLE]
We will prove that the solution of (3.1) converges to the local manifold of minima , provided the initial condition is chosen in a sufficiently small neighborhood of . The proof can be outlined as follows. Given any , we first fix an open neighborhood which satisfies the conclusions of Lemma 2.8 and Lemma 2.9. Then, for initial data in this neighborhood, we quantify the convergence of the solution (3.1) to in directions normal to the manifold, using the decomposition of from Lemma 2.8. Finally, after fixing a smaller neighborhood about , we prove that the tangential components of the gradient of do not take the trajectory from the basin of attraction.
Proposition 3.1**.**
Let , , let be the standard norm on , let be an open set, let be a three times continuously differentiable function, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume that is a non-empty -dimensional -submanifold of , and assume for every that . Then for every there exist such that for every , , (cf. Definition 2.6), for , , which satisfies for every that
[TABLE]
it holds for every that
[TABLE]
Proof of Proposition 3.1.
Let . Since is an open set, fix (cf. Definition 2.2) which satisfies that . In view or Proposition 2.7, fix such that for every , the set (cf. Definition 2.6) satisfies that and that
[TABLE]
In particular, the compactness of and the regularity of imply that there exists which satisfies that
[TABLE]
Let , . Let , let , , satisfy for every that
[TABLE]
and let denote the exit time
[TABLE]
Lemma 2.5 and the chain rule prove that
[TABLE]
where the local regularity of and the stopping time guarantee the well-posedness of this equation. Let . It follows from Lemma 2.8 and Lemma 2.9 that there exist which satisfy that
[TABLE]
Proposition 2.1, (3.7), and prove that there exists which satisfies that
[TABLE]
Returning to (3.10), it follows from (3.11) and (3.12) that
[TABLE]
Let satisfy that
[TABLE]
Let . For every it follows from (3.13) and (3.14) that
[TABLE]
Therefore, for every , it holds that
[TABLE]
For every , it follows from (3.13) and (3.16) that
[TABLE]
Fix which satisfies that
[TABLE]
Let . In combination (3.16), (3.17), , and the triangle inequality prove that for every . This is to say that . Since was arbitrary, this completes the proof of Proposition 3.1. ∎
4 Discrete deterministic gradient descent
In this section, for an objective function which satisfies the conditions of Theorem 1.1, we will analyze the convergence of the following deterministic gradient descent algorithm , , in discrete time which satisfies for a learning rate and that
[TABLE]
The proof is similar to the case of the deterministic gradient descent algorithm in continuous time. However, in the discrete setting, care must be taken to choose the learning rate sufficiently small. Since, if the learning rate is too large, for small values of the jump may be an overcorrection that causes the solution to overshoot the local manifold of minima and to leave the basin of attraction.
In the proof, we first identify a basin of attraction using Proposition 2.1 and Proposition 2.7. In the second step, we prove that the solution (4.1) converges along the normal directions to the manifold of local minima provided the solution remains in the basin of attraction. For this, we use the normal component of from Lemma 2.8 and the quantification of the convergence from Lemma 2.9. Finally, after fixing a perhaps smaller basin of attraction, we prove that the tangential component of the gradient from Lemma 2.8 does not cause the solution (4.1) to leave the basin of attraction.
Proposition 4.1**.**
Let , , , let be the standard norm on , let be an open set, let be a three times continuously differentiable function, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume that is a non-empty -dimensional -submanifold of , and assume for every that . Then for every there exists such that for every , , , (cf. Definition 2.6), for , , which satisfies for every that
[TABLE]
it holds for every that
[TABLE]
Proof of Proposition 4.1.
Let and . Since is open, fix (cf. Definition 2.2) which satisfies that . In view or Proposition 2.7, fix such that for every , the set (cf. Definition 2.6) satisfies that and that
[TABLE]
The regularity of and the compactness of prove that there exists which satisfies that
[TABLE]
Fix which satisfies the conclusion of Lemma 2.9 for the set . Let , , . Let , let , , satisfy that
[TABLE]
and let be the exit time which satisfies that
[TABLE]
Since for every the projection of is well-defined, we have that
[TABLE]
Lemma 2.8 proves that there exists such that for every there exists which satisfies that
[TABLE]
such that
[TABLE]
The triangle inequality, (4.10), (4.11), and (4.12) prove that there exists such that for every it holds that
[TABLE]
Finally, the choice of , Lemma 2.9, and (4.13) prove that there exists such that for every it holds that
[TABLE]
where the choice of guarantees that . Fix which satisfies that
[TABLE]
Let . It follows from (4.14) and (4.15) that for every it holds that
[TABLE]
After iterating this inequality, we have for every that
[TABLE]
Since there exists which satisfies for every that
[TABLE]
it follows from (4.17) that there exists which satisfies for every that
[TABLE]
It remains only to show that, provided is chosen sufficiently small, we have that . It follows from (4.7), (4.19), and that there exists which satisfies that
[TABLE]
The triangle inequality therefore implies that there exists such that for every it holds that
[TABLE]
Fix which satisfies that
[TABLE]
Let . The choice of , (4.21), and the triangle inequality prove for every that
[TABLE]
In combination (4.19) and (4.23) prove for every that
[TABLE]
The triangle inequality therefore implies for every that
[TABLE]
It follows from Proposition 2.7, the choice of , and that for every it holds that . This is to say that , which completes the proof of Proposition 4.1. ∎
Remark 4.2**.**
The conclusion of Proposition 4.1 can be extended to the case of using the same techniques. In this case, in the setting of Proposition 4.1, there exists such that for every , , , (cf. Definition 2.6), for , , which satisfies for every that
[TABLE]
it holds for every that
[TABLE]
The logarithm appears in estimate (4.18) in the case . The remainder of the proof is then the same, where the only additional observation is that the analogue of (4.21) is finite in the case as well.
5 Stochastic gradient descent
In this section, in the setting of Theorem 1.1, for a learning rate , for , , for a bounded open subset , for a probability space , for a measurable space , for a jointly measurable function , for , , i.i.d. random variables, we will analyze the convergence of the mini-batch stochastic gradient descent algorithm , , which satisfies that is continuous uniformly distributed on and for every that
[TABLE]
The role of the mini-batch size is to reduce the variance of the random gradient
[TABLE]
The variance reduction is quantified by the following well-known lemma, where the function plays the role of .
Lemma 5.1**.**
Let , let be the standard norm on , let be a non-empty open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables, and assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|G(\theta,X_{1})|^{2}\big{]}<\infty. Then for every non-empty compact set there exists which satisfies for every that
[TABLE]
Proof of Lemma 5.1.
Let be a compact set. It holds for every , that
[TABLE]
Since the , , are i.i.d. and since , , is locally bounded in , there exists which satisfies for every that
[TABLE]
This completes the proof of Lemma 5.1. ∎
In the following proposition, much like the first step of the proofs of Proposition 3.1 and Proposition 4.1, we establish the convergence of (5.1) in directions normal to the local manifold of minima. We first identify a basin of attraction for (5.1) using Proposition 2.1 and Proposition 2.7 and prove, using the gradient decomposition of Lemma 2.8 and the quantification of convergence from Lemma 2.9, that on the event that SGD does not escape this basin of attraction SGD converges to the manifold of minima in expectation.
Remark 5.2**.**
We emphasize that the events , , defined in Proposition 5.3 below depend upon the quantifiers , , , and . However, in order to simplify the presentation, we will oftentimes suppress this dependence in the notation. For every , we will write for the indicator function of the set .
Proposition 5.3**.**
Let , , , let be the standard norm on , let be an open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a non-empty -dimensional -submanifold of , assume for every that , for every , , let satisfy for every that , for every , , let satisfy that
[TABLE]
and for every , , , let satisfy that
[TABLE]
Then for every there exist such that for every , , , , (cf. Definition 2.6) it holds that
[TABLE]
Proof of Proposition 5.3.
Let . Since is open, fix (cf. Definition 2.2) which satisfies that . Fix which satisfy the conclusion of Proposition 2.7 for this set . Finally, fix which satisfies the conclusion of Lemma 2.9. Let , , , . To simplify the notation, and by a small abuse of notation, let , , be the functions which satisfy for every that
[TABLE]
Let , let satisfy for every that , and for every let satisfy that
[TABLE]
We will analyze the solution of (5.12) on the event . We observe that
[TABLE]
Since the event implies that , the projection of is well-defined and it holds by definition of the distance to that
[TABLE]
The three terms on the righthand side of (LABEL:sgd_000) will be treated separately. For the first term on the righthand side of (LABEL:sgd_000), the choice of , Lemma 2.8, and Lemma 2.9 prove, following identically the proof leading from (4.10) to (4.14), that there exist such that
[TABLE]
Therefore, there exist which satisfy that
[TABLE]
The remaining two terms of (LABEL:sgd_000) and the righthand side of (5.16) will be handled after taking the expectation on the event which satisfies that
[TABLE]
After returning to (LABEL:sgd_000), it follows from (5.16) that there exists which satisfies that
[TABLE]
For every let be the sigma algebra which satisfies that
[TABLE]
For the penultimate term of (5.18), since is -measurable, properties of the conditional expectation imply that
[TABLE]
Therefore, it holds that
[TABLE]
where the final equality follows from the fact that the , , are independent and therefore satisfy for every that
[TABLE]
The final term of (5.18) is handled using Lemma 5.1. Since is compact, the independence of the , , and Lemma 5.1 prove that there exists such that
[TABLE]
Returning to (5.18), it follows from (5.21) and (5.23) that there exists such that
[TABLE]
Fix which satisfies that
[TABLE]
Let . We claim that inequality (5.24) implies that there exists some which satisfies for every that
[TABLE]
The proof of (5.26) will proceed by induction. Since , there exists such that for every it holds that
[TABLE]
where the first inequality follows from the mean value theorem and and the second inequality is obtained by choosing sufficiently large. Fix which satisfies (5.27) and define which satisfies that
[TABLE]
For the base case, the definition of guarantees for every that
[TABLE]
For the induction step, suppose that for we have that
[TABLE]
Since the event implies that
[TABLE]
it follows from an -estimate, the inclusion , and the induction hypothesis that for every it holds that
[TABLE]
Returning to (5.24), it holds that
[TABLE]
After adding and subtracting , it holds that
[TABLE]
Since , it follows from (5.34) that
[TABLE]
Since , the choice , (5.27), and (5.35) prove that
[TABLE]
Therefore, we have that
[TABLE]
which completes the induction step. Since the base case is (5.29), this completes the proof of Proposition 5.3. ∎
Proposition 5.3 proves the convergence of SGD to on the event that SGD remains in a basin of attraction. It remains necessary to prove that, provided the mini-batch size is chosen to be sufficiently large, SGD remains in the basin of attraction for large times. We prove the first step toward this goal in the proposition below, which estimates the maximal excursion of SGD on the event that the dynamics do not leave a basin of attraction.
Proposition 5.4**.**
Let , , , let be the standard norm on , let be an open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a non-empty -dimensional -submanifold of , assume for every that , for every , , let satisfy for every that , for every , , let satisfy that
[TABLE]
and for every , , , let satisfy that
[TABLE]
Then for every there exist such that for every , , , , (cf. Definition 2.6) it holds that
[TABLE]
Proof of Proposition 5.4.
Let be the function which satisfies for every that
[TABLE]
Let . Since is open, fix (cf. Definition 2.2) which satisfies that . Fix which satisfies the conclusion of Proposition 2.7 for this set . We observe that the regularity of and the compactness of imply that
[TABLE]
Finally, fix which satisfies the conclusion of Lemma 2.9. Let , , , . As in Proposition 5.3, let , , be the functions which satisfy for every that
[TABLE]
Let , let satisfy for every that , and for every let satisfy that
[TABLE]
We will first prove that there exists which satisfies that
[TABLE]
where we observe that the constant can be absorbed by fixing sufficiently small. It holds that
[TABLE]
Lemma 2.8 proves that there exists and which satisfy that
[TABLE]
such that on the event it holds that
[TABLE]
Therefore, on the event it holds that
[TABLE]
Let satisfy that
[TABLE]
After taking the norm-squared of (5.50), on the event it holds that
[TABLE]
We will estimate (5.52) by taking the expectation on the event . The first term on the righthand side of (5.52) is handled using Proposition 5.3 and (5.48). For the second term, from (5.19) we recall the sigma algebras , , which satisfy that
[TABLE]
Since is -measurable, it follows identically to (5.21) and (5.22) that
[TABLE]
For the final term on the righthand side of (5.52), the compactness of , the independence of the , , and Lemma 5.1 prove that there exists which satisfies that
[TABLE]
In combination, Proposition 5.3 and estimates (5.48), (5.52), (5.54), and (5.55) prove that there exists which satisfies that
[TABLE]
It follows from the definition of , (5.43), and the definition of the projection that, on the event there exists which satisfies that
[TABLE]
Proposition 5.3 proves that there exists such that
[TABLE]
It follows from the triangle inequality, (5.56), and (5.58) that there exists which satisfies that
[TABLE]
which completes the proof of (5.46). Since for every we have , it follows from (5.59), the triangle inequality, and Hölder’s inequality that there exists which satisfies for every that
[TABLE]
where we have used that fact that, since , there exists a such that
[TABLE]
This completes the proof of Proposition 5.4. ∎
Remark 5.5**.**
We emphasize that the assumption is only used to ensure the boundedness in of the first sum appearing on the lefthand side of (5.61), which cannot be countered by the mini-batch size . Every other argument in the paper applies without change to the case . In particular, because the result of Proposition 5.4 is not needed if is compact, since SGD cannot leave the basin of attraction in tangential directions, the results of Section 6 apply for under this additional compactness assumption.
We will next obtain a lower bound in probability for the events , . For this, we will first establish sufficient conditions for containment in the set . Effectively, these conditions split the normal and tangential movement of SGD in the sense that, in order to be outside the set , a point must be either distance greater than from or be of distance roughly greater than from .
Lemma 5.6**.**
Let , , let be the standard norm on , and let be a -dimensional -submanifold, let be the function which satisfies for every that
[TABLE]
Then for every there exists such that for every , , for which satisfies that
[TABLE]
it holds that
[TABLE]
Proof of Lemma 5.6.
Let , let (cf. Definition 2.2), and let satisfy the conclusion of Proposition 2.7. That is, for every , it holds that and that
[TABLE]
Suppose that satisfies that
[TABLE]
The definition of the distance to and imply that there exists a possibly non-unique which satisfies that
[TABLE]
The triangle inequality implies that
[TABLE]
It follows that , and therefore that
[TABLE]
It follows from (5.66) and (5.69) that , which completes the proof of Lemma 5.6. ∎
In the following proposition, we obtain a lower bound in probability for the sets , . The interesting observation is that Proposition 5.3 and Proposition 5.4, which obtain estimates for the solution of (5.1) conditioned on the events , , can be used together and inductively to obtain lower bound in probability for the events , . Namely, Proposition 5.3 implies that, on the event , the process is converging to in the normal directions with high probability, and Proposition 5.4 can be used to estimate the probability that the solution (5.1) escapes the basin of attraction along the tangential directions. We first introduce some convenient notation.
Proposition 5.7**.**
Let , , , let be the standard norm on , let be an open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a non-empty -dimensional -submanifold of , assume for every that , for every , , let satisfy for every that , for every , , let satisfy that
[TABLE]
and for every , , , let satisfy that
[TABLE]
Then for every there exist such that for every , , , , (cf. Definition 2.6) it holds that
[TABLE]
Proof of Proposition 5.7.
Let be the function which satisfies for every that
[TABLE]
Let . Since is open, fix (cf. Definition 2.2) which satisfies that . Fix which satisfy the conclusion of Proposition 2.7 for this set . Fix which satisfies the conclusion of Lemma 2.9. Let , , , . As in Proposition 5.3, let , , be the functions which satisfy for every that
[TABLE]
Let , let satisfy for every that , and for every let satisfy that
[TABLE]
Since it holds that
[TABLE]
it follows that
[TABLE]
The two terms on the righthand side of (5.79) will be handled separately. We will first prove that there exists which satisfies that
[TABLE]
On the event , it follows from Lemma 2.8 that there exists , such that
[TABLE]
and such that on the event it holds that
[TABLE]
Therefore, on the event , we have that
[TABLE]
Lemma 2.9, (5.81), the choice of , the definition of the projection, and the triangle inequality prove that there exist such that on the event it holds that
[TABLE]
Fix which satisfies that
[TABLE]
Let . On the event , it follows from (5.84) and the choice of that
[TABLE]
We therefore conclude that
[TABLE]
Similarly to (5.21) and computation (5.22), it follows from the independence of the random variables , , that
[TABLE]
and that
[TABLE]
The definition of , Chebyshev’s inequality, Lemma 5.1, and (5.88) prove that there exists which satisfies that
[TABLE]
In the case of (5.89), Proposition 5.3 and Chebyshev’s inequality prove that, for the indicator function of the event , there exists which satisfies that
[TABLE]
where we have used the fact that, since , there exists such that for every it holds that . Furthermore, Chebyshev’s inequality and Lemma 5.1 prove that there exists which satisfies that
[TABLE]
Returning to (5.89), the previous two inequalities prove that there exists which satisfies that
[TABLE]
Combining (5.87), (5.90), and (5.93), there exists such that
[TABLE]
which completes the proof of (5.80). Returning to (5.79), it follows from (5.94) that there exists such that
[TABLE]
Therefore, there exists which satisfies that
[TABLE]
We will prove inductively that (5.96) implies that there exists such that for every it holds that
[TABLE]
The base case follows immediately from . For the inductive step, suppose that (5.101) is satisfied for some . It follows from (5.96) that
[TABLE]
It then follows from the inductive hypothesis (5.101) that
[TABLE]
which proves that
[TABLE]
Finally, since implies that , it holds that
[TABLE]
which completes the induction step, and the proof of (5.101). It remains only to estimate the final term on the righthand side of inequality (5.101). The definition of the events , , implies that
[TABLE]
Therefore, it holds that
[TABLE]
Lemma 5.6 proves that
[TABLE]
Since , the triangle inequality prove for every that
[TABLE]
Therefore, for every , on the event \big{\{}\big{|}\Theta^{M,r}_{k,\theta}-x_{0}\big{|}>R-\delta\big{\}} it holds that
[TABLE]
This implies that
[TABLE]
In combination, (5.103), (5.104), and (5.107) prove that
[TABLE]
It follows from Proposition 5.4, (5.108), and Chebyshev’s inequality that there exists which satisfies that
[TABLE]
Returning to (5.101), it follows from (5.109) that there exists which satisfies that
[TABLE]
where we have used the fact that, since , there exists which satisfies that
[TABLE]
This completes the proof of Proposition 5.7. ∎
We will now use Proposition 5.3 and Proposition 5.7 to estimate the probability that SGD of mini-batch size converges to within distance of the manifold of local minima at time . In the theorem, we assume that the initial condition is continuous uniformly distributed on a bounded open subset which satisfies that .
Theorem 5.8**.**
Let , , , let be the standard norm on , let be an open set, let be a bounded open set, let be the Lebesgue-Borel measure, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a -dimensional -submanifold of , assume that , assume for every that , for every , let be continuous uniformly distributed on , assume for every , that and \big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}} are independent, for every , let , , be random variables which satisfy that
[TABLE]
Then for every there exist such that for every , , , , it holds that
[TABLE]
Proof of Theorem 5.8.
Let . Since is open, fix (cf. Definition 2.2) which satisfies that . Fix that satisfy the conclusion of Proposition 2.7 for this set . Fix that satisfies the conclusions of Lemma 2.9 and Proposition 5.7. Let , , , . As in Proposition 5.3, let , , be the functions which satisfy for every that
[TABLE]
For every let satisfy for every that and for every let satisfy that
[TABLE]
Let be a random variable which is continuous uniformly distributed on , assume that and are independent, and for every let satisfy that . Let , . It holds that
[TABLE]
For the second term on the righthand side of (5.116), it follows from the continuous uniform distribution of on that
[TABLE]
We will now estimate the first term on the righthand side of (5.119). For every , let be the event which satisfies that that
[TABLE]
and for every let satisfy that
[TABLE]
It holds that
[TABLE]
For the second term on the righthand side of (5.123), it follows from Proposition 5.7 that there exists such that
[TABLE]
where we have used the fact that implies that there exists that satisfies for every that . For the first term on the righthand side of (5.123), since the random variables and \big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}} are independent, it holds that
[TABLE]
Proposition 5.3 and Chebyshev’s inequality prove that there exists such that for every it holds that
[TABLE]
In combination (LABEL:gss_5) and (5.126) prove that there exists such that
[TABLE]
Returning to (5.123), it follows from (5.124) and (5.127) that there exists such that
[TABLE]
Returning finally to (5.119), it follows from (5.120) and (5.128) that there exists such that
[TABLE]
which completes the proof of Theorem 5.8. ∎
The next corollary estimates the probability that independent samples of SGD with mini-batch size fail to to converge to within distance of the manifold of local minima at time . The proof is a straightforward consequence of Theorem 5.8 and the independence of the random variables.
Corollary 5.9**.**
Let , , , let be the standard norm on , let be an open set, let be a bounded open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a -dimensional -submanifold of , assume that , assume for every that , for every , , let , , be i.i.d. random variables, assume for every , that is continuous uniformly distributed on , assume for every , that and \big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}} are independent, and assume for every , that
[TABLE]
Then for every there exist such that for every , , , , it holds that
[TABLE]
Proof of Corollary 5.9.
Let . Since is open, fix (cf. Definition 2.2) which satisfies that . Fix which satisfy the conclusion of Proposition 2.7 for this set . Fix which satisfy the conclusions of Lemma 2.9 and Proposition 5.7. Let , , , . Since the , , are i.i.d. it holds that
[TABLE]
Theorem 5.8 and (5.135) prove estimate (LABEL:mp_0), which completes the proof of Corollary 5.9. ∎
The following corollary translates the convergence of , , to the local manifold of minima into a statement concerning the minimization of the objective function. The proof is a consequence of Corollary 5.9 and the local regularity of the objective function.
Corollary 5.10**.**
Let , , , let be the standard norm on , let be an open set, let be a bounded open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a -dimensional -submanifold of , assume that , assume for every that , for every , , let , , be i.i.d. random variables, assume for every , that is continuous uniformly distributed on , assume for every , that and \big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}} are independent, and assume for every , that
[TABLE]
Then for every there exist such that for every , , , , it holds that
[TABLE]
Proof of Corollary 5.10.
The proof is an immediate consequence of Corollary 5.9 and the local regularity of the objective function.∎
Under the assumptions and notations of Corollary 5.10, since a random variable satisfy that
[TABLE]
is either computationally inefficient or computationally impossible to obtain, we will prove that such a minimizer can be efficiently computed using mini-batch averages. In the following lemma, we prove that there exists a measurable selection that minimizes a mini-batch approximation.
Lemma 5.11**.**
Let , let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables, and let , , be i.i.d. random variables. Then for every there exists a random variable such that
[TABLE]
Proof of Lemma 5.11.
Let . Let satisfy for every that
[TABLE]
Let satisfy for every that
[TABLE]
It follow from (5.142) and (5.143) that is measurable and satisfies (5.141), which completes the proof of Lemma 5.11. ∎
In the following theorem, we prove that the minimum appearing on the lefthand side of (LABEL:mp_0) can be efficiently computed using mini-batch averages of the type appearing in Lemma 5.11.
Theorem 5.12**.**
Let , , , let be an open set, let be a bounded open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a -dimensional -submanifold of , assume that , assume for every that , for every , , let , , be i.i.d. random variables, assume for every , that and are independent, assume for every , that is continuous uniformly distributed on , assume for every , that and \big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}} are independent, assume for every , that
[TABLE]
and for every , let be a random variable which satisfies that
[TABLE]
Then for every there exist such that for every , , , , it holds that
[TABLE]
Proof of Theorem 5.12.
Let . Since is open, fix (cf. Definition 2.2) which satisfies that . Fix which satisfy the conclusion of Proposition 2.7 for this set . Fix which satisfy the conclusions of Lemma 2.9 and Proposition 5.7. Let , , , . For every let satisfy that
[TABLE]
and let satisfy that and for every let satisfy that . Since the events , , are disjoint, it holds that
[TABLE]
For the first term on the righthand side of (5.150), Corollary 5.10 proves that there exists which satisfies that
[TABLE]
We will now estimate the second term on the righthand side of (5.151). Let , , be disjoint events which satisfy that and that
[TABLE]
Since the events , , are disjoint, the final term of (5.150) satisfies that
[TABLE]
Let be the function which satisfies for every , that
[TABLE]
For every , since it holds for every that
[TABLE]
it holds for every that
[TABLE]
It follows from (5.153) and (5.156) that
[TABLE]
For the first term on the righthand side of (5.157), it holds that
[TABLE]
Since the random variables and are independent, since the are identically distributed, and since the distribution of has bounded support on , for the distribution of on , Lemma 5.1, Chebyshev’s inequality, and the definition of prove that that there exists which satisfies for every that
[TABLE]
Therefore, it holds that
[TABLE]
For the second term on the righthand side of (5.157), it is sufficient to apply the same argument, which proves that there exists which satisfies that
[TABLE]
Returning to (5.153), it follows from (5.157) and (5.160) that there exists which satisfies that
[TABLE]
Returning finally to (5.150), it follows from (5.151) and (5.162) that there exists which satisfies that
[TABLE]
which completes the proof of Theorem 5.12. ∎
In the final corollary of this section, we will compute the computational efficiency of the algorithm proposed in Theorem 5.12. The constant implicitly depends on the computational cost of computing and and initializing the random variable , but it does not depend upon the running time , the sampling size , or the mini-batch sizes .
Corollary 5.13**.**
Let , , , let be an open set, let be a bounded open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a -dimensional -submanifold of , assume that , assume for every that , for every , , let , , be i.i.d. random variables, assume for every , that and are independent, assume for every , that is continuous uniformly distributed on , assume for every , that and \big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}} are independent, assume for every , that
[TABLE]
and for every , let be a random variable which satisfies that
[TABLE]
Then for every there exist such that for every , , there exist , , such that for every , for which satisfy that
[TABLE]
it holds that
[TABLE]
Proof of Corollary 5.13.
Let . Let satisfy the conclusion of Theorem 5.12. Theorem 5.12 proves that there exists such that for every , , , , it holds that
[TABLE]
Fix , which satisfy that
[TABLE]
Since , it holds that
[TABLE]
For every which satisfies that , since there exists which satisfies that
[TABLE]
and therefore for every there exists which satisfies that
[TABLE]
It follows from (5.170) that there exists which satisfies that
[TABLE]
Returning to (5.169), it follows from (5.173) and (5.174) that there exists which satisfies that
[TABLE]
Let . It follows from (5.170), (5.171) and an explicit computation that there exist , , and such that for which satisfy that
[TABLE]
it holds that
[TABLE]
and for every that
[TABLE]
Returning to (5.175), it follows for every that
[TABLE]
which completes the proof of Corollary 5.13. ∎
6 Stochastic gradient descent - The compact case
In this section, we will analyze the converge of SGD to the manifold of local minima under the additional assumption that the manifold of local minima is compact. The essential difference in this case is that SGD cannot leave a basin of attraction along directions tangential to the manifold. We first observe the convergence of SGD in directions normal to the manifold.
The following proposition is an immediate consequence of Proposition 5.3 and the compactness of , where the essential difference in the compact case is that can be chosen arbitrarily large. In particular, by compactness, for every there exists such that for every , it holds that . Furthermore, it follows from Remark 5.5 that the results apply to .
Proposition 6.1**.**
Let , , , let be the standard norm on , let be an open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a non-empty compact -dimensional -submanifold of , assume for every that , for every , , let satisfy for every that , for every , , let satisfy that
[TABLE]
and for every , , , let satisfy that
[TABLE]
Then for every there exist such that for every , , , , (cf. Definition 2.6) it holds that
[TABLE]
Proof of Proposition 6.1.
The proof is an immediate consequence of Proposition 5.3 and the compactness of . ∎
We will now obtain a lower bound in probability for the events , . It follows from Proposition 5.7 and the compactness of that for every there exist such that the conclusion of Proposition 5.7 is satisfied for every , , and for this constant . That is, since for every sufficiently large we have , it holds that the constant can be chosen independently of .
The proof of the following proposition is then an immediate consequence of Proposition 5.7, after using the fact that the constant is independent of and passing to the limit . The improvement in the estimate, when compared to Proposition 5.7, is a result of the fact that SGD cannot leave the basin of attraction along the directions tangential to the manifold.
Proposition 6.2**.**
Let , , , let be the standard norm on , let be an open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a non-empty compact -dimensional -submanifold of , assume for every that , for every , , let satisfy for every that , for every , , let satisfy that
[TABLE]
and for every , , , let satisfy that
[TABLE]
Then for every there exist such that for every , , , , (cf. Definition 2.6) it holds that
[TABLE]
Proof of Proposition 6.2.
The proof is an immediate consequence of Proposition 5.7 and the compactness of .∎
The following theorem proves the convergence of SGD with initial data sampled from a uniform distribution on a bounded open set which satisfies that . The proof is an immediate consequence of Theorem 5.8, Proposition 6.1, and Proposition 6.2.
Theorem 6.3**.**
Let , , , let be the standard norm on , let be an open set, let be a bounded open set, let be the Lebesgue-Borel measure, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a compact -dimensional -submanifold of , assume that , assume for every that , for every , let be continuous uniformly distributed on , assume for every , that and \big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}} are independent, and for every , let , , be random variables which satisfy that
[TABLE]
Then for every there exist such that for every , , , , it holds that
[TABLE]
Proof of Theorem 6.3.
The proof is an immediate consequence of Theorem 5.8, Proposition 6.1, and Proposition 6.2.∎
The following theorem estimates probability that independent solutions of SGD with initial data sampled from a uniform distribution on a compact set which satisfies that is non-empty fail to converge to within distance to the local manifold of minima at time . The convergence is measured by minimizing a mini-batch average of the objective function. The proof is a consequence of Theorem 6.3 and the arguments leading from Theorem 5.8 to Theorem 5.12.
Theorem 6.4**.**
Let , , , let be the standard norm on , let be an open set, let be a bounded open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
let be the function which satisfies for every that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a compact -dimensional -submanifold of , assume that , assume for every that , for every , , let , , be i.i.d. random variables, assume for every , that and are independent, assume for every , that is continuous uniformly distributed on , assume for every , that and \big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}} are independent, assume for every , that
[TABLE]
and for every , let be a random variable which satisfies that
[TABLE]
Then for every there exist such that for every , , , , it holds that
[TABLE]
Proof of Theorem 6.4.
The proof is an immediate consequence of Theorem 6.3, Theorem 5.8, and Theorem 5.12.∎
In the final proposition of this section, we prove that the computation efficiency of the SGD algorithm proposed in Theorem 6.4 is improved by the compactness of . The improvement is due to the fact that the mini-batch size can be chosen smaller in the compact case, since the mini-batch size no longer needs to account for the possibility that SGD leaves a basin of attraction along directions tangential to the local manifold of minima.
Corollary 6.5**.**
Let , , , let be an open set, let be a bounded open set, let be a probability space, let be a measurable space, let be a measurable function, let , , be i.i.d. random variables which satisfy for every that \mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty, let be the function which satisfies for every that f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}, let satisfy that
[TABLE]
assume for every that is a locally Lipschitz continuous function, assume that is a three times continuously differentiable function, assume for every non-empty compact set that \sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty, assume that is a compact -dimensional -submanifold of , assume that , assume for every that , for every , , let , , be i.i.d. random variables, assume for every , that and are independent, assume for every , that is continuous uniformly distributed on , assume for every , that and \big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}} are independent, assume for every , that
[TABLE]
and for every , let be a random variable which satisfies that
[TABLE]
Then for every there exist such that for every , , there exist , , such that for every , for which satisfy that
[TABLE]
it holds that
[TABLE]
Proof of Corollary 6.5.
The proof is an immediate consequence of Theorem 6.4 and the proof of Corollary 5.13.∎
7 Applications
In this section, we prove that the conditions of Theorem 1.1 are satisfied for some (simple) objective functions of the type (1.33) that arise in the training of neural networks. We will consider the case of a four-parameter affine-linear network with a linear activation function and the case of a two-parameter network with the ReLU activation function. We will prove that the set of global minima are respectively a codimension submanifold of the parameter space, and a codimension submanifold. This implies, in particular, that the global minima are not locally unique, and that the established convergence results, such as those proven in [13, 24], do not apply.
7.1 A four-parameter network with a linear activation function
In this section, we show that the conditions of Theorem 1.1 are satisfied by a four-parameter affine-linear network with a linear activation function.
Proposition 7.1**.**
Let be finite, let be a probability space, let , , be i.i.d. random variables that are continuous uniformly distributed on , let be the function which satisfies for every that
[TABLE]
and let be the function that satisfies for every , that
[TABLE]
Then the functions , and the random variables , , satisfy the conditions of Theorem 1.1.
Proof of Proposition 7.1.
Let be finite. The finiteness of proves that, for every , we have . It follows by the uniform distribution of the , , on that , and it follows from the -integrability of that for every compact subset it holds that
[TABLE]
It follows by the definition of and that . It remains to characterize the set of minima of . We first observe that when minimizing , it is sufficient to minimize the potential over the set . To see this, suppose that . Then for it holds that
[TABLE]
Therefore, it holds that
[TABLE]
Let be fixed but arbitrary. An explicit computation proves the critical points of satisfy that
[TABLE]
For , , which satisfy that
[TABLE]
it follows that satisfies equation (7.6) if and only if it holds that
[TABLE]
For which satisfies that , an explicit computation proves that satisfies system (7.8) if and only if it holds that
[TABLE]
For which satisfies that
[TABLE]
for which satisfies that
[TABLE]
we claim that
[TABLE]
Let satisfy (7.9) and . Proceeding by contradiction, suppose that there exists which satisfies such that
[TABLE]
Since an explicit computation proves for every that
[TABLE]
the identical considerations leading to (7.9) prove that
[TABLE]
is uniquely minimized, owing to , by which satisfies that
[TABLE]
We conclude that satisfies that
[TABLE]
satisfies (7.9) and . Therefore, it holds that
[TABLE]
which contradicts the fact that on the connected set of which satisfies (7.9) and . This proves (7.12). It is immediate from (7.9) that is a non-empty, -dimensional, -submanifold of . It remains only to prove the nondegeneracy assumption. for every , after computing the Hessian222Due to the symmetry of the Hessian, we only write the upper diagonal., it holds that
[TABLE]
where this equality relies upon the fact that, due to (7.6) and on , we have that
[TABLE]
A column-reduction, which relies on the fact that for every we have , proves for every that
[TABLE]
This completes the proof of Proposition 7.1. ∎
7.2 A two parameter network with the ReLU activation function
In this section, we show that the conditions of Theorem 1.1 are satisfied by a two-parameter affine-linear network with the ReLU activation function.
Proposition 7.2**.**
Let be a probability space, let , , be i.i.d. random variables that are continuous uniformly distributed on , let be the function which satisfies for every that
[TABLE]
and let be the function which satisfies for every , that
[TABLE]
Then the functions , and the random variables , , satisfy the conditions of Theorem 1.1.
Proof of Proposition 7.2.
It is immediate that . Since the , are uniformly distributed on , for every it holds that
[TABLE]
and, furthermore, a straightforward computation proves for every compact set that
[TABLE]
It remains only to characterize the minima of the objective function, and to verify the nondegeneracy condition. An explicit computation proves that, when minimizing , it is sufficient to restrict to the set . Let satisfy that
[TABLE]
We observe for every that
[TABLE]
and for every that
[TABLE]
Therefore, for it holds that if and only if it holds that
[TABLE]
Let satisfy that
[TABLE]
We claim that
[TABLE]
Suppose that satisfies (7.29). By contradiction suppose that there exists such that
[TABLE]
Since an explicit computation proves that
[TABLE]
The arguments leading from (7.27) to (7.29) prove that (7.33) is uniquely minimized when
[TABLE]
Therefore, for which satisfies that
[TABLE]
we have that , that satisfies (7.29), and that
[TABLE]
This contradicts the fact that on the connected set of that satisfy (7.29). This proves (7.31). Since it is clear that is a non-empty, -dimensional, -submanifold of , it remains only to establish the nondegeneracy assumption. For every it holds that
[TABLE]
A column reduction and prove for every that
[TABLE]
This completes the proof of Proposition 7.2. ∎
Acknowledgements
The first author acknowledges financial support from the National Science Foundation Mathematical Sciences Postdoctoral Research Fellowship under Grant Number 1502731.
The second author acknowledges financial support by the DFG through the CRC 1283 “Taming uncertainty and profiting from randomness and low regularity in analysis, stochastics and their applications.”
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Anitescu. Degenerate Nonlinear Programming with a Quadratic Growth Condition. 10(4):1116–1135.
- 2[2] F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. , 15:595–627, 2014.
- 3[3] F. Bach and E Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in Neural Information Processing Systems (NIPS) , 2011.
- 4[4] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). In Advances in neural information processing systems , pages 773–781, 2013.
- 5[5] B. Bercu and J.-C. Fort. Generic stochastic gradient methods. Wiley Encyclopedia of Operations Research and Management Science , pages 1–8, 2013.
- 6[6] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 , pages 177–186. Physica-Verlag/Springer, Heidelberg, 2010.
- 7[7] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. Optimization for Machine Learning, MIT Press , pages 351–368, 2011.
- 8[8] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. 60(2):223–311.
