Convergence rates for the stochastic gradient descent method for   non-convex objective functions

Benjamin Fehrman; Benjamin Gess; Arnulf Jentzen

arXiv:1904.01517·math.NA·November 2, 2021

Convergence rates for the stochastic gradient descent method for non-convex objective functions

Benjamin Fehrman, Benjamin Gess, Arnulf Jentzen

PDF

TL;DR

This paper establishes local convergence and rate estimates for stochastic gradient descent on non-convex functions, relevant to machine learning applications, expanding understanding beyond convex scenarios.

Contribution

It provides the first local convergence and rate results for SGD on non-convex, non-globally convex functions, applicable in machine learning.

Findings

01

Proves local convergence to minima for non-convex functions.

02

Provides estimates on the rate of convergence.

03

Applicable to simple objective functions in machine learning.

Abstract

We prove the local convergence to minima and estimates on the rate of convergence for the stochastic gradient descent method in the case of not necessarily globally convex nor contracting objective functions. In particular, the results are applicable to simple objective functions arising in machine learning.

Equations856

f (θ) = E [F (θ, X)],

f (θ) = E [F (θ, X)],

(- \nabla_{θ} f (θ), θ - θ^{*}) \leq - L ∥ θ - θ^{*} ∥^{2} .

(- \nabla_{θ} f (θ), θ - θ^{*}) \leq - L ∥ θ - θ^{*} ∥^{2} .

\mathcal{M}=\big{\{}\theta\in\mathbb{R}^{d}\colon[f(\theta)=\inf\nolimits_{\vartheta\in\mathbb{R}^{d}}f(\vartheta)]\big{\}},

\mathcal{M}=\big{\{}\theta\in\mathbb{R}^{d}\colon[f(\theta)=\inf\nolimits_{\vartheta\in\mathbb{R}^{d}}f(\vartheta)]\big{\}},

Θ_{n}^{k, M, r} = Θ_{n - 1}^{k, M, r} - \frac{r}{n ^{ρ} M} [m = 1 \sum M (\nabla_{θ} F) (Θ_{n - 1}^{k, M, r}, X_{k, n, m})],

Θ_{n}^{k, M, r} = Θ_{n - 1}^{k, M, r} - \frac{r}{n ^{ρ} M} [m = 1 \sum M (\nabla_{θ} F) (Θ_{n - 1}^{k, M, r}, X_{k, n, m})],

m = 1 \sum M F (Θ_{n}^{K, M, M, r}, X_{1, n + 1, m}) = k \in {1, 2 \dots, K} min [m = 1 \sum M F (Θ_{n}^{k, M, r}, X_{1, n + 1, m})],

m = 1 \sum M F (Θ_{n}^{K, M, M, r}, X_{1, n + 1, m}) = k \in {1, 2 \dots, K} min [m = 1 \sum M F (Θ_{n}^{k, M, r}, X_{1, n + 1, m})],

\mathbb{P}\Big{(}\Big{[}f(\varTheta^{K,M,\mathfrak{M},r}_{n})-\inf\nolimits_{\theta\in\mathbb{R}^{d}}f(\theta)\Big{]}\geq\varepsilon\Big{)}\leq\frac{cK}{\varepsilon^{2}\mathfrak{M}}+\left[\kappa+c\left(\frac{1}{\varepsilon^{2}n^{\rho}}+\frac{n^{1-\rho}}{M^{\nicefrac{{1}}{{2}}}}\right)\right]^{K}.

\mathbb{P}\Big{(}\Big{[}f(\varTheta^{K,M,\mathfrak{M},r}_{n})-\inf\nolimits_{\theta\in\mathbb{R}^{d}}f(\theta)\Big{]}\geq\varepsilon\Big{)}\leq\frac{cK}{\varepsilon^{2}\mathfrak{M}}+\left[\kappa+c\left(\frac{1}{\varepsilon^{2}n^{\rho}}+\frac{n^{1-\rho}}{M^{\nicefrac{{1}}{{2}}}}\right)\right]^{K}.

\mathcal{M}=\{\theta\in\mathbb{R}^{d}\colon f(\theta)=\big{[}\inf\nolimits_{\vartheta\in\mathbb{R}^{d}}f(\vartheta)\big{]}\},

\mathcal{M}=\{\theta\in\mathbb{R}^{d}\colon f(\theta)=\big{[}\inf\nolimits_{\vartheta\in\mathbb{R}^{d}}f(\vartheta)\big{]}\},

M \cap U is a non-empty d -dimensional C^{1} -submanifold of R^{d} .

M \cap U is a non-empty d -dimensional C^{1} -submanifold of R^{d} .

\operatorname{rank}\big{(}\big{(}\operatorname{Hess}f\big{)}(\theta)\big{)}=d-\mathfrak{d}=\operatorname{codim}(\mathcal{M}\cap U).

\operatorname{rank}\big{(}\big{(}\operatorname{Hess}f\big{)}(\theta)\big{)}=d-\mathfrak{d}=\operatorname{codim}(\mathcal{M}\cap U).

f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1,1})\big{]}.

f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1,1})\big{]}.

Θ_{n}^{k, M, r} = Θ_{n - 1}^{k, M, r} - \frac{r}{n ^{ρ} M} [m = 1 \sum M (\nabla_{θ} F) (Θ_{n - 1}^{k, M, r}, X_{k, n, m})] .

Θ_{n}^{k, M, r} = Θ_{n - 1}^{k, M, r} - \frac{r}{n ^{ρ} M} [m = 1 \sum M (\nabla_{θ} F) (Θ_{n - 1}^{k, M, r}, X_{k, n, m})] .

F^{K, M, n} (θ, ω) = \frac{1}{M} m = 1 \sum M F (θ, X_{1, n + 1, m} (ω)) .

F^{K, M, n} (θ, ω) = \frac{1}{M} m = 1 \sum M F (θ, X_{1, n + 1, m} (ω)) .

m = 1 \sum M F (Θ_{n}^{K, M, M, r}, X_{1, n + 1, m}) = k \in {1, 2, \dots, K} min [m = 1 \sum M F (Θ_{n}^{k, M, r}, X_{1, n + 1, m})] .

m = 1 \sum M F (Θ_{n}^{K, M, M, r}, X_{1, n + 1, m}) = k \in {1, 2, \dots, K} min [m = 1 \sum M F (Θ_{n}^{k, M, r}, X_{1, n + 1, m})] .

\mathbb{P}\Big{(}\Big{[}f(\varTheta^{K,M,\mathfrak{M},r}_{n})-\inf\nolimits_{\theta\in\mathbb{R}^{d}}f(\theta)\Big{]}\geq\varepsilon\Big{)}\leq\frac{cK}{\varepsilon^{2}\mathfrak{M}}+\left[\kappa+c\left(\frac{1}{\varepsilon^{2}n^{\rho}}+\frac{n^{1-\rho}}{M^{\nicefrac{{1}}{{2}}}}\right)\right]^{K}.

\mathbb{P}\Big{(}\Big{[}f(\varTheta^{K,M,\mathfrak{M},r}_{n})-\inf\nolimits_{\theta\in\mathbb{R}^{d}}f(\theta)\Big{]}\geq\varepsilon\Big{)}\leq\frac{cK}{\varepsilon^{2}\mathfrak{M}}+\left[\kappa+c\left(\frac{1}{\varepsilon^{2}n^{\rho}}+\frac{n^{1-\rho}}{M^{\nicefrac{{1}}{{2}}}}\right)\right]^{K}.

n (ε) = c_{1} ε^{- \nicefrac 2 ρ}, M (ε) = c_{2} ε^{- \nicefrac 4 ρ + 4}, M (ε, η) = c_{3} ε^{- 2} η^{- 1} ∣ lo g (η) ∣, and K (η) = c_{4} ∣ lo g (η) ∣,

n (ε) = c_{1} ε^{- \nicefrac 2 ρ}, M (ε) = c_{2} ε^{- \nicefrac 4 ρ + 4}, M (ε, η) = c_{3} ε^{- 2} η^{- 1} ∣ lo g (η) ∣, and K (η) = c_{4} ∣ lo g (η) ∣,

\mathbb{P}\Big{(}\big{[}f(\varTheta^{K(\eta),M(\varepsilon),\mathfrak{M}(\varepsilon,\eta),r}_{n(\varepsilon)})-\inf_{\theta\in\mathbb{R}^{d}}f(\theta)\big{]}\geq\varepsilon\Big{)}\leq\eta.

\mathbb{P}\Big{(}\big{[}f(\varTheta^{K(\eta),M(\varepsilon),\mathfrak{M}(\varepsilon,\eta),r}_{n(\varepsilon)})-\inf_{\theta\in\mathbb{R}^{d}}f(\theta)\big{]}\geq\varepsilon\Big{)}\leq\eta.

\textrm{Eff}(\varepsilon,\eta;A)=\#\;\textrm{computations sufficient to ensure \eqref{intro_approx_cost_1}}.

\textrm{Eff}(\varepsilon,\eta;A)=\#\;\textrm{computations sufficient to ensure \eqref{intro_approx_cost_1}}.

\textrm{Eff}(\varepsilon,\eta;A)\leq c\big{(}\varepsilon^{-2}\eta^{-1}\left|\log(\eta)\right|+\varepsilon^{-\nicefrac{{6}}{{\rho}}+4}\left|\log(\eta)\right|\big{)},

\textrm{Eff}(\varepsilon,\eta;A)\leq c\big{(}\varepsilon^{-2}\eta^{-1}\left|\log(\eta)\right|+\varepsilon^{-\nicefrac{{6}}{{\rho}}+4}\left|\log(\eta)\right|\big{)},

f(\varTheta^{K,M,\infty,r}_{n})=\Big{[}\min_{k\in\{1,2,\ldots,K\}}f(\Theta^{k,M,r}_{n})\Big{]},

f(\varTheta^{K,M,\infty,r}_{n})=\Big{[}\min_{k\in\{1,2,\ldots,K\}}f(\Theta^{k,M,r}_{n})\Big{]},

\mathbb{P}\Big{(}\big{[}f(\varTheta^{K,M,\infty,r}_{n})-\inf_{\theta\in\mathbb{R}^{d}}f(\theta)\big{]}\geq\varepsilon\Big{)}\leq\left[\kappa+c\left(\frac{1}{\varepsilon^{2}n^{\rho}}+\frac{n^{1-\rho}}{M^{\nicefrac{{1}}{{2}}}}\right)\right]^{K}.

\mathbb{P}\Big{(}\big{[}f(\varTheta^{K,M,\infty,r}_{n})-\inf_{\theta\in\mathbb{R}^{d}}f(\theta)\big{]}\geq\varepsilon\Big{)}\leq\left[\kappa+c\left(\frac{1}{\varepsilon^{2}n^{\rho}}+\frac{n^{1-\rho}}{M^{\nicefrac{{1}}{{2}}}}\right)\right]^{K}.

\mathbb{P}\Big{(}\big{[}\min_{k\in\{1,2,\ldots,K\}}\inf_{\theta\in(\mathcal{M}\cap U)}\big{|}\Theta^{k,M,r}_{n}-\theta\big{|}\big{]}\geq\varepsilon\Big{)}\leq\left[\kappa+c\left(\frac{1}{\varepsilon^{2}n^{\rho}}+\frac{n^{1-\rho}}{M^{\nicefrac{{1}}{{2}}}}\right)\right]^{K}.

\mathbb{P}\Big{(}\big{[}\min_{k\in\{1,2,\ldots,K\}}\inf_{\theta\in(\mathcal{M}\cap U)}\big{|}\Theta^{k,M,r}_{n}-\theta\big{|}\big{]}\geq\varepsilon\Big{)}\leq\left[\kappa+c\left(\frac{1}{\varepsilon^{2}n^{\rho}}+\frac{n^{1-\rho}}{M^{\nicefrac{{1}}{{2}}}}\right)\right]^{K}.

n (ε) = c_{1} ε^{- \nicefrac 2 ρ}, M (ε) = c_{2} ε^{- \nicefrac 4 ρ + 4}, and K (η) = c_{3} ∣ lo g (η) ∣,

n (ε) = c_{1} ε^{- \nicefrac 2 ρ}, M (ε) = c_{2} ε^{- \nicefrac 4 ρ + 4}, and K (η) = c_{3} ∣ lo g (η) ∣,

\mathbb{P}\Big{(}\big{[}\min_{k\in\{1,2,\ldots,K(\eta)\}}\inf_{\vartheta\in(\mathcal{M}\cap U)}\big{|}\varTheta^{k,M(\varepsilon),r}_{n(\varepsilon)}-\vartheta\big{|}\big{]}\geq\varepsilon\Big{)}\leq\eta.

\mathbb{P}\Big{(}\big{[}\min_{k\in\{1,2,\ldots,K(\eta)\}}\inf_{\vartheta\in(\mathcal{M}\cap U)}\big{|}\varTheta^{k,M(\varepsilon),r}_{n(\varepsilon)}-\vartheta\big{|}\big{]}\geq\varepsilon\Big{)}\leq\eta.

\textrm{Eff}_{\textrm{SGD}}(\varepsilon,\eta;A)=\#\;\textrm{computations sufficient to ensure \eqref{intro_comp_cost}}.

\textrm{Eff}_{\textrm{SGD}}(\varepsilon,\eta;A)=\#\;\textrm{computations sufficient to ensure \eqref{intro_comp_cost}}.

\textrm{Eff}_{\textrm{SGD}}(\varepsilon,\eta;A)\leq c\big{(}\varepsilon^{-\nicefrac{{6}}{{\rho}}+4}\left|\log(\eta)\right|\big{)}.

\textrm{Eff}_{\textrm{SGD}}(\varepsilon,\eta;A)\leq c\big{(}\varepsilon^{-\nicefrac{{6}}{{\rho}}+4}\left|\log(\eta)\right|\big{)}.

\frac{\lambda\big{(}\{\theta\in A\colon\inf_{\vartheta\in(\mathcal{M}\cap U)}\big{|}x-\vartheta\big{|}\geq\varepsilon\})}{\lambda(A)}\geq 1-\frac{c\varepsilon^{d-\mathfrak{d}}}{\lambda(A)}.

\frac{\lambda\big{(}\{\theta\in A\colon\inf_{\vartheta\in(\mathcal{M}\cap U)}\big{|}x-\vartheta\big{|}\geq\varepsilon\})}{\lambda(A)}\geq 1-\frac{c\varepsilon^{d-\mathfrak{d}}}{\lambda(A)}.

\mathbb{P}\Big{(}\min_{i\in\{1,2,\ldots,K\}}\inf_{\theta\in(\mathcal{M}\cap U)}\big{|}\Theta^{i}-\theta\big{|}\geq\varepsilon\Big{)}\geq\Big{(}1-\frac{c\varepsilon^{d-\mathfrak{d}}}{\lambda(A)}\Big{)}^{K}.

\mathbb{P}\Big{(}\min_{i\in\{1,2,\ldots,K\}}\inf_{\theta\in(\mathcal{M}\cap U)}\big{|}\Theta^{i}-\theta\big{|}\geq\varepsilon\Big{)}\geq\Big{(}1-\frac{c\varepsilon^{d-\mathfrak{d}}}{\lambda(A)}\Big{)}^{K}.

\mathbb{P}\Big{(}\min_{i\in\{1,2,\ldots,K\}}\inf_{\theta\in(\mathcal{M}\cap U)}\big{|}\Theta^{i}-\theta\big{|}\geq\varepsilon\Big{)}\leq\eta,

\mathbb{P}\Big{(}\min_{i\in\{1,2,\ldots,K\}}\inf_{\theta\in(\mathcal{M}\cap U)}\big{|}\Theta^{i}-\theta\big{|}\geq\varepsilon\Big{)}\leq\eta,

K(\varepsilon,\eta)\geq\log\Big{(}1-\frac{c\varepsilon^{d-\mathfrak{d}}}{\lambda(A)}\Big{)}^{-1}\left|\log(\eta)\right|.

K(\varepsilon,\eta)\geq\log\Big{(}1-\frac{c\varepsilon^{d-\mathfrak{d}}}{\lambda(A)}\Big{)}^{-1}\left|\log(\eta)\right|.

K (ε, η) \geq c ε^{- (d - d)} ∣ lo g (η) ∣ .

K (ε, η) \geq c ε^{- (d - d)} ∣ lo g (η) ∣ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Convergence rates for the stochastic gradient descent

method for non-convex objective functions

Benjamin Fehrman1, Benjamin Gess2, and Arnulf Jentzen3

1Mathematical Institute, University of Oxford,

Oxford, United Kingdom,

e-mail: [email protected]

2 Max Planck Institute for Mathematics in the Sciences,

Leipzig, Germany,

Fakultät für Mathematik, Universität Bielefeld,

Bielefeld, Germany,

e-mail: [email protected]

3Seminar for Applied Mathematics, Department of Mathematics,

ETH Zurich, Zurich, Switzerland,

e-mail: [email protected]

Abstract

We prove the local convergence to minima and estimates on the rate of convergence for the stochastic gradient descent method in the case of not necessarily globally convex nor contracting objective functions. In particular, the results are applicable to simple objective functions arising in machine learning.

1 Introduction
1.1 Literature
1.2 Structure of the work
2 Geometric preliminaries
3 Continuous deterministic gradient descent
4 Discrete deterministic gradient descent
5 Stochastic gradient descent
6 Stochastic gradient descent - The compact case
7 Applications
7.1 A four-parameter network with a linear activation function
7.2 A two parameter network with the ReLU activation function

1 Introduction

Stochastic gradient descent algorithms (SGD), going back to [46], are the most common way to train neural networks. Despite their relevance to machine learning and much recent interest, estimates on their rate of convergence have so far only been shown under global contraction or convexity assumptions on the objective function that are often not satisfied by examples arising in machine learning. Indeed, citing from [52], “While SGD has been rigorously analyzed only for convex loss functions […], in deep learning the loss is a non-convex function of the network parameters, hence there are no guarantees that SGD finds the global minimizer.” In the present work, we prove the local convergence of SGD to the set of global minima of the objective function while avoiding such a global convexity or contractivity assumption. The relevance of the obtained results is demonstrated by the application to the training of (simple) neural networks.

Stochastic gradient descent methods are used to numerically minimize functions $f\colon\mathbb{R}^{d}\to\mathbb{R}$ of the form

[TABLE]

for some product measurable function $F\colon\mathbb{R}^{d}\times\mathbb{R}^{m}\rightarrow\mathbb{R}$ and some random variable $X\colon\Omega\rightarrow\mathbb{R}^{m}$ on some probability space $(\Omega,\mathcal{F},\mathbb{P})$ . The analysis of SGD has attracted considerable attention in the literature (cf., e.g., [2, 4, 8, 13, 24, 35, 51] and the references therein). In [13, 24], the convergence of SGD with rates assuming the following contraction property for the objective function $f$ , which is classical in stochastic approximation theory, was analyzed: There is an $L>0$ and a zero $\theta^{*}$ of $\nabla_{\theta}f$ such that for every $\theta\in\mathbb{R}^{d}$ it holds that

[TABLE]

In particular, this contraction property implies the uniqueness of the zero $\theta^{*}$ of $\nabla_{\theta}f$ and thus the uniqueness of local minima of $f$ . This is in stark contrast to actual objective functions arising in the training of neural networks which are expected to show rich sets of local minima and saddle points/plateaus. Consequently, it is vital for the application to machine learning to avoid such global contraction assumptions. In addition, for example due to the positive homogeneity of the ReLU function, the objective functions typically satisfy certain symmetries, implying that global (and local) minima are not isolated points nor unique, but form (possibly non-compact) manifolds. Indeed, this is demonstrated for simple neural networks in Section 7 below. We are therefore led to the task of analyzing the convergence properties of SGD locally at sets of minima111We emphasize that this is disjoint from the recent works [8, 30, 53] where the global convergence of the gradient of the objective function to zero has been shown for SGD and AdaGrad. This does not imply the local convergence to minima, since the gradient also vanishes in saddles/plateaus.. In the present work we provide estimates on the rate of convergence for SGD under assumptions avoiding a contraction property like (1.2).

Theorem 1.1.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(\nicefrac{{2}}{{3}},1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $A\subseteq\mathbb{R}^{d}$ be a bounded open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{k,n,m}\colon\Omega\rightarrow S$ , $k,n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a continuously differentiable function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume that $\mathcal{M}\cap U\cap A\neq\emptyset$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{k,M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $k\in\mathbb{N}$ , be i.i.d. random variables, assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{1,M,r}_{0}$ is continuous uniformly distributed on $A$ , assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $(\Theta^{k,M,r}_{0})_{k\in\mathbb{N}}$ and $(X_{k,n,m})_{k,n,m\in\mathbb{N}}$ are independent, assume for every $k,n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that

[TABLE]

and for every $n,M,\mathfrak{M},K\in\mathbb{N}$ , $r\in(0,\infty)$ let $\varTheta^{K,M,\mathfrak{M},r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ be a random variable which satisfies that

[TABLE]

(cf. Lemma 5.11 below). Then there exist $\mathfrak{r},c\in(0,\infty)$ , $\kappa\in[0,1)$ such that for every $n,M,\mathfrak{M},K\in\mathbb{N}$ , $r\in(0,\mathfrak{r}]$ , $\varepsilon\in(0,1]$ it holds that

[TABLE]

Theorem 1.1 is an immediate consequence of Theorem 5.12 in Section 5 below. The statement of Theorem 1.1 should be interpreted in the following way. We aim to minimize an objective function $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ , where we assume that the set of minima

[TABLE]

is somewhere locally smooth in the sense that there exists an open set $U\subseteq\mathbb{R}^{d}$ such that

[TABLE]

We furthermore assume that $f$ is locally $\operatorname{C}^{3}$ in a neighborhood of $\mathcal{M}\cap U$ and that the Hessian is maximally nondegenerate on $\mathcal{M}\cap U$ in the sense that for every $\theta\in(\mathcal{M}\cap U)$ it holds that

[TABLE]

Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, and let $X_{k,n,m}\colon\Omega\rightarrow S$ , $k,n,m\in\mathbb{N}$ , be i.i.d. random variables. We assume that there exists a measurable function $F\colon S\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ which satisfies for every $\theta\in\mathbb{R}^{d}$ that

[TABLE]

In particular, since it is oftentimes the case in practice that the deterministic gradient $\nabla f(\theta)$ cannot be computed or cannot be efficiently computed, the random gradient $\nabla_{\theta}F(\theta,X_{1,1,1})$ provides an efficiently computable stochastic approximation.

The initial data of SGD is sampled from a bounded open set $A\subseteq\mathbb{R}^{d}$ which satisfies that $\mathcal{M}\cap U\cap A\neq\emptyset$ . That is, for every mini-batch size $M\in\mathbb{N}$ and $r\in(0,\infty)$ , the initial data $\Theta^{k,M,r}_{0}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $k\in\mathbb{N}$ , are uniformly distributed on $A$ , independent, and independent of the driving noise $X_{k,n,m}$ , $k,n,m\in\mathbb{N}$ . We then compute independent solutions to SGD in the sense that for every $k,n\in\mathbb{N}$ it holds that

[TABLE]

For a fixed terminal time $n\in\mathbb{N}$ , for a sampling size $K\in\mathbb{N}$ , the output of the algorithm at this point is the collection of values $\Theta^{k,M,r}_{n}$ , $k\in\{1,2,\ldots,K\}$ . It remains to identify the value $\Theta^{k,M,r}_{n}$ , $k\in\{1,2,\ldots,K\}$ , that minimizes the objective function.

Much as in the case of the gradient, since the objective function cannot be practically computed, for a terminal time $n\in\mathbb{N}$ , for a mini-batch size $\mathfrak{M}\in\mathbb{N}$ , we introduce the mini-batch approximation $F^{K,\mathfrak{M},n}\colon\mathbb{R}^{d}\times\Omega\rightarrow\mathbb{R}$ which satisfies for every $(\theta,\omega)\in\mathbb{R}^{d}\times\Omega$ that

[TABLE]

We then identify the value $\Theta^{k,M,r}_{n}$ , $k\in\{1,\ldots,K\}$ , that minimizes $F^{K,\mathfrak{M},n}$ in the sense that we compute a random variable $\varTheta^{K,M,\mathfrak{M},r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ which satisfies that

[TABLE]

The conclusion of Theorem 1.1 estimates the probability that $\varTheta^{K,M,\mathfrak{M},r}_{n}$ is an $\varepsilon\in(0,1]$ minimizer of the objective function. Precisely, there exist $\mathfrak{r},c\in(0,\infty)$ , $\kappa\in[0,1)$ such that for every $n,M,\mathfrak{M},K\in\mathbb{N}$ , $r\in(0,\mathfrak{r}]$ , $\varepsilon\in(0,1]$ it holds that

[TABLE]

The limit $\mathfrak{M}\rightarrow\infty$ corresponds to computing the minimizer of $f$ exactly. If this can be done efficiently, then the first term on the righthand side of (1.14) vanishes.

The constant $\kappa\in[0,1)$ , which we compute precisely in Theorem 5.12 below, quantifies two sources of error: the probability that the initial condition lies outside of a basin of attraction and a portion of the probability that SGD beginning in a basin of attraction fails to converge. In Remark 5.61 below and Section 6, we prove that the restriction $\rho\in(\nicefrac{{2}}{{3}},1)$ can be extended to $\rho\in(0,1)$ under the additional assumption that $\mathcal{M}\cap U$ is a compact subset of $\mathbb{R}^{d}$ . Finally, it is not necessary to assume that $F$ is continuously differentiable, and this assumption can be replaced with the assumption that for every $x\in S$ we have that $F(\cdot,x)$ is a locally Lipschitz continuous function of $\theta\in\mathbb{R}^{d}$ .

We observe that the computational efficiency of the algorithm can be estimated using Theorem 1.1. In particular, it follows from Corollary 5.13 below that there exist constants $c_{i}\in(0,\infty)$ , $i\in\{1,2,3,4\}$ , such that for every $\varepsilon,\eta\in(0,1]$ , for $n(\varepsilon)\in\mathbb{N}_{0},M(\varepsilon),\mathfrak{M}(\varepsilon,\eta),K(\eta)\in\mathbb{N}$ which satisfy that

[TABLE]

it holds that

[TABLE]

For every bounded open set $A\subseteq\mathbb{R}^{d}$ which satisfies that $\mathcal{M}\cap U\cap A$ is non-empty, for every $\varepsilon,\eta\in(0,1]$ , the computational efficiency of the algorithm $\textrm{Eff}(\varepsilon,\eta;A)\in\mathbb{N}$ satisfies that

[TABLE]

It follows from (1.15) that there exists $c\in(0,\infty)$ which satisfies for every $\varepsilon,\eta\in(0,1]$ that

[TABLE]

where the constant $c\in(0,\infty)$ depends on the computational cost of computing $F$ and $\nabla_{\theta}F$ but not on the running time $n\in\mathbb{N}$ , mini-batch size $M\in\mathbb{N}$ , or sampling size $K\in\mathbb{N}$ . Furthermore, we prove in Corollary 6.5 below that that computational efficiency can be improved in the case that the local manifold of minima is compact.

The estimate of Theorem 1.1 quantifies two sources of error. The first term on the righthand side of (1.6) quantifies the error introduced by the mini-batch approximation of the objective function. In the case that the objective function $f$ can be efficiently computed, this error can be avoided by computing $\varTheta^{K,M,\infty,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ which satisfies that

[TABLE]

for which it follows from Corollary 5.10 below that

[TABLE]

The second term on the rigththand side of (1.6) quantifies the failure of the solutions $\Theta^{k,M,r}_{n}$ , $k\in\{1,2,\ldots,K\}$ , to converge to within distance $\varepsilon\in(0,1]$ to the local manifold of minima at time $n\in\mathbb{N}$ . We quantify this error in Corollary 5.9 below, where we prove that

[TABLE]

The methods of Corollary 5.13 below prove that there exist constants $c_{i}\in(0,\infty)$ , $i\in\{1,2,3\}$ , such that for every $\varepsilon,\eta\in(0,1]$ , for $n(\varepsilon)\in\mathbb{N},M(\varepsilon),K(\eta)\in\mathbb{N}$ which satisfy that

[TABLE]

it holds that

[TABLE]

For every bounded open set $A\subseteq\mathbb{R}^{d}$ with $\mathcal{M}\cap U\cap A\neq\emptyset$ , for every $\varepsilon,\eta\in(0,1]$ , the computational efficiency $\textrm{Eff}_{\textrm{SGD}}(\varepsilon,\eta;A)\in\mathbb{N}$ of (1.21) satisfies that

[TABLE]

It follows from (1.22) that for every bounded open set $A\subseteq\mathbb{R}^{d}$ with $\mathcal{M}\cap U\cap A\neq\emptyset$ there exists $c\in(0,\infty)$ such that for every $\varepsilon,\eta\in(0,1]$ it holds that

[TABLE]

In particular, the computational efficiency $\textrm{Eff}_{\textrm{SGD}}$ yields a significant improvement when compared with a random sampling algorithm. Precisely, suppose that $A\subseteq\mathbb{R}^{d}$ is a bounded open subset with $\mathcal{M}\cap U\cap A\neq\emptyset$ . Then, since $\mathcal{M}\cap U$ is a $\mathfrak{d}$ -dimensional, $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , for the Lebesgue-Borel measure $\lambda\colon\mathcal{B}(\mathbb{R}^{d})\rightarrow[0,\infty]$ , there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

If $\Theta^{i}\colon\Omega\rightarrow A$ , $i\in\mathbb{N}$ , are i.i.d. random variables that are continuous uniformly distributed on $A$ , it follows from (1.26) that for every $K\in\mathbb{N}$ it holds that

[TABLE]

For every $\varepsilon,\eta\in(0,1]$ , $K\in\mathbb{N}$ , in order to ensure that

[TABLE]

it is necessary to choose $K(\varepsilon,\eta)\in\mathbb{N}$ which satisfies that

[TABLE]

In particular, there exists $c\in(0,\infty)$ which satisfies for every $\varepsilon\in(0,(\nicefrac{{\lambda(A)}}{{2r}})^{\nicefrac{{1}}{{d-\mathfrak{d}}}}]$ that

[TABLE]

The computational efficiency of the random sampling algorithm is therefore worse than $\textrm{Eff}_{\textrm{SGD}}$ whenever the codimension $d-\mathfrak{d}$ is greater than $\nicefrac{{6}}{{\rho}}-4$ . This condition is expected to be satisfied in all practical machine learning applications, where the dimension $d\in\mathbb{N}$ is large, since for $\rho\in(\nicefrac{{2}}{{3}},1)$ we have $\nicefrac{{6}}{{\rho}}-4<5$ . In particular, this condition is satisfied for any $\rho\in(\nicefrac{{2}}{{3}},1)$ if there exists a unique minimum and $d\geq 5$ .

In a non-globally stable setting, i.e. when (1.2) is not satisfied, several obstacles in the proof of local convergence to minima and the estimation of the rate for SGD appear. In particular, even pretending a local minimum to be isolated and such that (1.2) holds in a neighborhood $V$ of the minimum, the global analysis put forward in [24] is not immediately localizable, since deterministic bounded sets are not invariant under the dynamics of SGD. On the contrary, with probability one each realization of SGD will eventually leave the basin of attraction $V$ , outside of which no control on the dynamics can be expected. Therefore, it becomes necessary to provide estimates on the probability that SGD leaves favorable neighborhoods. Second, as pointed out above, (local) minima are not expected to appear in an isolated manner, but as (local) manifolds. This needs to be accounted for in the mathematical analysis, giving rise to a quantitative analysis inspired by the center manifold theorem, which in turn relies on estimates on the probability of SGD leaving favorable neighborhoods in normal and tangential direction separately. In order to derive estimates on the rate of convergence, these steps are performed in a quantitative way in the proofs of this work. An intriguing observation is that the mathematical analysis of the rate of convergence relies on the use of mini-batches in order to control the loss of iterates in non-attracted regions.

In Sections 3 and 4 we provide an analysis of the deterministic gradient descent algorithm in continuous and discrete time in order to highlight the relevance of the assumptions in simplified settings. We emphasize again that, while the deterministic algorithms converge quickly, the computational costs of computing $\nabla f$ typically make the implementation of such algorithms infeasible. This is particularly the case when $f$ takes the form (1.33) below for a measure $\mu$ that is the empirical measure of a large training set. An advantage of the stochastic algorithm is that, provided $M\in\mathbb{N}$ is not too large, the mini-batch gradient can be computed efficiently in the case of (1.34) below. The disadvantage is that, inside an attracting set, the algebraic convergence of SGD in expectation is much slower than the exponential convergence of its deterministic counterpart.

1.1 Literature

The stochastic gradient descent algorithm has attained considerable interest in the literature, and a complete account on the existing results would go beyond the scope of this article. We will therefore restrict to works that seem most relevant to the current results and refer to the following works and the references therein for further details: See, for example, [2, 3, 4, 6, 7, 9, 10, 14, 23, 28, 34, 39, 40, 42, 43, 44, 49, 50, 51, 54, 56] and the references mentioned therein for numerical simulations and proofs of convergence rates for SGD type optimization algorithms, [5, 8, 47] and the references mentioned therein for overview articles on SGD type optimization algorithms, and [11, 12, 18, 19, 21, 22, 26, 27, 48] and the references mentioned therein for applications involving neural networks and SGD type optimization algorithms.

The case of a convex loss function is well-understood under mild further assumptions, for example, rates of convergence of the order $O(1/\sqrt{n})$ for SGD have been established in [8, 56]. In the case of a strongly convex objective function these can be improved to $O(1/n)$ , see [20, 37, 38].

The case of a non-convex objective function is considerably less well understood. In this case we have to distinguish two classes of results: The first class proves the convergence to zero (with or without rates) for the gradient of the objective function, thus implying the convergence to a critical point. The second class of results proves the convergence of the values of the loss function to their global minimum. Obviously, the second class of results are stronger and not implied by the first class, since these do not exclude convergence to saddle points or local minima. In the case of non-convex loss function rather complete results are known concerning the minimization of the gradient of the loss function. For example, the convergence of the gradient to zero with rates was shown in Lei, Hu, Li, & Tang [29] assuming a Hölder-regularity condition on the gradient of the loss function. This generalizes previous work Ghadimi, Lan, & Zhang [17] which required a second moment boundedness condition, which in turn was generalized by previous works Ghadimi & Lan [16] and Reddi, Hefny, Sra, Poczos, & Smola [45]. We note that while convergence to the global minimum with rates was obtained in [17] for the convex case, no results on the convergence of the value of the loss function have been shown in the non-convex case.

The convergence of the stochastic gradient descent method has been analysed in the literature under several additional assumptions replacing (strong) convexity, such as the error bounds condition in Luo & Tseng [33], essential strong convexity [31], weak strong convexity [36], the restricted secant inequality [55], and the quadratic growth condition Anitescu [1]. In these works, linear convergence rates are shown. In the notable contribution Karimi, Nutini, & Schmidt [25] have shown that all of these conditions imply the Polyak-Lojasiewicz (PL) inequality, introduced in Lojasiewicz [32] and Polyak [41], under which linear convergence of SGD is proven in [25], thus generalizing these previous works. Recently, further progress was made in Lei, Hu, Li, & Tang in [29] where a boundedness assumption on the gradient of the objective function, required in [25], was relaxed. We note that, while the PL condition does not require convexity, nor the uniqueness of global minimizers, it does exclude the existence of local minima, that is, assuming the PL condition each local minimum is a global minimum. Therefore, it is not implied by the assumptions made in the current work.

1.2 Structure of the work

The paper is organized as follows. We will use the local smoothness of $\mathcal{M}\cap U$ , the local smoothness of the objective function $f$ , and the maximal nondegeneracy of the Hessian to identify a basin of attraction for SGD. In Section 2, we present the geometric preliminaries that are used to identify this set. In particular, in Proposition 2.3 below we recall the existence of projections in a local neighborhoods of $\mathcal{M}\cap U$ , in Proposition 2.7 below we recall the existence of local tubular neighborhoods about $\mathcal{M}\cap U$ , in Lemma 2.8 below we prove a useful decomposition of $\nabla f$ into components normal and tangential to $\mathcal{M}\cap U$ , and in Lemma 2.9 below we prove a contraction estimate that will be used to obtain a convergence rate for the gradient descent algorithms in discrete time.

In Section 3, for objective functions $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ that satisfy the conditions of Theorem 1.1, we analyze the converge of the deterministic gradient descent algorithm in continuous time $\theta_{t}\in\mathbb{R}^{d}$ , $t\in[0,\infty)$ , that satisfies for every $t\in(0,\infty)$ that

[TABLE]

We prove in Proposition 3.1 below that the local smoothness of $\mathcal{M}\cap U$ , the local smoothness of $f$ , and the nondegeneracy of the Hessian imply the existence of a neighborhood $V\subseteq\mathbb{R}^{d}$ such that for every $\theta_{0}\in V$ the solution $\theta_{t}$ , $t\in[0,\infty)$ , converges exponentially fast to $\mathcal{M}\cap U$ . However, since in general neither $f$ nor $\nabla f$ are practically computable, and since continuous gradient descent cannot be implemented, the purpose of this section is to explain in a simplified setting the role of the assumptions and the geometric arguments from Section 2.

In Section 4, for objective functions $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ that satisfy the conditions of Theorem 1.1, we analyze the converge of the deterministic gradient descent algorithm in discrete time $\theta_{n}\in\mathbb{R}^{d}$ , $n\in\mathbb{N}_{0}$ , that satisfies for $\rho\in(0,1)$ , $r\in(0,\infty)$ , for every $n\in\mathbb{N}$ that

[TABLE]

We prove in Proposition 4.1 below that there exists a neighborhood $V\subseteq\mathbb{R}^{d}$ such that for every $\theta_{0}\in V$ the solution $\theta_{n}$ , $n\in\mathbb{N}_{0}$ , converges exponentially quickly to $\mathcal{M}\cap U$ . However, while discrete gradient descent yields an implementable algorithm, the computational costs of $f$ and $\nabla f$ in general make it practically infeasible. The purpose of this section is instead to explain how the geometric preliminaries of Section 2, and in particular Lemma 2.8 and Lemma 2.9, are applied in a simplified discrete setting.

In Section 5, we analyze the convergence of SGD to the manifold of local minima $\mathcal{M}\cap U$ . In Proposition 5.3 below, we prove the convergence of (1.4) to $\mathcal{M}\cap U$ in directions normal to the manifold. Precisely, we identify a basin of attraction $V\subseteq\mathbb{R}^{d}$ such that, on the event that SGD remains in $V$ , SGD converges to $\mathcal{M}\cap U$ in expectation with an algebraic rate. It remains to estimate the probability that SGD remains in the basin of attraction $V$ .

The first step is contained in Proposition 5.4 below, which estimates the maximal excursion of SGD in expectation. Then, in Proposition 5.7 below, we estimate the probability that SGD remains in a basin of attraction $V$ by separating this event into the event that SGD leaves $V$ in a direction normal to $\mathcal{M}\cap U$ and the event that SGD leaves $V$ in a direction tangential to $\mathcal{M}\cap U$ . Proposition 5.3 is used to estimate the first of these events, and Proposition 5.4 is used to estimate the second. In Theorem 5.8, we combine Proposition 5.3 and Proposition 5.7 to estimate the probability that SGD converges to within distance $\varepsilon\in(0,1]$ of $\mathcal{M}\cap U$ .

In Corollary 5.9 below, we estimate the probability that $K\in\mathbb{N}$ independent copies of SGD fail to converge to within distance $\varepsilon\in(0,1]$ of $\mathcal{M}\cap U$ . In Theorem 5.12 below we prove Theorem 1.1, which relies on Lemma 5.11 below and estimates for the mini-batch approximation of the objective function. Finally, in Corollary 5.13 below, we estimate the computational efficiency of the algorithm introduced in Theorem 1.1.

In Section 6, we prove that the estimates of Section 5 can be improved under the additional assumption that $\mathcal{M}\cap U$ is compact. These estimates apply, in particular, to the case when the objective function has a unique minimum. The reason for the improved estimate of Theorem 6.4 below and the improved computational efficiency of Corollary 6.5 below is that, in the compact case, SGD cannot escape a basin of attraction in directions tangential to the manifold. It is therefore sufficient to take a smaller mini-batch approximation of the gradient.

In Section 7, we prove that assumptions of Theorem 1.1 are satisfied by simple loss functions arising in machine learning applications. In particular, we show that the assumptions are satisfied by objective functions $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ which satisfy that

[TABLE]

where $\theta\in\mathbb{R}^{d}$ , $p\in[1,\infty)$ , $\varphi$ a measurable function on a measurable space $(S,\mathcal{S})$ , and $(u_{\theta}\colon S\rightarrow\mathbb{R}\mathbb{)}_{\theta\in\mathbb{R}^{d}}$ is a jointly-measurable artificial neural network. In this case, the function $F\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ satisfies for every $(\theta,x)\in\mathbb{R}^{d}\times S$ that

[TABLE]

and, for a probability space $(\Omega,\mathcal{F},\mathbb{P})$ , the sequence of random variables $X_{k,n,m}\colon\Omega\rightarrow S$ , $k,n,m\in\mathbb{N}$ , are i.i.d. with distribution $\mu$ . For the objective functions considered in Section 7.1 and Section 7.2 below, the global minima are non-unique and build locally smooth, non-compact manifolds of $\mathbb{R}^{d}$ on which Hessian of the objective function is maximally nondegenerate.

2 Geometric preliminaries

In this section, for an objective function $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ that satisfies the conditions of Theorem 1.1, we will characterize the local geometry of the local manifold of minima $\mathcal{M}\cap U$ . The analysis will rely on on the notion of a projection to $\mathcal{M}\cap U$ which is, however, only well-defined in local neighborhoods of the local manifold.

In the following proposition, we prove that the projection map to the local manifold of minima is locally well-defined and smooth. The proof is a consequence of Foote [15, Lemma] and the smoothness of $\mathcal{M}\cap U$ .

Proposition 2.1.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,\ldots,d-1\}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , and let $\mathcal{M}\cap U\subseteq\mathbb{R}^{d}$ be a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ . Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exists an open neighborhood $V\subset\mathbb{R}^{d}$ such that

(i)

$V$ * is a neighborhood of $x_{0}$ : it holds that $x_{0}\in V$ .* 2. (ii)

projections exist in $V$ : there exists a unique function $p\colon V\rightarrow(\mathcal{M}\cap U)$ which satisfies for every $x\in V$ that

[TABLE] 3. (iii)

the projection map is locally $\operatorname{C}^{1}$ -smooth: the map $p\colon V\rightarrow(\mathcal{M}\cap U)$ is once continuously differentiable.

Proof of Proposition 2.1.

The proof is an immediate consequence of [15, Lemma] and the $\operatorname{C}^{1}$ -regularity of $\mathcal{M}\cap U$ . ∎

The family of subsets satisfying for a fixed $x_{0}\in(\mathcal{M}\cap U)$ the conclusion of Proposition 2.1 will play an important role in the arguments to follow. We therefore make a global definition, and define the projection map on a global neighborhood of $\mathcal{M}\cap U$ . The existence of the projection map is an immediate consequence of Proposition 2.1.

Definition 2.2.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,\ldots,d-1\}$ , let $\mathcal{M}\cap U\subseteq\mathbb{R}^{d}$ be a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ .

(i)

For every $x\in(\mathcal{M}\cap U)$ let $\operatorname{Proj}(x)\subseteq\mathcal{B}(\mathbb{R}^{d})$ satisfy that

[TABLE] 2. (ii)

Let $p\colon\cup_{x\in(\mathcal{M}\cap U)}\left(\cup_{V\in\operatorname{Proj}(x)}V\right)\rightarrow(\mathcal{M}\cap U)$ be the unique function which satisfies for every $x\in\cup_{x\in(\mathcal{M}\cap U)}\left(\cup_{V\in\operatorname{Proj}(x)}V\right)$ that

[TABLE]

The following proposition proves that for every $x\in(\mathcal{M}\cap U)$ the tangent space $T_{x}(\mathcal{M}\cap U)$ and normal space $\big{(}T_{x}(\mathcal{M}\cap U)\big{)}^{\perp}$ to $\mathcal{M}\cap U$ at $x$ are characterized respectively by the null space of Hessian of $f$ and the space on which the Hessian of $f$ is positive definite.

Proposition 2.3.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,2,\ldots,d-1\}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $f\colon U\rightarrow\mathbb{R}$ be a three times continuously differentiable function, let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ and assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ . Then for every $x\in(\mathcal{M}\cap U)$ there exist a $(d-\mathfrak{d})$ -dimensional subvectorspace $P_{x}\subseteq\mathbb{R}^{d}$ and a $\mathfrak{d}$ -dimensional subvectorspace $N_{x}\subseteq\mathbb{R}^{d}$ such that

(i)

it holds that

[TABLE] 2. (ii)

it holds for every $v\in P_{x}\backslash\{0\}$ that

[TABLE] 3. (iii)

it holds that

[TABLE] 4. (iv)

it holds that

[TABLE] 5. (v)

it holds that

[TABLE]

Proof of Proposition 2.3.

Let $x\in(\mathcal{M}\cap U)$ . Since $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , the symmetry of the Hessian implies that there exist subspaces $N_{x},P_{x}\subseteq\mathbb{R}^{d}$ such that $\mathbb{R}^{d}=P_{x}\oplus N_{x}$ , that $\dim(P_{x})=d-\mathfrak{d}$ , that

[TABLE]

that $\dim(N_{x})=\mathfrak{d}$ , and that

[TABLE]

Let $\varepsilon\in(0,1)$ and suppose that $\gamma\colon(-\varepsilon,\varepsilon)\rightarrow\mathcal{M}\cap U$ is a smooth curve which satisfies $\gamma(0)=x$ . Since $\nabla f|_{\mathcal{M}\cap U}=0$ , it follows from the chain rule that

[TABLE]

It follows that $T_{x}(\mathcal{M}\cap U)\subseteq N_{x}$ and therefore, since $\dim(T_{x}(\mathcal{M}\cap U))=\mathfrak{d}$ , it holds that $T_{x}(\mathcal{M}\cap U)=N_{x}$ . Since $\mathbb{R}^{d}=T_{x}(\mathcal{M}\cap U)\oplus\big{(}T_{x}(\mathcal{M}\cap U)\big{)}^{\perp}$ , it holds that $P_{x}=\big{(}T_{x}(\mathcal{M}\cap U)\big{)}^{\perp}$ , which completes the proof of Proposition 2.3. ∎

In the following lemma, for a point $x\in\mathbb{R}^{d}$ such that the projection $p(x)\in(\mathcal{M}\cap U)$ is well-defined, we prove that the difference $x-p(x)\in\mathbb{R}^{d}$ lies in the space normal to $\mathcal{M}\cap U$ at $p(x)$ . This fact will be used to obtain a rate of convergence for the discrete gradient descent algorithms.

Lemma 2.4.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,2,\ldots,d-1\}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $f\colon U\rightarrow\mathbb{R}$ be a three times continuously differentiable function, let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , and assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ . Then for every $x_{0}\in(\mathcal{M}\cap U)$ , for every $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2), it holds for every $x\in V$ that

[TABLE]

Proof of Lemma 2.4.

Let $x_{0}\in(\mathcal{M}\cap U)$ , let $V\in\operatorname{Proj}(x_{0})$ , and let $p:V\rightarrow(\mathcal{M}\cap U)$ denote the projection map. Let $x\in V$ . If $x\in(\mathcal{M}\cap U)$ , the claim is immediate since then $x-p(x)=0$ . If $x\notin\mathcal{M}\cap U$ , for some $\varepsilon\in(0,1)$ suppose that $\gamma\colon(-\varepsilon,\varepsilon)\rightarrow\mathcal{M}\cap U$ is a smooth path which satisfies $\gamma(0)=p(x)$ . It holds that

[TABLE]

Therefore, since the curve $\gamma$ was arbitrary, it holds that $x-p(x)\in T_{p(x)}\left(\mathcal{M}\cap U\right)^{\perp}$ , which completes the proof of Lemma 2.4. ∎

In the following lemma, we derive a formula for the derivative of the distance function to the manifold in a neighborhood of $\mathcal{M}\cap U$ . The regularity of the distance function and the formula for its differential will be used to prove the convergence of the deterministic gradient descent algorithm in continuous time.

Lemma 2.5.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,2,\ldots,d-1\}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $f\colon U\rightarrow\mathbb{R}$ be a three times continuously differentiable function, let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , and assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ . Then for every $x_{0}\in(\mathcal{M}\cap U)$ , for every $V\in\operatorname{Proj}(x_{0})$ (cf. Definitition 2.2), it holds for every $x\in V\setminus\mathcal{M}\cap U$ that

[TABLE]

Proof of Lemma 2.5.

Let $x_{0}\in(\mathcal{M}\cap U)$ and let $V\in\operatorname{Proj}(x_{0})$ . It follows from Proposition 2.1 that

[TABLE]

The chain rule implies for every $i\in\{1,\ldots,d\}$ that

[TABLE]

Since $\frac{\partial}{\partial x_{i}}p(x)\in N_{p(x)}$ and since $x-p(x)\in P_{p(x)}$ it follows from Lemma 2.4 that

[TABLE]

Since for every $x\in V\setminus\mathcal{M}\cap U$ it holds that

[TABLE]

it holds for every $x\in V\setminus\mathcal{M}\cap U$ that

[TABLE]

which completes the proof of Lemma 2.5. ∎

We will now quantify what are essentially local tubular neighborhoods of the local manifold $\mathcal{M}\cap U$ . The following definition will play an important role throughout the paper.

Definition 2.6.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,\ldots,d-1\}$ , let $\mathcal{M}\cap U\subseteq\mathbb{R}^{d}$ be a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ . For every $x\in(\mathcal{M}\cap U)$ , $R,\delta\in(0,\infty)$ let $V_{R,\delta}(x)\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

A useful feature of the sets defined in Definition 2.6 is that the parameter $R\in(0,\infty)$ can be used to quantify distance in directions tangential to the manifold $\mathcal{M}\cap U$ , and the parameter $\delta\in(0,\infty)$ can be used to quantify distance in directions normal to the manifold $\mathcal{M}\cap U$ . The following technical proposition will be used to prove Proposition 4.1 below and Lemma 5.6 below.

Proposition 2.7.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,\ldots,d-1\}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $\mathcal{M}\cap U\subseteq\mathbb{R}^{d}$ be a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , and let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U)$ , for every $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2), there exist $R_{0},\delta_{0}\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ ,

(i)

it holds that $\overline{V}_{R,\delta}(x_{0})\subseteq V$ (cf. Definition 2.6), 2. (ii)

it holds that

[TABLE] 3. (iii)

it holds for every $x\in(\overline{B}_{R}(x_{0})\cap\mathcal{M}\cap U)$ and $v\in\big{(}T_{x}(\mathcal{M}\cap U)\big{)}^{\perp}$ with $\left|v\right|<\delta$ that

[TABLE]

Proof of Proposition 2.7.

Let $x_{0}\in(\mathcal{M}\cap U)$ . For every $R,\delta\in(0,\infty)$ let $\tilde{V}_{R,\delta}(x_{0})\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

Let $V\in\operatorname{Proj}(x_{0})$ . Since $U,V\subseteq\mathbb{R}^{d}$ are open, there exist $R_{0},\delta_{0}\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ it holds that

[TABLE]

and for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ that

[TABLE]

Following [15, Lemma], the normal bundle $T\left(\mathcal{M}\cap U\right)^{\perp}\subseteq\mathbb{R}^{2d}$ satisfies that

[TABLE]

Since $\mathcal{M}\cap U$ is a $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold, it follows that $T(\mathcal{M}\cap U)^{\perp}\subseteq\mathbb{R}^{2d}$ is a $d$ -dimensional $\operatorname{C}^{1}$ -submanifold. Furthermore, the map $\Psi\colon T(\mathcal{M}\cap U)^{\perp}\rightarrow\mathbb{R}^{d}$ which satisfies for every $(x,v)\in T\left(\mathcal{M}\cap U\right)^{\perp}$ that $\Psi(x,v)=x+v$ satisfies for every $x\in(\mathcal{M}\cap U)$ that

[TABLE]

It follows from the inverse function theorem that there exists $\delta_{1}\in(0,(\delta_{0}\wedge\nicefrac{{R_{0}}}{{4}}))$ such that for every $R\in(0,\nicefrac{{R_{0}}}{{2}}]$ , $\delta\in(0,\delta_{1}]$ it holds that

[TABLE]

Let $R\in(0,\nicefrac{{R_{0}}}{{2}}]$ , $\delta\in(0,\delta_{1}]$ . We will first prove that $\tilde{V}_{R,\delta}(x_{0})\subseteq V_{R,\delta}(x_{0})$ . Let $x\in\tilde{V}_{R,\delta}(x_{0})$ . If $x\in\overline{B}_{R}(x_{0})\cap\mathcal{M}\cap U$ then it holds by definition that $x\in V_{R,\delta}(x_{0})$ . If $x\notin\overline{B}_{R}(x_{0})\cap\mathcal{M}\cap U$ , since $x\in\tilde{V}_{R,\delta}(x_{0})$ implies that $\mathbf{d}(x,\mathcal{M}\cap U)=\mathbf{d}(x,\overline{B}_{R}(x_{0})\cap\mathcal{M}\cap U)$ and since the choice of $R_{0}\in(0,\infty)$ implies that

[TABLE]

it holds that $p(x)\in\overline{B}_{R}(x_{0})\cap\mathcal{M}\cap U$ . Since $\mathbf{d}(x,\mathcal{M}\cap U)=\mathbf{d}(x,\overline{B}_{R}(x_{0})\cap\mathcal{M}\cap U)=\left|x-p(x)\right|<\delta$ and since it holds that

[TABLE]

for $\frac{x-p(x)}{\left|x-p(x)\right|}\in T_{x}\left(\mathcal{M}\cap U\right)^{\perp}$ by Lemma 2.4, it holds that $x\in V_{R,\delta}(x_{0})$ . This completes the proof that $\tilde{V}_{R,\delta}(x_{0})\subseteq V_{R,\delta}(x_{0})$ . It remains to prove that $V_{R,\delta}(x_{0})\subseteq\tilde{V}_{R,\delta}(x_{0})$ . Let $x\in V_{R,\delta}(x_{0})$ . It is necessary to show that $\mathbf{d}(x,\mathcal{M}\cap U)=\mathbf{d}(x,\overline{B}_{R}(x_{0})\cap\mathcal{M}\cap U)<\delta$ . The definition of $V_{R,\delta}(x_{0})$ implies that there exist $\tilde{x}\in(\overline{B}_{R}(x_{0})\cap\mathcal{M}\cap U)$ and $\tilde{v}\in T_{\tilde{x}}\left(\mathcal{M}\cap U\right)^{\perp}$ with $\left|\tilde{v}\right|<\delta$ which satisfy that $x=\tilde{x}+\tilde{v}$ . We will prove that $p(x)=\tilde{x}$ . By contradiction, suppose that $p(x)\neq\tilde{x}$ . This implies that

[TABLE]

It follows from the triangle inequality that

[TABLE]

which proves that

[TABLE]

for $x-p(x)\in T_{p(x)}\left(\mathcal{M}\cap U\right)^{\perp}$ by Lemma 2.4 with $\left|x-p(x)\right|<\delta$ . Since $\tilde{x}\in(\overline{B}_{R}(x_{0})\cap\mathcal{M}\cap U)$ , it follows from (2.37) that $p(x)\in(\overline{B}_{R+2\delta_{1}}(x_{0})\cap\mathcal{M}\cap U)$ . Since $R\in(0,\nicefrac{{R_{0}}}{{2}}]$ and since $\delta\in(0,\delta_{1}]$ , equation (2.38) contradicts (2.33), which states that $\Psi$ is injective on the set

[TABLE]

We conclude that $p(x)=\tilde{x}$ , which implies that

[TABLE]

Therefore, it holds that $V_{R,\delta}(x_{0})\subseteq\tilde{V}_{R,\delta}(x_{0})$ , which completes the proof that $\tilde{V}_{R,\delta}(x_{0})=V_{R,\delta}(x_{0})$ . The final claim follows from a repetition of the arguments leading to (2.37) and (2.38). This completes the proof of of Proposition 2.7.∎

The following two lemmas contain the primary use of the nondegeneracy assumption, which states for every $\theta\in(\mathcal{M}\cap U)$ that

[TABLE]

The first of these proves that $\nabla f$ can be split into a component that is approximately normal to the local manifold of minima $\mathcal{M}\cap U$ , and into a component that is approximately tangential to $\mathcal{M}\cap U$ . We will use the normal component to obtain a rate of convergence for the gradient descent algorithms. The contribution of the tangential component will create errors that will need to be controlled.

Lemma 2.8.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,2,\ldots,d-1\}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $f\colon U\rightarrow\mathbb{R}$ be a three times continuously differentiable function, let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , and assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ . Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exist $R_{0},\delta_{0},c\in(0,\infty)$ and $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ it holds that (cf. Definition 2.6)

[TABLE]

and for every $x\in V_{R,\delta}(x_{0})$ there exists $\varepsilon_{x}\in\mathbb{R}^{d}$ which satisfies $\left|\varepsilon_{x}\right|\leq c\mathbf{d}(x,\mathcal{M}\cap U)^{2}$ such that

[TABLE]

Proof of Lemma 2.8.

Let $x_{0}\in(\mathcal{M}\cap U)$ and $R\in(0,\infty)$ . Since $U\subseteq\mathbb{R}^{d}$ is an open set, there exists $V\in\operatorname{Proj}(x_{0})$ which satisfies that $V\subseteq U$ . Since $V$ is open, fix $R_{0},\delta_{0}\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ it holds that

[TABLE]

Due to the compactness of $\overline{V}_{R,\delta}(x_{0})$ and the regularity of $f$ , there exists $c\in(0,\infty)$ which satisfies for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ that

[TABLE]

Let $x\in V_{R,\delta}(x_{0})$ . By integration, since $\left.\nabla f\right|_{\mathcal{M}\cap U}=0$ , it holds that

[TABLE]

It follows from (2.47), the local regularity of $f$ , and the definition of the projection that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

After defining $\varepsilon_{x}\in\mathbb{R}^{d}$ which satisfies that

[TABLE]

equation (2.48) and estimate (2.49) complete the proof of Lemma 2.8. ∎

The following lemma will play an important role in the analysis of the deterministic and stochastic gradient descent algorithms in discrete time. In the context of Lemma 2.8, for every $x\in\mathbb{R}^{d}$ with $p(x)\in(\mathcal{M}\cap U)$ well-defined, the following lemma quantifies the convergence of gradient descent to $\mathcal{M}\cap U$ .

Lemma 2.9.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,2,\ldots,d-1\}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $f\colon U\rightarrow\mathbb{R}$ be a three times continuously differentiable function, let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , and assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ . Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exist $R_{0},\delta_{0},\mathfrak{r},\in(0,\infty)$ , $\lambda\in(0,\infty)$ such that

[TABLE]

and $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $x\in V_{R,\delta}(x_{0})$ it holds that

[TABLE]

that

[TABLE]

and that

[TABLE]

Proof of Lemma 2.9.

Let $x_{0}\in(\mathcal{M}\cap U)$ . Since $U\subseteq\mathbb{R}^{d}$ is an open subset, there exists $V\in\operatorname{Proj}(x_{0})$ which satisfies that $V\subseteq U$ . Fix $R_{0},\delta_{0}\in(0,\infty)$ such that every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ it holds that (cf. Definition 2.6)

[TABLE]

Due to the compactness of $\overline{V}_{R_{0},\delta_{0}}(x_{0})$ and the regularity of $f$ , there exists $c\in(0,\infty)$ which satisfies for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ that

[TABLE]

Let $x\in V_{R,\delta}(x_{0})$ . For the first claim, using (2.58), fix $\mathfrak{r}\in(0,\infty)$ which satisfies that

[TABLE]

Let $r\in(0,\mathfrak{r}]$ . The definition of the distance to $\mathcal{M}\cap U$ implies that

[TABLE]

Since the nondegeneracy assumption states that

[TABLE]

Lemma 2.4 below and (2.58) prove that there exists for $\lambda\in(0,\infty)$ which satisfies that

[TABLE]

for which we have that

[TABLE]

where the choice of $\mathfrak{r}$ and (2.62) guarantee that $(1-r\lambda)\geq 0$ . In combination, estimates (2.60), (2.62), and (2.63) complete the proof of the first claim. The proof of the second claim is similar. For every $x\in V_{R,\delta}(x_{0})$ , the nondegeneracy assumption, Lemma 2.4, and (2.58) prove that there exists $\lambda\in(0,\infty)$ which satisfies (2.62) such that

[TABLE]

which completes the proof of Lemma 2.9. ∎

3 Continuous deterministic gradient descent

In this section, for an objective function $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ which satisfies the conditions of Theorem 1.1, we will analyze the local convergence to the local manifold of minima $\mathcal{M}\cap U$ of the deterministic gradient descent algorithm in continuous time $\theta_{t}\in\mathbb{R}^{d}$ , $t\in[0,\infty)$ , which satisfies for every $t\in(0,\infty)$ that

[TABLE]

We will prove that the solution of (3.1) converges to the local manifold of minima $\mathcal{M}\cap U$ , provided the initial condition is chosen in a sufficiently small neighborhood of $\mathcal{M}\cap U$ . The proof can be outlined as follows. Given any $x_{0}\in\mathcal{M}\cap U$ , we first fix an open neighborhood $x_{0}$ which satisfies the conclusions of Lemma 2.8 and Lemma 2.9. Then, for initial data $\theta_{0}$ in this neighborhood, we quantify the convergence of the solution (3.1) to $\mathcal{M}\cap U$ in directions normal to the manifold, using the decomposition of $\nabla f$ from Lemma 2.8. Finally, after fixing a smaller neighborhood about $x_{0}$ , we prove that the tangential components of the gradient of $\nabla f$ do not take the trajectory from the basin of attraction.

Proposition 3.1.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,2,\ldots,d-1\}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $f\colon U\rightarrow\mathbb{R}$ be a three times continuously differentiable function, let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , and assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ . Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exist $R_{0},\delta_{0},\lambda\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $\theta_{0}\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ (cf. Definition 2.6), for $\theta_{t}\in\mathbb{R}^{d}$ , $t\in[0,\infty)$ , which satisfies for every $t\in(0,\infty)$ that

[TABLE]

it holds for every $t\in[0,\infty)$ that

[TABLE]

Proof of Proposition 3.1.

Let $x_{0}\in(\mathcal{M}\cap U)$ . Since $U\subseteq\mathbb{R}^{d}$ is an open set, fix $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) which satisfies that $V\subseteq U$ . In view or Proposition 2.7, fix $R_{0},\delta_{0}\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ the set $V_{R,\delta}(x_{0})$ (cf. Definition 2.6) satisfies that $\overline{V}_{R,\delta}(x_{0})\subseteq V$ and that

[TABLE]

In particular, the compactness of $\overline{V}_{R_{0},\delta_{0}}(x_{0})$ and the regularity of $f$ imply that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Let $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ . Let $\theta_{0}\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ , let $\theta_{t}\in\mathbb{R}^{d}$ , $t\in[0,\infty)$ , satisfy for every $t\in(0,\infty)$ that

[TABLE]

and let $\tau\in(0,\infty)$ denote the exit time

[TABLE]

Lemma 2.5 and the chain rule prove that

[TABLE]

where the local regularity of $f$ and the stopping time $\tau$ guarantee the well-posedness of this equation. Let $t\in(0,\tau)$ . It follows from Lemma 2.8 and Lemma 2.9 that there exist $\lambda,c_{1}\in(0,\infty)$ which satisfy that

[TABLE]

Proposition 2.1, (3.7), and $\left.\nabla f\right|_{\mathcal{M}\cap U}=0$ prove that there exists $c_{2}\in(0,\infty)$ which satisfies that

[TABLE]

Returning to (3.10), it follows from (3.11) and (3.12) that

[TABLE]

Let $\delta_{1}\in(0,\delta_{0}]$ satisfy that

[TABLE]

Let $\delta\in(0,\delta_{1}]$ . For every $t\in(0,\tau)$ it follows from (3.13) and (3.14) that

[TABLE]

Therefore, for every $\delta\in(0,\delta_{1}]$ , $t\in[0,\tau)$ it holds that

[TABLE]

For every $t\in[0,\tau)$ , it follows from (3.13) and (3.16) that

[TABLE]

Fix $\delta_{2}\in(0,\delta_{1}]$ which satisfies that

[TABLE]

Let $\delta\in(0,\delta_{2}]$ . In combination (3.16), (3.17), $\theta_{0}\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ , and the triangle inequality prove that $\theta_{t}\in V_{R,\delta}(x_{0})$ for every $t\in(0,\infty)$ . This is to say that $\tau=\infty$ . Since $\theta_{0}\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ was arbitrary, this completes the proof of Proposition 3.1. ∎

4 Discrete deterministic gradient descent

In this section, for an objective function $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ which satisfies the conditions of Theorem 1.1, we will analyze the convergence of the following deterministic gradient descent algorithm $\theta_{n}\in\mathbb{R}^{d}$ , $n\in\mathbb{N}_{0}$ , in discrete time which satisfies for a learning rate $\rho\in(0,1)$ and $r\in(0,\infty)$ that

[TABLE]

The proof is similar to the case of the deterministic gradient descent algorithm in continuous time. However, in the discrete setting, care must be taken to choose the learning rate $r\in(0,\infty)$ sufficiently small. Since, if the learning rate is too large, for small values of $n$ the jump $-\frac{r}{n^{\rho}}\nabla f$ may be an overcorrection that causes the solution to overshoot the local manifold of minima and to leave the basin of attraction.

In the proof, we first identify a basin of attraction using Proposition 2.1 and Proposition 2.7. In the second step, we prove that the solution (4.1) converges along the normal directions to the manifold of local minima provided the solution remains in the basin of attraction. For this, we use the normal component of $\nabla f$ from Lemma 2.8 and the quantification of the convergence from Lemma 2.9. Finally, after fixing a perhaps smaller basin of attraction, we prove that the tangential component of the gradient from Lemma 2.8 does not cause the solution (4.1) to leave the basin of attraction.

Proposition 4.1.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{1,2,\ldots,d-1\}$ , $\rho\in(0,1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $f\colon U\rightarrow\mathbb{R}$ be a three times continuously differentiable function, let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , and assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ . Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exists $R_{0},\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $\theta_{0}\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ (cf. Definition 2.6), for $\theta_{n}\in\mathbb{R}^{d}$ , $n\in\mathbb{N}_{0}$ , which satisfies for every $n\in\mathbb{N}$ that

[TABLE]

it holds for every $n\in\mathbb{N}_{0}$ that

[TABLE]

Proof of Proposition 4.1.

Let $x_{0}\in(\mathcal{M}\cap U)$ and $\rho\in(0,1)$ . Since $U\subseteq\mathbb{R}^{d}$ is open, fix $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) which satisfies that $V\subseteq U$ . In view or Proposition 2.7, fix $R_{0},\delta_{0}\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ the set $V_{R,\delta}(x_{0})$ (cf. Definition 2.6) satisfies that $\overline{V}_{R,\delta}(x_{0})\subseteq V$ and that

[TABLE]

The regularity of $f$ and the compactness of $\overline{V}_{R_{0},\delta_{0}}(x_{0})$ prove that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Fix $\mathfrak{r}\in(0,\infty)$ which satisfies the conclusion of Lemma 2.9 for the set $V_{R_{0},\delta_{0}}(x_{0})$ . Let $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ . Let $\theta_{0}\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ , let $\theta_{n}\in\mathbb{R}^{d}$ , $n\in\mathbb{N}$ , satisfy that

[TABLE]

and let $\tau\in\mathbb{N}$ be the exit time which satisfies that

[TABLE]

Since for every $n\in\{1,\ldots,\tau\}$ the projection of $\theta_{n-1}$ is well-defined, we have that

[TABLE]

Lemma 2.8 proves that there exists $c\in(0,\infty)$ such that for every $n\in\{1,\ldots,\tau\}$ there exists $\varepsilon_{n}\in\mathbb{R}^{d}$ which satisfies that

[TABLE]

such that

[TABLE]

The triangle inequality, (4.10), (4.11), and (4.12) prove that there exists $c_{1}\in(0,\infty)$ such that for every $n\in\{1,\ldots,\tau\}$ it holds that

[TABLE]

Finally, the choice of $\mathfrak{r}\in(0,\infty)$ , Lemma 2.9, and (4.13) prove that there exists $\lambda\in(0,\infty)$ such that for every $n\in\{1,\ldots,\tau\}$ it holds that

[TABLE]

where the choice of $\mathfrak{r}\in(0,\infty)$ guarantees that $(1-r\lambda)\geq 0$ . Fix $\delta_{1}\in(0,\delta_{0}]$ which satisfies that

[TABLE]

Let $\delta\in(0,\delta_{1}]$ . It follows from (4.14) and (4.15) that for every $n\in\{1,\ldots,\tau\}$ it holds that

[TABLE]

After iterating this inequality, we have for every $n\in\{1,\ldots,\tau\}$ that

[TABLE]

Since there exists $c\in(0,\infty)$ which satisfies for every $n\in\mathbb{N}$ that

[TABLE]

it follows from (4.17) that there exists $c_{2}\in(0,\infty)$ which satisfies for every $n\in\{1,\ldots,\tau\}$ that

[TABLE]

It remains only to show that, provided $\delta\in(0,\delta_{1}]$ is chosen sufficiently small, we have that $\tau=\infty$ . It follows from (4.7), (4.19), and $\left.\nabla f\right|_{\mathcal{M}\cap U}=0$ that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

The triangle inequality therefore implies that there exists $c_{3}\in(0,\infty)$ such that for every $n\in\{1,\ldots,\tau\}$ it holds that

[TABLE]

Fix $\delta_{2}\in(0,\delta_{1}]$ which satisfies that

[TABLE]

Let $\delta\in(0,\delta_{2}]$ . The choice of $\delta_{2}\in(0,\delta_{1}]$ , (4.21), and the triangle inequality prove for every $n\in\{1,\ldots,\tau\}$ that

[TABLE]

In combination (4.19) and (4.23) prove for every $n\in\{1,\ldots,\tau\}$ that

[TABLE]

The triangle inequality therefore implies for every $n\in\{1,\ldots,\tau\}$ that

[TABLE]

It follows from Proposition 2.7, the choice of $R_{0},\delta_{0}\in(0,\infty)$ , and $\theta_{0}\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ that for every $n\in\mathbb{N}$ it holds that $\theta_{n}\in V_{R,\delta}(x_{0})$ . This is to say that $\tau=\infty$ , which completes the proof of Proposition 4.1. ∎

Remark 4.2.

The conclusion of Proposition 4.1 can be extended to the case of $\rho=1$ using the same techniques. In this case, in the setting of Proposition 4.1, there exists $R_{0},\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $\theta_{0}\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ (cf. Definition 2.6), for $\theta_{n}\in\mathbb{R}^{d}$ , $n\in\mathbb{N}_{0}$ , which satisfies for every $n\in\mathbb{N}$ that

[TABLE]

it holds for every $n\in\mathbb{N}_{0}$ that

[TABLE]

The logarithm appears in estimate (4.18) in the case $\rho=1$ . The remainder of the proof is then the same, where the only additional observation is that the analogue of (4.21) is finite in the case $\rho=1$ as well.

5 Stochastic gradient descent

In this section, in the setting of Theorem 1.1, for a learning rate $\rho\in(\nicefrac{{2}}{{3}},1)$ , for $r\in(0,\infty)$ , $M\in\mathbb{N}$ , for a bounded open subset $A\subseteq\mathbb{R}^{d}$ , for a probability space $(\Omega,\mathcal{F},\mathbb{P})$ , for a measurable space $(S,\mathcal{S})$ , for a jointly measurable function $F\colon S\times\Omega\rightarrow\mathbb{R}$ , for $X_{n,m}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $n,m\in\mathbb{N}$ , i.i.d. random variables, we will analyze the convergence of the mini-batch stochastic gradient descent algorithm $\Theta_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $n\in\mathbb{N}_{0}$ , which satisfies that $\Theta_{0}$ is continuous uniformly distributed on $A$ and for every $n\in\mathbb{N}$ that

[TABLE]

The role of the mini-batch size $M\in\mathbb{N}$ is to reduce the variance of the random gradient

[TABLE]

The variance reduction is quantified by the following well-known lemma, where the function $G$ plays the role of $\nabla_{\theta}F$ .

Lemma 5.1.

Let $d_{1},d_{2}\in\mathbb{N}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d_{2}}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d_{2}}$ , let $U\subseteq\mathbb{R}^{d_{1}}$ be a non-empty open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $G=(G(\theta,x))_{(\theta,x)\in\mathbb{R}^{d_{1}}\times S}\colon\mathbb{R}^{d_{1}}\times S\rightarrow\mathbb{R}^{d_{2}}$ be a measurable function, let $X_{m}\colon\Omega\rightarrow S$ , $m\in\mathbb{N}$ , be i.i.d. random variables, and assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|G(\theta,X_{1})|^{2}\big{]}<\infty$ . Then for every non-empty compact set $\mathfrak{C}\subseteq U$ there exists $c\in(0,\infty)$ which satisfies for every $M\in\mathbb{N}$ that

[TABLE]

Proof of Lemma 5.1.

Let $\mathfrak{C}\subseteq U$ be a compact set. It holds for every $\theta\in\mathfrak{C}$ , $M\in\mathbb{N}$ that

[TABLE]

Since the $X_{m}$ , $m\in\mathbb{N}$ , are i.i.d. and since $G(\theta,X_{1,1})$ , $\theta\in\mathbb{R}^{d_{1}}$ , is locally bounded in $L^{2}(\Omega;\mathbb{R}^{d_{2}})$ , there exists $c\in(0,\infty)$ which satisfies for every $M\in\mathbb{N}$ that

[TABLE]

This completes the proof of Lemma 5.1. ∎

In the following proposition, much like the first step of the proofs of Proposition 3.1 and Proposition 4.1, we establish the convergence of (5.1) in directions normal to the local manifold of minima. We first identify a basin of attraction for (5.1) using Proposition 2.1 and Proposition 2.7 and prove, using the gradient decomposition of Lemma 2.8 and the quantification of convergence from Lemma 2.9, that on the event that SGD does not escape this basin of attraction SGD converges to the manifold of minima in expectation.

Remark 5.2.

We emphasize that the events $A_{n}$ , $n\in\mathbb{N}_{0}$ , defined in Proposition 5.3 below depend upon the quantifiers $n,M\in\mathbb{N}$ , $r,R,\delta\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ , and $x_{0}\in(\mathcal{M}\cap U)$ . However, in order to simplify the presentation, we will oftentimes suppress this dependence in the notation. For every $n\in\mathbb{N}$ , we will write $\mathbf{1}_{A_{n}}\colon\Omega\rightarrow\{0,1\}$ for the indicator function of the set $A_{n}\subseteq\Omega$ .

Proposition 5.3.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(\nicefrac{{2}}{{3}},1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{0,\theta}\in\mathbb{R}^{d}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that $\theta^{M,r}_{0,\theta}(\omega)=\theta$ , for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

and for every $n,M\in\mathbb{N}$ , $r,R,\delta\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ , $x_{0}\in(\mathcal{M}\cap U)$ let $A_{n}(M,r,R,\delta,\theta,x_{0})\in\mathcal{F}$ satisfy that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exist $R_{0},\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ , $\theta\in V_{R,\delta}(x_{0})$ (cf. Definition 2.6) it holds that

[TABLE]

Proof of Proposition 5.3.

Let $x_{0}\in(\mathcal{M}\cap U)$ . Since $U\subseteq\mathbb{R}^{d}$ is open, fix $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) which satisfies that $V\subseteq U$ . Fix $R_{0},\delta_{0}\in(0,\infty)$ which satisfy the conclusion of Proposition 2.7 for this set $V$ . Finally, fix $\mathfrak{r}\in(0,\infty)$ which satisfies the conclusion of Lemma 2.9. Let $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ . To simplify the notation, and by a small abuse of notation, let $\nabla_{\theta}F^{M,n}\colon\mathbb{R}^{d}\times\Omega\rightarrow\mathbb{R}^{d}$ , $n\in\mathbb{N}$ , be the functions which satisfy for every $(\theta,\omega)\in\mathbb{R}^{d}\times\Omega$ that

[TABLE]

Let $\theta\in V_{R,\delta}(x_{0})$ , let $\Theta^{M,r}_{0,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that $\Theta^{M,r}_{0,\theta}(\omega)=\theta$ , and for every $n\in\mathbb{N}$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

We will analyze the solution $\Theta^{M,r}_{n,\theta}$ of (5.12) on the event $A_{n-1}$ . We observe that

[TABLE]

Since the event $A_{n-1}$ implies that $\Theta^{M,r}_{n-1,\theta}\in V_{R,\delta}(x_{0})\subseteq V$ , the projection of $\Theta^{M,r}_{n-1,\theta}$ is well-defined and it holds by definition of the distance to $\mathcal{M}\cap U$ that

[TABLE]

The three terms on the righthand side of (LABEL:sgd_000) will be treated separately. For the first term on the righthand side of (LABEL:sgd_000), the choice of $\mathfrak{r}\in(0,\infty)$ , Lemma 2.8, and Lemma 2.9 prove, following identically the proof leading from (4.10) to (4.14), that there exist $\lambda,c\in(0,\infty)$ such that

[TABLE]

Therefore, there exist $\lambda,c\in(0,\infty)$ which satisfy that

[TABLE]

The remaining two terms of (LABEL:sgd_000) and the righthand side of (5.16) will be handled after taking the expectation on the event $A_{n-1}\subseteq\Omega$ which satisfies that

[TABLE]

After returning to (LABEL:sgd_000), it follows from (5.16) that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

For every $m\in\mathbb{R}$ let $\mathcal{F}_{m}\subseteq\mathcal{F}$ be the sigma algebra which satisfies that

[TABLE]

For the penultimate term of (5.18), since $\mathbf{1}_{A_{n-1}}$ is $\mathcal{F}_{n-1}$ -measurable, properties of the conditional expectation imply that

[TABLE]

Therefore, it holds that

[TABLE]

where the final equality follows from the fact that the $X_{m,k}$ , $m,k\in\mathbb{N}$ , are independent and therefore satisfy for every $x\in\mathbb{R}^{d}$ that

[TABLE]

The final term of (5.18) is handled using Lemma 5.1. Since $\overline{V}_{R,\delta}(x_{0})$ is compact, the independence of the $X_{m,k}$ , $m,k\in\mathbb{N}$ , and Lemma 5.1 prove that there exists $c\in(0,\infty)$ such that

[TABLE]

Returning to (5.18), it follows from (5.21) and (5.23) that there exists $c_{1}\in(0,\infty)$ such that

[TABLE]

Fix $\delta_{1}\in(0,\delta_{0}]$ which satisfies that

[TABLE]

Let $\delta\in(0,\delta_{1}]$ . We claim that inequality (5.24) implies that there exists some $c\in(0,\infty)$ which satisfies for every $n\in\mathbb{N}$ that

[TABLE]

The proof of (5.26) will proceed by induction. Since $\rho\in(\nicefrac{{2}}{{3}},1)$ , there exists $n_{0}\geq 1$ such that for every $n\geq n_{0}$ it holds that

[TABLE]

where the first inequality follows from the mean value theorem and $\rho\in(\nicefrac{{2}}{{3}},1)$ and the second inequality is obtained by choosing $n\in\mathbb{N}$ sufficiently large. Fix $n_{0}\geq 1$ which satisfies (5.27) and define $\overline{c}\in(0,\infty)$ which satisfies that

[TABLE]

For the base case, the definition of $\overline{c}$ guarantees for every $n\in\{1,\ldots,n_{0}-1\}$ that

[TABLE]

For the induction step, suppose that for $n\geq n_{0}$ we have that

[TABLE]

Since the event $A_{n-1}$ implies that

[TABLE]

it follows from an $L^{\infty}$ -estimate, the inclusion $A_{n-1}\subseteq A_{n-2}$ , and the induction hypothesis that for every $m\in\{2,3,4\}$ it holds that

[TABLE]

Returning to (5.24), it holds that

[TABLE]

After adding and subtracting $\overline{c}n^{-\rho}$ , it holds that

[TABLE]

Since $\delta\in(0,\delta_{1}]$ , it follows from (5.34) that

[TABLE]

Since $n\geq n_{0}$ , the choice $\overline{c}\geq\frac{2c_{1}r}{M\lambda}$ , (5.27), and (5.35) prove that

[TABLE]

Therefore, we have that

[TABLE]

which completes the induction step. Since the base case is (5.29), this completes the proof of Proposition 5.3. ∎

Proposition 5.3 proves the convergence of SGD to $\mathcal{M}\cap U$ on the event that SGD remains in a basin of attraction. It remains necessary to prove that, provided the mini-batch size is chosen to be sufficiently large, SGD remains in the basin of attraction for large times. We prove the first step toward this goal in the proposition below, which estimates the maximal excursion of SGD on the event that the dynamics do not leave a basin of attraction.

Proposition 5.4.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(\nicefrac{{2}}{{3}},1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{0,\theta}\in\mathbb{R}^{d}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that $\theta^{M,r}_{0,\theta}(\omega)=\theta$ , for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

and for every $n,M\in\mathbb{N}$ , $r,R,\delta\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ , $x_{0}\in(\mathcal{M}\cap U)$ let $A_{n}(M,r,R,\delta,\theta,x_{0})\in\mathcal{F}$ satisfy that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exist $R_{0},\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ , $\theta\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ (cf. Definition 2.6) it holds that

[TABLE]

Proof of Proposition 5.4.

Let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

Let $x_{0}\in(\mathcal{M}\cap U)$ . Since $U\subseteq\mathbb{R}^{d}$ is open, fix $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) which satisfies that $V\subseteq U$ . Fix $R_{0},\delta_{0}\in(0,\infty)$ which satisfies the conclusion of Proposition 2.7 for this set $V$ . We observe that the regularity of $f$ and the compactness of $\overline{V}_{R_{0},\delta_{0}}(x_{0})$ imply that

[TABLE]

Finally, fix $\mathfrak{r}\in(0,\infty)$ which satisfies the conclusion of Lemma 2.9. Let $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ . As in Proposition 5.3, let $\nabla_{\theta}F^{M,n}\colon\mathbb{R}^{d}\times\Omega\rightarrow\mathbb{R}^{d}$ , $n\in\mathbb{N}$ , be the functions which satisfy for every $(\theta,\omega)\in\mathbb{R}^{d}\times\Omega$ that

[TABLE]

Let $\theta\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ , let $\Theta^{M,r}_{0,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that $\Theta^{M,r}_{0,\theta}(\omega)=\theta$ , and for every $n\in\mathbb{N}$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

We will first prove that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

where we observe that the constant $c\in(0,\infty)$ can be absorbed by fixing $r\in(0,\mathfrak{r}]$ sufficiently small. It holds that

[TABLE]

Lemma 2.8 proves that there exists $c_{1}\in(0,\infty)$ and $\varepsilon_{n}\colon A_{n-1}\rightarrow\mathbb{R}^{d}$ which satisfy that

[TABLE]

such that on the event $A_{n-1}$ it holds that

[TABLE]

Therefore, on the event $A_{n-1}$ it holds that

[TABLE]

Let $\tilde{\Theta}^{M,r}_{n-1,\theta}\colon A_{n-1}\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

After taking the norm-squared of (5.50), on the event $A_{n-1}$ it holds that

[TABLE]

We will estimate (5.52) by taking the expectation on the event $A_{n-1}$ . The first term on the righthand side of (5.52) is handled using Proposition 5.3 and (5.48). For the second term, from (5.19) we recall the sigma algebras $\mathcal{F}_{m}\subseteq\mathcal{F}$ , $m\in\mathbb{N}$ , which satisfy that

[TABLE]

Since $\varepsilon_{n}\colon A_{n-1}\rightarrow\mathbb{R}^{d}$ is $\mathcal{F}_{n-1}$ -measurable, it follows identically to (5.21) and (5.22) that

[TABLE]

For the final term on the righthand side of (5.52), the compactness of $\overline{V}_{R_{0},\delta_{0}}(x_{0})$ , the independence of the $X_{m,k}$ , $m,k\in\mathbb{N}$ , and Lemma 5.1 prove that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

In combination, Proposition 5.3 and estimates (5.48), (5.52), (5.54), and (5.55) prove that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

It follows from the definition of $\tilde{\Theta}^{M,r}_{n-1,\theta}$ , (5.43), and the definition of the projection that, on the event $A_{n-1}$ there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Proposition 5.3 proves that there exists $c\in(0,\infty)$ such that

[TABLE]

It follows from the triangle inequality, (5.56), and (5.58) that there exists $c_{1}\in(0,\infty)$ which satisfies that

[TABLE]

which completes the proof of (5.46). Since for every $r\leq s\in\mathbb{N}_{0}$ we have $\mathbf{1}_{A_{s}}\leq\mathbf{1}_{A_{r}}$ , it follows from (5.59), the triangle inequality, and Hölder’s inequality that there exists $c_{2}\in(0,\infty)$ which satisfies for every $r\in(0,\mathfrak{r}]$ that

[TABLE]

where we have used that fact that, since $\rho\in(\nicefrac{{2}}{{3}},1)$ , there exists a $c\in(0,\infty)$ such that

[TABLE]

This completes the proof of Proposition 5.4. ∎

Remark 5.5.

We emphasize that the assumption $\rho\in(\nicefrac{{2}}{{3}},1)$ is only used to ensure the boundedness in $n\in\mathbb{N}$ of the first sum appearing on the lefthand side of (5.61), which cannot be countered by the mini-batch size $M\in\mathbb{N}$ . Every other argument in the paper applies without change to the case $\rho\in(0,1)$ . In particular, because the result of Proposition 5.4 is not needed if $\mathcal{M}\cap U$ is compact, since SGD cannot leave the basin of attraction in tangential directions, the results of Section 6 apply for $\rho\in(0,1)$ under this additional compactness assumption.

We will next obtain a lower bound in probability for the events $A_{n}$ , $n\in\mathbb{N}_{0}$ . For this, we will first establish sufficient conditions for containment in the set $V_{R,\delta}(x_{0})$ . Effectively, these conditions split the normal and tangential movement of SGD in the sense that, in order to be outside the set $V_{R,\delta}(x_{0})$ , a point must be either distance greater than $\delta$ from $\mathcal{M}\cap U$ or be of distance roughly greater than $R$ from $x_{0}$ .

Lemma 5.6.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , and let $\mathcal{N}\subseteq\mathbb{R}^{d}$ be a $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold, let $\mathbf{d}(\cdot,\mathcal{N}):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

Then for every $x_{0}\in\mathcal{N}$ there exists $R_{0},\delta_{0}\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , for $V_{R,\delta}(x_{0})\subseteq\mathbb{R}^{d}$ which satisfies that

[TABLE]

it holds that

[TABLE]

Proof of Lemma 5.6.

Let $x_{0}\in\mathcal{N}$ , let $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2), and let $R_{0},\delta_{0}\in(0,\infty)$ satisfy the conclusion of Proposition 2.7. That is, for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ it holds that $\overline{V}_{R,\delta}(x_{0})\subseteq V$ and that

[TABLE]

Suppose that $x\in\mathbb{R}^{d}$ satisfies that

[TABLE]

The definition of the distance to $\mathcal{N}$ and $\left|x-x_{0}\right|\leq R-\delta$ imply that there exists a possibly non-unique $\tilde{x}\in\overline{\mathcal{N}}$ which satisfies that

[TABLE]

The triangle inequality implies that

[TABLE]

It follows that $\tilde{x}\in\overline{\overline{B}_{R}(x_{0})\cap\mathcal{N}}$ , and therefore that

[TABLE]

It follows from (5.66) and (5.69) that $x\in V_{R,\delta}(x_{0})$ , which completes the proof of Lemma 5.6. ∎

In the following proposition, we obtain a lower bound in probability for the sets $A_{n}$ , $n\in\mathbb{N}_{0}$ . The interesting observation is that Proposition 5.3 and Proposition 5.4, which obtain estimates for the solution of (5.1) conditioned on the events $A_{n}$ , $n\in\mathbb{N}_{0}$ , can be used together and inductively to obtain lower bound in probability for the events $A_{n}$ , $n\in\mathbb{N}_{0}$ . Namely, Proposition 5.3 implies that, on the event $A_{n-1}$ , the process is converging to $\mathcal{M}\cap U$ in the normal directions with high probability, and Proposition 5.4 can be used to estimate the probability that the solution (5.1) escapes the basin of attraction along the tangential directions. We first introduce some convenient notation.

Proposition 5.7.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(\nicefrac{{2}}{{3}},1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $(\cdot)_{+}\colon\mathbb{R}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a non-empty $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{0,\theta}\in\mathbb{R}^{d}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that $\theta^{M,r}_{0,\theta}(\omega)=\theta$ , for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

and for every $n,M\in\mathbb{N}$ , $r,R,\delta\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ , $x_{0}\in(\mathcal{M}\cap U)$ let $A_{n}(M,r,R,\delta,\theta,x_{0})\in\mathcal{F}$ satisfy that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exist $R_{0},\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ , $\theta\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ (cf. Definition 2.6) it holds that

[TABLE]

Proof of Proposition 5.7.

Let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

Let $x_{0}\in(\mathcal{M}\cap U)$ . Since $U\subseteq\mathbb{R}^{d}$ is open, fix $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) which satisfies that $V\subseteq U$ . Fix $R_{0},\delta_{0}\in(0,\infty)$ which satisfy the conclusion of Proposition 2.7 for this set $V$ . Fix $\mathfrak{r}\in(0,\infty)$ which satisfies the conclusion of Lemma 2.9. Let $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ . As in Proposition 5.3, let $\nabla_{\theta}F^{M,n}\colon\mathbb{R}^{d}\times\Omega\rightarrow\mathbb{R}^{d}$ , $n\in\mathbb{N}$ , be the functions which satisfy for every $(\theta,\omega)\in\mathbb{R}^{d}\times\Omega$ that

[TABLE]

Let $\theta\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ , let $\Theta^{M,r}_{0,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that $\Theta^{M,r}_{0,\theta}(\omega)=\theta$ , and for every $n\in\mathbb{N}$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

Since it holds that

[TABLE]

it follows that

[TABLE]

The two terms on the righthand side of (5.79) will be handled separately. We will first prove that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

On the event $A_{n-1}$ , it follows from Lemma 2.8 that there exists $\varepsilon_{n}\colon A_{n-1}\rightarrow\mathbb{R}^{d}$ , $c_{1}\in(0,\infty)$ such that

[TABLE]

and such that on the event $A_{n-1}$ it holds that

[TABLE]

Therefore, on the event $A_{n-1}$ , we have that

[TABLE]

Lemma 2.9, (5.81), the choice of $\mathfrak{r}\in(0,\infty)$ , the definition of the projection, and the triangle inequality prove that there exist $c_{1},\lambda\in(0,\infty)$ such that on the event $A_{n-1}$ it holds that

[TABLE]

Fix $\delta_{1}\in(0,\delta_{0}]$ which satisfies that

[TABLE]

Let $\delta\in(0,\delta_{1}]$ . On the event $A_{n-1}$ , it follows from (5.84) and the choice of $\delta_{1}\in(0,\delta_{0}]$ that

[TABLE]

We therefore conclude that

[TABLE]

Similarly to (5.21) and computation (5.22), it follows from the independence of the random variables $X_{m,k}$ , $m,k\in\mathbb{N}$ , that

[TABLE]

and that

[TABLE]

The definition of $A_{n-1}$ , Chebyshev’s inequality, Lemma 5.1, and (5.88) prove that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

In the case of (5.89), Proposition 5.3 and Chebyshev’s inequality prove that, for the indicator function $\mathbf{1}_{A_{n-2}}$ of the event $A_{n-2}$ , there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

where we have used the fact that, since $\rho\in(\nicefrac{{2}}{{3}},1)$ , there exists $c\in(0,\infty)$ such that for every $n\in\mathbb{N}$ it holds that $(n-1)^{-\rho}\leq cn^{-\rho}$ . Furthermore, Chebyshev’s inequality and Lemma 5.1 prove that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Returning to (5.89), the previous two inequalities prove that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Combining (5.87), (5.90), and (5.93), there exists $c\in(0,\infty)$ such that

[TABLE]

which completes the proof of (5.80). Returning to (5.79), it follows from (5.94) that there exists $c\in(0,\infty)$ such that

[TABLE]

Therefore, there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

We will prove inductively that (5.96) implies that there exists $c\in(0,\infty)$ such that for every $n\in\mathbb{N}$ it holds that

[TABLE]

The base case $n=0$ follows immediately from $\theta\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ . For the inductive step, suppose that (5.101) is satisfied for some $n\in\mathbb{N}$ . It follows from (5.96) that

[TABLE]

It then follows from the inductive hypothesis (5.101) that

[TABLE]

which proves that

[TABLE]

Finally, since $\theta\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})\subseteq V_{R,\delta}(x_{0})$ implies that $\mathbb{P}(A_{0})=1$ , it holds that

[TABLE]

which completes the induction step, and the proof of (5.101). It remains only to estimate the final term on the righthand side of inequality (5.101). The definition of the events $A_{m}$ , $m\in\mathbb{N}_{0}$ , implies that

[TABLE]

Therefore, it holds that

[TABLE]

Lemma 5.6 proves that

[TABLE]

Since $\Theta^{M,k}_{0,\theta}\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ , the triangle inequality prove for every $k\in\{1,2,\ldots,n\}$ that

[TABLE]

Therefore, for every $k\in\{1,\ldots,n\}$ , on the event $\big{\{}\big{|}\Theta^{M,r}_{k,\theta}-x_{0}\big{|}>R-\delta\big{\}}$ it holds that

[TABLE]

This implies that

[TABLE]

In combination, (5.103), (5.104), and (5.107) prove that

[TABLE]

It follows from Proposition 5.4, (5.108), and Chebyshev’s inequality that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Returning to (5.101), it follows from (5.109) that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

where we have used the fact that, since $\rho\in(\nicefrac{{2}}{{3}},1)$ , there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

This completes the proof of Proposition 5.7. ∎

We will now use Proposition 5.3 and Proposition 5.7 to estimate the probability that SGD of mini-batch size $M\in\mathbb{N}$ converges to within distance $\varepsilon\in(0,1]$ of the manifold of local minima at time $n\in\mathbb{N}$ . In the theorem, we assume that the initial condition $\Theta^{M,r}_{0}$ is continuous uniformly distributed on a bounded open subset $A\subseteq\mathbb{R}^{d}$ which satisfies that $\mathcal{M}\cap U\cap A\neq\emptyset$ .

Theorem 5.8.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(\nicefrac{{2}}{{3}},1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $A\subseteq\mathbb{R}^{d}$ be a bounded open set, let $\lambda\colon\mathcal{B}(\mathbb{R}^{d})\rightarrow[0,\infty]$ be the Lebesgue-Borel measure, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

let $(\cdot)_{+}\colon\mathbb{R}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume that $\mathcal{M}\cap U\cap A\neq\emptyset$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{M,r}_{0}\colon\Omega\rightarrow\mathbb{R}^{d}$ be continuous uniformly distributed on $A$ , assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{M,r}_{0}$ and $\big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}}$ are independent, for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $n\in\mathbb{N}$ , be random variables which satisfy that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U\cap A)$ there exist $R_{0},\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ , $\varepsilon\in(0,1]$ it holds that

[TABLE]

Proof of Theorem 5.8.

Let $x_{0}\in(\mathcal{M}\cap U\cap A)$ . Since $U\subseteq\mathbb{R}^{d}$ is open, fix $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) which satisfies that $V\subseteq U$ . Fix $R_{0},\delta_{0}\in(0,\infty)$ that satisfy the conclusion of Proposition 2.7 for this set $V$ . Fix $\mathfrak{r}\in(0,\infty)$ that satisfies the conclusions of Lemma 2.9 and Proposition 5.7. Let $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $M\in\mathbb{N}$ . As in Proposition 5.3, let $\nabla_{\theta}F^{M,n}\colon\mathbb{R}^{d}\times\Omega\rightarrow\mathbb{R}^{d}$ , $n\in\mathbb{N}$ , be the functions which satisfy for every $(\theta,\omega)\in\mathbb{R}^{d}\times\Omega$ that

[TABLE]

For every $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{0,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that $\Theta^{M,r}_{0,\theta}(\omega)=\theta$ and for every $n\in\mathbb{N}$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

Let $\Theta^{M,r}_{0}\colon\Omega\rightarrow\mathbb{R}^{d}$ be a random variable which is continuous uniformly distributed on $A$ , assume that $\Theta^{M,r}_{0}$ and $(X_{n,m})_{n,m\in\mathbb{N}}$ are independent, and for every $n\in\mathbb{N}$ let $\Theta^{M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that $\Theta^{M,r}_{n}=\Theta^{M,r}_{n,\Theta^{M,r}_{0}}$ . Let $n\in\mathbb{N}$ , $\varepsilon\in(0,1]$ . It holds that

[TABLE]

For the second term on the righthand side of (5.116), it follows from the continuous uniform distribution of $\Theta^{M,r}_{0}$ on $A$ that

[TABLE]

We will now estimate the first term on the righthand side of (5.119). For every $m\in\mathbb{N}_{0}$ , $\theta\in\mathbb{R}^{d}$ let $A_{m,\theta}\subseteq\Omega$ be the event which satisfies that that

[TABLE]

and for every $m\in\mathbb{N}_{0}$ let $A_{m}\in\mathcal{F}$ satisfy that

[TABLE]

It holds that

[TABLE]

For the second term on the righthand side of (5.123), it follows from Proposition 5.7 that there exists $c\in(0,\infty)$ such that

[TABLE]

where we have used the fact that $\rho\in(\nicefrac{{2}}{{3}},1)$ implies that there exists $c\in(0,\infty)$ that satisfies for every $n\in\{2,3,\ldots\}$ that $n^{1-\rho}\leq c(n-1)^{1-\rho}$ . For the first term on the righthand side of (5.123), since the random variables $\Theta^{M,r}_{0}$ and $\big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}}$ are independent, it holds that

[TABLE]

Proposition 5.3 and Chebyshev’s inequality prove that there exists $c\in(0,\infty)$ such that for every $\theta\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ it holds that

[TABLE]

In combination (LABEL:gss_5) and (5.126) prove that there exists $c\in(0,\infty)$ such that

[TABLE]

Returning to (5.123), it follows from (5.124) and (5.127) that there exists $c\in(0,\infty)$ such that

[TABLE]

Returning finally to (5.119), it follows from (5.120) and (5.128) that there exists $c\in(0,\infty)$ such that

[TABLE]

which completes the proof of Theorem 5.8. ∎

The next corollary estimates the probability that $K\in\mathbb{N}$ independent samples of SGD with mini-batch size $M\in\mathbb{N}$ fail to to converge to within distance $\varepsilon\in(0,1]$ of the manifold of local minima $\mathcal{M}\cap U$ at time $n\in\mathbb{N}$ . The proof is a straightforward consequence of Theorem 5.8 and the independence of the random variables.

Corollary 5.9.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(\nicefrac{{2}}{{3}},1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $A\subseteq\mathbb{R}^{d}$ be a bounded open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

let $(\cdot)_{+}\colon\mathbb{R}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume that $\mathcal{M}\cap U\cap A\neq\emptyset$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $n\in\mathbb{N}_{0}$ , $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{k,M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $k\in\mathbb{N}$ , be i.i.d. random variables, assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{1,M,r}_{0}$ is continuous uniformly distributed on $A$ , assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{M,r}_{0}$ and $\big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}}$ are independent, and assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U\cap A)$ there exist $R_{0},\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M,K\in\mathbb{N}$ , $\varepsilon\in(0,1]$ it holds that

[TABLE]

Proof of Corollary 5.9.

Let $x_{0}\in(\mathcal{M}\cap U\cap A)$ . Since $U\subseteq\mathbb{R}^{d}$ is open, fix $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) which satisfies that $V\subseteq U$ . Fix $R_{0},\delta_{0}\in(0,\infty)$ which satisfy the conclusion of Proposition 2.7 for this set $V$ . Fix $\mathfrak{r}\in(0,\infty)$ which satisfy the conclusions of Lemma 2.9 and Proposition 5.7. Let $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M,K\in\mathbb{N}$ . Since the $\Theta^{k,M,r}_{n}$ , $k\in\mathbb{N}$ , are i.i.d. it holds that

[TABLE]

Theorem 5.8 and (5.135) prove estimate (LABEL:mp_0), which completes the proof of Corollary 5.9. ∎

The following corollary translates the convergence of $\Theta^{k,M,r}_{n}$ , $k\in\{1,2,\ldots,K\}$ , to the local manifold of minima $\mathcal{M}\cap U$ into a statement concerning the minimization of the objective function. The proof is a consequence of Corollary 5.9 and the local regularity of the objective function.

Corollary 5.10.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(\nicefrac{{2}}{{3}},1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $A\subseteq\mathbb{R}^{d}$ be a bounded open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $(\cdot)_{+}\colon\mathbb{R}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume that $\mathcal{M}\cap U\cap A\neq\emptyset$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $n\in\mathbb{N}_{0}$ , $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{k,M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $k\in\mathbb{N}$ , be i.i.d. random variables, assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{1,M,r}_{0}$ is continuous uniformly distributed on $A$ , assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{M,r}_{0}$ and $\big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}}$ are independent, and assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U\cap A)$ there exist $R_{0},\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M,K\in\mathbb{N}$ , $\varepsilon\in(0,1]$ it holds that

[TABLE]

Proof of Corollary 5.10.

The proof is an immediate consequence of Corollary 5.9 and the local regularity of the objective function.∎

Under the assumptions and notations of Corollary 5.10, since a random variable $\varTheta^{K,M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

is either computationally inefficient or computationally impossible to obtain, we will prove that such a minimizer can be efficiently computed using mini-batch averages. In the following lemma, we prove that there exists a measurable selection that minimizes a mini-batch approximation.

Lemma 5.11.

Let $d\in\mathbb{N}$ , let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{k}\colon\Omega\rightarrow S$ , $k\in\mathbb{N}$ , be i.i.d. random variables, and let $\Theta^{k}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $k\in\mathbb{N}$ , be i.i.d. random variables. Then for every $K,\mathfrak{M}\in\mathbb{N}$ there exists a random variable $\varTheta^{K,\mathfrak{M}}\colon\Omega\rightarrow\mathbb{R}^{d}$ such that

[TABLE]

Proof of Lemma 5.11.

Let $K,\mathfrak{M}\in\mathbb{N}$ . Let $\mathfrak{K}\colon\Omega\rightarrow\{1,2,\ldots,\mathfrak{M}\}$ satisfy for every $\omega\in\Omega$ that

[TABLE]

Let $\varTheta^{K,\mathfrak{M}}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that

[TABLE]

It follow from (5.142) and (5.143) that $\Theta^{K,\mathfrak{M}}$ is measurable and satisfies (5.141), which completes the proof of Lemma 5.11. ∎

In the following theorem, we prove that the minimum appearing on the lefthand side of (LABEL:mp_0) can be efficiently computed using mini-batch averages of the type appearing in Lemma 5.11.

Theorem 5.12.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(\nicefrac{{2}}{{3}},1)$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $A\subseteq\mathbb{R}^{d}$ be a bounded open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $(\cdot)_{+}\colon\mathbb{R}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume that $\mathcal{M}\cap U\cap A\neq\emptyset$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $n\in\mathbb{N}_{0}$ , $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{k,M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $k\in\mathbb{N}$ , be i.i.d. random variables, assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that $(\Theta^{k,M,r}_{n-1})_{k\in\{2,3,\ldots\}}$ and $(X_{n,k})_{k\in\mathbb{N}}$ are independent, assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{1,M,r}_{0}$ is continuous uniformly distributed on $A$ , assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{M,r}_{0}$ and $\big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}}$ are independent, assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that

[TABLE]

and for every $n,M,\mathfrak{M},K\in\mathbb{N}$ , $r\in(0,\infty)$ let $\varTheta^{K,M,\mathfrak{M},r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ be a random variable which satisfies that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U\cap A)$ there exist $R_{0},\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M,\mathfrak{M},K\in\mathbb{N}$ , $\varepsilon\in(0,1]$ it holds that

[TABLE]

Proof of Theorem 5.12.

Let $x_{0}\in(\mathcal{M}\cap U\cap A)$ . Since $U\subseteq\mathbb{R}^{d}$ is open, fix $V\in\operatorname{Proj}(x_{0})$ (cf. Definition 2.2) which satisfies that $V\subseteq U$ . Fix $R_{0},\delta_{0}\in(0,\infty)$ which satisfy the conclusion of Proposition 2.7 for this set $V$ . Fix $\mathfrak{r}\in(0,\infty)$ which satisfy the conclusions of Lemma 2.9 and Proposition 5.7. Let $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M,\mathfrak{M},K\in\mathbb{N}$ . For every $i\in\{1,2,\ldots,K\}$ let $B^{\prime}_{i}\subseteq\Omega$ satisfy that

[TABLE]

and let $B_{1}\subseteq\Omega$ satisfy that $B_{1}=B^{\prime}_{1}$ and for every $i\in\{2,3,\ldots,K\}$ let $B_{i}\subseteq\Omega$ satisfy that $B_{i}=B^{\prime}_{i}\backslash\cup_{m=1}^{i-1}B_{m}$ . Since the events $B_{i}$ , $i\in\{1,2,\ldots,K\}$ , are disjoint, it holds that

[TABLE]

For the first term on the righthand side of (5.150), Corollary 5.10 proves that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

We will now estimate the second term on the righthand side of (5.151). Let $\tilde{B}_{j}\subseteq\Omega$ , $j\in\{1,2,\ldots,K\}$ , be disjoint events which satisfy that $\Omega=\coprod_{j\in\{1,2,\ldots,K\}}\tilde{B}_{j}$ and that

[TABLE]

Since the events $\tilde{B}_{j}$ , $j\in\{1,2,\ldots,K\}$ , are disjoint, the final term of (5.150) satisfies that

[TABLE]

Let $F^{\mathfrak{M},n}\colon\mathbb{R}^{d}\times\Omega\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ , $\omega\in\Omega$ that

[TABLE]

For every $i,j\in\{1,2,\ldots,K\}$ , since it holds for every $\omega\in B_{i}\cap\tilde{B}_{j}$ that

[TABLE]

it holds for every $i,j\in\{1,2,\ldots,K\}$ that

[TABLE]

It follows from (5.153) and (5.156) that

[TABLE]

For the first term on the righthand side of (5.157), it holds that

[TABLE]

Since the random variables $(\Theta^{k,M,r}_{n})_{k\in\mathbb{N}}$ and $(X_{n+1,k})_{k\in\mathbb{N}}$ are independent, since the $(\Theta^{k,M,r}_{n})_{k\in\mathbb{N}}$ are identically distributed, and since the distribution of $\Theta^{1,M,r}_{n}$ has bounded support on $\mathbb{R}^{d}$ , for the distribution $\mu_{n}$ of $\Theta^{1,M,r}_{n}$ on $\mathbb{R}^{d}$ , Lemma 5.1, Chebyshev’s inequality, and the definition of $F^{\mathfrak{M},n}$ prove that that there exists $c\in(0,\infty)$ which satisfies for every $j\in\{1,\ldots,K\}$ that

[TABLE]

Therefore, it holds that

[TABLE]

For the second term on the righthand side of (5.157), it is sufficient to apply the same argument, which proves that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Returning to (5.153), it follows from (5.157) and (5.160) that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Returning finally to (5.150), it follows from (5.151) and (5.162) that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

which completes the proof of Theorem 5.12. ∎

In the final corollary of this section, we will compute the computational efficiency of the algorithm proposed in Theorem 5.12. The constant implicitly depends on the computational cost of computing $F$ and $\nabla_{\theta}F$ and initializing the random variable $X_{1,1}$ , but it does not depend upon the running time $n\in\mathbb{N}$ , the sampling size $K\in\mathbb{N}$ , or the mini-batch sizes $M,\mathfrak{M}\in\mathbb{N}$ .

Corollary 5.13.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(\nicefrac{{2}}{{3}},1)$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $A\subseteq\mathbb{R}^{d}$ be a bounded open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume that $\mathcal{M}\cap U\cap A\neq\emptyset$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $n\in\mathbb{N}_{0}$ , $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{k,M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $k\in\mathbb{N}$ , be i.i.d. random variables, assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that $(\Theta^{k,M,r}_{n-1})_{k\in\{2,3,\ldots\}}$ and $(X_{n,k})_{k\in\mathbb{N}}$ are independent, assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{1,M,r}_{0}$ is continuous uniformly distributed on $A$ , assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{M,r}_{0}$ and $\big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}}$ are independent, assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that

[TABLE]

and for every $n,M,\mathfrak{M},K\in\mathbb{N}$ , $r\in(0,\infty)$ let $\varTheta^{K,M,\mathfrak{M},r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ be a random variable which satisfies that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U\cap A)$ there exist $R_{0},\delta_{0},\mathfrak{r}\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ there exist $c_{i}\in(0,\infty)$ , $i\in\{1,2,3,4\}$ , such that for every $\varepsilon,\eta\in(0,1]$ , for $n(\varepsilon),M(\varepsilon),K(\eta),\mathfrak{M}(\varepsilon,\eta)\in\mathbb{N}$ which satisfy that

[TABLE]

it holds that

[TABLE]

Proof of Corollary 5.13.

Let $x_{0}\in(\mathcal{M}\cap U\cap A)$ . Let $R_{0},\delta_{0},\mathfrak{r}\in(0,\infty)$ satisfy the conclusion of Theorem 5.12. Theorem 5.12 proves that there exists $\overline{c}\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M,\mathfrak{M},K\in\mathbb{N}$ , $\varepsilon\in(0,1]$ it holds that

[TABLE]

Fix $\overline{R}\in(0,R_{0}]$ , $\overline{\delta}\in(0,\delta_{0}]$ which satisfy that

[TABLE]

Since $\mathcal{M}\cap U\cap A\neq\emptyset$ , it holds that

[TABLE]

For every $M\in\mathbb{N}$ which satisfies that $M\geq 2\overline{c}$ , since $\rho\in(\nicefrac{{2}}{{3}},1)$ there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

and therefore for every $M\geq 2\overline{c}$ there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

It follows from (5.170) that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Returning to (5.169), it follows from (5.173) and (5.174) that there exists $c\in(0,\infty)$ which satisfies that

[TABLE]

Let $\eta\in(0,1]$ . It follows from (5.170), (5.171) and an explicit computation that there exist $c_{i}\in(0,\infty)$ , $i\in\{1,2,3,4\}$ , and $\mathfrak{r}_{1}\in(0,\mathfrak{r}]$ such that for $n(\varepsilon),M(\varepsilon),\mathfrak{M}(\varepsilon,\eta),K(\eta)\in\mathbb{N}$ which satisfy that

[TABLE]

it holds that

[TABLE]

and for every $r\in(0,\mathfrak{r}_{1}]$ that

[TABLE]

Returning to (5.175), it follows for every $r\in(0,\mathfrak{r}_{1}]$ that

[TABLE]

which completes the proof of Corollary 5.13. ∎

6 Stochastic gradient descent - The compact case

In this section, we will analyze the converge of SGD to the manifold of local minima under the additional assumption that the manifold of local minima is compact. The essential difference in this case is that SGD cannot leave a basin of attraction along directions tangential to the manifold. We first observe the convergence of SGD in directions normal to the manifold.

The following proposition is an immediate consequence of Proposition 5.3 and the compactness of $\mathcal{M}\cap U$ , where the essential difference in the compact case is that $R\in(0,\infty)$ can be chosen arbitrarily large. In particular, by compactness, for every $x_{0}\in(\mathcal{M}\cap U)$ there exists $R_{0}\in(0,\infty)$ such that for every $R_{1},R_{2}\in[R_{0},\infty)$ , $\delta\in(0,\infty)$ it holds that $V_{R_{1},\delta}(x_{0})=V_{R_{2},\delta}(x_{0})$ . Furthermore, it follows from Remark 5.5 that the results apply to $\rho\in(0,1)$ .

Proposition 6.1.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(0,1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a non-empty compact $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{0,\theta}\in\mathbb{R}^{d}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that $\theta^{M,r}_{0,\theta}(\omega)=\theta$ , for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

and for every $n,M\in\mathbb{N}$ , $r,R,\delta\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ , $x_{0}\in(\mathcal{M}\cap U)$ let $A_{n}(M,r,R,\delta,\theta,x_{0})\in\mathcal{F}$ satisfy that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exist $\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,\infty)$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ , $\theta\in V_{R,\delta}(x_{0})$ (cf. Definition 2.6) it holds that

[TABLE]

Proof of Proposition 6.1.

The proof is an immediate consequence of Proposition 5.3 and the compactness of $\mathcal{M}\cap U$ . ∎

We will now obtain a lower bound in probability for the events $A_{m}$ , $m\in\mathbb{N}$ . It follows from Proposition 5.7 and the compactness of $\mathcal{M}\cap U$ that for every $x_{0}\in(\mathcal{M}\cap U)$ there exist $\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that the conclusion of Proposition 5.7 is satisfied for every $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , and $R\in(0,\infty)$ for this constant $c\in(0,\infty)$ . That is, since for every $R_{1},R_{2}\in(0,\infty)$ sufficiently large we have $V_{R_{1},\delta}(x_{0})=V_{R_{2},\delta}(x_{0})$ , it holds that the constant can be chosen independently of $R\in(0,\infty)$ .

The proof of the following proposition is then an immediate consequence of Proposition 5.7, after using the fact that the constant $c\in(0,\infty)$ is independent of $R\in(0,\infty)$ and passing to the limit $R\rightarrow\infty$ . The improvement in the estimate, when compared to Proposition 5.7, is a result of the fact that SGD cannot leave the basin of attraction along the directions tangential to the manifold.

Proposition 6.2.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(0,1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

let $(\cdot)_{+}\colon\mathbb{R}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a non-empty compact $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{0,\theta}\in\mathbb{R}^{d}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy for every $\omega\in\Omega$ that $\theta^{M,r}_{0,\theta}(\omega)=\theta$ , for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ satisfy that

[TABLE]

and for every $n,M\in\mathbb{N}$ , $r,R,\delta\in(0,\infty)$ , $\theta\in\mathbb{R}^{d}$ , $x_{0}\in(\mathcal{M}\cap U)$ let $A_{n}(M,r,R,\delta,\theta,x_{0})\in\mathcal{F}$ satisfy that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U)$ there exist $\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,\infty)$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ , $\theta\in V_{\nicefrac{{R}}{{2}},\delta}(x_{0})$ (cf. Definition 2.6) it holds that

[TABLE]

Proof of Proposition 6.2.

The proof is an immediate consequence of Proposition 5.7 and the compactness of $\mathcal{M}\cap U$ .∎

The following theorem proves the convergence of SGD with initial data sampled from a uniform distribution on a bounded open set $A\subseteq\mathbb{R}^{d}$ which satisfies that $\mathcal{M}\cap U\cap A\neq\emptyset$ . The proof is an immediate consequence of Theorem 5.8, Proposition 6.1, and Proposition 6.2.

Theorem 6.3.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(0,1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $A\subseteq\mathbb{R}^{d}$ be a bounded open set, let $\lambda\colon\mathcal{B}(\mathbb{R}^{d})\rightarrow[0,\infty]$ be the Lebesgue-Borel measure, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $\mathbf{d}(\cdot,\mathcal{M}\cap U):\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

let $(\cdot)_{+}\colon\mathbb{R}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a compact $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume that $\mathcal{M}\cap U\cap A\neq\emptyset$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{M,r}_{0}\colon\Omega\rightarrow\mathbb{R}^{d}$ be continuous uniformly distributed on $A$ , assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{M,r}_{0}$ and $\big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}}$ are independent, and for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{M,r}_{n,\theta}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $n\in\mathbb{N}$ , be random variables which satisfy that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U\cap A)$ there exist $\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,\infty)$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M\in\mathbb{N}$ , $\varepsilon\in(0,1]$ it holds that

[TABLE]

Proof of Theorem 6.3.

The proof is an immediate consequence of Theorem 5.8, Proposition 6.1, and Proposition 6.2.∎

The following theorem estimates probability that $K\in\mathbb{N}$ independent solutions of SGD with initial data sampled from a uniform distribution on a compact set $A\subseteq\mathbb{R}^{d}$ which satisfies that $\mathcal{M}\cap U\cap A$ is non-empty fail to converge to within distance $\varepsilon\in(0,1]$ to the local manifold of minima at time $n\in\mathbb{N}$ . The convergence is measured by minimizing a mini-batch average of the objective function. The proof is a consequence of Theorem 6.3 and the arguments leading from Theorem 5.8 to Theorem 5.12.

Theorem 6.4.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(0,1)$ , let $\left|\cdot\right|\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the standard norm on $\mathbb{R}^{d}$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $A\subseteq\mathbb{R}^{d}$ be a bounded open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

let $(\cdot)_{+}\colon\mathbb{R}\rightarrow\mathbb{R}$ be the function which satisfies for every $x\in\mathbb{R}^{d}$ that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a compact $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume that $\mathcal{M}\cap U\cap A\neq\emptyset$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $n\in\mathbb{N}_{0}$ , $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{k,M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $k\in\mathbb{N}$ , be i.i.d. random variables, assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that $(\Theta^{k,M,r}_{n-1})_{k\in\{2,3,\ldots\}}$ and $(X_{n,k})_{k\in\mathbb{N}}$ are independent, assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{1,M,r}_{0}$ is continuous uniformly distributed on $A$ , assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{M,r}_{0}$ and $\big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}}$ are independent, assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that

[TABLE]

and for every $n,M,\mathfrak{M},K\in\mathbb{N}$ , $r\in(0,\infty)$ let $\varTheta^{K,M,\mathfrak{M},r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ be a random variable which satisfies that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U\cap A)$ there exist $\delta_{0},\mathfrak{r},c\in(0,\infty)$ such that for every $R\in(0,\infty)$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ , $n,M,K\in\mathbb{N}$ , $\varepsilon\in(0,1]$ it holds that

[TABLE]

Proof of Theorem 6.4.

The proof is an immediate consequence of Theorem 6.3, Theorem 5.8, and Theorem 5.12.∎

In the final proposition of this section, we prove that the computation efficiency of the SGD algorithm proposed in Theorem 6.4 is improved by the compactness of $\mathcal{M}\cap U$ . The improvement is due to the fact that the mini-batch size $M\in\mathbb{N}$ can be chosen smaller in the compact case, since the mini-batch size no longer needs to account for the possibility that SGD leaves a basin of attraction along directions tangential to the local manifold of minima.

Corollary 6.5.

Let $d\in\mathbb{N}$ , $\mathfrak{d}\in\{0,1,\ldots,d-1\}$ , $\rho\in(0,1)$ , let $U\subseteq\mathbb{R}^{d}$ be an open set, let $A\subseteq\mathbb{R}^{d}$ be a bounded open set, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $(S,\mathcal{S})$ be a measurable space, let $F=(F(\theta,x))_{(\theta,x)\in\mathbb{R}^{d}\times S}\colon\mathbb{R}^{d}\times S\rightarrow\mathbb{R}$ be a measurable function, let $X_{n,m}\colon\Omega\rightarrow S$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables which satisfy for every $\theta\in\mathbb{R}^{d}$ that $\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}\big{]}<\infty$ , let $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{d}$ that $f(\theta)=\mathbb{E}\big{[}F(\theta,X_{1,1})\big{]}$ , let $\mathcal{M}\subseteq\mathbb{R}^{d}$ satisfy that

[TABLE]

assume for every $x\in S$ that $\mathbb{R}^{d}\ni\theta\mapsto F(\theta,x)\in\mathbb{R}$ is a locally Lipschitz continuous function, assume that $f|_{U}\colon U\rightarrow\mathbb{R}$ is a three times continuously differentiable function, assume for every non-empty compact set $\mathfrak{C}\subseteq U$ that $\sup\nolimits_{\theta\in\mathfrak{C}}\mathbb{E}\big{[}|F(\theta,X_{1,1})|^{2}+|(\nabla_{\theta}F)(\theta,X_{1,1})|^{2}\big{]}<\infty$ , assume that $\mathcal{M}\cap U$ is a compact $\mathfrak{d}$ -dimensional $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{d}$ , assume that $\mathcal{M}\cap U\cap A\neq\emptyset$ , assume for every $\theta\in(\mathcal{M}\cap U)$ that $\operatorname{rank}((\operatorname{Hess}f)(\theta))=d-\mathfrak{d}$ , for every $n\in\mathbb{N}_{0}$ , $M\in\mathbb{N}$ , $r\in(0,\infty)$ let $\Theta^{k,M,r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ , $k\in\mathbb{N}$ , be i.i.d. random variables, assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that $(\Theta^{k,M,r}_{n-1})_{k\in\{2,3,\ldots\}}$ and $(X_{n,k})_{k\in\mathbb{N}}$ are independent, assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{1,M,r}_{0}$ is continuous uniformly distributed on $A$ , assume for every $M\in\mathbb{N}$ , $r\in(0,\infty)$ that $\Theta^{M,r}_{0}$ and $\big{(}X_{n,m}\big{)}_{n,m\in\mathbb{N}}$ are independent, assume for every $n,M\in\mathbb{N}$ , $r\in(0,\infty)$ that

[TABLE]

and for every $n,M,\mathfrak{M},K\in\mathbb{N}$ , $r\in(0,\infty)$ let $\varTheta^{K,M,\mathfrak{M},r}_{n}\colon\Omega\rightarrow\mathbb{R}^{d}$ be a random variable which satisfies that

[TABLE]

Then for every $x_{0}\in(\mathcal{M}\cap U\cap A)$ there exist $R_{0},\delta_{0},\mathfrak{r}\in(0,\infty)$ such that for every $R\in(0,R_{0}]$ , $\delta\in(0,\delta_{0}]$ , $r\in(0,\mathfrak{r}]$ there exist $c_{i}\in(0,\infty)$ , $i\in\{1,2,3,4\}$ , such that for every $\varepsilon,\eta\in(0,1]$ , for $n(\varepsilon),M(\varepsilon),K(\eta),\mathfrak{M}(\varepsilon,\eta)\in\mathbb{N}$ which satisfy that

[TABLE]

it holds that

[TABLE]

Proof of Corollary 6.5.

The proof is an immediate consequence of Theorem 6.4 and the proof of Corollary 5.13.∎

7 Applications

In this section, we prove that the conditions of Theorem 1.1 are satisfied for some (simple) objective functions $f\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ of the type (1.33) that arise in the training of neural networks. We will consider the case of a four-parameter affine-linear network with a linear activation function and the case of a two-parameter network with the ReLU activation function. We will prove that the set of global minima are respectively a codimension $2$ submanifold of the parameter space, and a codimension $1$ submanifold. This implies, in particular, that the global minima are not locally unique, and that the established convergence results, such as those proven in [13, 24], do not apply.

7.1 A four-parameter network with a linear activation function

In this section, we show that the conditions of Theorem 1.1 are satisfied by a four-parameter affine-linear network with a linear activation function.

Proposition 7.1.

Let $\varphi\in L^{2}([0,1])$ be finite, let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $X_{n,m}\colon\Omega\rightarrow[0,1]$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables that are continuous uniformly distributed on $[0,1]$ , let $f\colon\mathbb{R}^{4}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta=(\theta_{1},\theta_{2},\theta_{3},\theta_{4})\in\mathbb{R}^{4}$ that

[TABLE]

and let $F\colon\mathbb{R}^{4}\times[0,1]\rightarrow\mathbb{R}$ be the function that satisfies for every $\theta\in\mathbb{R}^{4}$ , $x\in[0,1]$ that

[TABLE]

Then the functions $f$ , $F$ and the random variables $X_{n,m}$ , $n,m\in\mathbb{N}$ , satisfy the conditions of Theorem 1.1.

Proof of Proposition 7.1.

Let $\varphi\in L^{2}([0,1])$ be finite. The finiteness of $\varphi$ proves that, for every $x\in[0,1]$ , we have $F(\cdot,x)\in\operatorname{C}^{0,1}_{\textrm{loc}}(\mathbb{R}^{4})$ . It follows by the uniform distribution of the $X_{n,m}$ , $n,m\in\mathbb{N}$ , on $[0,1]$ that $f(\cdot)=\mathbb{E}[F(\cdot,X_{1,1})]$ , and it follows from the $L^{2}$ -integrability of $\varphi$ that for every compact subset $\mathfrak{C}\subseteq\mathbb{R}^{4}$ it holds that

[TABLE]

It follows by the definition of $f$ and $\varphi\in L^{2}([0,1])$ that $f\in\operatorname{C}^{3}_{\textrm{loc}}(\mathbb{R}^{4})$ . It remains to characterize the set of minima of $f$ . We first observe that when minimizing $f$ , it is sufficient to minimize the potential over the set $\{\theta_{3}\neq 0\}$ . To see this, suppose that $\theta=(\theta_{1},\theta_{2},0,\theta_{4})$ . Then for $\tilde{\theta}=(0,0,1,\theta_{4})$ it holds that

[TABLE]

Therefore, it holds that

[TABLE]

Let $\theta\in\mathbb{R}^{4}\cap\{\theta_{3}\neq 0\}$ be fixed but arbitrary. An explicit computation proves the critical points of $f$ satisfy that

[TABLE]

For $r_{k}\in\mathbb{R}$ , $k\in\{0,1\}$ , which satisfy that

[TABLE]

it follows that $\theta\in\mathbb{R}^{4}$ satisfies equation (7.6) if and only if it holds that

[TABLE]

For $\theta\in\mathbb{R}^{4}$ which satisfies that $\theta_{3}\neq 0$ , an explicit computation proves that $\theta$ satisfies system (7.8) if and only if it holds that

[TABLE]

For $U\subseteq\mathbb{R}^{4}$ which satisfies that

[TABLE]

for $\mathcal{M}\subseteq\mathbb{R}^{4}$ which satisfies that

[TABLE]

we claim that

[TABLE]

Let $\theta\in\mathbb{R}^{4}$ satisfy (7.9) and $\theta_{3}\neq 0$ . Proceeding by contradiction, suppose that there exists $\theta_{0}=(\theta_{1,0},\theta_{2,0},\theta_{3,0},\theta_{4,0})$ which satisfies $\theta_{3,0}\neq 0$ such that

[TABLE]

Since an explicit computation proves for every $(\theta_{1},\theta_{4})\in\mathbb{R}^{2}$ that

[TABLE]

the identical considerations leading to (7.9) prove that

[TABLE]

is uniquely minimized, owing to $\theta_{3,0}\neq 0$ , by $(\theta_{1},\theta_{4})\in\mathbb{R}^{2}$ which satisfies that

[TABLE]

We conclude that $\tilde{\theta}_{0}\in\mathbb{R}^{4}$ satisfies that

[TABLE]

satisfies (7.9) and $\tilde{\theta}_{3,0}\neq 0$ . Therefore, it holds that

[TABLE]

which contradicts the fact that $\nabla f=0$ on the connected set of $\theta\in\mathbb{R}^{4}$ which satisfies (7.9) and $\theta_{3}\neq 0$ . This proves (7.12). It is immediate from (7.9) that $\mathcal{M}\cap U$ is a non-empty, $2$ -dimensional, $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{4}$ . It remains only to prove the nondegeneracy assumption. for every $\theta\in(\mathcal{M}\cap U)$ , after computing the Hessian222Due to the symmetry of the Hessian, we only write the upper diagonal., it holds that

[TABLE]

where this equality relies upon the fact that, due to (7.6) and $\theta_{3}\neq 0$ on $\mathcal{M}\cap U$ , we have that

[TABLE]

A column-reduction, which relies on the fact that for every $\theta\in(\mathcal{M}\cap U)$ we have $\theta_{3}\neq 0$ , proves for every $\theta\in(\mathcal{M}\cap U)$ that

[TABLE]

This completes the proof of Proposition 7.1. ∎

7.2 A two parameter network with the ReLU activation function

In this section, we show that the conditions of Theorem 1.1 are satisfied by a two-parameter affine-linear network with the ReLU activation function.

Proposition 7.2.

Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space, let $X_{n,m}\colon\Omega\rightarrow[0,1]$ , $n,m\in\mathbb{N}$ , be i.i.d. random variables that are continuous uniformly distributed on $[0,1]$ , let $f\colon\mathbb{R}^{2}\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta=(\theta_{1},\theta_{2})\in\mathbb{R}^{2}$ that

[TABLE]

and let $F\colon\mathbb{R}^{2}\times[0,1]\rightarrow\mathbb{R}$ be the function which satisfies for every $\theta\in\mathbb{R}^{2}$ , $x\in[0,1]$ that

[TABLE]

Then the functions $f$ , $F$ and the random variables $X_{n,m}$ , $n,m\in\mathbb{N}$ , satisfy the conditions of Theorem 1.1.

Proof of Proposition 7.2.

It is immediate that $F(\cdot,x)\in\operatorname{C}^{0,1}_{\textrm{loc}}(\mathbb{R}^{2})$ . Since the $X_{n,m}$ , $n,m\in\mathbb{N}$ are uniformly distributed on $[0,1]$ , for every $\theta\in\mathbb{R}^{2}$ it holds that

[TABLE]

and, furthermore, a straightforward computation proves for every compact set $\mathfrak{C}\subseteq\mathbb{R}^{2}$ that

[TABLE]

It remains only to characterize the minima of the objective function, and to verify the nondegeneracy condition. An explicit computation proves that, when minimizing $f$ , it is sufficient to restrict to the set $\{\theta_{1}>0,\theta_{2}>0\}$ . Let $U\subseteq\mathbb{R}^{2}$ satisfy that

[TABLE]

We observe for every $\theta\in U$ that

[TABLE]

and for every $\theta\in U$ that

[TABLE]

Therefore, for $\theta\in U$ it holds that $\nabla f(\theta)=0$ if and only if it holds that

[TABLE]

Let $\mathcal{M}\subseteq\mathbb{R}^{2}$ satisfy that

[TABLE]

We claim that

[TABLE]

Suppose that $\theta\in U$ satisfies (7.29). By contradiction suppose that there exists $\theta_{0}=(\theta_{1,0},\theta_{2,0})\in\{\theta_{1}>0,\theta_{2}>0\}$ such that

[TABLE]

Since $\theta_{1,0}>0$ an explicit computation proves that

[TABLE]

The arguments leading from (7.27) to (7.29) prove that (7.33) is uniquely minimized when

[TABLE]

Therefore, for $\tilde{\theta}_{0}\in\mathbb{R}^{2}$ which satisfies that

[TABLE]

we have that $\tilde{\theta}_{0}\in U$ , that $\tilde{\theta}_{0}$ satisfies (7.29), and that

[TABLE]

This contradicts the fact that $\nabla f=0$ on the connected set of $\theta\in U$ that satisfy (7.29). This proves (7.31). Since it is clear that $\mathcal{M}\cap U$ is a non-empty, $1$ -dimensional, $\operatorname{C}^{1}$ -submanifold of $\mathbb{R}^{2}$ , it remains only to establish the nondegeneracy assumption. For every $\theta\in(\mathcal{M}\cap U)$ it holds that

[TABLE]

A column reduction and $\theta_{2}\neq 0$ prove for every $\theta\in(\mathcal{M}\cap U)$ that

[TABLE]

This completes the proof of Proposition 7.2. ∎

Acknowledgements

The first author acknowledges financial support from the National Science Foundation Mathematical Sciences Postdoctoral Research Fellowship under Grant Number 1502731.

The second author acknowledges financial support by the DFG through the CRC 1283 “Taming uncertainty and profiting from randomness and low regularity in analysis, stochastics and their applications.”

Bibliography56

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Anitescu. Degenerate Nonlinear Programming with a Quadratic Growth Condition. 10(4):1116–1135.
2[2] F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. , 15:595–627, 2014.
3[3] F. Bach and E Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in Neural Information Processing Systems (NIPS) , 2011.
4[4] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). In Advances in neural information processing systems , pages 773–781, 2013.
5[5] B. Bercu and J.-C. Fort. Generic stochastic gradient methods. Wiley Encyclopedia of Operations Research and Management Science , pages 1–8, 2013.
6[6] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 , pages 177–186. Physica-Verlag/Springer, Heidelberg, 2010.
7[7] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. Optimization for Machine Learning, MIT Press , pages 351–368, 2011.
8[8] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. 60(2):223–311.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Convergence rates for the stochastic gradient descent

Abstract

Contents

1 Introduction

Theorem 1.1**.**

1.1 Literature

1.2 Structure of the work

2 Geometric preliminaries

Proposition 2.1**.**

Proof of Proposition 2.1.

Definition 2.2**.**

Proposition 2.3**.**

Proof of Proposition 2.3.

Lemma 2.4**.**

Proof of Lemma 2.4.

Lemma 2.5**.**

Proof of Lemma 2.5.

Definition 2.6**.**

Proposition 2.7**.**

Proof of Proposition 2.7.

Lemma 2.8**.**

Proof of Lemma 2.8.

Lemma 2.9**.**

Proof of Lemma 2.9.

3 Continuous deterministic gradient descent

Proposition 3.1**.**

Proof of Proposition 3.1.

4 Discrete deterministic gradient descent

Proposition 4.1**.**

Proof of Proposition 4.1.

Remark 4.2**.**

5 Stochastic gradient descent

Lemma 5.1**.**

Proof of Lemma 5.1.

Remark 5.2**.**

Proposition 5.3**.**

Proof of Proposition 5.3.

Proposition 5.4**.**

Proof of Proposition 5.4.

Remark 5.5**.**

Lemma 5.6**.**

Proof of Lemma 5.6.

Proposition 5.7**.**

Proof of Proposition 5.7.

Theorem 5.8**.**

Proof of Theorem 5.8.

Corollary 5.9**.**

Proof of Corollary 5.9.

Corollary 5.10**.**

Proof of Corollary 5.10.

Lemma 5.11**.**

Proof of Lemma 5.11.

Theorem 5.12**.**

Proof of Theorem 5.12.

Corollary 5.13**.**

Proof of Corollary 5.13.

6 Stochastic gradient descent - The compact case

Proposition 6.1**.**

Proof of Proposition 6.1.

Proposition 6.2**.**

Proof of Proposition 6.2.

Theorem 6.3**.**

Proof of Theorem 6.3.

Theorem 6.4**.**

Proof of Theorem 6.4.

Corollary 6.5**.**

Proof of Corollary 6.5.

7 Applications

7.1 A four-parameter network with a linear activation function

Proposition 7.1**.**

Proof of Proposition 7.1.

7.2 A two parameter network with the ReLU activation function

Proposition 7.2**.**

Proof of Proposition 7.2.

Theorem 1.1.

Proposition 2.1.

Definition 2.2.

Proposition 2.3.

Lemma 2.4.

Lemma 2.5.

Definition 2.6.

Proposition 2.7.

Lemma 2.8.

Lemma 2.9.

Proposition 3.1.

Proposition 4.1.

Remark 4.2.

Lemma 5.1.

Remark 5.2.

Proposition 5.3.

Proposition 5.4.

Remark 5.5.

Lemma 5.6.

Proposition 5.7.

Theorem 5.8.

Corollary 5.9.

Corollary 5.10.

Lemma 5.11.

Theorem 5.12.

Corollary 5.13.

Proposition 6.1.

Proposition 6.2.

Theorem 6.3.

Theorem 6.4.

Corollary 6.5.

Proposition 7.1.

Proposition 7.2.