Online estimation of the asymptotic variance for averaged stochastic   gradient algorithms

Antoine Godichon-Baggioni

arXiv:1702.00931·math.ST·October 17, 2017

Online estimation of the asymptotic variance for averaged stochastic gradient algorithms

Antoine Godichon-Baggioni

PDF

TL;DR

This paper proves a Central Limit Theorem for stochastic gradient algorithms in Hilbert spaces, introduces a recursive method to estimate their asymptotic variance, and demonstrates its effectiveness through logistic regression and geometric quantile examples.

Contribution

It establishes a CLT for averaged stochastic gradient estimates in Hilbert spaces and proposes a new recursive algorithm for asymptotic variance estimation.

Findings

01

Proves asymptotic normality of stochastic gradient estimates.

02

Introduces a recursive algorithm for variance estimation.

03

Demonstrates the method on logistic regression and geometric quantiles.

Abstract

Stochastic gradient algorithms are more and more studied since they can deal efficiently and online with large samples in high dimensional spaces. In this paper, we first establish a Central Limit Theorem for these estimates as well as for their averaged version in general Hilbert spaces. Moreover, since having the asymptotic normality of estimates is often unusable without an estimation of the asymptotic variance, we introduce a new recursive algorithm for estimating this last one, and we establish its almost sure rate of convergence as well as its rate of convergence in quadratic mean. Finally, two examples consisting in estimating the parameters of the logistic regression and estimating geometric quantiles are given.

Equations533

m := ar g h \in H min E [g (X, h)],

m := ar g h \in H min E [g (X, h)],

G (h) := E [g (X, h)],

G (h) := E [g (X, h)],

⟨ A, B ⟩_{F} := j \in J \sum ⟨ A (e_{j}), B (e_{j}) ⟩, \forall A, B \in S (H),

⟨ A, B ⟩_{F} := j \in J \sum ⟨ A (e_{j}), B (e_{j}) ⟩, \forall A, B \in S (H),

\nabla G (m) = 0.

\nabla G (m) = 0.

∥ Γ_{h} ∥_{o p} \leq C_{A},

∥ Γ_{h} ∥_{o p} \leq C_{A},

∥ \nabla G (h) - Γ_{m} (h - m) ∥ \leq C_{ϵ} ∥ h - m ∥^{2} .

∥ \nabla G (h) - Γ_{m} (h - m) ∥ \leq C_{ϵ} ∥ h - m ∥^{2} .

E [∥ \nabla_{h} g (X, h) ∥^{2}] \leq L_{1} (1 + ∥ h - m ∥^{2}) .

E [∥ \nabla_{h} g (X, h) ∥^{2}] \leq L_{1} (1 + ∥ h - m ∥^{2}) .

E [∥ \nabla_{h} g (X, h) ∥^{4}] \leq L_{2} (1 + ∥ h - m ∥^{4}) .

E [∥ \nabla_{h} g (X, h) ∥^{4}] \leq L_{2} (1 + ∥ h - m ∥^{4}) .

E [∥ \nabla_{h} g (X, h) ∥^{2 q}] \leq L_{q} (1 + ∥ h - m ∥^{2 q}) .

E [∥ \nabla_{h} g (X, h) ∥^{2 q}] \leq L_{q} (1 + ∥ h - m ∥^{2 q}) .

φ (h) := E [\nabla_{h} g (X, h) \otimes \nabla_{h} g (X, h)] .

φ (h) := E [\nabla_{h} g (X, h) \otimes \nabla_{h} g (X, h)] .

h \to m lim ∥ E [\nabla_{h} g (X, m) \otimes \nabla_{h} g (X, m)] - E [\nabla_{h} g (X, h) \otimes \nabla_{h} g (X, h)] ∥_{F} = 0.

h \to m lim ∥ E [\nabla_{h} g (X, m) \otimes \nabla_{h} g (X, m)] - E [\nabla_{h} g (X, h) \otimes \nabla_{h} g (X, h)] ∥_{F} = 0.

∥ E [\nabla_{h} g (X, m) \otimes \nabla_{h} g (X, m) - \nabla_{h} g (X, h) \otimes \nabla_{h} g (X, h)] ∥_{F} \leq C_{ϵ}^{'} ∥ h - m ∥ .

∥ E [\nabla_{h} g (X, m) \otimes \nabla_{h} g (X, m) - \nabla_{h} g (X, h) \otimes \nabla_{h} g (X, h)] ∥_{F} \leq C_{ϵ}^{'} ∥ h - m ∥ .

∥ E [\nabla_{h} g (X, h) \otimes \nabla_{h} g (X, h)] ∥_{F} \leq E [∥ \nabla_{h} g (X, h) ∥^{2}] \leq L_{1} (1 + ∥ h - m ∥^{2}) .

∥ E [\nabla_{h} g (X, h) \otimes \nabla_{h} g (X, h)] ∥_{F} \leq E [∥ \nabla_{h} g (X, h) ∥^{2}] \leq L_{1} (1 + ∥ h - m ∥^{2}) .

B (h, A) = {h^{'} \in H, ∥ h - h^{'} ∥ < A} .

B (h, A) = {h^{'} \in H, ∥ h - h^{'} ∥ < A} .

∥ h \otimes h^{'} ∥_{F} = ∥ h ∥ ∥ h^{'} ∥ .

∥ h \otimes h^{'} ∥_{F} = ∥ h ∥ ∥ h^{'} ∥ .

m_{n + 1} = m_{n} - γ_{n} \nabla_{h} g (X_{n + 1}, m_{n}),

m_{n + 1} = m_{n} - γ_{n} \nabla_{h} g (X_{n + 1}, m_{n}),

m_{n + 1} = m_{n} - γ_{n} Φ (m_{n}) + γ_{n} ξ_{n + 1},

m_{n + 1} = m_{n} - γ_{n} Φ (m_{n}) + γ_{n} ξ_{n + 1},

∥ m_{n} - m ∥^{2} = o (\frac{( ln n ) ^{δ}}{n ^{α}}) a . s .

∥ m_{n} - m ∥^{2} = o (\frac{( ln n ) ^{δ}}{n ^{α}}) a . s .

E [∥ m_{n} - m ∥^{2 p}] \leq \frac{C _{p}}{n ^{p α}} .

E [∥ m_{n} - m ∥^{2 p}] \leq \frac{C _{p}}{n ^{p α}} .

n \to \infty lim \frac{1}{γ _{n}} (m_{n} - m) \sim N (0, Σ_{R M}),

n \to \infty lim \frac{1}{γ _{n}} (m_{n} - m) \sim N (0, Σ_{R M}),

Σ_{R M} := \int_{0}^{+ \infty} e^{- s Γ_{m}} Σ^{'} e^{- s Γ_{m}} d s, and Σ^{'} := E [\nabla_{h} g (X, m) \otimes \nabla_{h} g (X, m)] .

Σ_{R M} := \int_{0}^{+ \infty} e^{- s Γ_{m}} Σ^{'} e^{- s Γ_{m}} d s, and Σ^{'} := E [\nabla_{h} g (X, m) \otimes \nabla_{h} g (X, m)] .

n \to \infty lim n^{α /2} (m_{n} - m) \sim N (0, c_{γ} Σ_{R M}),

n \to \infty lim n^{α /2} (m_{n} - m) \sim N (0, c_{γ} Σ_{R M}),

e^{M} = k = 0 \sum \infty \frac{1}{k !} M^{k} .

e^{M} = k = 0 \sum \infty \frac{1}{k !} M^{k} .

∥ Σ_{R M} ∥_{F}

∥ Σ_{R M} ∥_{F}

n \to \infty lim n (m_{n} - m) \sim N (0, c Σ^{'}) .

n \to \infty lim n (m_{n} - m) \sim N (0, c Σ^{'}) .

\overline{m}_{n} = \frac{1}{n} k = 1 \sum n m_{k} .

\overline{m}_{n} = \frac{1}{n} k = 1 \sum n m_{k} .

\overline{m}_{n + 1} = \overline{m}_{n} + \frac{1}{n + 1} (m_{n + 1} - \overline{m}_{n}) .

\overline{m}_{n + 1} = \overline{m}_{n} + \frac{1}{n + 1} (m_{n + 1} - \overline{m}_{n}) .

∥ \overline{m}_{n} - m ∥^{2} = o (\frac{( ln n ) ^{1 + δ}}{n}) a . s .

∥ \overline{m}_{n} - m ∥^{2} = o (\frac{( ln n ) ^{1 + δ}}{n}) a . s .

E [∥ \overline{m}_{n} - m ∥^{2 p}] \leq \frac{C _{p}^{'}}{n ^{p}} .

E [∥ \overline{m}_{n} - m ∥^{2 p}] \leq \frac{C _{p}^{'}}{n ^{p}} .

n \to \infty lim n (\overline{m}_{n} - m) \sim N (0, Σ),

n \to \infty lim n (\overline{m}_{n} - m) \sim N (0, Σ),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Online estimation of the asymptotic variance for averaged stochastic gradient algorithms

Antoine Godichon-Baggioni

Institut de Mathématiques de Toulouse,

Université Paul Sabatier, 31000 Toulouse, France

email: [email protected]

Abstract

Stochastic gradient algorithms are more and more studied since they can deal efficiently and online with large samples in high dimensional spaces. In this paper, we first establish a Central Limit Theorem for these estimates as well as for their averaged version in general Hilbert spaces. Moreover, since having the asymptotic normality of estimates is often unusable without an estimation of the asymptotic variance, we introduce a new recursive algorithm for estimating this last one, and we establish its almost sure rate of convergence as well as its rate of convergence in quadratic mean. Finally, two examples consisting in estimating the parameters of the logistic regression and estimating geometric quantiles are given.

Keywords: Stochastic Gradient Algorithm, Averaging, Central Limit Theorem, Asymptotic Variance.

1 Introduction

High Dimensional and Functional Data Analysis are interesting domains which do not have stopped growing for many years. To consider these kinds of data, it is more and more important to think about methods which take into account the high dimension as well as the possibility of having large samples. In this paper, we focus on an usual stochastic optimization problem which consists in estimating

[TABLE]

where $X$ is a random variable taking values in a space $\mathcal{X}$ and $g:\mathcal{X}\times H\longrightarrow\mathbb{R}$ , where $H$ is a separable Hilbert space. In order to build an estimator of $m$ , an usual method was to consider the solver of the problem generated by the sample, i.e to consider $M$ -estimates (see Huber and Ronchetti, (2009) and Maronna et al., (2006) among others). In order to build these estimates, deterministic convex optimization algorithms (see Boyd and Vandenberghe, (2004)) are often used (see Vardi and Zhang, (2000), Oja and Niinimaa, (1985) in the case of the median), and these methods are really efficient in small dimensional spaces.

Nevertheless, in a context of high dimensional spaces, this kind of method can encounter many computational problems. The main ones are that it needs to store all the data, which can be expensive in term of memory and that they cannot deal online with the data. In order to overcome this, stochastic gradient algorithms (Robbins and Monro, (1951)) are efficient candidates since they do not need to store the data into memory, and they can be easily updated, which is crucial if the data arrive sequentially (see Duflo, (1996), Duflo, (1997), Kushner and Yin, 2003a or Nemirovski et al., (2009) among others). In order to improve the convergence, Ruppert, (1988) and Polyak and Juditsky, (1992) introduced its averaged version (see also Dippon and Renz, (1997) for a weighted version). These algorithms have become crucial to statistics and modern machine learning (Bach and Moulines, (2013), Bach, (2014), Juditsky et al., (2014)). There are already many results on these algorithms in the literature, that we can split into two parts: asymptotic results, such as almost sure rates of convergence (Schwabe and Walk,, 1996; Duflo,, 1997; Walk,, 1992; Pelletier,, 1998, 2000), and non asymptotic ones, such as rates of convergence in quadratic mean (Cardot et al.,, 2017; Godichon-Baggioni, 2016a, ; Bach and Moulines,, 2013; Bach,, 2014; Nemirovski et al.,, 2009).

In a recent work, Godichon-Baggioni, 2016b introduces a new framework, with only locally strongly convexity assumptions, in general Hilbert spaces, which allows to obtain almost sure and $L^{p}$ rates of convergence. In keeping with it, and in order to have a deeper study of the stochastic gradient algorithm as well as of its averaged version (up to a new assumption), we first give the asymptotic normality of the estimates. In a second time, since a Central Limit Theorem is often unusable without an estimation of the variance, we introduce a recursive algorithm, inspired by Gahbiche and Pelletier, (2000), to estimate the asymptotic variance of the averaged estimator and we establish its rates of convergence. As far as we know, there was not yet an efficient and recursive estimate of the asymptotic variance in the literature. Finally, two examples of application are given. The first usual one consists in estimating the parameters of the logistic regression (Bach,, 2014) while the second one consists in estimating geometric quantiles (see Chaudhuri, (1996) and Chakraborty and Chaudhuri, (2014)), which are useful robust indicators in statistics. Indeed, they are often used in data depth and outliers detection (Serfling, (2006), Hallin and Paindaveine, (2006)), as well as for robust estimation of the mean and variance (see Minsker et al., (2014)), or for Robust Principal Component Analysis (Gervini, (2008), Kraus and Panaretos, (2012), Cardot and Godichon-Baggioni, (2017)).

The paper is organized as follows: Section 2 recalls the framework introduced by Godichon-Baggioni, 2016b before giving two new assumptions which allow to get the rate of convergence of the estimators of the asymptotic variance. In section 3, the stochastic gradient algorithm as well as its averaged version are introduced and their asymptotic normality are given. The recursive estimator of the asymptotic variance is given in Section 4 and its almost sure as well as its quadratic mean rates of convergence are established. Applications, consisting in estimating the logistic regression parameters and in the recursive estimation of geometric quantiles, are given in Section 5 as well as a short simulation study. Finally, the proofs are postponed in Section 6 and in Appendix.

2 Assumptions

Let $H$ be a separable Hilbert space such as $\mathbb{R}^{d}$ or $L^{2}(I)$ (for some closed interval $I\subset\mathbb{R}$ ), we denote by $\left\langle.,.\right\rangle$ its inner product and by $\left\|.\right\|$ the associated norm. Let $X$ be a random variable taking values in a space $\mathcal{X}$ , and let $G:H\longrightarrow\mathbb{R}$ be the function we would like to minimize, defined for all $h\in H$ by

[TABLE]

where $g:\mathcal{X}\times H\longrightarrow\mathbb{R}$ . Moreover, let us suppose that the functional $G$ is convex. Finally, let us introduce the space of linear operators on $H$ , denoted by $\mathcal{S}(H)$ , equipped with the Frobenius (or Hilbert-Schmidt) inner product, which is defined by

[TABLE]

where $\left(e_{j}\right)_{j\in J}$ is an orthonormal basis of $H$ . We denote by $\left\|.\right\|_{F}$ the associated norm, and $\mathcal{S}(H)$ is then a separable Hilbert space. Let us recall the framework introduced by Godichon-Baggioni, 2016b :

(A1)

The functional $g$ is Frechet-differentiable for the second variable almost everywhere. Moreover, $G$ is differentiable and there exists $m\in H$ such that

[TABLE]

(A2)

The functional $G$ is twice continuously differentiable almost everywhere and for all positive constant $A$ , there is a positive constant $C_{A}$ such that for all $h\in\mathcal{B}\left(m,A\right)$ ,

[TABLE]

where $\Gamma_{h}$ is the Hessian of the functional $G$ at $h$ and $\left\|.\right\|_{op}$ is the usual spectral norm for linear operators.

(A3)

There exists a positive constant $\epsilon$ such that for all $h\in\mathcal{B}\left(m,\epsilon\right)$ , there is an orthonormal basis of $H$ composed of eigenvectors of $\Gamma_{h}$ . Moreover, let us denote by $\lambda_{\min}$ the limit inf of the eigenvalues of $\Gamma_{m}$ , then $\lambda_{\min}$ is positive. Finally, for all $h\in\mathcal{B}\left(m,\epsilon\right)$ , and for all eigenvalue $\lambda_{h}$ of $\Gamma_{h}$ , we have $\lambda_{h}\geq\frac{\lambda_{\min}}{2}>0$ .

(A4)

There are positive constants $\epsilon,C_{\epsilon}$ such that for all $h\in\mathcal{B}\left(m,\epsilon\right)$ ,

[TABLE]

(A5)
(a)

There is a positive constant $L_{1}$ such that for all $h\in H$ ,

[TABLE]

(a’)

There is a positive constant $L_{2}$ such that for all $h\in H$ ,

[TABLE]

(b)

For all integer $q$ , there is a positive constant $L_{q}$ such that for all $h\in H$ ,

[TABLE]

Let us now make some comments on assumptions. First, Assumption (A1) ensures the existence of a solution and enables to use a stochastic gradient descent, while (A2) gives some smoothness properties on the objective function. Assumption (A3) ensures the uniqueness of the minimizer of $G$ , and (A4),(A5) give bounds of the gradient and of the remainder term of its Taylor’s expansion. The main difference between this framework and the usual one for strongly convex objective is that we just assume the local strong convexity of the objective function, and in return, $p$ -th moments of the gradient of the functional $g$ have to be bounded. Note also that the Hessian of the functional $G$ is not supposed to be compact, so that its smallest eigenvalue does not necessarily converge to [math] when the dimension tends to infinity (a counter example is given in Section 5). Remark that assumptions (A1) to (A5b) are deeply discussed in Godichon-Baggioni, 2016b . Let us now introduce two new assumptions.

(A6)

Let $\varphi:H\longrightarrow\mathcal{S}(H)$ be the functional defined for all $h\in H$ by

[TABLE]

(a)

The functional $\varphi$ is continuous at $m$ with respect to the Frobenius norm:

[TABLE]

(b)

The functional $\varphi$ is locally lipschitz on a neighborhood of $m$ : there are positive constants $\epsilon,C_{\epsilon}^{\prime}$ , such that for all $h\in\mathcal{B}\left(m,\epsilon\right)$ ,

[TABLE]

Assumption (A6a) enables to establish the asymptotic normality of the stochastic gradient descent as well as of its averaged version. Note that under (A5a), the functional $\varphi$ is bounded, and more precisely

[TABLE]

Assumption (A6b) can be verified by giving a bound, on a neighborhood of $m$ , of the derivative of the functional $\varphi$ . This last assumption allows to give the rate of convergence of the estimators of the asymptotic variance. An example is given for the special case of the geometric median in Appendix.

Remark 2.1.

For all $h\in H$ and $A>0$ ,

[TABLE]

Remark 2.2.

Let $h,h^{\prime}\in H$ , the linear operator $h\otimes h^{\prime}:H\longrightarrow H$ is defined for all $h^{\prime\prime}\in H$ by $h\otimes h^{\prime}(h^{\prime\prime}):=\left\langle h,h^{\prime\prime}\right\rangle h^{\prime}$ . Moreover,

[TABLE]

3 The stochastic gradient algorithm and its averaged version

3.1 The Robbins-Monro algorithm

In what follows, let $X_{1},...,X_{n}$ be independent random variables with the same law as $X$ . The stochastic gradient algorithm is defined recursively for all $n\geq 1$ by

[TABLE]

with $m_{1}$ bounded and $\left(\gamma_{n}\right)$ is a step sequence of the form $\gamma_{n}:=c_{\gamma}n^{-\alpha}$ , with $c_{\gamma}>0$ and $\alpha\in\left(\frac{1}{2},1\right)$ . Moreover, let $\left(\mathcal{F}_{n}\right)_{n\geq 1}$ be the sequence of $\sigma$ -algebras defined for all $n\geq 1$ by $\mathcal{F}_{n}:=\sigma\left(X_{1},...,X_{n}\right)$ . Then, the algorithm can be considered as a noisy (or stochastic) gradient algorithm since it can be written as

[TABLE]

where $\Phi\left(m_{n}\right):=\nabla G\left(m_{n}\right)$ , and $\left(\xi_{n}\right)$ , defined for all $n\geq 1$ by $\xi_{n+1}:=\Phi\left(m_{n}\right)-\nabla_{h}g\left(X_{n+1},m_{n}\right)$ , is a martingale differences sequence adapted to the filtration $\left(\mathcal{F}_{n}\right)$ . Finally, note that under assumptions (A1) to (A5a), it was proven in Godichon-Baggioni, 2016b that for all positive constant $\delta$ ,

[TABLE]

Moreover, assuming that (A5b) is also fulfilled, for all positive integer $p$ , there is a constant $C_{p}$ such that for all $n\geq 1$ ,

[TABLE]

In order to get a deeper study of this estimate, we now give its asymptotic normality.

Theorem 3.1.

Suppose assumptions (A1) to (A5a’) and (A6a) hold. Then, we have the convergence in law

[TABLE]

with

[TABLE]

The proof is given in Appendix. Note that the variance $\Sigma_{RM}$ does not depend on the step sequence $\left(\gamma_{n}\right)$ , but Theorem 3.1 could be written as

[TABLE]

Remark 3.1.

Let $M$ be a squared matrix, $e^{M}$ is defined by (see Horn and Johnson, (2012) among others)

[TABLE]

Thanks to assumptions (A2),(A3), $0<\lambda_{\min}\left(\Gamma_{m}\right)\leq\lambda_{\max}\left(\Gamma_{m}\right)<\infty$ , while under (A5a) and by dominated convergence,

[TABLE]

and $\Sigma_{RM}$ is so well defined.

Remark 3.2.

Note that analogous results are given by (Fabian,, 1968; Pelletier,, 1998) in the particular case of finite dimensional spaces while, for analogous results in Banach and Hilbert spaces, one can also see Walk, (1992), Ljung et al., (2012), Kushner and Yin, 2003b .

Remark 3.3.

Note that taking a step sequence of the form $\gamma_{n}=\frac{c}{n}$ with $c>\frac{2}{\lambda_{\min}}$ is possible, and one can obtain the following asymptotic normality (see Pelletier, (2000) among others for the case of finite dimensional spaces)

[TABLE]

Nevertheless, it does not only necessitate to have some information on the Hessian $\Gamma_{m}$ , but $c\Sigma^{\prime}$ is also not the optimal variance (see Duflo, (1997) and Pelletier, (2000) for instance).

3.2 The averaged algorithm

As mentioned in Remark 3.3, having the parametric rate of convergence ( $O\left(\frac{1}{n}\right)$ ) with the Robbins-Monro algorithm is possible taking a good choice of step sequence $\left(\gamma_{n}\right)$ . Nevertheless, this choice is often complicated and the asymptotic variance which is obtained is not optimal. Then, in order to improve the convergence, let us now introduce the averaged algorithm (see Ruppert, (1988) and Polyak and Juditsky, (1992)) defined for all $n\geq 1$ by

[TABLE]

This can be written recursively for all $n\geq 1$ as

[TABLE]

It was proven in Godichon-Baggioni, 2016b that under assumptions (A1) to (A5a), for all $\delta>0$ ,

[TABLE]

Suppose assumption (A5b) is also fulfilled, for all positive integer $p$ , there is a positive constant $C_{p}^{\prime}$ such that for all $n\geq 1$ ,

[TABLE]

Finally, in order to have a deeper study of this estimate, we now give its asymptotic normality.

Theorem 3.2.

Suppose assumptions (A1) to (A5a’) and (A6a) are verified. Then, we have the convergence in law

[TABLE]

with $\Sigma:=\Gamma_{m}^{-1}\Sigma^{\prime}\Gamma_{m}^{-1}$ , and $\Sigma^{\prime}:=\mathbb{E}\left[\nabla_{h}g\left(X,m\right)\otimes\nabla_{h}g\left(X,m\right)\right]$ .

The proof is given in Section 6. For analogous results, one can also see Schwabe and Walk, (1996), Pelletier, (2000), Dippon and Walk, (2006).

4 Recursive estimation of the asymptotic variance

4.1 Some existing estimators

A first naive method to estimate the asymptotic variance could be to estimate the Hessian $\Gamma_{m}$ and the variance $\Sigma^{\prime}$ as follows

[TABLE]

but the main problem is that under assumptions (A2), (A3) and (A5a), if $H$ is an infinite dimensional space, then

[TABLE]

Another problem is that, in order to get a recursive estimator of the asymptotic variance, it needs to invert a matrix at each iteration, which costs much calculus time in high dimensional spaces. A second estimator of the asymptotic variance was introduced in Pelletier, (2000), defined for all $n\geq 1$ by

[TABLE]

and under (A1) to (A6b),

[TABLE]

Thus, this estimator faces two main problems: it is not recursive and it converges very slowly. Finally, in order to solve the second problem, a faster algorithm was introduced by Gahbiche and Pelletier, (2000), defined for all $n\geq 1$ by

[TABLE]

with $(1+\alpha)/2<s<1$ , $\mu\geq 0$ and $s/2<\delta<(1+s)/2$ . This algorithm is first based on an usual decomposition of the stochastic gradient algorithm (see equation (18)) which enables to make appear a martingale term which carries the convergence rate (see equation (27)). In a second time, the objective is to find step sequences which enable to improve the rate of convergence of the variance estimate (see Gahbiche and Pelletier, (2000) for technical details on assumptions on the step sequences). In the case of finite dimensional spaces, the following convergence in probability is given (under some assumptions)

[TABLE]

with $c>0$ . A first technical problem is that only the convergence in probability is given, in the case of finite dimensional spaces, and for the usual spectral norm. A second one is that it is not recursive and it cannot be easily updated.

4.2 A recursive and fast estimate

We now give a recursive version of the algorithm defined by (11) to estimate the asymptotic variance in separable Hilbert spaces, before establishing its rates of convergence (almost sure and in quadratic mean). This algorithm is defined by

[TABLE]

with

[TABLE]

The difference with previous algorithm is the replacement of $\overline{m}_{n}$ by $\overline{m}_{j}$ , which enables the estimates to be written recursively for all $n\geq 1$ as

[TABLE]

with $V_{1}=\Sigma_{1}=0$ . Then, contrary to previous algorithms, this one does not need to store all the estimations into memory and can be easily updated. Finally, the following theorem ensures that it is quite fast.

Theorem 4.1.

Suppose assumptions (A1) to (A5a’) and (A6b) hold. Then, the sequence $\left(\Sigma_{n}\right)$ defined by (12) verifies for all positive constant $\gamma$ ,

[TABLE]

Moreover, suppose (A5b) holds too, there is a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

The proof is given in Section 6.

Corollary 4.1.

Suppose assumptions (A1) to (A5a’) and (A6b) hold. Then, for all positive constant $\gamma$ ,

[TABLE]

Moreover, suppose (A5b) holds too, there is a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

Remark 4.1.

The constant $C$ in Theorem 4.1 depends on the constants introduced in assumptions, on the initialization of the stochastic gradient descent, and on $\alpha,\delta,\mu,s,c_{\gamma}$ .

Remark 4.2.

Estimating recursively the asymptotic variance coupled with Theorem 3.2 can be useful to build online asymptotic confidence balls. Moreover, in the recent literature, non asymptotic convergence rates are often given under the form

[TABLE]

where $R_{n}$ is a rest term. Then, using the recursive variance estimates could enable to have, in practice, a precise bound of the quadratic mean error, and in the short term, it could allow to get precise non asymptotic confidence balls.

Remark 4.3.

In order to get a faster algorithm (in term of computational time), one can consider a parallelized version of previous estimates. This consists in splitting the sample into $p$ parts, and to run the algorithm on each subsample to get $p$ estimates $\Sigma_{n/p,i}$ , before taking the mean of these $p$ last ones.

5 Applications

5.1 Application to the logistic regression

Let $d$ be a positive integer, and let $Y\in\left\{-1,1\right\}$ and $X\in\mathbb{R}^{d}$ be random variables. In order to get the parameter $m^{l}\in\mathbb{R}^{d}$ of the logistic regression, the aim is to minimize the functional $G_{l}$ defined for all $h\in\mathbb{R}^{d}$ by

[TABLE]

Under usual assumptions (see Bach, (2014) among others), the functional $G_{l}$ is locally strongly convex and twice Fréchet differentiable with for all $h\in\mathbb{R}^{d}$ ,

[TABLE]

Then, the parameters of the logistic regression and the asymptotic variance can be estimated simultaneously as:

[TABLE]

5.2 Application to the geometric median and geometric quantiles

Let $H$ be a separable Hilbert space and let $X$ be a random variable taking values in $H$ . Let $v\leavevmode\nobreak\ \in\leavevmode\nobreak\ H$ such that $\left\|v\right\|<1$ , the geometric quantile $m^{v}$ corresponding to the direction $v$ (see Chaudhuri, (1996)) is defined by

[TABLE]

and in a particular case, the geometric median $m$ (see Haldane, (1948)) corresponds to the case where $v=0$ . Under usual assumptions (see Kemperman, (1987) and Cardot et al., (2013) among others), the functional $G_{v}$ is locally strongly convex and twice Fréchet-differentiable with for all $h\in H$ ,

[TABLE]

Then, it is possible to estimate simultaneously and recursively the geometric quantile $m^{v}$ as well as the asymptotic variance of the averaged estimator as follows:

[TABLE]

Note that under usual assumptions, the asymptotic variance obtained is the same as the one obtained with non-recursive estimates (Maronna et al.,, 2006; Gervini,, 2008) in the special case of the geometric median.

5.3 A short simulation study

We focus here on the estimation of the geometric median. We consider from now that $X$ is a random variable taking values in $\mathbb{R}^{d}$ , with $d\geq 3$ , and following a uniform law on the unit sphere $\mathcal{S}^{d}$ . Then, the geometric median $m$ is equal to [math] and the Hessian of the functional $G_{0}$ at $m$ verifies

[TABLE]

Note that assumptions (A1) and (A6b) are then verified (see Section 3 in Godichon-Baggioni, 2016b , Lemma A.1 in Godichon-Baggioni et al., (2017) and the Appendix to be convinced). Finally, the asymptotic variance of the stochastic gradient estimate and of its averaged version verify

[TABLE]

First, let us consider a stepsequence $\gamma_{n}=n^{-2/3}$ and let us study the quality of the Gaussian approximation of $Q_{n},Q_{n}^{\prime}$ , where

[TABLE]

Figure 1 (respectively Figure 2) seems to confirm Theorem 3.1 (respectively Theorem 3.2) since we can see that the estimated density of a component of $Q_{n}$ (respectively $Q_{n}^{\prime}$ ) is close to the density of $\mathcal{N}\left(0,1\right)$ , and so, even for small sample sizes ( $n=200$ ), which is also confirmed by a Kolmogorov-Smirnov test.

In Figure 3, we consider the evolution of the quadratic mean error, with respect to the Frobenius norm, of the estimates $\left(\Sigma_{n}\right)$ of $\Sigma$ defined by (12), with regard to the sample size. For this, we generate $100$ samples, and use the parallelized version of the algorithms. Figure 3 tends to confirm that for small dimensional spaces ( $d=10$ ), the estimates of the asymptotic variance converge quite quickly and that it is still the case for moderate dimensional spaces ( $d=5000$ ).

6 Proofs

6.1 Some decompositions of the algorithms

In order to simplify the proofs, let us now give some decompositions of the algorithms.

6.1.1 The Robbins-Monro algorithm

Let us recall that the stochastic gradient algorithm can be written as

[TABLE]

Linearizing the gradient, it comes

[TABLE]

where $\delta_{n}:=\Gamma_{m}\left(m_{n}-m\right)-\Phi\left(m_{n}\right)$ is the remainder term in the Taylor’s expansion of the gradient. Thanks to previous decomposition and with the help of an induction (see Duflo, (1996) or Duflo, (1997) for instance), one can check that for all $n\geq 1$ ,

[TABLE]

with $\beta_{n}:=\prod_{k=1}^{n}\left(I_{H}-\gamma_{k}\Gamma_{m}\right)$ for all $n\geq 1$ and $\beta_{0}:=I_{H}$ . Finally, the asymptotic variance can be seen as the almost sure limit of the sequence of random variables $\left(\Gamma_{m}^{-1}\xi_{n}\otimes\Gamma_{m}^{-1}\xi_{n}\right)_{n}$ (see the proof of Theorem 3.2). Then, in order to prove the convergence of the estimates, we need to exhibit this sequence. In this aim, one can rewrite equation (16) as

[TABLE]

with

[TABLE]

6.1.2 The averaged algorithm

Summing equalities (18) and dividing by $n$ , we obtain the following decomposition of the averaged estimator

[TABLE]

Finally, by linearity and applying an Abel’s transform to the first term on the right-hand side of previous equality (see Delyon and Juditsky, (1992) or Delyon and Juditsky, (1993) for instance),

[TABLE]

6.1.3 The recursive estimator of the asymptotic variance

In order to simplify the proof of Theorem 4.1, we will introduce a new estimator of the variance. In this aim, let us now introduce the sequences $\left(a_{n}\right)_{n\geq 1}$ and $\left(b_{n}\right)_{n\geq 1}$ defined for all $n\geq 1$ by $a_{n}:=\exp\left(\frac{n^{1-s}}{2(1-s)}\right)$ and $b_{n}:=\sum_{k=1}^{n}a_{k}^{2}$ . Then, thanks to decomposition (18), let

[TABLE]

In order to simplify several proofs, we now give $L^{p}$ upper bounds of the terms on the right-hand side of previous equality.

Lemma 6.1.

Suppose assumptions (A1) to (A5b) hold. Then, for all positive integer $p$ ,

[TABLE]

The proof of this lemma as well as an analogous lemma which gives the asymptotic almost sure behavior of these terms are given in Appendix. We can now introduce the following estimator

[TABLE]

and one can decompose $\Sigma_{n}$ as follows:

[TABLE]

6.2 Proof of Theorem 3.2

Proof of Theorem 3.2.

Let us recall that the averaged algorithm can be written as

[TABLE]

It is proven in Godichon-Baggioni, 2016b that

[TABLE]

In order get the asymptotic normality of the martingale term $\left(\frac{1}{n}\sum_{k=1}^{n}\xi_{k+1}\right)$ , let us check that assumptions of Theorem 5.1 in Jakubowski, (1988) are fulfilled, i.e let $\left(e_{i}\right)_{i\in I}$ be an orthonormal basis of $H$ and $\psi_{i,j}:=\left\langle\Sigma^{\prime}e_{i},e_{j}\right\rangle$ for all $i,j\in I$ , we have to verify

[TABLE]

Proof of (23) Let $\eta>0$ , applying Markov’s inequality,

[TABLE]

Then, applying Lemma H.1, there is a positive constant $C$ such that

[TABLE]

Proof of (24). First, note that

[TABLE]

with $\epsilon_{k+1}:=\xi_{k+1}\otimes\xi_{k+1}-\mathbb{E}\left[\xi_{k+1}\otimes\xi_{k+1}|\mathcal{F}_{k}\right]$ . Remark that $\left(\epsilon_{n}\right)$ is a sequence of martingale differences adapted to the filtration $\left(\mathcal{F}_{n}\right)$ , and one can check that

[TABLE]

Let us now prove that the sequence of operators $\left(\mathbb{E}\left[\xi_{k+1}\otimes\xi_{k+1}|\mathcal{F}_{k}\right]\right)$ converges almost surely to $\Sigma^{\prime}$ , with respect to the Frobenius norm. Note that

[TABLE]

Then, thanks to assumption (A6a), since $\left\|\Phi(m_{k})\right\|\leq C\left\|m_{k}-m\right\|$ and since $\left(m_{k}\right)$ converges to $m$ almost surely (see Godichon-Baggioni, 2016b ),

[TABLE]

In a particular case, for all $i,j\in I$ ,

[TABLE]

Thus, applying Toeplitz’s lemma,

[TABLE]

Finally, for all $i,j\in I$ ,

[TABLE]

Proof of (25). Let $\epsilon>0$ , applying Markov’s inequality,

[TABLE]

Since for all $j\in I$ , $\left\langle\xi_{k+1},e_{j}\right\rangle^{2}=\left\langle\xi_{k+1}\otimes\xi_{k+1}(e_{j},e_{j}\right\rangle$ , and by linearity

[TABLE]

Since $\mathbb{E}\left[\xi_{k+1}\otimes\xi_{k+1}|\mathcal{F}_{k}\right]$ converges almost surely to $\Sigma^{\prime}$ and by dominated convergence,

[TABLE]

Moreover, since $\Sigma^{\prime}=\mathbb{E}\left[\nabla_{h}g\left(X,m\right)\otimes\nabla_{h}g\left(X,m\right)\right]$ , thanks to assumption (A5a),

[TABLE]

Thus, since for all $j\in I$ , $\left\langle\Sigma^{\prime}(e_{j}),e_{j}\right\rangle\geq 0$ ,

[TABLE]

which concludes the proof. ∎

6.3 Proof of Theorem 4.1

For the sake of simplicity, the proof is given for $mu=0$ (the case where $\mu>0$ is strictly analogous). Let us recall that equation (12) can be written as

[TABLE]

In order to prove Theorem 4.1, we just have to give the rates of convergence of the terms on the right-hand side of previous equality. The following lemma gives the almost sure and the rate of convergence in quadratic mean of the first term on the right-hand side of previous equality.

Lemma 6.2.

Suppose assumptions (A1) to (A5a’) and (A6b) hold. Then, for all $\gamma>0$ ,

[TABLE]

Moreover, suppose assumption (A5b) holds too. Then,

[TABLE]

The proof is given in Appendix. The following lemma gives the almost sure and the rate of convergence in quadratic mean of the second term on the right-hand side of equality (26).

Lemma 6.3.

Suppose assumptions (A1) to (A5a’) and (A6b) hold. Then, for all $\gamma>0$ ,

[TABLE]

Moreover, suppose assumption (A5b) holds too. Then

[TABLE]

The proof is given in Appendix. Finally, the following Proposition gives the almost sure and the rate of convergence in quadratic mean of the last term on the right-hand side of equality (26).

Proposition 6.1.

Suppose assumptions (A1) to (A5a’) and (A6b) hold. Then, there is a positive constant $\gamma$ such that

[TABLE]

Suppose assumption (A5b) holds too. Then, there is a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

Proof of Proposition 6.1.

Applying equality (2), one can check that

[TABLE]

where $A_{1,k},A_{2,k},M_{k+1}$ are defined in (21). The following Lemma gives the rate of convergence in quadratic mean of the first terms on the right-hand side of previous inequality.

Lemma 6.4.

Suppose Assumptions (A1) to (A6b) hold. Then, for all $i,j\in\left\{1,2\right\}$ ,

[TABLE]

The proof of this lemma as well as its "almost sure version" are given in Appendix.

Then, we just have to bound the last term on the right-hand side of inequality (27). First let us decompose $M_{k+1}\otimes M_{k+1}$ as

[TABLE]

Note that for all $j$ , $M_{j}$ is $\mathcal{F}_{j}$ -measurable and $\mathbb{E}\left[\Xi_{j+1}\otimes M_{j}|\mathcal{F}_{j}\right]=0$ . Moreover,

[TABLE]

The end of the proof consists in giving a bound of the quadratic mean of each term on the right-hand side of previous equality. Note that the almost sure rates of convergence are not proven since it is quite analogous.

Bounding $\mathbb{E}\left[\left\|\frac{1}{\sum_{k=1}^{n}k^{-\delta}}\sum_{k=1}^{n}\frac{1}{k^{\delta}}\frac{1}{b_{k}}\sum_{j=1}^{k}a_{j}\Xi_{j+1}\otimes M_{j}\right\|_{F}^{2}\right]$ . First, note that

[TABLE]

Moreover, with the help of an integral test for convergence, one can check that there is a positive constant $C$ such that for all positive integers $k\leq n$ ,

[TABLE]

Furthermore, since $\left(\Xi_{j+1}\otimes M_{j}\right)_{j}$ is a sequence of martingale differences adapted to the filtration $\left(\mathcal{F}_{j}\right)$ , let

[TABLE]

Then, applying equality (2) and Cauchy-Schwarz’s inequality,

[TABLE]

Finally, applying Lemmas 6.1 and H.1 as well as inequality (28),

[TABLE]

With analogous calculus, one can check

[TABLE]

Bounding $\mathbb{E}\left[\left\|\frac{1}{\sum_{k=1}^{n}k^{-\delta}}\sum_{k=1}^{n}\frac{1}{k^{\delta}}\frac{1}{b_{k}}\sum_{j=1}^{k}a_{j}\Xi_{j+1}\otimes\left(M_{k+1}-M_{j+1}\right)\right\|_{F}^{2}\right]$ . First, note that

[TABLE]

Note that $\left(\sum_{j=1}^{j^{\prime}-1}a_{j}a_{j^{\prime}}\Xi_{j+1}\otimes\Xi_{j^{\prime}+1}\right)_{j^{\prime}}$ is a sequence of martingale differences adapted to the filtration $\left(\mathcal{F}_{j^{\prime}}\right)$ . Furthermore,

[TABLE]

Then end of the proof consists in bounding the two terms on the right-hand side of previous equality. First, since $\left(\sum_{j=1}^{j^{\prime}-1}a_{j}a_{j^{\prime}}\Xi_{j+1}\otimes\Xi_{j^{\prime}+1}\right)_{j^{\prime}}$ is a sequence of martingale differences adapted to the filtration $\left(\mathcal{F}_{j^{\prime}}\right)$ , let

[TABLE]

Then, applying equality (2) and Cauchy-Schwarz’s inequality,

[TABLE]

Finally, applying Lemma H.1, H.2 and 6.1,

[TABLE]

Then, since $\delta<(1+s)/2$ ,

[TABLE]

In the same way, by linearity, let

[TABLE]

Since $\left(\Xi_{i^{\prime\prime}}\right)$ is a sequence of martingale differences adapted to the filtration $\left(\mathcal{F}_{i^{\prime\prime}}\right)$ ,

[TABLE]

Furthermore, since $\left(\sum_{j^{\prime\prime}=2}^{j}\sum_{j^{\prime}=1}^{j^{\prime\prime}-1}a_{j^{\prime}}a_{j^{\prime\prime}}\Xi_{j^{\prime}+1}\otimes\Xi_{j^{\prime\prime}+1}\right)_{j^{\prime\prime}}$ is a sequence of martingale differences adapted to the filtration $\left(\mathcal{F}_{j^{\prime\prime}}\right)$ and applying equality (2),

[TABLE]

Applying Cauchy-Schwarz’s inequality as well as Lemmas H.1 and 6.1,

[TABLE]

Finally, applying Lemma H.2,

[TABLE]

Thus,

[TABLE]

Moreover, with analogous calculus, one can check

[TABLE]

Bounding $\frac{1}{\sum_{k=1}^{n}k^{-\delta}}\sum_{k=1}^{n}\frac{1}{k^{\delta}b_{k}}\sum_{j=1}^{k}a_{k}^{2}\left(\Xi_{k+1}\otimes\Xi_{k+1}-\Sigma\right)$ . First , note that

[TABLE]

The end of the proof consists in bounding the quadratic mean of the terms on the right-hand side of previous equality. First, applying Lemma H.4, let

[TABLE]

Then, applying inequality (6) and Corollary H.1,

[TABLE]

Furthermore, thanks to Lemma H.2,

[TABLE]

Thus, since $\alpha>1/2$ ,

[TABLE]

Moreover, applying Lemma H.4, let

[TABLE]

Furthermore, since $\left(\mathbb{E}\left[\Xi_{k+1}\otimes\Xi_{k+1}|\mathcal{F}_{k}\right]-\Xi_{k+1}\otimes\Xi_{k+1}\right)$ is a sequence of martingale differences adapted to the filtration $\left(\mathcal{F}_{k}\right)$ and applying Lemma H.1,

[TABLE]

Then, applying Lemma H.2,

[TABLE]

Finally,

[TABLE]

which concludes the proof. ∎

Appendix A Proof of Theorem 3.1

Let us recall that the Robbins-Monro algorithm can be written for all $n\geq 1$ as (see (17))

[TABLE]

It was proven in Godichon-Baggioni, 2016b that under assumptions (A1) to (A5a), for all $\gamma>0$ ,

[TABLE]

Then, we just have to apply Theorem 5.1 in Jakubowski, (1988) to the last term on the right-hand side of equality (17). More precisely, let $\left(e_{i}\right)_{i\in I}$ be an orthonormal basis of $H$ composed of eigenvectors of $\Gamma_{m}$ and let $\psi_{i,j}^{\prime}:=\left\langle\Sigma_{RM}e_{i},e_{j}\right\rangle$ for all $i,j\in I$ , we have to prove that the following equalities are verified.

[TABLE]

Proof of (29). Let $\eta>0$ , applying Markov’s inequality,

[TABLE]

First, since each eigenvalue $\lambda$ of $\Gamma_{m}$ verifies $0<\lambda_{\min}\leq\lambda\leq C$ , there is a rank $n_{\alpha}$ such that for all positive integer $k,n$ verifying $n_{\alpha}\leq k\leq n$ ,

[TABLE]

For the sake of simplicity, we consider from now that $n_{\alpha}=1$ (one can see the proof of Lemma 3.1 in Cardot et al., (2017) for an analogous and more detailed proof). Then, applying Lemmas H.1 and H.3, there is a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

which concludes the proof of (29).

Proof of (30). Since

[TABLE]

we just have to prove that

[TABLE]

First, note that by linearity

[TABLE]

with $\epsilon_{k+1}=\xi_{k+1}\otimes\xi_{k+1}-\mathbb{E}\left[\xi_{k+1}\otimes\xi_{k+1}|\mathcal{F}_{k}\right]$ . Note that $\left(\epsilon_{k}\right)$ is a sequence of martingale differences adapted to the filtration $\left(\mathcal{F}_{k}\right)$ . We now prove that the two last terms on the right-hand side of previous equality converge almost surely to [math]. First, as in Godichon-Baggioni, 2016b and Cardot et al., (2017), one can check that

[TABLE]

Let us now rewrite $\mathbb{E}\left[\xi_{k+1}\otimes\xi_{k+1}|\mathcal{F}_{k}\right]$ as

[TABLE]

Then, let

[TABLE]

Moreover, since there is a positive constant $C$ such that for all $n\geq 1$ , $\left\|\Phi(m_{n})\right\|\leq C\|m_{n}-m\|$ ,

[TABLE]

Thus, applying inequalities (5) and (32) as well as Lemma H.3, for all $\beta<\alpha$ ,

[TABLE]

In the same way,

[TABLE]

Then, with the help of assumption (A6a), Lemma H.3 and Toeplitz’s lemma, one can check that

[TABLE]

In order to verify equality (33), we have to prove

[TABLE]

Let $\left(e_{i}\right)_{i\in I}$ be an orthonormal basis of $H$ composed of eigenvectors of $\Gamma_{m}$ , and let $\left(\lambda_{i}\right)_{i\in I}$ be the set of the associated eigenvalues. Then, let us rewrite $\nabla_{h}g\left(X,m\right)$ as

[TABLE]

and it comes, by linearity and by dominated convergence,

[TABLE]

In the same way,

[TABLE]

In order to conclude the proof, let us now introduce the following lemma, which allows to give a bound of $\left\|\frac{1}{\gamma_{n}}\sum_{k=1}^{n}\beta_{n}\beta_{k}^{-1}\gamma_{k}\Sigma^{\prime}\beta_{n}\beta_{k}^{-1}\gamma_{k}-\Sigma_{RM}\right\|_{F}$ .

Lemma A.1.

There is a positive sequence $\left(a_{n}\right)$ such that for all $n\geq 1$ and for all $i,i^{\prime}\in I$ ,

[TABLE]

and $\lim_{n\to\infty}a_{n}=0$ .

Proof.

The proof is given in Appendix. ∎

Thanks to previous lemma, let

[TABLE]

Under assumption (A5a),

[TABLE]

Since $a_{n}$ converges to [math], this concludes the proof of inequality (30).

Proof of inequality (31) Let $\epsilon>0$ , applying Markov’s inequality,

[TABLE]

Since $\frac{1}{\gamma_{n}}\sum_{k=1}^{n}\left(\beta_{n}\beta_{k}^{-1}\gamma_{k}\xi_{k+1}\right)\otimes\left(\beta_{n}\beta_{k}^{-1}\gamma_{k}\xi_{k+1}\right)$ converges almost surely to $\Sigma_{RM}$ with respect to the Frobenius norm and by dominated convergence,

[TABLE]

Moreover, since

[TABLE]

and since $\left\langle\Sigma_{RM}\left(e_{j}\right),e_{j}\right\rangle\geq 0$ for all $j\in I$ ,

[TABLE]

which concludes the proof.

Appendix B Proof of Lemma 6.1

Proof.

Bounding $\mathbb{E}\left[\left\|\sum_{k=1}^{n}\frac{a_{k}}{\gamma_{k}}\left(T_{k}-T_{k+1}\right)\right\|^{2p}\right]$ . Applying an Abel’s transform,

[TABLE]

First, $\mathbb{E}\left[\left\|\frac{a_{1}}{\gamma_{1}}T_{1}\right\|^{2p}\right]=O\left(1\right)$ . Moreover, applying inequality (6) ,

[TABLE]

Furthermore, one can check that there is a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

and applying Lemma H.4 and inequality (6),

[TABLE]

Finally, applying Lemma H.2,

[TABLE]

Bounding $\mathbb{E}\left[\left\|\sum_{k=1}^{n}a_{k}\Delta_{k}\right\|^{2p}\right]$ . Since there is a positive constant $C_{m}$ (see Godichon-Baggioni, 2016b ) such that for all $n\geq 1$ , $\left\|\Delta_{n}\right\|\leq C_{m}\left\|T_{n}\right\|^{2}$ , applying Lemma H.4 and inequality (6),

[TABLE]

Applying Lemma H.2,

[TABLE]

Bounding $\mathbb{E}\left[\left\|\sum_{k=1}^{n}a_{k}\Xi_{k+1}\right\|^{2p}\right]$ . First, since $\left(\Xi_{n}\right)$ is a sequence of martingale differences, and thanks to Lemma H.2,

[TABLE]

With the help of an induction on $p$ (see the proof of Theorem 4.2 in Godichon-Baggioni, 2016a for instance), one can check that for all integer $p\geq 1$ ,

[TABLE]

which concludes the proof. ∎

Appendix C Proof of Lemma A.1

Let $\left(\lambda_{i}\right)_{i\in I}$ be the eigenvalues of the Hessian $\Gamma_{m}$ . First, let

[TABLE]

Let us recall that there is a positive constant $C$ such taht for all $i\in I$ , $\lambda_{i}\leq C$ . Then, let $n_{\alpha}$ be an integer such that for all $k\geq n_{\alpha}$ , $C\gamma_{k}<1$ , and it comes, for all $k\geq n_{\alpha}$ , $\lambda_{i}\gamma_{k}\leq C\gamma_{k}<1$ . Then, with the help of the Taylor’s expansion of the functional $x\longmapsto\ln(1-x)$ , one can check that for all $i\in I$ and for all $k\geq n_{\alpha}$ ,

[TABLE]

with $c:=\frac{1}{1-C\gamma_{n_{\alpha}}}$ . Then, for all $n,k\geq n_{\alpha}$ ,

[TABLE]

With the help of an integral test for convergence,

[TABLE]

Then,

[TABLE]

We now give an upper bound of $\sum_{k=1}^{n}\gamma_{k}^{2}c_{n,k}$ . Since $0<\lambda_{\min}\leq\lambda_{i}\leq C$ for all $i\in I$ , there is a rank $n_{\alpha}$ , only depending on $\lambda_{\min},C,c_{\gamma}$ and $\alpha$ , such that the functional $\varphi:\mathbb{R}\longrightarrow\mathbb{R}$ defined for all $t\in\mathbb{R}$ by

[TABLE]

is increasing on $[n_{\alpha},+\infty]$ . For the sake of simplicity, let us consider that $n_{\alpha}=0$ . Then, with the help of an integral test for convergence,

[TABLE]

Then, since for all $i\in I$ , $0<\lambda_{\min}\leq\lambda_{i}\leq C$ , one can check that there is a positive sequence $\left(\epsilon_{n}\right)_{n\geq 1}$ only depending on $\alpha,c_{\gamma},\lambda_{\min},C$ such that

[TABLE]

With analogous calculus, on can check that there is a positive sequence $\left(\epsilon_{n}^{\prime}\right)_{n\geq 1}$ only depending on $\alpha,c_{\gamma},\lambda_{\min},C$ such that

[TABLE]

which concludes the proof.

Appendix D Proof of Lemma 6.2

We only give the bound of the quadratic mean error since the almost sure rate of convergence is quite straightforward. First, since

[TABLE]

and by linearity, let

[TABLE]

Then, we have to bound the three terms on the right-hand side of previous equality.

Bounding $\mathbb{E}\left[\left\|\frac{1-\delta}{n^{1-\delta}}\sum_{k=1}^{n}\frac{1}{k^{\delta+s}}\exp\left(-\frac{k^{1-s}}{1-s}\right)\left(\sum_{j=1}^{k}e^{\frac{j^{1-s}}{2(1-s)}}\left(m_{j}-m\right)\right)\otimes\left(\sum_{j=1}^{k}e^{\frac{j^{1-s}}{2(1-s)}}\left(\overline{m}_{j}-m\right)\right)\right\|_{F}^{2}\right]$ . First, applying Lemma H.4 and equality (2), let

[TABLE]

Applying Cauchy-Schwarz’s inequality,

[TABLE]

First, note that thanks to Lemma 6.1

[TABLE]

Furthermore, applying Lemmas H.4 and Lemma H.2 as well as inequality (9),

[TABLE]

Then, applying Lemma H.2,

[TABLE]

With analogous calculus, one can check that

[TABLE]

Bounding $\mathbb{E}\left[\left\|\frac{1-\delta}{n^{1-\delta}}\sum_{k=1}^{n}\frac{1}{k^{\delta+s}}\exp\left(-\frac{k^{1-s}}{1-s}\right)\left(\sum_{j=1}^{k}e^{\frac{j^{1-s}}{2(1-s)}}\left(\overline{m}_{j}-m\right)\right)\otimes\left(\sum_{j=1}^{k}e^{\frac{j^{1-s}}{2(1-s)}}\left(\overline{m}_{j}-m\right)\right)\right\|_{F}^{2}\right]$ . First, applying Lemma H.4 and equality (2), let

[TABLE]

Then, applying inequality (35) and Corollary H.2,

[TABLE]

which concludes the proof.

Appendix E Proof of Lemma 6.3

We just give the proof for the rate of convergence in quadratic mean, the proof of the almost sure rate of convergence is quite straightforward. Let

[TABLE]

We now bound the quadratic mean of each term on the right-hand side of previous equality. First, note that with the help of an integral test for convergence,

[TABLE]

Then,

[TABLE]

Then, applying Lemma H.4, there is a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

Furthermore, applying equality (2)

[TABLE]

Finally, applying Lemma 6.1 ans since $\delta<(1+s)/2$ ,

[TABLE]

In the same way, with the help of an integral test for convergence,

[TABLE]

Thus, one can check that there is a positive constant $c$ such that for all $n\geq 1$ ,

[TABLE]

Then,

[TABLE]

Thus, applying Lemma H.4, there is a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

Finally, applying equality (2) and Lemma 6.1,

[TABLE]

which concludes the proof.

Appendix F Proof of Lemma 6.4

Proof of Lemma 6.4.

This proof is a direct application of Lemma 6.1. In order to convince the reader, we just give one proof, and the other ones are analogous. Applying Lemma H.4 and 6.1 as well as Corollary H.2,

[TABLE]

which concludes the proof. ∎

We now give the "almost sure version" of Lemma 6.4.

Lemma F.1.

Suppose Assumptions (A1) to (A5a’) hold. Then, for all $i,j\in\left\{1,2\right\}$ , and for all $\gamma>0$ ,

[TABLE]

The proof is not given since it is quite closed to the one of Lemma 6.4.

Appendix G Dealing with Assumption (A6) for the geometric median

In what follows, we consider that assumption (H2) in Godichon-Baggioni, 2016b is fulfilled, i.e:

(H2)

The random variable $X$ is not concentrated around single points: for all positive constant $A$ , there is a positive constant $C_{A}$ such that for all $h\in\mathcal{B}\left(0,A\right)$ ,

[TABLE]

Then, for all $h\in H$ , let us define the function $\varphi_{h}:[0,1]\longrightarrow\mathcal{S}(H)$ , defined for all $t\in[0,1]$ by

[TABLE]

In what follows, we will denote $A(t):=X-m+t\left(h-m\right)$ . Note that

[TABLE]

and that the functional $\varphi_{h}$ is differentiable, and its derivative is defined for all $t\in[0,1]$ by

[TABLE]

Then, applying Cauchy-Schwarz’s inequality,

[TABLE]

Thus, let $\epsilon>0$ , thanks to Assumption (H2), there is a positive constant $C_{\left\|m\right\|+\epsilon}$ such that for all $t\in[0,1]$ and for all $h\in\mathcal{B}\left(m,\epsilon\right)$ ,

[TABLE]

Finally,

[TABLE]

Appendix H Technical lemmas

In order to simplify the proof, we recall or give some technical lemmas. The following one ensures that the sequence $\left(\xi_{n}\right)$ admits uniformly bounded $2p$ -moments.

Lemma H.1 (Godichon-Baggioni, 2016b ).

Suppose assumptions (A1) to (A5a’) hold, there is a positive constant $K$ such that for all $n\geq 1$ ,

[TABLE]

Moreover, suppose assumption (A5b) holds too. Then, for all positive integer $p$ , there is a positive constant $K_{p}$ such that for all $n\geq 1$ ,

[TABLE]

As a particular case, since for all eigenvalue $\lambda$ of $\Gamma_{m}$ , $0<\lambda_{\min}\leq\lambda\leq C$ , for all $n\geq 1$ ,

[TABLE]

Corollary H.1.

Suppose assumptions (A1) to (A6b) hold. Then, there is a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

The proof is not given since it is a direct application of assumption (A6b) and Lemma H.1. The following lemma gives upper bounds of the sums of exponential terms which appears in several proofs.

Lemma H.2.

For all constants $a,b,c$ such that $a\in(0,1)$ , there is a positive constant $C_{a,b,c}$ such that

[TABLE]

The proof is not given it is a direct application of an integral test for convergence. As a corollary, one can obtain the following bound (lower and upper) of $b_{n}$ .

Corollary H.2.

There are positive constants $c,C$ such that for all $n\geq 1$ ,

[TABLE]

The following lemma is really useful in the proof of Theorem 3.1.

Lemma H.3 (Cardot and Godichon-Baggioni, (2017)).

Let $\alpha,\beta$ be non-negative constants such that $0<\alpha<1$ , and $\left(u_{n}\right)$ , $\left(v_{n}\right)$ be two sequences defined for all $n\geq 1$ by

[TABLE]

with $c_{u},c_{v}>0$ . Thus, there is a positive constant $c_{0}$ such that for all $n\geq 1$ ,

[TABLE]

Finally, we recall the following results, which enables us to upper bound the $L^{p}$ moments of a sum of random variables in normed vector spaces.

Lemma H.4 (Godichon-Baggioni, 2016a ).

Let $Y_{1},...,Y_{n}$ be random variables taking values in a normed vector space such that for all positive constant $q$ and for all $k\geq 1$ , $\mathbb{E}\left[\left\|Y_{k}\right\|^{q}\right]<\infty$ . Thus, for all constants $a_{1},...,a_{n}$ and for all integer $p$ ,

[TABLE]

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bach, (2014) Bach, F. (2014). Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. The Journal of Machine Learning Research , 15(1):595–627.
2Bach and Moulines, (2013) Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O (1/n). In Advances in Neural Information Processing Systems , pages 773–781.
3Boyd and Vandenberghe, (2004) Boyd, S. and Vandenberghe, L. (2004). Convex optimization . Cambridge university press.
4Cardot et al., (2017) Cardot, H., Cénac, P., Godichon-Baggioni, A., et al. (2017). Online estimation of the geometric median in hilbert spaces: Nonasymptotic confidence balls. The Annals of Statistics , 45(2):591–614.
5Cardot et al., (2013) Cardot, H., Cénac, P., and Zitt, P.-A. (2013). Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli , 19(1):18–43.
6Cardot and Godichon-Baggioni, (2017) Cardot, H. and Godichon-Baggioni, A. (2017). Fast estimation of the median covariation matrix with application to online robust principal components analysis. Test , 26(3):461–480.
7Chakraborty and Chaudhuri, (2014) Chakraborty, A. and Chaudhuri, P. (2014). The spatial distribution in infinite dimensional spaces and related quantiles and depths. The Annals of Statistics , 42:1203–1231.
8Chaudhuri, (1996) Chaudhuri, P. (1996). On a geometric notion of quantiles for multivariate data. J. Amer. Statist. Assoc. , 91(434):862–872.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Online estimation of the asymptotic variance for averaged stochastic gradient algorithms

Abstract

1 Introduction

2 Assumptions

Remark 2.1**.**

Remark 2.2**.**

3 The stochastic gradient algorithm and its averaged version

3.1 The Robbins-Monro algorithm

Theorem 3.1**.**

Remark 3.1**.**

Remark 3.2**.**

Remark 3.3**.**

3.2 The averaged algorithm

Theorem 3.2**.**

4 Recursive estimation of the asymptotic variance

4.1 Some existing estimators

4.2 A recursive and fast estimate

Theorem 4.1**.**

Corollary 4.1**.**

Remark 4.1**.**

Remark 4.2**.**

Remark 4.3**.**

5 Applications

5.1 Application to the logistic regression

5.2 Application to the geometric median and geometric quantiles

5.3 A short simulation study

6 Proofs

6.1 Some decompositions of the algorithms

6.1.1 The Robbins-Monro algorithm

6.1.2 The averaged algorithm

6.1.3 The recursive estimator of the asymptotic variance

Lemma 6.1**.**

6.2 Proof of Theorem 3.2

Proof of Theorem 3.2.

6.3 Proof of Theorem 4.1

Lemma 6.2**.**

Lemma 6.3**.**

Proposition 6.1**.**

Proof of Proposition 6.1.

Lemma 6.4**.**

Appendix A Proof of Theorem 3.1

Lemma A.1**.**

Proof.

Appendix B Proof of Lemma 6.1

Proof.

Appendix C Proof of Lemma A.1

Appendix D Proof of Lemma 6.2

Appendix E Proof of Lemma 6.3

Appendix F Proof of Lemma 6.4

Proof of Lemma 6.4.

Lemma F.1**.**

Appendix G Dealing with Assumption (A6) for the geometric median

Appendix H Technical lemmas

Lemma H.1** (Godichon-Baggioni, 2016b ).**

Corollary H.1**.**

Lemma H.2**.**

Corollary H.2**.**

Lemma H.3** (Cardot and Godichon-Baggioni, (2017)).**

Lemma H.4** (Godichon-Baggioni, 2016a ).**

Remark 2.1.

Remark 2.2.

Theorem 3.1.

Remark 3.1.

Remark 3.2.

Remark 3.3.

Theorem 3.2.

Theorem 4.1.

Corollary 4.1.

Remark 4.1.

Remark 4.2.

Remark 4.3.

Lemma 6.1.

Lemma 6.2.

Lemma 6.3.

Proposition 6.1.

Lemma 6.4.

Lemma A.1.

Lemma F.1.

Lemma H.1 (Godichon-Baggioni, 2016b ).

Corollary H.1.

Lemma H.2.

Corollary H.2.

Lemma H.3 (Cardot and Godichon-Baggioni, (2017)).

Lemma H.4 (Godichon-Baggioni, 2016a ).