Making the Last Iterate of SGD Information Theoretically Optimal

Prateek Jain; Dheeraj Nagaraj; Praneeth Netrapalli

arXiv:1904.12443·math.OC·May 30, 2019

Making the Last Iterate of SGD Information Theoretically Optimal

Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli

PDF

TL;DR

This paper introduces new step size sequences for SGD and GD that achieve information-theoretic optimal bounds on the suboptimality of the last iterate, addressing a longstanding gap in theoretical understanding and practical performance.

Contribution

It proposes a novel modification scheme for step size sequences that ensures the last iterate's suboptimality matches the optimal bounds, improving theoretical guarantees and practical results.

Findings

01

New step size sequences achieve optimal last iterate bounds

02

Modified sequences match average-case guarantees for last point

03

Simulations confirm significant improvement over standard step sizes

Abstract

Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) \emph{averages} of iterates and obtains information theoretically optimal bounds on suboptimality, the \emph{last point} of SGD is, by far, the most preferred choice in practice. The best known results for last point of SGD \cite{shamir2013stochastic} however, are suboptimal compared to information theoretic lower bounds by a $lo g T$ factor, where $T$ is the number of iterations. \cite{harvey2018tight} shows that in fact, this additional $lo g T$ factor is tight for standard step size sequences of $\OTheta \frac{1}{t}$ and $\OTheta \frac{1}{t}$ for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth,…

Equations205

x \in W min F (x),

x \in W min F (x),

k := in f {i : T \cdot 2^{- i} \leq 1}, T_{i} := T - ⌈ T \cdot 2^{- i} ⌉, 0 \leq i \leq k, \mbox an d T_{k + 1} := T .

k := in f {i : T \cdot 2^{- i} \leq 1}, T_{i} := T - ⌈ T \cdot 2^{- i} ⌉, 0 \leq i \leq k, \mbox an d T_{k + 1} := T .

α_{t} = \frac{C \cdot 2 ^{- i}}{T} when T_{i} < t \leq T_{i + 1}, 0 \leq i \leq k .

α_{t} = \frac{C \cdot 2 ^{- i}}{T} when T_{i} < t \leq T_{i + 1}, 0 \leq i \leq k .

E [F (x_{T})] \leq F (x^{*}) + \frac{4 D ^{2}}{C T} + \frac{11 G ^{2} C}{T} .

E [F (x_{T})] \leq F (x^{*}) + \frac{4 D ^{2}}{C T} + \frac{11 G ^{2} C}{T} .

F (x_{T}) = F (x^{*}) + O (\frac{D ^{2}}{C T} + \frac{C G ^{2}}{T} lo g (\frac{1}{δ})) \leq F (x^{*}) + O (D G \frac{l o g \frac{1}{δ}}{T}) .

F (x_{T}) = F (x^{*}) + O (\frac{D ^{2}}{C T} + \frac{C G ^{2}}{T} lo g (\frac{1}{δ})) \leq F (x^{*}) + O (D G \frac{l o g \frac{1}{δ}}{T}) .

F (x_{T}) \leq F (x^{*}) + \frac{4 D ^{2}}{C T} + \frac{11 G ^{2} C}{T} .

F (x_{T}) \leq F (x^{*}) + \frac{4 D ^{2}}{C T} + \frac{11 G ^{2} C}{T} .

α_{t} = 2^{- i} \frac{1}{λ t}, \forall T_{i} < t \leq T_{i + 1}, 0 \leq i \leq k .

α_{t} = 2^{- i} \frac{1}{λ t}, \forall T_{i} < t \leq T_{i + 1}, 0 \leq i \leq k .

E [F (x_{T})] \leq F (x^{*}) + \frac{130 G ^{2}}{λ T} .

E [F (x_{T})] \leq F (x^{*}) + \frac{130 G ^{2}}{λ T} .

E [F (x_{T})] = F (x^{*}) + O (\frac{G ^{2} lo g ( \frac{1}{δ} )}{λ T}) .

E [F (x_{T})] = F (x^{*}) + O (\frac{G ^{2} lo g ( \frac{1}{δ} )}{λ T}) .

F (x_{T}) \leq F (x^{*}) + \frac{130 G ^{2}}{λ T} .

F (x_{T}) \leq F (x^{*}) + \frac{130 G ^{2}}{λ T} .

α_{t} := 2^{- i} γ_{t} \forall T_{i} < t \leq T_{i + 1} and 0 \leq i \leq k .

α_{t} := 2^{- i} γ_{t} \forall T_{i} < t \leq T_{i + 1} and 0 \leq i \leq k .

E [F (x_{T})] \leq 5 G^{2} γ_{T} (\frac{1}{β ^{2}} + \frac{1}{β ^{4}}) + ⌈ \frac{T}{4} ⌉ \leq t \leq T_{1} in f E [F (y_{t})] .

E [F (x_{T})] \leq 5 G^{2} γ_{T} (\frac{1}{β ^{2}} + \frac{1}{β ^{4}}) + ⌈ \frac{T}{4} ⌉ \leq t \leq T_{1} in f E [F (y_{t})] .

F (x_{T})

F (x_{T})

t = t_{0} \sum t_{1} 2 α_{t} E [F (x_{t}) - F (x_{t_{0}})] \leq t = t_{0} \sum t_{1} G^{2} α_{t}^{2} .

t = t_{0} \sum t_{1} 2 α_{t} E [F (x_{t}) - F (x_{t_{0}})] \leq t = t_{0} \sum t_{1} G^{2} α_{t}^{2} .

∥ x_{t + 1} - x_{t_{0}} ∥

∥ x_{t + 1} - x_{t_{0}} ∥

∥ x_{t + 1} - x_{t_{0}} ∥^{2}

∥ x_{t + 1} - x_{t_{0}} ∥^{2}

E [∥ x_{t + 1} - x_{t_{0}} ∥^{2}]

E [∥ x_{t + 1} - x_{t_{0}} ∥^{2}]

E [∥ x_{t + 1} - x_{t_{0}} ∥^{2}]

E [∥ x_{t + 1} - x_{t_{0}} ∥^{2}]

L_{t_{1}} = \frac{1}{e \cdot r}, t \in [t_{0}, t_{1}], L_{t - 1} = L_{t} + L_{t}^{2}, t_{0} \leq t - 1 < t_{1} .

L_{t_{1}} = \frac{1}{e \cdot r}, t \in [t_{0}, t_{1}], L_{t - 1} = L_{t} + L_{t}^{2}, t_{0} \leq t - 1 < t_{1} .

A (l, t_{1}) := t = l \sum t_{1} L_{t} [2 α_{t} (F (x_{t}) - F (x_{l})) - α_{t}^{2} G^{2}], A^{*} (t_{0}, t_{1}) := t = t_{0} \sum t_{1} L_{t} [2 α_{t} (F (x_{t}) - F (x^{*})) - α_{t}^{2} G^{2}] .

A (l, t_{1}) := t = l \sum t_{1} L_{t} [2 α_{t} (F (x_{t}) - F (x_{l})) - α_{t}^{2} G^{2}], A^{*} (t_{0}, t_{1}) := t = t_{0} \sum t_{1} L_{t} [2 α_{t} (F (x_{t}) - F (x^{*})) - α_{t}^{2} G^{2}] .

P [(p . A) (t_{0}, t_{1}) > η] \leq exp (- \frac{η}{8 α _{t_{0}}^{2} G ^{2}}) .

P [(p . A) (t_{0}, t_{1}) > η] \leq exp (- \frac{η}{8 α _{t_{0}}^{2} G ^{2}}) .

P [A^{*} (t_{0}, t_{1}) > η] \leq exp (\frac{2 D ^{2} L _{t_{0}}}{8 α _{t_{0}}^{2} G ^{2}}) exp (\frac{- η}{8 α _{t_{0}}^{2} G ^{2}}) .

P [A^{*} (t_{0}, t_{1}) > η] \leq exp (\frac{2 D ^{2} L _{t_{0}}}{8 α _{t_{0}}^{2} G ^{2}}) exp (\frac{- η}{8 α _{t_{0}}^{2} G ^{2}}) .

4 (T_{i + 2} - T_{i + 1}) \geq T_{i + 1} - T_{i} .

4 (T_{i + 2} - T_{i + 1}) \geq T_{i + 1} - T_{i} .

τ_{i} := ar g T_{i} < t \leq T_{i + 1} in f E [F (x_{t})], i \in [k + 1], \mbox an d τ_{0} := ar g ⌈ \frac{T}{4} ⌉ \leq t \leq T_{1} in f E [F (x_{t})] .

τ_{i} := ar g T_{i} < t \leq T_{i + 1} in f E [F (x_{t})], i \in [k + 1], \mbox an d τ_{0} := ar g ⌈ \frac{T}{4} ⌉ \leq t \leq T_{1} in f E [F (x_{t})] .

E [F (x_{τ_{i + 1}}) - F (x_{τ_{i}})] \leq \frac{5 G ^{2} γ _{T}}{β ^{2}} 2^{- i}, E [F (x_{τ_{1}}) - F (x_{τ_{0}})] \leq \frac{5 G ^{2} γ _{T}}{β ^{4}} .

E [F (x_{τ_{i + 1}}) - F (x_{τ_{i}})] \leq \frac{5 G ^{2} γ _{T}}{β ^{2}} 2^{- i}, E [F (x_{τ_{1}}) - F (x_{τ_{0}})] \leq \frac{5 G ^{2} γ _{T}}{β ^{4}} .

\frac{\sum _{t = τ_{i}}^{T_{i + 2}} 2 α _{t} E [ F ( x _{t} ) - F ( x _{τ_{i}} ) ]}{T _{i + 2} - τ _{i} + 1}

\frac{\sum _{t = τ_{i}}^{T_{i + 2}} 2 α _{t} E [ F ( x _{t} ) - F ( x _{τ_{i}} ) ]}{T _{i + 2} - τ _{i} + 1}

G^{2} 2^{- 2 i} γ_{T_{i} + 1}^{2}

G^{2} 2^{- 2 i} γ_{T_{i} + 1}^{2}

G^{2} 2^{- 2 i} γ_{T_{i} + 1}^{2}

G^{2} 2^{- 2 i} γ_{T_{i} + 1}^{2}

= \frac{2 ^{- i} γ _{T_{i + 2}}}{5} E [F (x_{τ_{i + 1}}) - F (x_{τ_{i}})] \geq \frac{2 ^{- i} β γ _{T_{i + 1}}}{5} E [F (x_{τ_{i + 1}}) - F (x_{τ_{i}})],

t = T_{i + 1} + 1 \sum T_{i + 2} p^{(i + 1)} (t) F (x_{t})

t = T_{i + 1} + 1 \sum T_{i + 2} p^{(i + 1)} (t) F (x_{t})

t = T_{1} + 1 \sum T_{2} p^{(1)} (t) F (x_{t})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent

Full text

Making the Last Iterate of SGD

Information Theoretically Optimal

Prateek Jain

Microsoft Research

Bengaluru, India

[email protected]

&Dheeraj Nagaraj

Department of Electrical Engineering and Computer Science

Massachusetts Institute of Technology

Cambridge, USA 02139

[email protected]

&Praneeth Netrapalli

Microsoft Research

Bengaluru, India

[email protected]

Abstract

Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) averages of iterates and obtains information theoretically optimal bounds on suboptimality, the last point of SGD is, by far, the most preferred choice in practice. The best known results for last point of SGD [1] however, are suboptimal compared to information theoretic lower bounds by a $\log T$ factor, where $T$ is the number of iterations. [2] shows that in fact, this additional $\log T$ factor is tight for standard step size sequences of $\Theta\left(\frac{1}{\sqrt{t}}\right)$ and $\Theta\left(\frac{1}{t}\right)$ for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth, convex functions, the best known step-size sequences still lead to $O(\log T)$ -suboptimal convergence rates (on the final iterate). The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of last point of SGD as well as GD. We achieve this by designing a modification scheme, that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average of SGD/GD with original sequence. We also show that our result holds with high-probability. We validate our results through simulations which demonstrate that the new step size sequence indeed improves the final iterate significantly compared to the standard step size sequences.

††Accepted for presentation at the Conference on Learning Theory (COLT) 2019

K****eywords Stochastic Gradient Descent $\cdot$ Machine Learning $\cdot$ Convex Optimization

1 Introduction

Stochastic Gradient Descent (SGD) is one of the most popular algorithms for solving large-scale empirical risk minimization (ERM) problems [3, 4, 5]. The algorithm updates the iterates using stochastic gradients obtained by sampling data points uniformly at random. The algorithm has been studied for several decades [6] but there are still significant gaps between practical implementations and theoretical analyses. In particular, the standard analyses hold only for some kind of average of iterates, but most practitioners just use the final iterate of SGD. So, [7] asked the natural question of whether the final iterate of SGD, as opposed to average of iterates, is provably good. It was partly answered in [1] which gave sub-optimality bound for the last point of SGD but the obtained sub-optimality rates are $O(\log T)$ worse than the information theoretically optimal rates; $T$ is the number of iterations.

[2] showed that the above result is tight for the standard step-size sequence used by most existing theoretical results. The extra logarithmic factor is not due to the stochastic nature of SGD. In fact, even for subgradient descent (GD) when applied to general non-smooth, convex functions, the last point’s convergence rates are sub-optimal by $O(\log T)$ factor.

So, this work addresses the following two fundamental questions:

*“Does there exist a step-size sequence for which the last point of SGD when applied to general convex functions as well as to strongly-convex functions has optimal error (sub-optimality) rate?”, and,

*“Does there exist a step-size sequence for which the last point of GD when applied to general non-smooth convex functions has optimal error (sub-optimality) rate?”

In this paper, we answer both the questions in the affirmative. That is, we provide novel step size sequences and show that the final iterate of SGD run with these step size sequences has the information theoretically optimal error (suboptimality) rate. In particular, for general non-smooth convex functions, our results ensure an error rate of $O(\frac{1}{\sqrt{T}})$ and for strongly-convex functions, the error rate is $O(\frac{1}{T})$ . We also present high-probablity versions, i.e., we show that with probability at least $1-\delta$ , the suboptimality is $O\left(\sqrt{\tfrac{\log{\tfrac{1}{\delta}}}{T}}\right)$ and $O\left(\tfrac{\log{\tfrac{1}{\delta}}}{T}\right)$ respectively (see Theorems 1 and 2). For GD, we show that a similarly modified step-size sequence leads to suboptimality of $O(\frac{1}{{T}})$ and $O(\frac{1}{\sqrt{T}})$ for non-smooth convex functions, with and with out strong convexity respectively, which is optimal.

In general, SGD takes the iterates near the optimum value but since the objective isn’t smooth near the optimizer $x^{*}$ , the gradients don’t become small even when the points are close to $x^{*}$ . Standard step sizes don’t decay appreciably with time to ensure fast enough convergence to $x^{*}$ . Therefore the iterates $x_{t}$ , after going close to $x^{*}$ , start oscillating around it without actually approaching it (See Section 4 for concrete examples). Our new step sizes, given in Section 2.1 ensure that the step sizes decay fast enough after a certain point, making the iterates go closer to the optimum $x^{*}$ . The exact mode of this decay ensures that the last iterate approaches the optimum at the information theoretic rate.

Our results utilize a general step size modification scheme which ensures that the upper bounds for the average function value with the original step sizes gets transferred to the last iterate when the modified step sizes are used (see Theorems 3 and 4). A key technical contribution of the paper is the proof of Theorem 2 that constructs a sequence of averaging schemes which are ‘good’ with high probability such that the last averaging scheme consists only of the last iterate and hence lets us conclude that the last iterate is ‘good’ with high probability.

Our new step-size sequence requires that the number of iterations or horizon $T$ is known apriori. In contrast, standard step-size sequences do not require $T$ apriori, and hence guarantee any-time results. Information about $T$ apriori helps us in ensuring that we do not drop step-size too early; only after we are close to the optimum, does the step size drop rapidly. In fact, we conjecture that in absence of apriori information about $T$ , no step-size sequence can ensure the information theoretically optimal error rates for final iterate of SGD. As a step towards proving this, we show that in the case of strongly convex objectives, any choice of step sizes with infinite horizon (i.e, without the knowledge of total number of iterations) is either suboptimal almost surely or suboptimal in expectation for infinitely many points. We show this in Theorem 5.

Related Work: Averaging was used first in the stochastic approximation setting by [8] to show optimal rates of convergence. Gradient Descent type methods have been shown to achieve information theoretically optimal error rates in the convex and strongly convex settings when averaging of iterates is used ([9],[10],[11], [12], Epoch GD in [13] , SGD [14] and [15]). The question of the last iterate was first considered in [1] and it gives a bound of $O(\frac{\log{T}}{\sqrt{T}})$ and $O(\frac{\log{T}}{T})$ in expectation for the general case and strongly convex case respectively. [2] show matching high probability bounds and show that for the standard step sizes ( $O\left(\frac{1}{\sqrt{t}}\right)$ in the general case and $O\left(\frac{1}{t}\right)$ in the strongly convex case), the logarithmic-suboptimal bounds are tight.

Organization: The setting and main results are presented in Section 2. In particular, Section 2.1 describes the general step size modification considered and states key results regarding this modification and the lower bound is presented in Section 2.2. Key technical ideas are developed in Section 3 and the main theorems are proved. We present some experimental results in Section 4 and conclude in Section 5. Skipped proofs of technical lemmas are given in the appendix.

2 Problem Setup and Main Results

Consider the following optimization problem:

[TABLE]

where objective function $F:\mathbb{R}^{d}\to\mathbb{R}$ is a convex function and $\mathcal{W}\subset\mathbb{R}^{d}$ is a closed convex set. Let the global minimizer of $F(\cdot)$ be $x^{*}\in\mathcal{W}$ . We start the SGD algorithm at a point $x_{1}\in\mathcal{W}$ and iteratively obtain estimates $x_{t}$ for the minimizer of $F(\cdot)$ . We assume that at each time step, we have access to independent, unbiased estimate $\hat{g}_{t}$ to a subgradient $g_{t}\in\partial F$ . That is, $\mathbb{E}[\hat{g}_{t}(x)]=g_{t}(x)\in\partial F(x)$ for every $x\in\mathcal{W}$ and $(\hat{g}_{t}-g_{t})_{t=1}^{T}$ are independent. We pick step sizes $(\alpha_{t})_{t=1}^{T}\geq 0$ . Let $\Pi_{\mathcal{W}}$ be the projection operator to the set $\mathcal{W}$ . The SGD algorithm is given as follows:

Henceforth, we will retain the assumptions made above. Whenever we use $g_{t}(x)$ , it is implied that $g_{t}(x)\in\partial F(x)$ . Throughout the paper, we assume that $F$ is a Lipschitz continuous convex function.

Assumption 1 (Lipschitz Continuity).

$F:\mathbb{R}^{d}\rightarrow\mathbb{R}$ * is $G$ -Lipschitz continuous convex function over closed convex set $\mathcal{W}$ , i.e., $\|g(x)\|\leq G$ for every $x\in\mathcal{W}$ and every $g(x)\in\partial F(x)$ . Furthermore, the stochastic gradients $\hat{g}$ satisfy: $\|\hat{g}(x)\|\leq G$ almost surely for every $x\in\mathcal{W}$ .*

Assumption 2 (Closed and bounded set).

Diameter of closed convex set $\mathcal{W}$ is bounded by $D$ , i.e., $\mathrm{diam}(\mathcal{W})\leq D$ .

Assumption 3 (Strong convexity).

Let $\lambda>0$ . A convex function $F$ is said to be $\lambda$ strongly convex over $\mathcal{W}$ iff $F(y)\geq F(x)+\langle\nabla F(x),y-x\rangle+\frac{\lambda}{2}\|y-x\|^{2}\;\forall\;x,y\in\mathcal{W}$ .

Step size sequence for general convex functions: we first define,

[TABLE]

Clearly, $0=T_{0}<T_{1}<\dots<T_{k}=T-1<T_{k+1}=T$ . We note in particular that $T_{1}\approx\frac{T}{2}$ . Let $C>0$ be arbitrary. Then, we choose the step size $\alpha_{t}$ as follows:

[TABLE]

The theorem below provides suboptimality guarantee for the SGD algorithm with the step-size sequence mentioned above.

Theorem 1 (SGD/GD Last Point for General Convex Functions).

Let Assumptions 1 and 2 hold. Given $T\geq 4$ , let $x_{1},\dots,x_{T}$ be the iterates of SGD (Algorithm 1) with step size $\alpha_{t}$ as defined in Equation (3). Then, the following holds for all $T\geq 4$ :

[TABLE]

In particular, if we choose $C=\frac{D}{G}$ , we have: $\mathbb{E}F(x_{T})\leq F(x^{*})+\frac{15GD}{\sqrt{T}}\,.$ Furthermore, the following holds w.p. $\geq 1-\delta$ for any $0<\delta<\frac{1}{e}$ :

[TABLE]

Finally, under the same assumptions, GD update $(x_{t+1}=\Pi_{\mathcal{W}}(x_{t}-\alpha_{t}\nabla F(x_{t})))$ with the same step-size sequence given in (3) also ensures the following after $T$ iterations:

[TABLE]

We will prove this theorem in Section 3 after developing some general ideas.

Remarks: (1) Note that the bounds on sub-optimality (for SGD and GD) are information theoretically optimal up to constants.

(2) Our result on the expected sub-optimality improves upon that of [1] by a multiplicative $\log T$ factor and our result on the high probability sub-optimality improves upon [2] by a multiplicative factor of $\log T\sqrt{\log\frac{1}{\delta}}$ . On the other hand, our step-size sequence requires apriori knowledge of $T$ . We conjecture that for any-time algorithm (i.e., without apriori knowledge of $T$ ) expected error rate of $\frac{GD\log T}{\sqrt{T}}$ is information theoretically optimal.

(3) The rate obtained above for last point of GD (in the deterministic setting) is also optimal in the gradient oracle model and to the best of our knowledge, is the first such result for last point of GD.

Step size sequence for strongly-convex functions: Let $F(\cdot)$ be $\lambda$ strongly convex (Assumption 3). Let $k:=\inf\{i:T\cdot 2^{-i}\leq 1\}$ . We pick $\alpha_{t}$ as follows:

[TABLE]

We now present our result for last point of SGD with strong-convexity assumption.

Theorem 2 (SGD Last Point for Strongly Convex Functions).

Let $F$ satisfy Assumptions 1 and 3. Then the following holds for the $T$ -th iterate of the SGD algorithm (Algorithm 1) when run with the step size sequence given in Equation (4):

[TABLE]

Furthermore, the following holds for all $0<\delta\leq 1/e$ with probability at least $1-\delta$ :

[TABLE]

Under the same assumptions, GD update $(x_{t+1}=\Pi_{\mathcal{W}}(x_{t}-\alpha_{t}\nabla F(x_{t})))$ with the same step-size sequence given in (4) also ensures the following after $T$ iterations:

[TABLE]

Here again, we note that the result is information theoretically optimal up to $\log(1/\delta)$ factor.

2.1 General Step Size Modification

Theorems 1 and 2 are consequences of our general results on step size modification that we present below. Consider SGD step size sequence $(\gamma_{t})_{t=1}^{T}$ . We obtain modified step size sequence $(\alpha_{t})_{t=1}^{T}$ as follows:

[TABLE]

Under certain mild conditions, we will show that the last iterate of SGD with step size $\alpha_{t}$ is as good as the average iterate of SGD with step size $\gamma_{t}$ . We make these notions precise below:

Assumption 4 (Slowly Decreasing Step Size Sequence).

We call a step size sequence $(\gamma_{t})$ ‘decreasing’ if $\gamma_{t+1}\leq\gamma_{t}$ . We say that step size sequence $\gamma_{t}$ has ‘at most polynomial decay’ with decay constant $0<\beta\leq 1$ if $\gamma_{2t}\geq\beta\gamma_{t}$ for every $t\geq 1$ .

We have the following general theorem:

Theorem 3.

Let $(\gamma_{t})_{t=1}^{T}$ be a decreasing step size sequence with at most polynomial decay with decay constant $0<\beta\leq 1$ . Let the iterates of SGD with step size $\gamma_{t}$ be $y_{1},\dots,y_{T}$ . Let $\alpha_{t}$ be the modification of $\gamma_{t}$ as defined in Equation (5). Let the iterates of SGD with step size $\alpha_{t}$ be $x_{1},\dots,x_{T}$ . Then, for all $T\geq 4$ , we have:

[TABLE]

We also give a high probability version of Theorem 3.

Theorem 4.

Let $T\geq 4$ . Let $q^{(0)}$ be any arbitrary fixed probability distribution over the set $\{\lceil\tfrac{T}{4}\rceil,\dots,T_{1}\}$ . With probability atleast $1-\frac{\delta}{2}$ , we have:

[TABLE]

That is, the above theorems show that compared to any weighted average of function values of iterates in the $[T/4,T_{1}]$ iterations, the error is not significantly larger if $\beta$ is reasonably large and $\gamma_{T}$ is small. Now, using standard analysis, we can ensure small average function value for iterates in $[T/4,T_{1}]$ iterations. Small value of $\gamma_{T}$ and bound on $\beta$ hold trivially for standard step-size sequences.

See Section 3 for detailed proofs of the above theorems. We first develop general technique and prove key lemmas in the next section, and then present proofs for all the theorems.

2.2 Lower Bounds

The step size modification procedure described above assumed the knowledge of the last iterate $T$ (this is not a setback in practice). We study the case of infinite horizon SGD. In this section we state our bounds on the last iterate of ‘any time’ (infinite horizon) SGD in the case of strongly convex objectives. We will first introduce the notion of suboptimality that we consider. In particular, we look at two kinds of ‘bad performance’ in infinite horizon SGD for non-smooth strongly convex optimization. Consider any infinite step size sequence $\gamma_{t}$ .

The sequence $\gamma_{t}$ is said to be ‘bad in expectation’ if for an objective $F$ satisfying assumptions 1,2 and 3, some choice of subgradient oracle, and SGD iterates $(x_{t})_{t\in\mathbb{N}}$ with step size $\gamma_{t}$ , there is a fixed subsequence $\{t_{k}\}_{k\in\mathbb{N}}$ such that $\lim_{k\to\infty}t_{k}\mathbb{E}[F(x_{t_{k}})-F(x^{*})]=\infty$ . 2. 2.

The sequence $\gamma_{t}$ is said to be ‘bad almost surely’ if for an objective $F$ satisfying assumptions 1,2 and 3, some choice of subgradient oracle, and SGD iterates $(x_{t})_{t\in\mathbb{N}}$ with step size $\gamma_{t}$ , with probability $1$ there exists a random infinite sequence of times $\{t_{k}\}$ such that $\lim_{k\to\infty}t_{k}[F(x_{t_{k}})-F(x^{*})]=\infty$

We give a ‘no free lunch’ theorem: that is we show that infinite horizon step-size sequence for non-smooth strongly convex optimization is either ‘bad in expectation’ or ‘bad almost surely’. More precisely, we will show that if any infinite horizon SGD is good in ‘expectation’ for every $t$ for every strongly convex function, then it is ‘bad almost surely’ for some function $F$ .

Theorem 5.

Consider infinite horizon SGD with step size $\gamma_{t}$ such that assumptions 1 2 and 3 hold for the objective function. Then, for any choice of $\gamma_{t}>0$ , the algorithm is either bad in expectation or bad almost surely.

We give the proof in Section B.

3 Technical Ideas and Proofs

Recall the definition of $T_{i}$ from Section 2. The rough idea behind the proof is as follows: we will find a ‘good point’ in the range $\left[\lceil\tfrac{T}{4}\rceil,T_{1}\right]$ and then show that this implies that there is a ‘good point’ between $T_{1}=T/2$ and $T_{2}\approx 3T/4$ and so on, until we conclude that $x_{T}$ is a good point.

To this end, we first provide a key lemma that bounds the total weighted deviation of SGD iterates from a given iterate $x_{t_{0}}$ (in terms of function value), i.e., it intuitively shows that once we find an iterate with small function value, the remaining iterates cannot deviate from it significantly. The lemma uses a trick that was first used in [16] and then also in [1].

Lemma 1.

Let $x_{1},\dots,x_{T}$ be the output of SGD algorithm (Algorithm 1) with step size sequence $\alpha_{t}$ defined by (3). Then, given any $1<t_{0}<t_{1}\leq T$ ,

[TABLE]

Proof.

By convexity of $\mathcal{W}$ , we have:

[TABLE]

Taking squares and expanding on both sides,

[TABLE]

Taking expectation on both sides, and realizing that $\hat{g}_{t}$ is independent of $x_{t}$ and $x_{t_{0}}$ , we conclude,

[TABLE]

Here we have used the fact that $\mathbb{E}\left[\hat{g}_{t}(x_{t})|x_{t},x_{t_{0}}\right]=g_{t}(x_{t})$ . Using convexity, $\langle g_{t},x_{t}-x_{t_{0}}\rangle$ is lower bounded by $F(x_{t})-F(x_{t_{0}})$ . We conclude that:

[TABLE]

The result now follows by summing the above term from $t=t_{0}$ to $t=t_{1}$ . ∎

We now provide a high probability version of Lemma 1. To this end, we construct an exponential super-martingale that when combined with a Chernoff bound leads to exponential concentration bound. The method used is somewhat similar to the one used in [2], but our technique is specifically for Lemma 1 and is more concise.

For simplicity of exposition, we first define a few key quantities. Let $1<t_{0}<t_{1}\leq T$ and $r=t_{1}-t_{0}+1$ . We define the sequence $L_{t}$ as follows: for $t_{0}\leq t\leq t_{1}$ as follows:

[TABLE]

Using Lemma 3, $\frac{1}{e\cdot r}\leq L_{t}\leq\frac{1}{r}$ . Now, for any $l$ such that $t_{0}\leq l\leq t_{1}$ , we define the following random variables :

[TABLE]

We note the difference between $A^{*}(t_{0},t_{1})$ and $A(l,t_{1})$ : $A(l,t_{1})$ considers suboptimality with respect to $x_{l}$ whereas $A^{*}(t_{0},t_{1})$ considers the suboptimality with respect to the optimizer $x^{*}$ .

Lemma 2.

Let $A$ and $A^{*}$ be as defined by (7). Let $p(t_{0}),\dots,p(t_{1})$ be any probability distribution over $\{t_{0},\dots,t_{1}\}$ . We let $(p.A)(t_{0},t_{1}):=\sum_{l=t_{0}}^{t_{1}}p(l)A(l,t_{1})$ . Also, let $\alpha_{t}$ be a decreasing step size sequence. Then,

[TABLE]

Additionally, if $diam(\mathcal{W})\leq D$ almost surely, we have:

[TABLE]

Lemma 3.

Let $\Gamma>0$ be fixed. Let $\lambda_{0}=\frac{1}{re\Gamma}$ , $\lambda_{1}=\lambda_{0}+\Gamma\lambda_{0}^{2}$ , $\dots$ , $\lambda_{i+1}=\lambda_{i}+\Gamma\lambda^{2}_{i}$ . Then, for every $i\leq r$ , $\lambda_{i}\leq(1+\tfrac{1}{r})^{i}\lambda_{0}$

See Section A for proofs of the above given lemmata. We also require the following technical lemma:

Lemma 4.

Let $T_{i}$ be as defined in Section 2. Then, for all $0\leq i\leq k-1$ :

[TABLE]

Proof.

Lemma follows from the fact that $2\lceil a\rceil-1\leq\lceil 2a\rceil\leq 2\lceil a\rceil$ . ∎

3.1 Step Size Modification

Henceforth, we will assume that $\gamma_{t}$ is a decreasing step size sequence with at most polynomial decay (decay constant being $\beta$ ). We let $\alpha_{t}$ be the modification of $\gamma_{t}$ as defined in Equation 5. Let,

[TABLE]

Note that $\tau_{k+1}=T$ . We note that $\tau_{i}$ are completely deterministic and only used as part of the proof. The ability to compute $\tau_{i}$ is not necessary.

Lemma 5.

Let $x_{t}$ ’s be iterates of SGD (Algorithm 1) with modified step size sequence $\alpha_{t}$ of $\gamma_{t}$ defined in (5); $\gamma_{t}$ sequence satisfies Assumption 4. Let $T_{i},\ k$ be as defined by (2), and $\tau_{i}$ , $0\leq i\leq k+1$ be as defined in (8). Also, let $T\geq 4$ . Then, the following holds for all $i\in[k]$ :

[TABLE]

Proof.

We first consider $i\geq 1$ . If $\mathbb{E}[F(x_{\tau_{i+1}})]\leq\mathbb{E}[F(x_{\tau_{i}})]$ , the proof is done. Else, using Lemma 1 with $t_{0}=\tau_{i}$ and $t_{1}=T_{i+2}$ , and the fact that $\alpha_{t}$ is a decreasing sequence, we get:

[TABLE]

By definition of $\tau_{i}$ , $\mathbb{E}[F(x_{\tau_{i}})]\leq\mathbb{E}[F(x_{t})]$ whenever $T_{i}<t\leq T_{i+1}$ . Hence,

[TABLE]

where the first equality follows from the definition of $\alpha_{t}$ in (5), first inequality follows from Equation (9), and the final inequality follows from the fact that $\mathbb{E}[F(x_{t})-F(x_{\tau_{i}})]\geq 0$ when $T_{i}<t\leq T_{i+1}$ (see definition of $\tau_{i}$ in (8)).

Now, by using the above inequality with the assumption $\mathbb{E}[F(x_{\tau_{i+1}})]\geq\mathbb{E}[F(x_{\tau_{i}})]$ , and the fact that $T_{i+2}-T_{i}\geq T_{i+2}-\tau_{i}+1$ , we have:

[TABLE]

where $\zeta_{1}$ follows from Lemma 4. The equality follows from definition of $\alpha_{t}$ and the last inequality follows from the $\beta$ -slowly decaying assumption for $\gamma_{t}$ (Assumption 4).That is we obtain the result for the case $i\geq 1$ . The proof for the case when $i=0$ follows with minor modifications to the arguments given above. ∎

We now present a high probability version of Lemma 5.

Lemma 6.

Consider the setting of Lemma 5. Let $0\leq i\leq k$ and define $t_{0}=T_{i}+1$ for $1\leq i$ and $t_{0}=\lceil\tfrac{T}{4}\rceil$ for $i=0$ . Let $q^{(i)}$ be any probability distribution over $\{t_{0},\dots,T_{i+1}\}$ . Let $p^{i+1}(t):=\frac{L_{t}\alpha_{t}}{\sum_{s=T_{i+1}+1}^{T_{i+2}}L_{s}\alpha_{s}}$ , where $t\in[T_{i+1}+1,T_{i+2}]$ and the sequence $(L_{t})_{T_{i+1}+1}^{T_{i+2}}$ is defined by (6). Then, for any $\delta_{i}\in(0,1)$ and $i\in[1,k-1]$ , the following holds with probability at least $1-\delta_{i}$ :

[TABLE]

For $i=0$ , the following holds with probability atleast $1-\delta_{0}$ :

[TABLE]

Proof.

We will only show the case $1\leq i\leq k-1$ . The $i=0$ case follows by a similar proof. For $T_{i}\leq t\leq T_{i+1}$ , we define $\Gamma(t)=\sum_{s=t+1}^{T_{i+2}}\alpha_{s}L_{s}$ . We let $\kappa$ be defined as follows over $\{T_{i}+1,\dots,T_{i+2}\}$ :

[TABLE]

From Lemma 7, we conclude that $\kappa$ is a probability distribution over $\{T_{i}+1,\dots,T_{i+2}\}$ . From Lemma 2, we conclude that with probability atleast $1-\delta_{i}$ :

[TABLE]

We will show that when this event happens, the inequality in the statement of the lemma holds. If $\sum_{t=T_{i+1}+1}^{T_{i+2}}p^{(i+1)}(t)F(x_{t})\leq\sum_{s=T_{i}+1}^{T_{i+1}}q^{(i)}(s)F(x_{s})$ , then the statement of the lemma holds trivially. Now assume $\sum_{t=T_{i+1}+1}^{T_{i+2}}p^{(i+1)}(t)F(x_{t})>\sum_{s=T_{i}+1}^{T_{i+1}}q^{(i)}(s)F(x_{s})$ . We use the fact that $\kappa$ is supported over $\{T_{i}+1,\dots,T_{i+1}\}$ and hence:

[TABLE]

We exchange summation and collect the coefficients of the term $F(x_{t})$ to conclude:

[TABLE]

where $\sigma(s):=\sum_{t=T_{i}+1}^{s}\kappa(s)$ (empty sum being [math] by definition). By definition of $\sigma(s)=\kappa(s)+\sigma(s-1)$ , $\Gamma(s-1)=\alpha_{s}L_{s}+\Gamma(s)$ and $\kappa(s)=\tfrac{\Gamma(T_{i+1})}{\Gamma(s)}q^{(i)}(s)+\tfrac{\alpha_{s}L_{s}}{\Gamma(s)}\sigma(s-1)$ . Therefore, we conclude:

[TABLE]

We recall that $p^{(i+1)}(t)\cdot\left(\sum_{s=T_{i+1}+1}^{T_{i+2}}\alpha_{s}L_{s}\right)=\alpha_{t}L_{t}$ whenever $T_{i+1}<t\leq T_{i+2}$ . The rest of the proof is similar to Equation (11) in Lemma 5. We use the fact that $\alpha_{t}$ is the modification of $\gamma_{t}$ , $\gamma_{t}$ has at most polynmial decay, $\frac{1}{e(T_{i+2}-T_{i})}\leq L_{t}\leq\frac{1}{T_{i+2}-T_{i}}$ and Lemma 4 in Equation (14) to conclude the result. ∎

Lemma 7.

Let $\kappa$ be as defined in (12). Then, $\kappa$ is a probability distribution over $\{T_{i}+1,\dots,T_{i+2}\}$ .

The proof of this lemma is given in Section A

3.2 Proof of Theorem 3

The result now follows by using Theorem 4 with $\beta=2$ and $q^{(0)}(t)=\frac{1}{T_{1}-\lceil\frac{T}{4}\rceil+1}$ .∎

4 Experiments

We now empirically compare SGD last point with our step-size sequence (Our Method) with the standard steps size sequence (Standard) as well as the averaged iterates of SGD (Averaged). We apply these methods on two non-smooth problems: a) Lasso regression, b) linear SVM training.

Lasso Regression: We consider gradient descent for $F(x)=\frac{1}{n}\sum_{i=1}^{n}\|\langle a_{i},x\rangle-b_{i}\|^{2}+\lambda\|x\|_{1}$ for $x\in\mathbb{R}^{d}$ . Here $a_{i}\sim\mathcal{N}(0,I_{d})$ and $b_{i}=\langle a_{i},x^{*}\rangle+z_{i}$ for some $s$ sparse vector $x^{*}$ and $z_{i}\sim\mathcal{N}(0,\sigma^{2})$ . $a_{i}$ and $z_{i}$ are all independent. We use the step sizes of $\gamma_{t}=\frac{C}{\sqrt{T}}$ and let $\alpha_{t}$ be the modification of $\gamma_{t}$ as given in Section 2 for total $T$ iterations.

Since the objective is not smooth, the gradient doesn’t vanish near the optimum. Therefore, when the standard step size was picked, the iterate $x_{t}$ kept oscillating around the infimum but never really reaches it. In contrast, our method decreased the step size after sometime which allows better convergence to the optimum (see Figure 1(a)).

Training SVMs: We consider training SVMs which is a typical example where non-smooth SGD is heavily used [4]. For our experiments, we generate data as follows. Let $a_{i}\sim\mathcal{N}(0,\sigma^{2}I_{d})$ and the label $b_{i}=\mathrm{sgn}(a_{i}(1)+z_{i})$ where $z_{i}\sim\mathcal{N}(0,\eta^{2})$ . We generate $n=500$ points in $d=30$ dimensions. The SVM training problem is now:

[TABLE]

where $\lambda=0.1$ . Since the objective is $\lambda$ strongly convex, we consider step sizes of $\gamma_{t}:=\frac{1}{\lambda t}$ for the standard method and the modified step sizes given in Equation (4) for our method. Figure 1 (b) plots loss during a typical run of SGD and Figure 1 (c) for the loss averaged over $100$ independent runs of SGD for the same problem with the same initial point. The last point of SGD with modified step size sequence (Our Method) in blue consistently outperformed the standard SGD (Standard) in red. The green line denotes the loss of the average of the last $\frac{T}{4}$ iterates.

5 Conclusions and Discussion

We studied the fundamental question of sub-optimality of the last point of SGD/GD for general non-smooth convex functions as well as for strongly-convex functions. We proposed a novel step-size sequence that leads to information theoretically optimal rates in both the above mentioned settings. Our result proves a more general result for any “modified step-size” of a decaying standard step-size, and uses a novel technique of tracking best iterate in each time-interval and ensuring that the later iterates do not significantly deviate from the best iterate in the previous time interval. We also provide a high-probability bound using a super-martingale technique from [2]. Simulations show that our step-size indeed leads to better last point than the standard step-size sequences.

Our approach fundamentally exploits an assumption that we apriori know the total number of iterations $T$ . Hence, our result does not provide an any-time algorithm. In contrast, existing any-time results have an extra $\log T$ multiplicative factor in the sub-optimality. We conjecture that this gap is fundamental and every any-time algorithm would suffer from the extra $\log T$ factor. We give lower bounds for the strongly convex case to show that for any choice of step sizes, the algorithm is either sub-optimal in expectation or almost surely so infinitely often.

Acknowledgements

This research was partially supported by ONR N00014-17-1-2147 and MIT-IBM Watson AI Lab.

Appendix A Proofs of Technical Lemmas

A.1 Proof of Lemma 2

Proof of Lemma 2.

We fix $l$ such that $t_{0}\leq l\leq t_{1}$ . In this proof, we will freely use the fact that $\alpha_{t_{0}}\geq\alpha_{t}$ whenever $t_{0}\leq t$ . Let $l\leq t\leq t_{1}$ Define $\Delta_{t}=\langle\hat{g}_{t}(x_{t})-g_{t}(x_{t}),x_{t}-x_{l}\rangle$ . We note that $x_{t}$ are random variables and are functions of $\hat{g}_{1},\dots,\hat{g}_{t-1}$ only. We define the sigma-field $\mathcal{F}_{t}:=\sigma(\hat{g}_{1},\dots,\hat{g}_{t})$ .

We use the following notation for the sake of convenience: $D_{t}:=\|x_{t}-x_{l}\|^{2}$ . Clearly, $D_{t}$ is $\mathcal{F}_{t-1}$ measurable and $\Delta_{t}$ is $\mathcal{F}_{t}$ measurable. It is clear from the definition of $\Delta_{t}$ that $\mathbb{E}[\Delta_{t}|\mathcal{F}_{t-1}]=0$ and $|\Delta_{t}|\leq 2G\|x_{t}-x_{l}\|=2G\sqrt{D_{t}}$ .

By Hoeffding’s lemma, we conclude that for any $\mu\in\mathbb{R}$ , we conclude:

[TABLE]

Let $\lambda=\frac{1}{8\alpha_{t_{0}}^{2}G^{2}}$ . For $t_{0}\leq t\leq t_{1}$ , consider

[TABLE]

Clearly, $M_{t}$ is $\mathcal{F}_{t}$ measurable. $M_{l}=1$ since $D_{l}=\Delta_{l}=0$ almost surely. We will show that $M_{t}$ is a super martingale:

[TABLE]

Therefore,

[TABLE]

From the proof of Lemma 1, for $l\leq t\leq t_{1}$ we have:

[TABLE]

In the third step, we have used the convexity of $F(\cdot)$ . Reordering Equation (18) and using the notation defined above:

[TABLE]

Multiplying the equation above by $L_{t}$ and adding from $t=l$ to $t=t_{1}$ , noting the fact that $D_{l}=0$ and $D_{t_{1}+1}\geq 0$ , we conclude:

[TABLE]

We recall the random variable

[TABLE]

From equations (17) and (19), we conclude that for every $l$ such that $t_{0}\leq l\leq t_{1}$ :

[TABLE]

By convexity of the exponential function, we have:

[TABLE]

By Chernoff Bound, we conclude:

[TABLE]

The case for $A^{*}(t_{0},t_{1})$ proceeds similarly but this time we use $x^{*}$ in place of $x_{t_{0}}$ . We define $D_{t}^{*}:=\|x_{t}-x^{*}\|^{2}$ , $\Delta_{t}^{*}:=\langle\hat{g}_{t}(x_{t})-g_{t}(x_{t}),x_{t}-x^{*}\rangle$ and

[TABLE]

We note that for $t_{0}<t\leq t_{1}$ , $\mathbb{E}\left[M^{*}_{t}\bigr{|}\mathcal{F}_{t-1}\right]\leq M^{*}_{t-1}$ and $D^{*}_{t_{0}}\leq D^{2}$ . Therefore,

[TABLE]

Here we have used the fact that $L_{t_{0}}\leq\frac{1}{t_{0}-t_{1}+1}\leq 1$ . Noting that $\exp(\lambda A^{*}(t_{0},t_{1}))\leq M^{*}_{t_{1}}$ we use Chernoff bound to conclude the result. ∎

A.2 Proof of Lemma 3

Proof of Lemma 3.

We prove this by induction. The assertion is true for $i=0$ . Suppose it is true for $i=k\leq r-1$ . Then,

[TABLE]

The we have proved the assertion through induction. ∎

A.3 Proof of Lemma 7

Proof of Lemma 7.

We take the definitions of the terms used from the proof of Lemma 6. It is clear from the definition that $\kappa(t)\geq 0$ . Since $\kappa(t)=0$ for $t>T_{i+1}$ , it is sufficient to show that $\sum_{s=T_{i}+1}^{T_{i+1}}\kappa(s)=1$ .

We define $\sigma(t)=\sum_{s=T_{i}+1}^{t}\kappa(s)$ (an empty sum denotes 0). By definition of $\kappa$ , for $T_{i}+1\leq t\leq T_{i+1}$

[TABLE]

Continuing the above recursion, we conclude:

[TABLE]

Since $q$ is a probability distribution over $\{T_{i}+1,\dots,T_{i}\}$ , we conclude

[TABLE]

.

∎

Appendix B Proofs of Lower Bounds

We will prove theorem 5 for $G=5$ and $\mu=1$ for the sake of convenience. We can handle the general case by considering the transformation $F_{0}(x)=\frac{25\mu}{G^{2}}F(\frac{G}{2\mu}x)$ . We scale the domain as $D_{0}:=\frac{5\mu}{G}D$ . If $F$ is $\mu$ strongly convex and $G$ Lipschitz, then $F_{0}$ is $1$ strongly convex and $5$ Lipschitz. We take the subgradient oracle for $F_{0}$ to be $\hat{g}^{0}_{t}(x):=\frac{5}{G}\hat{g}_{t}(\frac{Gx}{5\mu})$ . It is easy to check that if SGD for $F(.)$ with step sizes $\alpha_{t}$ , the iterates are $x_{t}$ , then starting from $x^{0}_{0}:=\frac{5\mu}{G}x_{0}$ and using step sizes $\alpha^{0}_{t}:=\mu\alpha_{t}$ and the subgradient oracle defined above, the iterates for $F_{0}$ is $x_{t}^{0}=\frac{5\mu}{G}x_{t}$ . Therefore, $F_{0}(x_{t}^{0})=\frac{25\mu}{G^{2}}F(x_{t})$ and the proof below goes through seamlessly. This is similar to the rescaling used for the lowerbounds in [2].

Without loss of generality, we will restrict our attention to strictly positive step size sequences: $\gamma_{t}>0$ . We further restrict the possible values of $\gamma_{t}$ in the following lemma:

Lemma 8.

If the step size sequence $\gamma_{t}$ is such that there is an infinite sequence of times $t_{k}$ such that $\lim_{k\to\infty}t_{k}\gamma_{t_{k}}=\infty$ , then SGD is bad in expectation. Therefore, we can restrict our consideration to step size sequences of the form $\gamma_{t}=O(\frac{1}{t})$ .

Proof.

Consider the function $F:[-1,1]\to\mathbb{R}$ defined by $F(x)=|x|+\frac{x^{2}}{2}$ . $F$ has a global optimum at $x=0$ and it is $1$ strongly convex. Let $\epsilon_{t}$ be a sequence of i.i.d. rademacher random variables (i.e, uniform over $\{-1,1\}$ ). We let the subgradient oracle to return $\hat{g}_{t}(x)=\mathsf{sgn}(x)+x+3\epsilon_{t}$ . Clearly,

[TABLE]

$\epsilon_{t}$ is independent of $x_{t}$ and conditioned on the value of $x_{t}$ , with probability atleast $\frac{1}{2}$ , $\epsilon_{t}$ has the opposite sign as $x_{t}$ . When this happens, $(\mathsf{sgn}(x_{t})+x_{t}+3\epsilon_{t})$ has the opposite sign of $x_{t}$ and $|(\mathsf{sgn}(x_{t})+x_{t}+3\epsilon_{t})|\geq 1$ . Therefore under this event, $|x_{t}-\gamma_{t}(\mathsf{sgn}(x_{t})+x_{t}+3\epsilon_{t})|\geq|x_{t}|+\gamma_{t}\geq\gamma_{t}$ .

Therefore, we conclude:

[TABLE]

Considering the fact that $(t_{k}+1)\mathbb{E}\left(F(x_{t_{k}+1})-F(0)\right)\geq t_{k}\mathbb{E}|x_{t_{k}+1}|\geq\frac{1}{2}\min(t_{k},\gamma_{t_{k}}t_{k})\to\infty$ , we conclude that SGD with this step size is bad in expectation.

∎

Henceforth, we will restrict our attention without loss of generality to step size sequences such that $\gamma_{t}=O(\frac{1}{t})$ . We will first consider the function $F_{1}(x)=\frac{1}{2}x^{2}$ over the set $[-1,1]$ . Let the infinite horizon learning rate be $(\gamma_{t})_{t\in\mathbb{N}}$ at each time instant, the subgradient oracle returns $x+\epsilon_{t}$ where $\epsilon_{t}$ is a sequence of i.i.d. uniform random variable over $\{-1,1\}$ (that is rademacher random variables). Let the iterates of SGD be $z_{t}$ and $z_{1}=1$ .

Lemma 9.

Let $T_{0}$ be the smallest time such that $\gamma_{t}<1$ for all $t\geq T_{0}$ . Then, for every $t\geq T_{0}$

[TABLE] 2. 2.

[TABLE]

Proof.

Suppose $\gamma_{t}\geq 1$ . Then:

[TABLE]

Therefore, when $\gamma_{t}\geq 1$ , $|z_{t+1}|=1$ . Therefore, $z_{T_{0}}=1$ almost surely. When $t\geq T_{0}$ ,

[TABLE]

Therefore, when $t\geq T_{0}$ , the iteration of SGD won’t leave the set $[-1,1]$ almost surely, so there is no need for the projection step to obtain the next iterate. That is, for $t\geq T_{0}$ , $z_{t+1}=z_{t}(1-\gamma_{t})+\epsilon_{t}\gamma_{t}$ . Squaring and taking expectations, we conclude:

[TABLE]

Clearly, $\mathbb{E}|z_{T_{0}}|^{2}=1=\frac{1}{T_{0}-T_{0}+1}$ . Using induction in the equation above, we conclude: $\mathbb{E}|z_{t}|^{2}\geq\frac{1}{t-T_{0}+1}$ for every $t\geq T_{0}$ .

∎

We divide $\mathbb{N}$ into time intervals of the form $\{2^{k},2^{k}+1,\dots,2^{k+1}-1\}:=I_{k}$ . We have the following lemma:

Lemma 10.

If $\gamma_{t}=\leq\frac{C}{t}$ for some constant $C\geq 0$ and there exist positive infinite sequences $c_{k}$ and $d_{k}$ such that $\lim_{k\to\infty}c_{k}=\infty$ , $\lim_{k\to\infty}d_{k}=0$ and every $k$ , either one of the two conditions below hold:

$\sum_{t\in I_{k}}\gamma_{t}^{2}\geq c_{k}2^{-k}\left(\sum_{t\in I_{k}}\gamma_{t}\right)^{2}$ ** 2. 2.

$\sum_{i\in I_{k}}\gamma_{t}\leq d_{k}$ **

Then, SGD with step size $\gamma_{t}$ is bad in expectation.

Proof.

We consider the optimization problem considered in Lemma 9 i.e, optimizing $F(x)=x^{2}$ . Let $T_{k}=2^{k}$ . We assume the contrary - that is, $\mathbb{E}|z_{T_{k}}|^{2}\leq\frac{L}{T_{k}}$ for every $k$ , for some $L>0$ . As shown in the second inequality of Lemma 9, irrespective of the choice of $\gamma_{1},\dots,\gamma_{T_{k}-1}$ ,

[TABLE]

From the first equality in Lemma 9, we conclude that for $t\in I_{k}$ , $\mathbb{E}|z_{t+1}|^{2}=(1-\gamma_{t})^{2}\mathbb{E}z_{t}^{2}+\gamma_{t}^{2}$ Since $\gamma_{t}\leq\frac{C}{t}$ , we can take $k$ large enough so that $\gamma_{t}\leq\frac{1}{2}$ for every $t\in I_{k}$ . Using the fact that $(1-\gamma_{t})\geq\exp{-\frac{\gamma_{t}}{1-\gamma_{t}}}\geq\exp{-2\gamma_{t}}$ . Therefore,

[TABLE]

Unravelling the recursion above, we conclude:

[TABLE]

We define $S_{k}:=\sum_{t\in I_{k}}\gamma_{t}$ .

Suppose for a particular $k$ , the first item in the statement of the lemma holds

By assumption, $\sum_{t\in I_{k}}\gamma_{t}^{2}\geq\frac{c_{k}}{T_{k}}S_{k}^{2}$ . Using this in Equation (20), we conclude:

[TABLE]

Now, since $\gamma_{t}\leq\frac{C}{t}$ , we have $S_{k}\leq C$ . Therefore,

[TABLE]

In the third step, we have used the fact that $T_{k}\mathbb{E}|z_{T_{k}}|^{2}\leq L$ . We now consider the function $h:\mathbb{R}^{+}\to\mathbb{R}^{+}$ given by $h(x)=e^{-2x}+\kappa x^{2}$ for some $\kappa>0$ . Clearly, $h$ is convex, bounded below and tends to infinity as $x\to\infty$ . Therefore, it has a unique minimizer $t^{*}$ - the unique point such that $h^{\prime}(t^{*})=0$ . That is, $t^{*}$ is the unique point which satisfies: $\kappa t^{*}=2e^{-4t^{*}}\leq 2$ . Therefore, $t^{*}\leq\frac{2}{\kappa}$ . Therefore, $h(t^{*})\geq e^{-4t^{*}}\geq e^{-8/\kappa}\geq 1-\frac{8}{\kappa}$ . In Equation (23), we take $\kappa=\frac{c_{k}e^{-4C}}{L}$ we conclude:

[TABLE]

Where $C^{\prime}$ is a constant depending only on $L$ and $C$ . 2. 2.

Suppose for a particular $k$ , the second item in the statement of the lemma holds: Then, by Equation (20), we have:

[TABLE]

From Equations (21) and (22), we conclude that there exists an absolute constant $\bar{C}$ depending only on $C$ and $L$ such that:

[TABLE]

Since $\max\left(d_{k},\tfrac{1}{c_{k}}\right)\to 0$ , we can choose $k$ large enough so that $\sup_{s>k}\bar{C}\max\left(d_{k},\tfrac{1}{c_{k}}\right)\leq 1-e^{-\epsilon}$ for arbitrary $\epsilon>0$ .

From Equation (23), it follows that for arbitrary $K\in\mathbb{N}$ ,

[TABLE]

By Lemma 9, $\mathbb{E}|z_{T_{k}}|^{2}\geq\frac{1}{T_{k}-1}\geq 2^{-k}$ . By our assumption, $\mathbb{E}|z_{T_{k+K}}|^{2}\leq L2^{-k-K}$ . Therefore, we conclude: $L2^{-k-K}\geq e^{-\epsilon K}2^{-k}$ for every $K\in\mathbb{N}$ . This cannot hold for any finite $L$ when we take $\epsilon<\log{2}$ . This contradicts our assumption. Therefore, SGD with step size $\gamma_{t}$ is bad in expectation.

∎

We will show that if conditions for $\gamma_{t}$ in Lemma 9 or those in Lemma 10 don’t hold, then SGD is bad almost surely. We recall the definition of the interval $I_{k}=\{2^{k}+1,\dots,2^{k+1}\}$ . We prove the following lemma to inspect how frequently long, contiguous segments of $\epsilon_{t}$ are all equal to $1$ for $t\in I_{k}$ . We take $\tau_{k}:=2^{\lfloor\log_{2}(k/2)\rfloor}$ . We note that $\frac{k}{4}\leq\tau_{k}\leq\frac{k}{2}$ We can divide $I_{k}$ into $|I_{k}|/\tau_{k}$ contiguous, disjoint intervals, each of size $\tau_{k}$ . We call these intervals $J_{k}(i)$ for $i\in\{1,\dots,|I_{k}|/\tau_{k}\}$ . We let $A_{k}$ to be the event that for some $i\in\{1,\dots,|I_{k}|/tau\}$ , $\epsilon_{t}=1$ for all $t\in J_{k}(i)$ . In particular, the even $A_{k}$ implies that there is a contiguous $\tau_{k}$ length sequence of $\epsilon_{t}$ of all $1$ s in $I_{k}$ .

Lemma 11.

$\mathbb{P}(A_{k}^{c})\leq Ck2^{-k/2}$ * for some absolute constant $C$ .*

Proof.

We subdivide the interval $I_{k}$ into disjoint subintervals of length $\tau_{k}$ . There are $\tfrac{2^{k}}{\tau_{k}}$ such intervals. The event $A_{k}$ holds if over one such subinterval, the random signs are all $1$ . The probability of a given subinterval having all signs equal to $1$ is $p_{\tau_{k}}:=\tfrac{1}{2^{\tau_{k}}}$ . Therefore, we conclude:

[TABLE]

Here we have used the inequality $xe^{-x}\leq\frac{1}{e}$ for $x>0$ .

Therefore, we conclude that: $\mathbb{P}(A_{k}^{c})\leq Ck2^{-k/2}$ for some absolute constant $C$ . ∎

We now consider the same function which was considered in Lemma 8 i.e, $F:[-1,1]\to\mathbb{R}$ defined by $F(x)=|x|+\frac{x^{2}}{2}$ . $F$ has a global optimum at $x=0$ and it is $1$ strongly convex. Let $\epsilon_{t}$ be a sequence of i.i.d. rademacher random variables (i.e, uniform over $\{-1,1\}$ ). We let the subgradient oracle to return $\hat{g}_{t}(x)=\mathsf{sgn}(x)+x+3\epsilon_{t}$ . Let the iterates of SGD for $F$ with step sizes $\gamma_{t}$ be $y_{t}$ .

Lemma 12.

Suppose $\gamma_{t}\leq\frac{C}{t}$ , there exists an infinite sequence $(k_{r})_{r\in\mathbb{N}}$ and fixed constants $c_{0},d_{0}>0$ such that both the conditions hold:

[TABLE] 2. 2.

[TABLE]

We note that these conditions are the negations of the conditions for $\gamma_{t}$ in Lemma 8 and Lemma 10. Then SGD with step size $\gamma_{t}$ is bad almost surely.

Proof.

We will show that there exists a sequence of independent events $B_{k_{r}}$ for $r\in\mathbb{N}$ such that $\mathbb{P}(B_{k_{r}})\geq p_{0}>0$ uniformly and whenever $B_{k_{r}}$ holds,

[TABLE]

For some constant $\delta_{0}>0$ . We note that $p_{0}$ and $\delta_{0}$ depend only on $C,d_{0}$ and $c_{0}$ . We consider a random times $T_{\mathsf{max}},T_{\mathsf{min}}\in I_{k}$ as follows:

If the event $A_{k}^{c}$ holds, pick a uniformly random element $i_{0}$ from $\{1,\dots,|I_{k}|/\tau_{k}\}$ independent of everything else. Set $T_{\mathsf{max}}:=\max J_{k}(i_{0})$ and $T_{\mathsf{min}}:=\min J_{k}(i_{0})$ 2. 2.

If the event $A_{k}$ holds, pick a uniformly random element $i_{0}$ from $\{i:\text{ for all }t\in J_{k}(i),\epsilon_{t}=1\}$ , independent of everything else. Set $T_{\mathsf{max}}:=\max J_{k}(i_{0})$ and $T_{\mathsf{min}}:=\min J_{k}(i_{0})$

We note that by symmetry, $i_{0}$ is uniformly distributed over the set $\{1,\dots,|I_{k}|/\tau_{k}\}$ . We will show that, when the event $A_{k}$ holds, then one of the following is true:

$y\left(T_{\mathsf{max}}\right)=-1$ . 2. 2.

$y\left(T_{\mathsf{min}}\right)-y\left(T_{\mathsf{max}}\right)\geq\sum_{t\in J_{k}(i_{0})}\gamma_{t}$

Suppose the event $A_{k}$ holds. Then for $T_{\mathsf{min}}\leq t<T_{\mathsf{max}}$ , $y_{t+1}=\max(y_{t}-\gamma_{t}(y_{t}+\mathsf{sgn}(y_{t})+3),-1)$ . Since under the event $A_{k}$ , $\epsilon_{t}=1$ for every $t\in J_{k}(i_{0})$ , we conclude that $\gamma_{t}(y_{t}+\mathsf{sgn}(y_{t})+3)\geq\gamma_{t}$ . That is SGD drifts in the negative direction irrespective of the value of the iterate. It is therefore clear that if for some $T_{\mathsf{min}}\leq t\leq T_{\mathsf{max}}$ , $y_{t}$ hits $-1$ , then $y\left(T_{\mathsf{max}}\right)=-1$ . Now suppose that for $y_{t}>-1$ for every $t$ in this range. Then, $y_{t+1}\leq y_{t}-\gamma_{t}$ . But unraveling this recursion, it follows that $y\left(T_{\mathsf{min}}\right)-y\left(T_{\mathsf{max}}\right)\geq\sum_{t\in J_{k}(i_{0})}\gamma_{t}$ . Therefore, it follows that when the event $A_{k}$ holds:

[TABLE]

It is clear that since $\gamma_{t}=C/t$ , for $k$ large enough, $\tfrac{1}{2}\sum_{t\in J_{k}(i_{0})}\gamma_{t}\leq 1$ . Therefore, we conclude that for $k$ large enough, when the event $A_{k}$ holds,

[TABLE]

Fix $0<\beta<1$ .We now consider $E_{k}$ to be the event $\bigr{\{}\sum_{t\in J_{k}(i_{0})}\gamma_{t}\geq\frac{\beta\tau}{|I_{k}|}\sum_{t\in I_{k}}\gamma_{t}\bigr{\}}$ .

By symmetry, $i_{0}$ is uniformly distributed over $\{1,\dots,|I_{k}|/\tau_{k}\}$ . Therefore,

[TABLE]

and

[TABLE]

Now, when $k$ is part of the infinite sequence $(k_{r})$ , by assumption we have:

[TABLE]

Therefore, by Payley-Zigmund inequality, whenever $k$ is part of the infinite sequence $(k_{r})$ , for every $\beta<1$ ,

[TABLE]

Recalling the definition of $E_{k}$ , we conclude, $\mathbb{P}(E_{k_{r}})\geq\frac{(1-\beta)^{2}}{c_{0}}$ .

We will now define the event $B_{k}:=E_{k}\cap A_{k}$ . The events $B_{k}$ are all independent by definition. When the event $B_{k}$ holds, clearly, from equation (24), we conclude:

[TABLE]

The second inequality follows from the defintion of $E_{k}$ . Using the fact that any $t\in I_{k_{r}}$ is such that $t\leq 2|I_{k_{r}}|$ and $\tau_{k_{r}}=\Theta(k_{r})$ , we conclude that for some $\delta_{0}>0$ , fixed, the following holds whenever the event $B_{k_{r}}$ holds.

[TABLE]

It is clear that we can find a $p_{0}>0$ such that for all $k_{r}$ large enough, $\mathbb{P}(B_{k_{r}})>p_{0}$ .

Since $B_{k_{r}}$ are independent sets, it follows that infinitely many of them are true with probability $1$ . From equation (25), we conclude that SGD with step sizes $\gamma_{t}$ is bad almost surely. ∎

Proof of Theorem 5.

We will conclude this from Lemmas 8, 10 and 12. Therefore, it is sufficient to show that any strictly positive infinite sequence $\gamma_{t}$ is such that atleast one of the following condition holds

There is an infinite sequence of times $t_{k}$ such that $\lim_{k\to\infty}t_{k}\gamma_{t_{k}}=\infty$ . In this case, by Lemma 8, we conclude that it is bad in expectation. 2. 2.

There exists a $C$ such that $\gamma_{t}\leq\frac{C}{t}$ and there exist infinite sequences $c_{k}\to\infty$ and $d_{k}\to 0$ such that for every k, either $\sum_{t\in I_{k}}\gamma_{t}^{2}\geq c_{k}2^{-k}\left(\sum_{t\in I_{k}}\gamma_{t}\right)^{2}$ or $\sum_{t\in I_{k}}\gamma_{t}\leq d_{k}$ . In this case, by Lemma 10, we conclude that it is bad in expectation. 3. 3.

There exists a $C$ such that $\gamma_{t}\leq\frac{C}{t}$ and there exist fixed positive constants $c_{0}$ and $d_{0}$ such that for some infinite sub-sequence $(k_{r})$ , $\sum_{t\in I_{k_{r}}}\gamma_{t}^{2}\leq c_{0}2^{-k_{r}}\left(\sum_{t\in I_{k_{r}}}\gamma_{t}\right)^{2}$ and $\sum_{t\in I_{k_{r}}}\gamma_{t}\geq d_{0}$ . In this case, by Lemma 12, we conclude that the algorithm is bad almost surely.

It is therefore sufficient to show that if conditions 1 and 2 don’t hold then condition 3 holds. The negation of condition 1 is that $\gamma_{t}\leq\frac{C}{t}$ for some $C>0$ . Now, we denote by

[TABLE]

and

[TABLE]

. Therefore, $\eta_{k}\geq c_{k}$ or $\lambda_{k}\leq d_{k}$ for some $c_{k}\to\infty$ and $d_{k}\to 0$ is equivalent to $\eta_{k}+\frac{1}{\lambda_{k}}\to\infty$ which is equivalent to the statement that for every subsequence $k_{r}$ , $\eta_{k_{r}}+\frac{1}{\lambda_{k_{r}}}\to\infty$ . Therefore the negation of condition 2 is equivalent to atleast one of the following conditions being true

There exists infinite sequence $(t_{k})$ such that $t_{k}\gamma_{t_{k}}\to\infty$ 2. 2.

There exists and infinite subsequence $k_{r}$ such that $\eta_{k_{r}}+\frac{1}{\lambda_{k_{r}}}\leq M$ for some $M>0$ . That is, $\eta_{k_{r}}\leq M:=c_{0}$ and $\lambda_{k_{r}}\geq\frac{1}{M}:=d_{0}$

Therefore we conclude that when neither of the conditions 1 and 2 hold, then condition 3 holds. This proves our result.

∎

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning , pages 71–79, 2013.
2[2] Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. ar Xiv preprint ar Xiv:1812.05217 , 2018.
3[3] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436, 2015.
4[4] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming , 127(1):3–30, 2011.
5[5] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. ar Xiv preprint ar Xiv:1711.04325 , 2017.
6[6] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning , 8(3-4):231–357, 2015.
7[7] Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conference on Learning Theory , pages 47–1, 2012.
8[8] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization , 30(4):838–855, 1992.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Making the Last Iterate of SGD

Abstract

1 Introduction

2 Problem Setup and Main Results

Assumption 1** (Lipschitz Continuity).**

Assumption 2** (Closed and bounded set).**

Assumption 3** (Strong convexity).**

Theorem 1** (SGD/GD Last Point for General Convex Functions).**

Theorem 2** (SGD Last Point for Strongly Convex Functions).**

2.1 General Step Size Modification

Assumption 4** (Slowly Decreasing Step Size Sequence).**

Theorem 3**.**

Theorem 4**.**

2.2 Lower Bounds

Theorem 5**.**

3 Technical Ideas and Proofs

Lemma 1**.**

Proof.

Lemma 2**.**

Lemma 3**.**

Lemma 4**.**

Proof.

3.1 Step Size Modification

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Lemma 7**.**

3.2 Proof of Theorem 3

Proof.

3.3 Proof of Theorem 4

Proof.

3.4 Proof of Theorem 1

Proof.

3.5 Proof of Theorem 2

Proof.

4 Experiments

5 Conclusions and Discussion

Acknowledgements

Appendix A Proofs of Technical Lemmas

A.1 Proof of Lemma 2

Proof of Lemma 2.

A.2 Proof of Lemma 3

Proof of Lemma 3.

A.3 Proof of Lemma 7

Proof of Lemma 7.

Appendix B Proofs of Lower Bounds

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

Lemma 10**.**

Proof.

Lemma 11**.**

Proof.

Lemma 12**.**

Proof.

Proof of Theorem 5.

Assumption 1 (Lipschitz Continuity).

Assumption 2 (Closed and bounded set).

Assumption 3 (Strong convexity).

Theorem 1 (SGD/GD Last Point for General Convex Functions).

Theorem 2 (SGD Last Point for Strongly Convex Functions).

Assumption 4 (Slowly Decreasing Step Size Sequence).

Theorem 3.

Theorem 4.

Theorem 5.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.