Making the Last Iterate of SGD Information Theoretically Optimal
Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli

TL;DR
This paper introduces new step size sequences for SGD and GD that achieve information-theoretic optimal bounds on the suboptimality of the last iterate, addressing a longstanding gap in theoretical understanding and practical performance.
Contribution
It proposes a novel modification scheme for step size sequences that ensures the last iterate's suboptimality matches the optimal bounds, improving theoretical guarantees and practical results.
Findings
New step size sequences achieve optimal last iterate bounds
Modified sequences match average-case guarantees for last point
Simulations confirm significant improvement over standard step sizes
Abstract
Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) \emph{averages} of iterates and obtains information theoretically optimal bounds on suboptimality, the \emph{last point} of SGD is, by far, the most preferred choice in practice. The best known results for last point of SGD \cite{shamir2013stochastic} however, are suboptimal compared to information theoretic lower bounds by a factor, where is the number of iterations. \cite{harvey2018tight} shows that in fact, this additional factor is tight for standard step size sequences of and for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
Making the Last Iterate of SGD
Information Theoretically Optimal
Prateek Jain
Microsoft Research
Bengaluru, India
&Dheeraj Nagaraj
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Cambridge, USA 02139
&Praneeth Netrapalli
Microsoft Research
Bengaluru, India
Abstract
Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) averages of iterates and obtains information theoretically optimal bounds on suboptimality, the last point of SGD is, by far, the most preferred choice in practice. The best known results for last point of SGD [1] however, are suboptimal compared to information theoretic lower bounds by a factor, where is the number of iterations. [2] shows that in fact, this additional factor is tight for standard step size sequences of and for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth, convex functions, the best known step-size sequences still lead to -suboptimal convergence rates (on the final iterate). The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of last point of SGD as well as GD. We achieve this by designing a modification scheme, that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average of SGD/GD with original sequence. We also show that our result holds with high-probability. We validate our results through simulations which demonstrate that the new step size sequence indeed improves the final iterate significantly compared to the standard step size sequences.
††Accepted for presentation at the Conference on Learning Theory (COLT) 2019
K****eywords Stochastic Gradient Descent Machine Learning Convex Optimization
1 Introduction
Stochastic Gradient Descent (SGD) is one of the most popular algorithms for solving large-scale empirical risk minimization (ERM) problems [3, 4, 5]. The algorithm updates the iterates using stochastic gradients obtained by sampling data points uniformly at random. The algorithm has been studied for several decades [6] but there are still significant gaps between practical implementations and theoretical analyses. In particular, the standard analyses hold only for some kind of average of iterates, but most practitioners just use the final iterate of SGD. So, [7] asked the natural question of whether the final iterate of SGD, as opposed to average of iterates, is provably good. It was partly answered in [1] which gave sub-optimality bound for the last point of SGD but the obtained sub-optimality rates are worse than the information theoretically optimal rates; is the number of iterations.
[2] showed that the above result is tight for the standard step-size sequence used by most existing theoretical results. The extra logarithmic factor is not due to the stochastic nature of SGD. In fact, even for subgradient descent (GD) when applied to general non-smooth, convex functions, the last point’s convergence rates are sub-optimal by factor.
So, this work addresses the following two fundamental questions:
*“Does there exist a step-size sequence for which the last point of SGD when applied to general convex functions as well as to strongly-convex functions has optimal error (sub-optimality) rate?”, and,
*“Does there exist a step-size sequence for which the last point of GD when applied to general non-smooth convex functions has optimal error (sub-optimality) rate?”
In this paper, we answer both the questions in the affirmative. That is, we provide novel step size sequences and show that the final iterate of SGD run with these step size sequences has the information theoretically optimal error (suboptimality) rate. In particular, for general non-smooth convex functions, our results ensure an error rate of and for strongly-convex functions, the error rate is . We also present high-probablity versions, i.e., we show that with probability at least , the suboptimality is and respectively (see Theorems 1 and 2). For GD, we show that a similarly modified step-size sequence leads to suboptimality of and for non-smooth convex functions, with and with out strong convexity respectively, which is optimal.
In general, SGD takes the iterates near the optimum value but since the objective isn’t smooth near the optimizer , the gradients don’t become small even when the points are close to . Standard step sizes don’t decay appreciably with time to ensure fast enough convergence to . Therefore the iterates , after going close to , start oscillating around it without actually approaching it (See Section 4 for concrete examples). Our new step sizes, given in Section 2.1 ensure that the step sizes decay fast enough after a certain point, making the iterates go closer to the optimum . The exact mode of this decay ensures that the last iterate approaches the optimum at the information theoretic rate.
Our results utilize a general step size modification scheme which ensures that the upper bounds for the average function value with the original step sizes gets transferred to the last iterate when the modified step sizes are used (see Theorems 3 and 4). A key technical contribution of the paper is the proof of Theorem 2 that constructs a sequence of averaging schemes which are ‘good’ with high probability such that the last averaging scheme consists only of the last iterate and hence lets us conclude that the last iterate is ‘good’ with high probability.
Our new step-size sequence requires that the number of iterations or horizon is known apriori. In contrast, standard step-size sequences do not require apriori, and hence guarantee any-time results. Information about apriori helps us in ensuring that we do not drop step-size too early; only after we are close to the optimum, does the step size drop rapidly. In fact, we conjecture that in absence of apriori information about , no step-size sequence can ensure the information theoretically optimal error rates for final iterate of SGD. As a step towards proving this, we show that in the case of strongly convex objectives, any choice of step sizes with infinite horizon (i.e, without the knowledge of total number of iterations) is either suboptimal almost surely or suboptimal in expectation for infinitely many points. We show this in Theorem 5.
Related Work: Averaging was used first in the stochastic approximation setting by [8] to show optimal rates of convergence. Gradient Descent type methods have been shown to achieve information theoretically optimal error rates in the convex and strongly convex settings when averaging of iterates is used ([9],[10],[11], [12], Epoch GD in [13] , SGD [14] and [15]). The question of the last iterate was first considered in [1] and it gives a bound of and in expectation for the general case and strongly convex case respectively. [2] show matching high probability bounds and show that for the standard step sizes ( in the general case and in the strongly convex case), the logarithmic-suboptimal bounds are tight.
Organization: The setting and main results are presented in Section 2. In particular, Section 2.1 describes the general step size modification considered and states key results regarding this modification and the lower bound is presented in Section 2.2. Key technical ideas are developed in Section 3 and the main theorems are proved. We present some experimental results in Section 4 and conclude in Section 5. Skipped proofs of technical lemmas are given in the appendix.
2 Problem Setup and Main Results
Consider the following optimization problem:
[TABLE]
where objective function is a convex function and is a closed convex set. Let the global minimizer of be . We start the SGD algorithm at a point and iteratively obtain estimates for the minimizer of . We assume that at each time step, we have access to independent, unbiased estimate to a subgradient . That is, for every and are independent. We pick step sizes . Let be the projection operator to the set . The SGD algorithm is given as follows:
Henceforth, we will retain the assumptions made above. Whenever we use , it is implied that . Throughout the paper, we assume that is a Lipschitz continuous convex function.
Assumption 1** (Lipschitz Continuity).**
* is -Lipschitz continuous convex function over closed convex set , i.e., for every and every . Furthermore, the stochastic gradients satisfy: almost surely for every .*
Assumption 2** (Closed and bounded set).**
Diameter of closed convex set is bounded by , i.e., .
Assumption 3** (Strong convexity).**
Let . A convex function is said to be strongly convex over iff .
Step size sequence for general convex functions: we first define,
[TABLE]
Clearly, . We note in particular that . Let be arbitrary. Then, we choose the step size as follows:
[TABLE]
The theorem below provides suboptimality guarantee for the SGD algorithm with the step-size sequence mentioned above.
Theorem 1** (SGD/GD Last Point for General Convex Functions).**
Let Assumptions 1 and 2 hold. Given , let be the iterates of SGD (Algorithm 1) with step size as defined in Equation (3). Then, the following holds for all :
[TABLE]
In particular, if we choose , we have: Furthermore, the following holds w.p. for any :
[TABLE]
Finally, under the same assumptions, GD update with the same step-size sequence given in (3) also ensures the following after iterations:
[TABLE]
We will prove this theorem in Section 3 after developing some general ideas.
Remarks: (1) Note that the bounds on sub-optimality (for SGD and GD) are information theoretically optimal up to constants.
(2) Our result on the expected sub-optimality improves upon that of [1] by a multiplicative factor and our result on the high probability sub-optimality improves upon [2] by a multiplicative factor of . On the other hand, our step-size sequence requires apriori knowledge of . We conjecture that for any-time algorithm (i.e., without apriori knowledge of ) expected error rate of is information theoretically optimal.
(3) The rate obtained above for last point of GD (in the deterministic setting) is also optimal in the gradient oracle model and to the best of our knowledge, is the first such result for last point of GD.
Step size sequence for strongly-convex functions: Let be strongly convex (Assumption 3). Let . We pick as follows:
[TABLE]
We now present our result for last point of SGD with strong-convexity assumption.
Theorem 2** (SGD Last Point for Strongly Convex Functions).**
Let satisfy Assumptions 1 and 3. Then the following holds for the -th iterate of the SGD algorithm (Algorithm 1) when run with the step size sequence given in Equation (4):
[TABLE]
Furthermore, the following holds for all with probability at least :
[TABLE]
Under the same assumptions, GD update with the same step-size sequence given in (4) also ensures the following after iterations:
[TABLE]
Here again, we note that the result is information theoretically optimal up to factor.
2.1 General Step Size Modification
Theorems 1 and 2 are consequences of our general results on step size modification that we present below. Consider SGD step size sequence . We obtain modified step size sequence as follows:
[TABLE]
Under certain mild conditions, we will show that the last iterate of SGD with step size is as good as the average iterate of SGD with step size . We make these notions precise below:
Assumption 4** (Slowly Decreasing Step Size Sequence).**
We call a step size sequence ‘decreasing’ if . We say that step size sequence has ‘at most polynomial decay’ with decay constant if for every .
We have the following general theorem:
Theorem 3**.**
Let be a decreasing step size sequence with at most polynomial decay with decay constant . Let the iterates of SGD with step size be . Let be the modification of as defined in Equation (5). Let the iterates of SGD with step size be . Then, for all , we have:
[TABLE]
We also give a high probability version of Theorem 3.
Theorem 4**.**
Let . Let be any arbitrary fixed probability distribution over the set . With probability atleast , we have:
[TABLE]
That is, the above theorems show that compared to any weighted average of function values of iterates in the iterations, the error is not significantly larger if is reasonably large and is small. Now, using standard analysis, we can ensure small average function value for iterates in iterations. Small value of and bound on hold trivially for standard step-size sequences.
See Section 3 for detailed proofs of the above theorems. We first develop general technique and prove key lemmas in the next section, and then present proofs for all the theorems.
2.2 Lower Bounds
The step size modification procedure described above assumed the knowledge of the last iterate (this is not a setback in practice). We study the case of infinite horizon SGD. In this section we state our bounds on the last iterate of ‘any time’ (infinite horizon) SGD in the case of strongly convex objectives. We will first introduce the notion of suboptimality that we consider. In particular, we look at two kinds of ‘bad performance’ in infinite horizon SGD for non-smooth strongly convex optimization. Consider any infinite step size sequence .
The sequence is said to be ‘bad in expectation’ if for an objective satisfying assumptions 1,2 and 3, some choice of subgradient oracle, and SGD iterates with step size , there is a fixed subsequence such that . 2. 2.
The sequence is said to be ‘bad almost surely’ if for an objective satisfying assumptions 1,2 and 3, some choice of subgradient oracle, and SGD iterates with step size , with probability there exists a random infinite sequence of times such that
We give a ‘no free lunch’ theorem: that is we show that infinite horizon step-size sequence for non-smooth strongly convex optimization is either ‘bad in expectation’ or ‘bad almost surely’. More precisely, we will show that if any infinite horizon SGD is good in ‘expectation’ for every for every strongly convex function, then it is ‘bad almost surely’ for some function .
Theorem 5**.**
Consider infinite horizon SGD with step size such that assumptions 1 2 and 3 hold for the objective function. Then, for any choice of , the algorithm is either bad in expectation or bad almost surely.
We give the proof in Section B.
3 Technical Ideas and Proofs
Recall the definition of from Section 2. The rough idea behind the proof is as follows: we will find a ‘good point’ in the range and then show that this implies that there is a ‘good point’ between and and so on, until we conclude that is a good point.
To this end, we first provide a key lemma that bounds the total weighted deviation of SGD iterates from a given iterate (in terms of function value), i.e., it intuitively shows that once we find an iterate with small function value, the remaining iterates cannot deviate from it significantly. The lemma uses a trick that was first used in [16] and then also in [1].
Lemma 1**.**
Let be the output of SGD algorithm (Algorithm 1) with step size sequence defined by (3). Then, given any ,
[TABLE]
Proof.
By convexity of , we have:
[TABLE]
Taking squares and expanding on both sides,
[TABLE]
Taking expectation on both sides, and realizing that is independent of and , we conclude,
[TABLE]
Here we have used the fact that . Using convexity, is lower bounded by . We conclude that:
[TABLE]
The result now follows by summing the above term from to . ∎
We now provide a high probability version of Lemma 1. To this end, we construct an exponential super-martingale that when combined with a Chernoff bound leads to exponential concentration bound. The method used is somewhat similar to the one used in [2], but our technique is specifically for Lemma 1 and is more concise.
For simplicity of exposition, we first define a few key quantities. Let and . We define the sequence as follows: for as follows:
[TABLE]
Using Lemma 3, . Now, for any such that , we define the following random variables :
[TABLE]
We note the difference between and : considers suboptimality with respect to whereas considers the suboptimality with respect to the optimizer .
Lemma 2**.**
Let and be as defined by (7). Let be any probability distribution over . We let . Also, let be a decreasing step size sequence. Then,
[TABLE]
Additionally, if almost surely, we have:
[TABLE]
Lemma 3**.**
Let be fixed. Let , , , . Then, for every ,
See Section A for proofs of the above given lemmata. We also require the following technical lemma:
Lemma 4**.**
Let be as defined in Section 2. Then, for all :
[TABLE]
Proof.
Lemma follows from the fact that . ∎
3.1 Step Size Modification
Henceforth, we will assume that is a decreasing step size sequence with at most polynomial decay (decay constant being ). We let be the modification of as defined in Equation 5. Let,
[TABLE]
Note that . We note that are completely deterministic and only used as part of the proof. The ability to compute is not necessary.
Lemma 5**.**
Let ’s be iterates of SGD (Algorithm 1) with modified step size sequence of defined in (5); sequence satisfies Assumption 4. Let be as defined by (2), and , be as defined in (8). Also, let . Then, the following holds for all :
[TABLE]
Proof.
We first consider . If , the proof is done. Else, using Lemma 1 with and , and the fact that is a decreasing sequence, we get:
[TABLE]
By definition of , whenever . Hence,
[TABLE]
where the first equality follows from the definition of in (5), first inequality follows from Equation (9), and the final inequality follows from the fact that when (see definition of in (8)).
Now, by using the above inequality with the assumption , and the fact that , we have:
[TABLE]
where follows from Lemma 4. The equality follows from definition of and the last inequality follows from the -slowly decaying assumption for (Assumption 4).That is we obtain the result for the case . The proof for the case when follows with minor modifications to the arguments given above. ∎
We now present a high probability version of Lemma 5.
Lemma 6**.**
Consider the setting of Lemma 5. Let and define for and for . Let be any probability distribution over . Let , where and the sequence is defined by (6). Then, for any and , the following holds with probability at least :
[TABLE]
For , the following holds with probability atleast :
[TABLE]
Proof.
We will only show the case . The case follows by a similar proof. For , we define . We let be defined as follows over :
[TABLE]
From Lemma 7, we conclude that is a probability distribution over . From Lemma 2, we conclude that with probability atleast :
[TABLE]
We will show that when this event happens, the inequality in the statement of the lemma holds. If , then the statement of the lemma holds trivially. Now assume . We use the fact that is supported over and hence:
[TABLE]
We exchange summation and collect the coefficients of the term to conclude:
[TABLE]
where (empty sum being [math] by definition). By definition of , and . Therefore, we conclude:
[TABLE]
We recall that whenever . The rest of the proof is similar to Equation (11) in Lemma 5. We use the fact that is the modification of , has at most polynmial decay, and Lemma 4 in Equation (14) to conclude the result. ∎
Lemma 7**.**
Let be as defined in (12). Then, is a probability distribution over .
The proof of this lemma is given in Section A
3.2 Proof of Theorem 3
Proof.
Recall the definition of in (8). Clearly, . Summing the bounds in Lemma 5 we conclude:
[TABLE]
We conclude the result by noting that for all . ∎
3.3 Proof of Theorem 4
Proof.
This proof is similar to the proof of Theorem 3, but instead of Lemma 5 we use Lemma 6. In Lemma 6, we pick for and we let be arbitrary. We let . By union bound, the inequalities in the statement of Lemma 6 hold for all simultaneously with probabiliy atleast . Summing all these inequalities, we conclude:
[TABLE]
We note that the distribution has unit mass over the point and that when to conclude the result. ∎
3.4 Proof of Theorem 1
Proof.
We note that the step size defined in Equation (3) is the modification of the standard step size . Let be the output of SGD under the assumptions of the theorem when step size is used. Using the fact that infimum is smaller than any weighted average, we have:
[TABLE]
where the second line follows from . follows from the standard analysis [6]. Final inequality follows from the fact that . We note that satisfies the conditions for Theorem 3 with . We invoke Theorem 3 to conclude the bound on expectation. The above proof in expectation also works for GD . We take and SGD is the same as GD. Here each and is a deterministic point mass. Therefore, the expectation bound for the last iterate of SGD holds for the last iterate of GD.
We will now prove the high probability bound. Let , , and for . Then using Lemma 2, the following holds with probability atleast :
[TABLE]
Using and proceeding similarly as above, we have w.p. ,
[TABLE]
Theorem now follows by using Theorem 4 with , , and union bound. ∎
3.5 Proof of Theorem 2
Proof.
We note that the step size defined in Equation (4) is the modification of the standard step size used for strongly convex functions (see [14]). Let be the output of SGD when step size is used. From Theorem 5 in [14], we conclude that:
[TABLE]
The expectation bound follows from using above equation with Theorem 3 and noting that satisfies the required conditions with . We get high probability bounds by invoking high probability bounds for suffix averaging from [2], i.e., w.p. at least ,
[TABLE]
The result now follows by using Theorem 4 with and .∎
4 Experiments
We now empirically compare SGD last point with our step-size sequence (Our Method) with the standard steps size sequence (Standard) as well as the averaged iterates of SGD (Averaged). We apply these methods on two non-smooth problems: a) Lasso regression, b) linear SVM training.
Lasso Regression: We consider gradient descent for for . Here and for some sparse vector and . and are all independent. We use the step sizes of and let be the modification of as given in Section 2 for total iterations.
Since the objective is not smooth, the gradient doesn’t vanish near the optimum. Therefore, when the standard step size was picked, the iterate kept oscillating around the infimum but never really reaches it. In contrast, our method decreased the step size after sometime which allows better convergence to the optimum (see Figure 1(a)).
Training SVMs: We consider training SVMs which is a typical example where non-smooth SGD is heavily used [4]. For our experiments, we generate data as follows. Let and the label where . We generate points in dimensions. The SVM training problem is now:
[TABLE]
where . Since the objective is strongly convex, we consider step sizes of for the standard method and the modified step sizes given in Equation (4) for our method. Figure 1 (b) plots loss during a typical run of SGD and Figure 1 (c) for the loss averaged over independent runs of SGD for the same problem with the same initial point. The last point of SGD with modified step size sequence (Our Method) in blue consistently outperformed the standard SGD (Standard) in red. The green line denotes the loss of the average of the last iterates.
5 Conclusions and Discussion
We studied the fundamental question of sub-optimality of the last point of SGD/GD for general non-smooth convex functions as well as for strongly-convex functions. We proposed a novel step-size sequence that leads to information theoretically optimal rates in both the above mentioned settings. Our result proves a more general result for any “modified step-size” of a decaying standard step-size, and uses a novel technique of tracking best iterate in each time-interval and ensuring that the later iterates do not significantly deviate from the best iterate in the previous time interval. We also provide a high-probability bound using a super-martingale technique from [2]. Simulations show that our step-size indeed leads to better last point than the standard step-size sequences.
Our approach fundamentally exploits an assumption that we apriori know the total number of iterations . Hence, our result does not provide an any-time algorithm. In contrast, existing any-time results have an extra multiplicative factor in the sub-optimality. We conjecture that this gap is fundamental and every any-time algorithm would suffer from the extra factor. We give lower bounds for the strongly convex case to show that for any choice of step sizes, the algorithm is either sub-optimal in expectation or almost surely so infinitely often.
Acknowledgements
This research was partially supported by ONR N00014-17-1-2147 and MIT-IBM Watson AI Lab.
Appendix A Proofs of Technical Lemmas
A.1 Proof of Lemma 2
Proof of Lemma 2.
We fix such that . In this proof, we will freely use the fact that whenever . Let Define . We note that are random variables and are functions of only. We define the sigma-field .
We use the following notation for the sake of convenience: . Clearly, is measurable and is measurable. It is clear from the definition of that and .
By Hoeffding’s lemma, we conclude that for any , we conclude:
[TABLE]
Let . For , consider
[TABLE]
Clearly, is measurable. since almost surely. We will show that is a super martingale:
[TABLE]
Therefore,
[TABLE]
From the proof of Lemma 1, for we have:
[TABLE]
In the third step, we have used the convexity of . Reordering Equation (18) and using the notation defined above:
[TABLE]
Multiplying the equation above by and adding from to , noting the fact that and , we conclude:
[TABLE]
We recall the random variable
[TABLE]
From equations (17) and (19), we conclude that for every such that :
[TABLE]
By convexity of the exponential function, we have:
[TABLE]
By Chernoff Bound, we conclude:
[TABLE]
The case for proceeds similarly but this time we use in place of . We define , and
[TABLE]
We note that for , \mathbb{E}\left[M^{*}_{t}\bigr{|}\mathcal{F}_{t-1}\right]\leq M^{*}_{t-1} and . Therefore,
[TABLE]
Here we have used the fact that . Noting that we use Chernoff bound to conclude the result. ∎
A.2 Proof of Lemma 3
Proof of Lemma 3.
We prove this by induction. The assertion is true for . Suppose it is true for . Then,
[TABLE]
The we have proved the assertion through induction. ∎
A.3 Proof of Lemma 7
Proof of Lemma 7.
We take the definitions of the terms used from the proof of Lemma 6. It is clear from the definition that . Since for , it is sufficient to show that .
We define (an empty sum denotes 0). By definition of , for
[TABLE]
Continuing the above recursion, we conclude:
[TABLE]
Since is a probability distribution over , we conclude
[TABLE]
.
∎
Appendix B Proofs of Lower Bounds
We will prove theorem 5 for and for the sake of convenience. We can handle the general case by considering the transformation . We scale the domain as . If is strongly convex and Lipschitz, then is strongly convex and Lipschitz. We take the subgradient oracle for to be . It is easy to check that if SGD for with step sizes , the iterates are , then starting from and using step sizes and the subgradient oracle defined above, the iterates for is . Therefore, and the proof below goes through seamlessly. This is similar to the rescaling used for the lowerbounds in [2].
Without loss of generality, we will restrict our attention to strictly positive step size sequences: . We further restrict the possible values of in the following lemma:
Lemma 8**.**
If the step size sequence is such that there is an infinite sequence of times such that , then SGD is bad in expectation. Therefore, we can restrict our consideration to step size sequences of the form .
Proof.
Consider the function defined by . has a global optimum at and it is strongly convex. Let be a sequence of i.i.d. rademacher random variables (i.e, uniform over ). We let the subgradient oracle to return . Clearly,
[TABLE]
is independent of and conditioned on the value of , with probability atleast , has the opposite sign as . When this happens, has the opposite sign of and . Therefore under this event, .
Therefore, we conclude:
[TABLE]
Considering the fact that , we conclude that SGD with this step size is bad in expectation.
∎
Henceforth, we will restrict our attention without loss of generality to step size sequences such that . We will first consider the function over the set . Let the infinite horizon learning rate be at each time instant, the subgradient oracle returns where is a sequence of i.i.d. uniform random variable over (that is rademacher random variables). Let the iterates of SGD be and .
Lemma 9**.**
Let be the smallest time such that for all . Then, for every
[TABLE] 2. 2.
[TABLE]
Proof.
Suppose . Then:
[TABLE]
Therefore, when , . Therefore, almost surely. When ,
[TABLE]
Therefore, when , the iteration of SGD won’t leave the set almost surely, so there is no need for the projection step to obtain the next iterate. That is, for , . Squaring and taking expectations, we conclude:
[TABLE]
Clearly, . Using induction in the equation above, we conclude: for every .
∎
We divide into time intervals of the form . We have the following lemma:
Lemma 10**.**
If for some constant and there exist positive infinite sequences and such that , and every , either one of the two conditions below hold:
** 2. 2.
**
Then, SGD with step size is bad in expectation.
Proof.
We consider the optimization problem considered in Lemma 9 i.e, optimizing . Let . We assume the contrary - that is, for every , for some . As shown in the second inequality of Lemma 9, irrespective of the choice of ,
[TABLE]
From the first equality in Lemma 9, we conclude that for , Since , we can take large enough so that for every . Using the fact that . Therefore,
[TABLE]
Unravelling the recursion above, we conclude:
[TABLE]
We define .
Suppose for a particular , the first item in the statement of the lemma holds
By assumption, . Using this in Equation (20), we conclude:
[TABLE]
Now, since , we have . Therefore,
[TABLE]
In the third step, we have used the fact that . We now consider the function given by for some . Clearly, is convex, bounded below and tends to infinity as . Therefore, it has a unique minimizer - the unique point such that . That is, is the unique point which satisfies: . Therefore, . Therefore, . In Equation (23), we take we conclude:
[TABLE]
Where is a constant depending only on and . 2. 2.
Suppose for a particular , the second item in the statement of the lemma holds: Then, by Equation (20), we have:
[TABLE]
From Equations (21) and (22), we conclude that there exists an absolute constant depending only on and such that:
[TABLE]
Since , we can choose large enough so that for arbitrary .
From Equation (23), it follows that for arbitrary ,
[TABLE]
By Lemma 9, . By our assumption, . Therefore, we conclude: for every . This cannot hold for any finite when we take . This contradicts our assumption. Therefore, SGD with step size is bad in expectation.
∎
We will show that if conditions for in Lemma 9 or those in Lemma 10 don’t hold, then SGD is bad almost surely. We recall the definition of the interval . We prove the following lemma to inspect how frequently long, contiguous segments of are all equal to for . We take . We note that We can divide into contiguous, disjoint intervals, each of size . We call these intervals for . We let to be the event that for some , for all . In particular, the even implies that there is a contiguous length sequence of of all s in .
Lemma 11**.**
* for some absolute constant .*
Proof.
We subdivide the interval into disjoint subintervals of length . There are such intervals. The event holds if over one such subinterval, the random signs are all . The probability of a given subinterval having all signs equal to is . Therefore, we conclude:
[TABLE]
Here we have used the inequality for .
Therefore, we conclude that: for some absolute constant . ∎
We now consider the same function which was considered in Lemma 8 i.e, defined by . has a global optimum at and it is strongly convex. Let be a sequence of i.i.d. rademacher random variables (i.e, uniform over ). We let the subgradient oracle to return . Let the iterates of SGD for with step sizes be .
Lemma 12**.**
Suppose , there exists an infinite sequence and fixed constants such that both the conditions hold:
[TABLE] 2. 2.
[TABLE]
We note that these conditions are the negations of the conditions for in Lemma 8 and Lemma 10. Then SGD with step size is bad almost surely.
Proof.
We will show that there exists a sequence of independent events for such that uniformly and whenever holds,
[TABLE]
For some constant . We note that and depend only on and . We consider a random times as follows:
If the event holds, pick a uniformly random element from independent of everything else. Set and 2. 2.
If the event holds, pick a uniformly random element from , independent of everything else. Set and
We note that by symmetry, is uniformly distributed over the set . We will show that, when the event holds, then one of the following is true:
. 2. 2.
Suppose the event holds. Then for , . Since under the event , for every , we conclude that . That is SGD drifts in the negative direction irrespective of the value of the iterate. It is therefore clear that if for some , hits , then . Now suppose that for for every in this range. Then, . But unraveling this recursion, it follows that . Therefore, it follows that when the event holds:
[TABLE]
It is clear that since , for large enough, . Therefore, we conclude that for large enough, when the event holds,
[TABLE]
Fix .We now consider to be the event \bigr{\{}\sum_{t\in J_{k}(i_{0})}\gamma_{t}\geq\frac{\beta\tau}{|I_{k}|}\sum_{t\in I_{k}}\gamma_{t}\bigr{\}}.
By symmetry, is uniformly distributed over . Therefore,
[TABLE]
and
[TABLE]
Now, when is part of the infinite sequence , by assumption we have:
[TABLE]
Therefore, by Payley-Zigmund inequality, whenever is part of the infinite sequence , for every ,
[TABLE]
Recalling the definition of , we conclude, .
We will now define the event . The events are all independent by definition. When the event holds, clearly, from equation (24), we conclude:
[TABLE]
The second inequality follows from the defintion of . Using the fact that any is such that and , we conclude that for some , fixed, the following holds whenever the event holds.
[TABLE]
[TABLE]
It is clear that we can find a such that for all large enough, .
Since are independent sets, it follows that infinitely many of them are true with probability . From equation (25), we conclude that SGD with step sizes is bad almost surely. ∎
Proof of Theorem 5.
We will conclude this from Lemmas 8, 10 and 12. Therefore, it is sufficient to show that any strictly positive infinite sequence is such that atleast one of the following condition holds
There is an infinite sequence of times such that . In this case, by Lemma 8, we conclude that it is bad in expectation. 2. 2.
There exists a such that and there exist infinite sequences and such that for every k, either or . In this case, by Lemma 10, we conclude that it is bad in expectation. 3. 3.
There exists a such that and there exist fixed positive constants and such that for some infinite sub-sequence , and . In this case, by Lemma 12, we conclude that the algorithm is bad almost surely.
It is therefore sufficient to show that if conditions 1 and 2 don’t hold then condition 3 holds. The negation of condition 1 is that for some . Now, we denote by
[TABLE]
and
[TABLE]
. Therefore, or for some and is equivalent to which is equivalent to the statement that for every subsequence , . Therefore the negation of condition 2 is equivalent to atleast one of the following conditions being true
There exists infinite sequence such that 2. 2.
There exists and infinite subsequence such that for some . That is, and
Therefore we conclude that when neither of the conditions 1 and 2 hold, then condition 3 holds. This proves our result.
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning , pages 71–79, 2013.
- 2[2] Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. ar Xiv preprint ar Xiv:1812.05217 , 2018.
- 3[3] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436, 2015.
- 4[4] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming , 127(1):3–30, 2011.
- 5[5] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. ar Xiv preprint ar Xiv:1711.04325 , 2017.
- 6[6] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning , 8(3-4):231–357, 2015.
- 7[7] Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conference on Learning Theory , pages 47–1, 2012.
- 8[8] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization , 30(4):838–855, 1992.
