First-order algorithms converge faster than $O(1/k)$ on convex problems
Ching-pei Lee, Stephen J. Wright

TL;DR
This paper proves that first-order algorithms like gradient descent and coordinate descent can achieve convergence rates faster than $O(1/k)$ for convex problems, improving known bounds.
Contribution
It establishes that several first-order methods attain an $o(1/k)$ convergence rate, surpassing the traditional $O(1/k)$ rate, and shows this is the best possible improvement.
Findings
Gradient descent achieves $o(1/k)$ convergence rate.
Proximal methods also attain $o(1/k)$ rate.
The $o(1/k)$ rate is tight and cannot be improved to $O(1/k^{1+ heta})$ for any $ heta>0$.
Abstract
It is well known that both gradient descent and stochastic coordinate descent achieve a global convergence rate of in the objective value, when applied to a scheme for minimizing a Lipschitz-continuously differentiable, unconstrained convex function. In this work, we improve this rate to . We extend the result to proximal gradient and proximal coordinate descent on regularized problems to show similar convergence rates. The result is tight in the sense that a rate of is not generally attainable for any , for any of these methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods
∎
11institutetext: Department of Computer Sciences and Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI
11email: {ching-pei,swright}@cs.wisc.edu
First-Order Algorithms
Converge Faster than on Convex Problems ††thanks: This work was supported by NSF Awards IIS-1447449, 1628384, 1634597, and 1740707; Subcontract 8F-30039 from Argonne National Laboratory; and Award N660011824020 from the DARPA Lagrange Program.
Ching-pei Lee
Stephen J. Wright
Abstract
It is well known that both gradient descent and stochastic coordinate descent achieve a global convergence rate of in the objective value, when applied to a scheme for minimizing a Lipschitz-continuously differentiable, unconstrained convex function. In this work, we improve this rate to . We extend the result to proximal gradient and proximal coordinate descent on regularized problems to show similar convergence rates. The result is tight in the sense that a rate of is not generally attainable for any , for any of these methods.
Keywords:
Gradient descent methods Coordinate descent methods Proximal gradient methods Convex optimization Complexity
1 Introduction
Consider the unconstrained optimization problem
[TABLE]
where has domain in an inner-product space and is convex and -Lipschitz continuously differentiable for some . We assume throughout that the solution set is non-empty. (Elementary arguments based on the convexity and continuity of show that is a closed convex set.) Classical convergence theory for gradient descent on this problem indicates a global convergence rate in the function value. Specifically, if
[TABLE]
and , we have
[TABLE]
where is the optimal objective value and denotes the distance from to the solution set. The proof of (3) relies on showing that
[TABLE]
where the first inequality utilizes the fact that gradient descent is a descent method (yielding a nonincreasing sequence of function values ). We demonstrate in this paper that the bound (3) is not tight, in the sense that , and thus . This result is a consequence of the following technical lemma.
Lemma 1
Let be a nonnegative sequence satisfying the following conditions:
* is monotonically decreasing;* 2. 2.
* is summable, that is, .*
Then , so that .
Proof
The proof uses simplified elements of the proofs of Lemmas 2 and 9 of Section 2.2.1 from Pol87a . Define and . Note that
[TABLE]
From (5) we have
[TABLE]
so that is a monotonically decreasing nonnegative sequence. Thus there is such that , and since , we have also.
Assuming for contradiction that , there exists such that for all , so that for all . This contradicts the summability of . Therefore we have , so that , proving the result. ∎
Our claim about the fixed-step gradient descent method follows immediately by setting in Lemma 1. We state the result formally as follows, and prove it at the start of Section 2.
Theorem 1.1
Consider (1) with convex and -Lipschitz continuously differentiable and nonempty solution set . If the step sizes satisfy for all , then gradient descent (2) generates objective values that converge to at an asymptotic rate of .
This result shows that the rate for gradient descent with a fixed short step size is universal on convex problems, without any additional requirements such as the boundedness of assumed in (Ber16a, , Proposition 1.3.3). In the remainder of the paper, we show that this faster rate holds for several other smooth optimization algorithms, including gradient descent with fixed steps in the larger range , gradient descent with various line-search strategies, and stochastic coordinate descent with arbitrary sampling strategies. We then extend the result to algorithms for regularized convex optimization problems, including proximal gradient and stochastic proximal coordinate descent.
Except for the cases of coordinate descent and proximal coordinate descent which require a finite-dimensional space so that all the coordinates can be processed, our results apply to any inner-product spaces. Assumptions such as bounded solution set, bounded level set, or bounded distance to the solution set, which are commonly assumed in the literature, are all unnecessary. We can remove these assumptions because an implicit regularization property causes the iterates to stay within a bounded area.
In our description, the Euclidean norm is used for simplicity, but our results can be extended directly to any norms induced by an inner product,111We meant that given an inner product , the norm is defined as . provided that Lipschitz continuity of is defined with respect to the corresponding norm and its dual norm.
Related Work.
Our work was inspired by (PenZZ18a, , Corollary 2) and (Ber16a, , Proposition 1.3.3), which improve convergence for certain algorithms and problems on convex problems in a Euclidean space from to when the level set is compact. Our paper develops improved convergence rates of several algorithms on convex problems without the assumption on the level set, with most of our results applying to non-Euclidean Hilbert spaces. The main proof techniques in this work are somewhat different from those in the works cited here.
For an accelerated version of proximal gradient on convex problems, it is proved in AttP16a that the convergence rate can be improved from to . Accelerated proximal gradient is a more complicated algorithm than the nonaccelerated versions we discuss, and thus AttP16a require a more complicated analysis that is quite different from ours.
DenLPY17a have stated a version of Lemma 1 with a proof different from the proof that we present, using it to show the convergence rate of the quantity of a version of the alternating-directions method of multipliers (ADMM). Our work differs in the range of algorithms considered and the nature of the convergence. We also provide a discussion of the tightness of the convergence rate.
2 Main Results on Unconstrained Smooth Problems
We start by detailing the procedure for obtaining (4), to complete the proof of Theorem 1.1. First, we define
[TABLE]
From the Lipschitz continuity of , we have for any point and any real number that
[TABLE]
Clearly,
[TABLE]
so in this case, we have by rearranging (7) that
[TABLE]
Considering any solution and any , we have for gradient descent (2) that
[TABLE]
Since in (10), from (9) and the convexity of (implying ), we have
[TABLE]
By rearranging (11) and using ,
[TABLE]
We then obtain (4) by summing (12) from to and noticing that is arbitrary in .
Theorem 1.1 applies to step sizes in the range only, but it is known that gradient descent converges at the rate of for both the fixed step size scheme with and line-search schemes. Next, we show that rates hold for these variants too. We then extend the result to stochastic coordinate descent with arbitrary sampling of coordinates.
2.1 Gradient Descent with Longer Steps
In this subsection, we allow the steplengths for (2) to vary from iteration to iteration, according to the following conditions, for some :
[TABLE]
Note that these conditions encompass a fixed-steplength strategy with as a special case, by setting , and noting that condition (13b) is a consequence of (7). (Note too that can be almost twice as large as the bound considered above.)
The main result for this subsection is as follows.
Theorem 2.1
Consider (1) with convex and -Lipschitz continuously differentiable and nonempty solution set . If the step sizes satisfy (13), then gradient descent (2) generates objective values converging to at an asymptotic rate of .
We give two alternative proofs of this result to provide different insights. The first proof is similar to the one we presented for Theorem 1.1 at the start of this section. The second proof holds only for Euclidean spaces. This proof improves the standard proof of (Nes04a, , Section 2.1.5).
We start from the following lemma, which verifies that the iterates remain in a bounded set and is used in both proofs.
Lemma 2
Consider algorithm (2) with any initial point , and assume that is convex and -Lipschitz-continuously differentiable for some . Then when the sequence of steplengths is chosen to satisfy (13), all iterates lie in a bounded set. In particular, for any and any , we have that
[TABLE]
Proof
By (13b) and the convexity of , (10) further implies that for any ,
[TABLE]
We know that the first term is nonnegative from (13b), while the second term is nonpositive from the optimality of . Therefore, (16) implies
[TABLE]
We then obtain (14) by summing (17) for and telescoping. By noting that for all , (15) follows. ∎
The first proof of Theorem 2.1 is as follows.
Proof (First Proof of Theorem 2.1)
We again consider Lemma 1 with , which is always nonnegative from the optimality of . Monotonicity is clear from (13b), so we just need to show summability. By rearranging (14) and noting , we obtain
[TABLE]
For the second proof of Theorem 2.1, we first outline the analysis from (Nes04a, , Section 2.1.5) and then show how it can be modified to produce the desired rate. Denote by the projection of onto (which is well defined because is nonempty, closed, and convex). We can utilize the convexity of to obtain
[TABLE]
so that
[TABLE]
By subtracting from both sides of (13b) and using and (18), we obtain
[TABLE]
By dividing both sides of this expression by and using , we obtain
[TABLE]
By summing (19) over , we obtain
[TABLE]
A rate is obtained by noting from Lemma 2 that for some and all , so that
[TABLE]
Our alternative proof uses the fact that (21) is a loose bound for Euclidean spaces and that an improved result can be obtained by working directly with (20). We first use the Bolzano-Weierstrass theorem (a bounded and closed set is sequentially compact in a Euclidean space) together with Lemma 2, to show that the sequence approaches the solution set .
Lemma 3
Assume the conditions in Lemma 2 and in addition that has domain in a Euclidean space . We have
[TABLE]
Proof
The proof is similar to (PenZZ18a, , Proposition 1). Assume for contradiction that (22) does not hold. Then there are and an infinite increasing sequence , , such that
[TABLE]
From Lemma 2 and that , we can the sequence lies in a compact set and therefore has an accumulation point . From (19), we have
[TABLE]
so that and hence . By continuity of , it follows that , so that by definition, contradicting (23). ∎
We note that a result similar to Lemma 3 has been given in BurGIS95a using a more complicated argument with more restricted choices of .
Proof (Second Proof of Theorem 2.1, for
Euclidean Spaces)
We start with (20) and show that
[TABLE]
or, equivalently,
[TABLE]
From the arithmetic-mean / harmonic-mean inequality,222 This inequality says that for any real numbers , their harmonic mean does not exceed their arithmetic mean. Namely,
we have that
[TABLE]
Lemma 3 shows that , so by the Stolz-Cesàro theorem (see, for example, Mur09a ), the right-hand side of (25) converges to [math]. Therefore, from the sandwich lemma, (24) holds. ∎
2.2 Coordinate Descent
We now extend Theorem 1.1 to the case of randomized coordinate descent. Our results can extend immediately to block-coordinate descent with fixed blocks. Our analysis for coordinate descent requires Euclidean spaces so that coordinate descent can go through all coordinates.
The standard short-step coordinate descent procedure requires knowledge of coordinate-wise Lipschitz constants. Denoting by the th unit vector, we denote by the constants such that:
[TABLE]
where denotes the th coordinate of the gradient. Note that if is -Lipschitz continuous, there always exist such that (26) holds. Without loss of generality, we assume for all . Given parameters such that for all , the coordinate descent update is
[TABLE]
where is the coordinate selected for updating at the th iteration. We consider the general case of stochastic coordinate descent in which each is independently identically distributed following a fixed prespecified probability distribution satisfying
[TABLE]
for some constant . Nesterov Nes12a proves that stochastic coordinate descent has a convergence rate (in expectation of ) on convex problems. We show below that this rate can be improved to .
Theorem 2.2
Consider (1) with convex and nonempty solution set , and that componentwise-Lipschitz continuous differentiability (26) holds with some . If we apply coordinate descent (27) and at each iteration, is independently picked at random following a probability distribution satisfying (28), then the expected objective converges to at an asymptotic rate of .
Proof
From (26) and that , by treating all other coordinates as non-variables, we have that for any ,
[TABLE]
showing that the algorithm decreases at each iteration. Consider any , by defining
[TABLE]
we have from (27) that
[TABLE]
By taking expectation over on both sides of the above expression, we obtain from the convexity of and (29) that
[TABLE]
By taking expectation over on (31) and summing (31) over , we obtain
[TABLE]
The result now follows from Lemma 1. ∎
3 Regularized Problems
We turn now to regularized optimization in an inner-product space:
[TABLE]
where both terms are convex, is -Lipschitz-continuously differentiable, and is extended-valued, proper, and closed, but possibly nondifferentiable. We also assume that is such that the prox-operator can be applied easily, by solving the following problem for any given and any :
[TABLE]
We assume further that the solution set of (32) is nonempty, and denote by the value of for all . We discuss two algorithms to show how our techniques can be extended to regularized problems. They are proximal gradient (both with and without line search) and stochastic proximal coordinate descent with arbitrary sampling.
3.1 Short-Step Proximal Gradient
Given , the th step of the proximal gradient algorithm is defined as follows:
[TABLE]
Note that is uniquely defined here, since the subproblem is strongly convex. It is shown in BecT09a ; Nes13a that converges to at a rate of for this algorithm, under our assumptions. We prove that a rate can be attained.
Theorem 3.1
Consider (32) with convex and -Lipschitz continuously differentiable, convex, and nonempty solution set . Given any , the proximal gradient method (33) generates iterates whose objective value converges to at a rate.
Proof
The method (33) can be shown to be a descent method from the Lipschitz continuity of and the fact that . From the optimality of the solution to (33) and that ,
[TABLE]
where denotes the subdifferential of . Consider any . We have from (33) that for any , the following chain of relationships holds:
[TABLE]
where in the last inequality, we have used
[TABLE]
By rearranging (35) we obtain
[TABLE]
The result follows by summing both sides of this expression over and applying Lemma 1. ∎
3.2 Proximal Gradient with Line Search
We discuss a line-search variant of proximal gradient, where the update is defined as follows:
[TABLE]
where is chosen such that for given and defined as in (13a), we have
[TABLE]
This framework is a generalization of that in Section 2.1, and includes the SpaRSA algorithm of WriNF09a , which obtains an initial choice of from a Barzilai-Borwein approach and adjusts it until (38) holds. The approach of the previous subsection can also be seen as a special case of (37)-(38) through the following elementary result, whose proof is omitted.
Lemma 4
Consider a convex function , a positive scalar and two vectors and . If is the unique solution of the strictly convex problem
[TABLE]
then
[TABLE]
By setting , (where ), this lemma together with (36) implies that (38) holds for any . Moreover, it also implies that for any ,
[TABLE]
Therefore, for any , (38) holds whenever
[TABLE]
or equivalently
[TABLE]
which is how the upper bound for is set.
We show now that this approach also has a convergence rate on convex problems.
Theorem 3.2
Consider (32) with convex and -Lipschitz continuously differentiable, convex, and nonempty solution set . Given some and and such that and , then the algorithm (37) with satisfying (38) generates iterates whose objective values converge to at a rate of . Moreover, the sequence of iterates is bounded.
Proof
From the optimality conditions of (37), we have
[TABLE]
Now consider any . We have from (37) that for any , the following chain of relationships holds:
[TABLE]
By rearrangement, of this inequality, we obtain
[TABLE]
and by summing both sides and using telescoping sums, we find that , thus the conditions of Lemma 1 are satisfied by , and the rate follows.
By summing the inequality above finitely over , we obtain
[TABLE]
By rearranging this inequality, we obtain a uniform upper bound on , thus showing that the sequence is bounded. ∎
3.3 Proximal Coordinate Descent
We now discuss the extension of coordinate descent to (32), with the assumption (26) on , Euclidean domain of dimension , sampling weighted according to (28) as in Section 2.2, and the additional assumption of separability of the regularizer , that is,
[TABLE]
where each is convex, extended valued, and possibly nondifferentiable. As in our discussion of Section 2.2, the results in this subsection can be extended directly to the case of block-coordinate descent.
Given the component-wise Lipschitz constants and algorithmic parameters with for all , proximal coordinate descent updates have the form
[TABLE]
With for all , LuX15a showed that the expected objective value converges to at a rate. When arbitrary sampling (28) is considered, (43) is a special case of the general algorithmic framework described in LeeW18b . The latter paper shows the same rate for convex problems under the additional assumption that for any , we have
[TABLE]
We show here that with arbitrary sampling according to (28), (43) produces convergence rates for the expected objective on convex problems, without the assumption (44).
The following result makes use of the quantity defined in (30).
Theorem 3.3
Consider (32) with and convex and nonempty solution set . Assume further that (42) is true, and that (26) holds with some . Given with for all , suppose that proximal coordinate descent defines iterates according to (43), with chosen i.i.d. according to a probability distribution satisfying (28). Then converges to at an asymptotic rate of . Moreover, given any , the sequence of is bounded.
Proof
From (26), we first notice that in the update (43),
[TABLE]
From Lemma 4, the method defined by (43) is a descent method. Optimality of the subproblem in (43) yields
[TABLE]
By taking any , and using the definition (30), we have:
[TABLE]
By taking expectation over on both sides of (47) and using the convexity of together with (45), we obtain
[TABLE]
where in (48a) we used the fact that (43) is a descent method. By taking expectation over on (48b), summing over , and applying Lemma 1, we obtain the result.
Boundedness of follows from the same telescoping sum and the fact that decreases monotonically with . ∎
Our result shows that, similar to gradient descent and proximal gradient, proximal coordinate descent and coordinate descent also provide a form of implicit regularization in that the expected value of is bounded. Since can be viewed as a weighted Euclidean norm, this observation implies that the iterates are also in a sense expected to lie within a bounded region.
Our analysis here improves the rates in LuX15a ; LeeW18b in terms of the dependency on and removes the assumption of (13a) in LeeW18b . Even aside from the improvement from to , Theorem 3.3 is the first time that a convergence rate for proximal stochastic coordinate descent with arbitrary sampling for the coordinates is proven without additional assumptions such as (44). By manipulating (48b), one can also observe how different probability distributions affect the upper bound, and it might also be possible to get better upper bounds by using norms different from (30).
4 Tightness of the Estimate
We demonstrate that the estimate of convergence of is tight by showing that for any , there is a convex smooth function for which the sequence of function values generated by gradient descent with a fixed step size converges slower than . The example problem we provide is a simple one-dimensional function, so it serves also as a special case of stochastic coordinate descent and the proximal methods (where ) as well. Thus, this example shows tightness of our analysis for all methods without line search considered in this paper.
Consider the one-dimensional real convex function
[TABLE]
where is an even integer greater than . The minimizer of this function is clearly at , for which . Suppose that the gradient descent method is applied starting from . For any descent method, the iterates are confined to and we have
[TABLE]
so we set . Suppose that as above. Then the iteration formula is
[TABLE]
and by Lemma 2, all iterates lie in a bounded set: the level set defined by . In fact, since and , we have that
[TABLE]
so that and the value of remains valid for all iterates.
We show by an informal argument that there exists a constant such that
[TABLE]
From (50) we have
[TABLE]
By substituting the hypothesis (51) into (52), and taking to be large, we obtain the following sequence of equivalent approximate equalities:
[TABLE]
This last expression is approximately satisfied for large if satisfies the expression
[TABLE]
Stated another way, our result (51) indicates that a convergence rate faster than is not possible when steepest descent with fixed steplength is applied to the function provided that
[TABLE]
that is,
[TABLE]
We follow AttCPR18a to provide a continuous-time analysis of the same objective function, using a gradient flow argument. For the function defined by (49), consider the following differential equation:
[TABLE]
Suppose that
[TABLE]
for some , which indicates that starting from any , lies in a bounded area. Substituting (54) into (53), we obtain
[TABLE]
which holds true if and only if the following equations are satisfied:
[TABLE]
from which we obtain
[TABLE]
Since decreases monotonically to zero, for all ,
[TABLE]
is an appropriate value for a bound on . These values of and satisfy , making a valid step size. The objective value is , matching the rate of (51).
Acknowledgment
The authors thank Yixin Tao for a discussion that helped us to improve the clarity of this work.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming 168 (1-2), 123–175 (2018)
- 2(2) Attouch, H., Peypouquet, J.: The rate of convergence of nesterov’s accelerated forward-backward method is actually faster than 1/k^2. SIAM Journal on Optimization 26 (3), 1824–1834 (2016)
- 3(3) Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2 (1), 183–202 (2009)
- 4(4) Bertsekas, D.P.: Nonlinear programming, 3 edn. Athena scientific Belmont (2016)
- 5(5) Burachik, R., Graña Drummond, L., Iusem, A.N., Svaiter, B.: Full convergence of the steepest descent method with inexact line searches. Optimization 32 (2), 137–146 (1995)
- 6(6) Deng, W., Lai, M.J., Peng, Z., Yin, W.: Parallel multi-block ADMM with o ( 1 / k ) 𝑜 1 𝑘 o(1/k) convergence. Journal of Scientific Computing 71 (2), 712–736 (2017)
- 7(7) Lee, C.p., Wright, S.J.: Inexact variable metric stochastic block-coordinate descent for regularized optimization. Tech. rep. (2018). URL http://www.optimization-online.org/DB_HTML/2018/08/6753.html
- 8(8) Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming 152 (1-2), 615–642 (2015)
