Polyak Steps for Adaptive Fast Gradient Method
Mathieu Barr\'e, Alexandre d'Aspremont

TL;DR
This paper introduces a new adaptive method for accelerated gradient algorithms that estimates the strong convexity parameter online, eliminating the need for restart schemes and maintaining optimal convergence rates.
Contribution
It proposes a novel approach to adaptively estimate the strong convexity parameter during optimization, removing the necessity for restart strategies.
Findings
Achieves optimal linear convergence without restarts.
Demonstrates robustness of the method with estimated bounds on .
Provides empirical evidence of effectiveness.
Abstract
Accelerated algorithms for minimizing smooth strongly convex functions usually require knowledge of the strong convexity parameter . In the case of an unknown , current adaptive techniques are based on restart schemes. When the optimal value is known, these strategies recover the accelerated linear convergence bound without additional grid search. In this paper we propose a new approach that has the same bound without any restart, using an online estimation of strong convexity parameter. We show the robustness of the Fast Gradient Method when using a sequence of upper bounds on . We also present a good candidate for this estimate sequence and detail consistent empirical results.
| Dataset | regularization Logit | regularization Lasso | regularization SVM |
|---|---|---|---|
| Musk | |||
| Madelon |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Advanced Optimization Algorithms Research
Polyak Steps for Adaptive Fast Gradient Method
Mathieu Barré &Alexandre d’Aspremont
Abstract
Accelerated algorithms for minimizing smooth strongly convex functions usually require knowledge of the strong convexity parameter . In the case of an unknown , current adaptive techniques are based on restart schemes. When the optimal value is known, these strategies recover the accelerated linear convergence bound without additional grid search. In this paper we propose a new approach that has the same bound without any restart, using an online estimation of strong convexity parameter. We show the robustness of the Fast Gradient Method when using a sequence of upper bounds on . We also present a good candidate for this estimate sequence and detail consistent empirical results.
1 Introduction
We focus on solving a generic optimization problem written
[TABLE]
in the variable , where is a -smooth, -strongly convex function and a convex penalty term. In the deterministic setting, classical convergence bounds show
[TABLE]
after iterations of gradient descent with fixed step size, while accelerated proximal gradient descent methods yield iterates satisfying
[TABLE]
after iterations, showing a significantly weaker dependence on the problem’s condition number (see (Nesterov, 2005) for a complete discussion). Similar rates have been obtained in the stochastic setting under the assumption that is a finite sum. Early work in (Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013; Xiao and Zhang, 2014; Defazio et al., 2014) produced algorithms with a slow rate roughly matching (2) in its dependence on the condition number. Improved algorithms (Lin et al., 2014; Allen-Zhu et al., 2016; Shalev-Shwartz and Zhang, 2014; Lan and Zhou, 2018) obtain an accelerated rate similar to that in (3), with (Lan and Zhou, 2018) in particular showing that these bounds are unimprovable. All these results rely on a strong convexity assumption, with (Arjevani, 2017) showing that explicit knowledge of the strong convexity constant is required to get the fast rate using simple step size strategies. This remains a key limitation since the strong convexity constant is either unknown or poorly approximated in practice.
The situation is more favorable in the deterministic setting, with (Nesterov, 2013; Lin and Xiao, 2014a; Fercoq and Qu, 2016; Roulet and d’Aspremont, 2017; Renegar and Grimmer, 2018) showing that the fast rate can be achieved up to a factor , using a restart strategy (the first three references have an extra factor in the bound). The results in (Roulet and d’Aspremont, 2017) also show that the factor can be removed when the value of is known, so that restarted accelerated methods are fully adaptive to strong convexity constant (and other types of growth conditions for that matter). This assumption is often reduced to assuming (see e.g. (Asi and Duchi, 2019) for a more complete discussion), and was used early on to devise better step size strategies for gradient methods, known as Polyak steps (Polyak, 1969; Nedic, 2002).
Our objective here is to remove the need for restart. From a practical point of view, while the theoretical bound in (Roulet and d’Aspremont, 2017) is optimal, empirical performance can vary significantly with residual parameter settings. From a theoretical perspective, the need to use a restart scheme highlights the fact that current algorithms and/or convergence analysis fail to capture some key aspects of the problem’s regularity properties. Restart schemes are a hack which achieve nearly optimal convergence rates, we seek to find better methods that alleviate the need for these schemes.
We make the following contributions.
- •
We bound the precision required in estimating the strong convexity parameter to get the fast convergence rate in (3). In particular, we show that sublinear convergence in the estimate of is enough to guarantee fast linear convergence of the iterates.
- •
Assuming is known, we detail an efficient strategy to produce local estimates of the strong convexity parameter . This estimate has the added benefit of being local, hence better adapts to the geometry of the problem, further speeding up convergence compared to methods given a fixed initial bound on .
- •
We test our strategy on a variety of learning problems and show that our method often significantly outperforms restart schemes in practice.
Notation
In what follows, will denote a -smooth and -strongly convex function, a lower-continuous proper convex function. is then a -strongly convex function and will denote the unique minimizer of on . Let be the optimal value of . will be supposed simple enough so that for the gradient mapping
[TABLE]
can be computed explicitly. Finally the reduced gradient is defined as
[TABLE]
2 Nesterov Acceleration of Smooth and Strongly Convex Functions
In the following we seek to solve the optimization problem
[TABLE]
in the variable .
2.1 APG with Known Strong Convexity Parameter
A classical method for smooth and strongly convex minimization, when the strong convexity parameter is known, is the Accelerated Proximal Gradient (APG) described in Algorithm 1.
It can be derived from the generic formulation of the Optimal Gradient Method in (Nesterov, 2018, §2.2.12-13), using a good choice of estimate sequences and coefficients in order to get only two iterate sequences, and , with simple updates. Algorithm 2 describes Algorithm 1 using an estimate sequence formulation that will prove useful when introducing an estimated strong convexity in the algorithm. A proof of this statement can be found in Appendix A.2.
We start with the following lemma from (Lin and Xiao, 2014b), which is an extension of (Nesterov, 2018, Th 2.2.13), and will be used in the analysis.
Lemma 2.1
The following inequality holds for .
[TABLE]
Proof. proof in the Appendix B.1
Corollary 2.2
[TABLE]
Lemma 2.1 guarantees that the components of of Algorithm 2 are lower bounds on . In particular, we have . These estimate sequences have also the huge advantage to be strongly convex quadratic functions. Proposition 2.3 now recalls the convergence bound of APG.
Proposition 2.3
After iterations the output of algorithm 2 satisfies
[TABLE]
and
[TABLE]
Proof. A complete proof using estimate sequence is given in Appendix B.2.
This result shows a linear convergence rate in . A linesearch on the smoothness parameter can be added to the algorithm without losing the convergence bound Lin and Xiao (2014b). In Algorithms 1 and 2 the strong convexity parameter is given as an input, and is typically hard to estimate. When a misspecified is given, two cases are to be distinguished. In the case where we have a lower bound on , the proof of Proposition 2.3 still applies because is only used in lower bounds. Linear convergence is preserved and the rate of convergence becomes . When is only an upper bound on , the previous results only guarantee that the iterates of APG will not blow up (cf. see for instance (Lin and Xiao, 2014b, Lemma 10)). In what follows we present robustness result on APG, when using an upper bounding sequence that converges to at a sublinear rate.
2.2 APG with Estimates of Strong Convexity Parameter
The main result of this section is that for all , a sequence such that
[TABLE]
for so converges at a sublinear rate towards , allows us to compute such that
[TABLE]
i.e. converges at a linear rate towards .
Let be a positive real sequence such that . Suppose that the are available in an online setting, meaning that the -th term can be used at the -th iteration of the algorithm. In the formulation of Algorithm 2, two properties have to be satisfied at each iteration to obtain the convergence bound of Proposition 2.3.
[TABLE]
The are modified in order to incorporate the strong convexity estimator.
[TABLE]
Adding these estimate sequences in the APG scheme yields Algorithm 3.
With this choice of recurrence for , the proximal update for is preserved. However in this case can no longer be expressed as a combination of and . In addition, the algorithm keeps the same form of updates as before, ensuring the property to be preserved at each iteration. However, relied on the strong convexity lower bounds induced by , and these bounds do not hold anymore with , introducing additional error terms. Proposition 2.4 below thus gives a preliminary bound on the primal gap depending on the distance between the and .
Proposition 2.4
Given a non increasing sequence of estimate such that , the output of Algorithm 3 after iterations satisfies
[TABLE]
and
[TABLE]
Proof. The proof of this result is essentially the same as of Proposition 2.3 and is completely detailed in Appendix B.3.
Our goal now is to control the right hand side given sufficient conditions on the gaps . In the strongly convex case, the behaviour of the distance to the optimum of the second iterate sequence can be controlled. The following lemma uses the form of the update in as a convex combination of and to bound .
Lemma 2.5
Given an non increasing upperbounding sequence of . is a sequence defined as in Algorithm 3 on using .
[TABLE]
Proof. See Appendix B.4
The recurrence equation that defines the allows for a simple bound on the ratio .
Lemma 2.6
For and defined as in Algorithm 3
[TABLE]
Proof. non increasing.
In the next Lemma, we show that when converges to at a summable rate, then converges to 0 with the same speed as .
Lemma 2.7
Given a non increasing sequence satisfying
[TABLE]
with with defined as in Lemma 2.6. Then for and defined as in Algorithm 3
[TABLE]
with
Proof. The proof of this statement can be found in Appendix B.5.
Now we can prove our main result on robustness of the fast gradient method using upper estimates of the strong convexity parameter.
Proposition 2.8
Given a non increasing sequence satisfying
[TABLE]
with , the output of Algorithm 3 satisfies
[TABLE]
where and
[TABLE]
Proof.
Combine Proposition 2.4 and Lemma 2.7. The bound on is true because .
This results can be extended in the case where the converge at a summable rate to . Note also that the constant is bounded by since the will never be taken larger than in our case of interest.
3 Estimation of Strong Convexity Parameter
In this section we propose an estimate of the strong convexity parameter, that can be computed online with the iterations of the algorithm. We do not prove the convergence of our estimate in the general case but we present hints that support its performance. The optimum function value is required to compute these estimates, as for Polyak steps. We set to a rough upper bound on , for instance is suitable for problems that need to be solved with accelerated methods. Then for is defined as follows
[TABLE]
In the following we keep our study in the case and becomes
[TABLE]
Lemma A.1 in the Appendix ensures that the are lower bounded by the strong convexity . The following lemma shows that is effectively converging to when the are iterates of a gradient descent on , a strongly convex quadratic.
Lemma 3.1
Let , , and suppose . Let be the iterates of a gradient descent procedure starting at with constant step where is the largest eigenvalue of . We get
[TABLE]
where is the smallest eigenvalue of , the second smallest and the component of on the eigenspace associated with .
Proof. Decompose the iterates on the eigenvectors of .
The same kind of convergence with an accelerated rate can be obtain when the are the iterates of an APG with a constant momentum on a strongly convex quadratic. The key in these two examples is that the component of associated with the smallest eigenvalue of the hessian of has the slowest convergence rate. This is the conjugate effect of a gradient step that decreases first the components associated with the highest eigenvalues and of a small extrapolation step that preserves the order of convergence between the different components.
4 Numerical Experiments
In this section we present numerical experiments on Algorithm 3. We also show results of Algorithm 4, a very simple modification of APG for which we did not prove robustness but that appears to work very well in practice.
Both Algorithms 3 and 4 compute and use the strong convexity estimates defined in (22) during their execution. In order to get the values of in the experiments we run APG for a sufficient amount of time to reach machine precision. We compare our two algorithms (APG adapt) and (APG adapt v2) with Proximal Gradient Descent (PGD), Accelerated Proximal Gradient for smooth functions (APG), Accelerated Proximal Gradient with known strong convexity parameter (APG Optiamal ) (for square loss and regularized logistic loss) and restarted Accelerated Proxmial Gradient using in a stopping criterion with decay parameter (APG Restart ) tuned to give the best result. The restart scheme is described in Appendix C. Even though the theoretical complexity bound is optimal, the tuning step for the restart strategy still has a significant impact on empirical performance, as shown in Figure 3 in the Appendix. In terms of computational cost, our algorithms require one more call to the gradient oracle per iteration than the restarted algorithm but there is no parameter to tune, indeed is always chosen as and has no impact in practice.
Figure 1 shows the convergence of the primal gap when solving the matrix completion problem on synthetic data using the nuclear norm penalization formulation. Our adaptive algorithms exhibit linear convergence meaning that they successfully estimate the local strong convexity of the problem.
Figure 2 regroups the results of experiments on two real world datasets of different sizes using 4 different classical losses. In all cases, our algorithms perform well and display the fast converging rate. Figure 4 in Appendix C shows additional experiments and Figure 5 the convergence of our online estimate of the strong convexity parameter during the execution of the algorithm.
Appendix A Usefull Lemmas
Lemma A.1
Since is -smooth and -strongly convex, the following bounds hold
[TABLE]
[TABLE]
.
Proof. [Nesterov, 2018, Th 2.1.5, Th 2.1.10]
Lemma A.2
The sequence follows the same updates in Algorithm 1 and 2.
Proof. Note that . Let , is a quadratic function. Since is the of
[TABLE]
[TABLE]
reinjecting in the expression of ,
[TABLE]
which is the update of Algorithm 1.
Appendix B Proofs of Lemmas and Propositions
B.1 Proof of Lemma 2.1
The optimality condition of can be written with . By strong convexity of we have
[TABLE]
B.2 Proof of Proposition 2.3
Recall that with this update of we have .
We have and Lemma 2.1 implies . This leads to the useful bound
[TABLE]
Then we show by induction that .
At rank , , and thus .
Then suppose the property is true at rank . Denote
[TABLE]
We conclude by combining the formulae defining and .
[TABLE]
finally since we get . In addition, and leads to .
B.3 Proof of Proposition 2.4
We follow the proof of Proposition 2.3. However here we have a different bound on .
. Which leads to
[TABLE]
Now we show by induction that . At rank , and , so the property is true. Suppose it is true at rank . LEt .
[TABLE]
We conclude by combining the formulae defining and .
[TABLE]
finally since we get , re-injecting in (26) gives the right bound. In addition, and leads to .
B.4 Proof of Lemma 2.5
From the definition of in Algorithm 3, with . By convexity of
[TABLE]
We denote , we have that . Note that is -strongly convex, which gives
[TABLE]
We can bound the same way using Corollary 2.2
[TABLE]
combining these inequality in (27) gives the result.
B.5 Proof of Lemma 2.7
We prove our result by induction. For this is true since . Now suppose the property is true until a rank .
By Lemma 2.5,
[TABLE]
Thus
[TABLE]
which concludes the proof.
Appendix C Numerical Experiments
In the quadratic case we dispose of a natural strong convexity parameter which is the smallest eigenvalue of the Hessian. However when the loss has a more complex structure we do not know a priori which quantity our estimates of strong convexity should be compared to. When looking at the proof of the convergence rate of Algorithm 3, the exact error term due to the fact that upper bounds is
[TABLE]
where is an iterate in Algorithm 3. We then define
[TABLE]
C.1 Parameters of the losses in Figure 2
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Allen-Zhu et al. [2016] Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning , pages 1110–1119, 2016.
- 2Arjevani [2017] Yossi Arjevani. Limitations on variance-reduction and acceleration schemes for finite sums optimization. In Advances in Neural Information Processing Systems , pages 3540–3549, 2017.
- 3Asi and Duchi [2019] H. Asi and J. Duchi. The importance of better models in stochastic optimization. ar Xiv:1903.08619 , 2019.
- 4Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.
- 5Defazio et al. [2014] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. ar Xiv preprint ar Xiv:1407.0202 , 2014.
- 6Fercoq and Qu [2016] Olivier Fercoq and Zheng Qu. Restarting accelerated gradient methods with a rough strong convexity estimate. ar Xiv preprint ar Xiv:1609.07358 , 2016.
- 7Johnson and Zhang [2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems , pages 315–323, 2013.
- 8Lan and Zhou [2018] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical programming , 171(1-2):167–215, 2018.
