Polyak Steps for Adaptive Fast Gradient Method

Mathieu Barr\'e; Alexandre d'Aspremont

arXiv:1906.03056·math.OC·June 10, 2019

Polyak Steps for Adaptive Fast Gradient Method

Mathieu Barr\'e, Alexandre d'Aspremont

PDF

Open Access

TL;DR

This paper introduces a new adaptive method for accelerated gradient algorithms that estimates the strong convexity parameter online, eliminating the need for restart schemes and maintaining optimal convergence rates.

Contribution

It proposes a novel approach to adaptively estimate the strong convexity parameter during optimization, removing the necessity for restart strategies.

Findings

01

Achieves optimal linear convergence without restarts.

02

Demonstrates robustness of the method with estimated bounds on .

03

Provides empirical evidence of effectiveness.

Abstract

Accelerated algorithms for minimizing smooth strongly convex functions usually require knowledge of the strong convexity parameter $μ$ . In the case of an unknown $μ$ , current adaptive techniques are based on restart schemes. When the optimal value $f^{*}$ is known, these strategies recover the accelerated linear convergence bound without additional grid search. In this paper we propose a new approach that has the same bound without any restart, using an online estimation of strong convexity parameter. We show the robustness of the Fast Gradient Method when using a sequence of upper bounds on $μ$ . We also present a good candidate for this estimate sequence and detail consistent empirical results.

Tables1

Table 1. Table 1: Table of the parameters used in Figure 2 .

Dataset	regularization Logit	regularization Lasso	regularization SVM
Musk	$λ ∥ \cdot ∥^{2}, λ = 100$	$λ ∥ \cdot ∥_{1}, λ = 100$	$\frac{1}{C} ∥ \cdot ∥^{2}, C = 1$
Madelon	$λ ∥ \cdot ∥^{2}, λ = 1000$	$λ ∥ \cdot ∥_{1}, λ = 800$	$\frac{1}{C} ∥ \cdot ∥^{2}, C = 1$

Equations175

min f (x) ≜ h (x) + ψ (x)

min f (x) ≜ h (x) + ψ (x)

f (x_{k}) - f (x_{0}) \leq \frac{L}{2} (1 - \frac{μ}{L})^{k} ∥ x^{*} - x_{0} ∥_{2}

f (x_{k}) - f (x_{0}) \leq \frac{L}{2} (1 - \frac{μ}{L})^{k} ∥ x^{*} - x_{0} ∥_{2}

f (x_{k}) - f (x_{0}) \leq \frac{L}{2} (1 - \frac{μ}{L})^{k} ∥ x^{*} - x_{0} ∥_{2}

f (x_{k}) - f (x_{0}) \leq \frac{L}{2} (1 - \frac{μ}{L})^{k} ∥ x^{*} - x_{0} ∥_{2}

T_{α} (y) = x \in R^{n} argmin h (y) + \nabla h (y)^{T} (x - y) + \frac{α}{2} ∥ x - y ∥^{2} + ψ (x)

T_{α} (y) = x \in R^{n} argmin h (y) + \nabla h (y)^{T} (x - y) + \frac{α}{2} ∥ x - y ∥^{2} + ψ (x)

g_{α} (y) = α (y - T_{α} (y)) .

g_{α} (y) = α (y - T_{α} (y)) .

min f (x) := h (x) + ψ (x)

min f (x) := h (x) + ψ (x)

f (x) \geq f (T_{L} (y)) + g_{L} (y)^{T} (x - y) + \frac{1}{2 L} ∥ g_{L} (y) ∥^{2} + \frac{μ}{2} ∥ x - y ∥^{2}

f (x) \geq f (T_{L} (y)) + g_{L} (y)^{T} (x - y) + \frac{1}{2 L} ∥ g_{L} (y) ∥^{2} + \frac{μ}{2} ∥ x - y ∥^{2}

f (x) - f^{*} \geq \frac{μ}{2} ∥ x - x^{*} ∥^{2}, \forall x \in R^{n}

f (x) - f^{*} \geq \frac{μ}{2} ∥ x - x^{*} ∥^{2}, \forall x \in R^{n}

f (y_{k}) - f^{*} \leq \frac{( f ( x _{0} ) - f ^{*} ) + \frac{μ}{2} ∥ x _{0} - x ^{*} ∥ ^{2}}{A _{k}}

f (y_{k}) - f^{*} \leq \frac{( f ( x _{0} ) - f ^{*} ) + \frac{μ}{2} ∥ x _{0} - x ^{*} ∥ ^{2}}{A _{k}}

A_{k} = (1 - \frac{μ}{L})^{- k}, \forall k \geq 0

A_{k} = (1 - \frac{μ}{L})^{- k}, \forall k \geq 0

0 \leq μ_{i} - μ \leq \frac{C}{( i + 1 ) ^{2}}

0 \leq μ_{i} - μ \leq \frac{C}{( i + 1 ) ^{2}}

f (y_{k}) - f^{*} \leq C_{0} (1 - \frac{μ}{L})^{k},

f (y_{k}) - f^{*} \leq C_{0} (1 - \frac{μ}{L})^{k},

\left.\begin{array}[]{ll}(P_{1}^{k}):&m_{k}(x^{*})\leq A_{k}f^{*}\\ (P_{2}^{k}):&A_{k}f(y_{k})\leq f(x_{0})-f^{*}+\underset{x\in{\mathbb{R}}^{n}}{\min\;}m_{k}(x)+\frac{a_{0}\mu}{2}\|x-x_{0}\|^{2}\end{array}\right\}k\geq 0

\left.\begin{array}[]{ll}(P_{1}^{k}):&m_{k}(x^{*})\leq A_{k}f^{*}\\ (P_{2}^{k}):&A_{k}f(y_{k})\leq f(x_{0})-f^{*}+\underset{x\in{\mathbb{R}}^{n}}{\min\;}m_{k}(x)+\frac{a_{0}\mu}{2}\|x-x_{0}\|^{2}\end{array}\right\}k\geq 0

\left\{\begin{array}[]{ll}m_{0}(x)=a_{0}f^{*}&\\ m_{k+1}(x)=m_{k}(x)+a_{k+1}\left(l_{L}(x,x_{k+1})+\frac{\mu_{k+1}}{2}\|x-x_{k+1}\|^{2}\right),k\geq 0\end{array}\right.

\left\{\begin{array}[]{ll}m_{0}(x)=a_{0}f^{*}&\\ m_{k+1}(x)=m_{k}(x)+a_{k+1}\left(l_{L}(x,x_{k+1})+\frac{\mu_{k+1}}{2}\|x-x_{k+1}\|^{2}\right),k\geq 0\end{array}\right.

f (y_{k}) - f^{*} \leq \frac{f ( x _{0} ) - f ^{*} + \frac{a _{0} μ}{2} ∥ x _{0} - x ^{*} ∥ ^{2}}{A _{k}} + i = 0 \sum k \frac{a _{i}}{2 A _{k}} (μ_{i} - μ) ∥ x_{i} - x^{*} ∥^{2}

f (y_{k}) - f^{*} \leq \frac{f ( x _{0} ) - f ^{*} + \frac{a _{0} μ}{2} ∥ x _{0} - x ^{*} ∥ ^{2}}{A _{k}} + i = 0 \sum k \frac{a _{i}}{2 A _{k}} (μ_{i} - μ) ∥ x_{i} - x^{*} ∥^{2}

A_{k} = i = 1 \prod k (1 - \frac{μ _{i}}{L})^{- 1}

A_{k} = i = 1 \prod k (1 - \frac{μ _{i}}{L})^{- 1}

∥ x_{k + 1} - x^{*} ∥^{2} \leq \frac{2 ( f ( x _{0} ) - f ^{*} ) + \frac{a _{0} μ}{2} ∥ x _{0} - x ^{*} ∥ ^{2}}{A _{k} μ} + i = 0 \sum k \frac{a _{i}}{A _{k}} \frac{μ _{i} - μ}{μ} ∥ x_{i} - x^{*} ∥^{2}, \forall k \geq 0

∥ x_{k + 1} - x^{*} ∥^{2} \leq \frac{2 ( f ( x _{0} ) - f ^{*} ) + \frac{a _{0} μ}{2} ∥ x _{0} - x ^{*} ∥ ^{2}}{A _{k} μ} + i = 0 \sum k \frac{a _{i}}{A _{k}} \frac{μ _{i} - μ}{μ} ∥ x_{i} - x^{*} ∥^{2}, \forall k \geq 0

\frac{a _{k + 1}}{A _{k}} = \frac{\frac{μ _{k + 1}}{L}}{1 - \frac{μ _{k + 1}}{L}} \leq C_{1} = \frac{\frac{μ _{0}}{L}}{1 - \frac{μ _{0}}{L}}

\frac{a _{k + 1}}{A _{k}} = \frac{\frac{μ _{k + 1}}{L}}{1 - \frac{μ _{k + 1}}{L}} \leq C_{1} = \frac{\frac{μ _{0}}{L}}{1 - \frac{μ _{0}}{L}}

0 \leq μ_{k} - μ \leq \frac{C}{( k + 1 ) ^{2}}, \forall k \geq 1

0 \leq μ_{k} - μ \leq \frac{C}{( k + 1 ) ^{2}}, \forall k \geq 1

a_{k} (μ_{k} - μ) ∥ x_{k} - x^{*} ∥^{2} \leq \frac{C _{0}}{( k + 1 ) ^{2}}, \forall k \geq 0

a_{k} (μ_{k} - μ) ∥ x_{k} - x^{*} ∥^{2} \leq \frac{C _{0}}{( k + 1 ) ^{2}}, \forall k \geq 0

0 \leq μ_{k} - μ \leq \frac{C}{( k + 1 ) ^{2}}, \forall k \geq 1

0 \leq μ_{k} - μ \leq \frac{C}{( k + 1 ) ^{2}}, \forall k \geq 1

f (y_{k}) - f^{*} \leq \frac{5 C _{0}}{2 A _{k}}, \forall k \geq 0

f (y_{k}) - f^{*} \leq \frac{5 C _{0}}{2 A _{k}}, \forall k \geq 0

A_{k} \geq (1 - \frac{μ}{L})^{- k}, \forall k \geq 1

A_{k} \geq (1 - \frac{μ}{L})^{- k}, \forall k \geq 1

\left.\begin{array}[]{l}\hat{\mu}_{k+1}=\frac{\|g_{L}(y_{k})\|^{2}}{2(f(y_{k})-f^{*})}\\ \mu_{k+1}=\underset{i=0..k+1}{\min\;}\hat{\mu}_{i}\end{array}\right\}\forall k\geq 0

\left.\begin{array}[]{l}\hat{\mu}_{k+1}=\frac{\|g_{L}(y_{k})\|^{2}}{2(f(y_{k})-f^{*})}\\ \mu_{k+1}=\underset{i=0..k+1}{\min\;}\hat{\mu}_{i}\end{array}\right\}\forall k\geq 0

\overset{μ}{^}_{k + 1} = \frac{∥\nabla h ( y _{k} ) ∥ ^{2}}{2 ( h ( y _{k} ) - h ^{*} )}, \forall k \geq 0

\overset{μ}{^}_{k + 1} = \frac{∥\nabla h ( y _{k} ) ∥ ^{2}}{2 ( h ( y _{k} ) - h ^{*} )}, \forall k \geq 0

\frac{∥\nabla h ( y _{k} ) ∥ ^{2}}{2 ( h ( y _{k} ) - h ^{*} )} - μ \leq \frac{∥ y _{0} - x ^{*} ∥ ^{2}}{ω _{1}^{2}} (λ_{2} - μ) \frac{λ _{2}}{μ} (\frac{1 - \frac{λ _{2}}{L}}{1 - \frac{μ}{L}})^{2 k}

\frac{∥\nabla h ( y _{k} ) ∥ ^{2}}{2 ( h ( y _{k} ) - h ^{*} )} - μ \leq \frac{∥ y _{0} - x ^{*} ∥ ^{2}}{ω _{1}^{2}} (λ_{2} - μ) \frac{λ _{2}}{μ} (\frac{1 - \frac{λ _{2}}{L}}{1 - \frac{μ}{L}})^{2 k}

h (x) \leq h (y) + \nabla h (y)^{T} (x - y) + \frac{L}{2} ∥ x - y ∥^{2}

h (x) \leq h (y) + \nabla h (y)^{T} (x - y) + \frac{L}{2} ∥ x - y ∥^{2}

h (x) \geq h (y) + \nabla h (y)^{T} (x - y) + \frac{μ}{2} ∥ x - y ∥^{2}

h (x) \geq h (y) + \nabla h (y)^{T} (x - y) + \frac{μ}{2} ∥ x - y ∥^{2}

ϕ_{k + 1} (x)

ϕ_{k + 1} (x)

+ a_{k + 1} (f (T_{L} (x_{k + 1})) + g_{L} (x_{k + 1})^{T} (x - x_{k + 1}) + \frac{1}{2 L} ∥ g_{L} (x_{k + 1}) ∥^{2})

v_{k + 1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Advanced Optimization Algorithms Research

Full text

Polyak Steps for Adaptive Fast Gradient Method

Mathieu Barré &Alexandre d’Aspremont

Abstract

Accelerated algorithms for minimizing smooth strongly convex functions usually require knowledge of the strong convexity parameter $\mu$ . In the case of an unknown $\mu$ , current adaptive techniques are based on restart schemes. When the optimal value $f^{*}$ is known, these strategies recover the accelerated linear convergence bound without additional grid search. In this paper we propose a new approach that has the same bound without any restart, using an online estimation of strong convexity parameter. We show the robustness of the Fast Gradient Method when using a sequence of upper bounds on $\mu$ . We also present a good candidate for this estimate sequence and detail consistent empirical results.

1 Introduction

We focus on solving a generic optimization problem written

[TABLE]

in the variable $x\in{\mathbb{R}}^{n}$ , where $h$ is a $L$ -smooth, $\mu$ -strongly convex function and $\psi(x)$ a convex penalty term. In the deterministic setting, classical convergence bounds show

[TABLE]

after $k$ iterations of gradient descent with fixed step size, while accelerated proximal gradient descent methods yield iterates satisfying

[TABLE]

after $k$ iterations, showing a significantly weaker dependence on the problem’s condition number $\kappa=L/\mu$ (see (Nesterov, 2005) for a complete discussion). Similar rates have been obtained in the stochastic setting under the assumption that $h$ is a finite sum. Early work in (Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013; Xiao and Zhang, 2014; Defazio et al., 2014) produced algorithms with a slow rate roughly matching (2) in its dependence on the condition number. Improved algorithms (Lin et al., 2014; Allen-Zhu et al., 2016; Shalev-Shwartz and Zhang, 2014; Lan and Zhou, 2018) obtain an accelerated rate similar to that in (3), with (Lan and Zhou, 2018) in particular showing that these bounds are unimprovable. All these results rely on a strong convexity assumption, with (Arjevani, 2017) showing that explicit knowledge of the strong convexity constant is required to get the fast rate using simple step size strategies. This remains a key limitation since the strong convexity constant is either unknown or poorly approximated in practice.

The situation is more favorable in the deterministic setting, with (Nesterov, 2013; Lin and Xiao, 2014a; Fercoq and Qu, 2016; Roulet and d’Aspremont, 2017; Renegar and Grimmer, 2018) showing that the fast rate can be achieved up to a factor $log(\kappa)$ , using a restart strategy (the first three references have an extra $1/\mu$ factor in the bound). The results in (Roulet and d’Aspremont, 2017) also show that the $log(\kappa)$ factor can be removed when the value of $f^{*}$ is known, so that restarted accelerated methods are fully adaptive to strong convexity constant (and other types of growth conditions for that matter). This assumption is often reduced to assuming $f^{*}=0$ (see e.g. (Asi and Duchi, 2019) for a more complete discussion), and was used early on to devise better step size strategies for gradient methods, known as Polyak steps (Polyak, 1969; Nedic, 2002).

Our objective here is to remove the need for restart. From a practical point of view, while the theoretical bound in (Roulet and d’Aspremont, 2017) is optimal, empirical performance can vary significantly with residual parameter settings. From a theoretical perspective, the need to use a restart scheme highlights the fact that current algorithms and/or convergence analysis fail to capture some key aspects of the problem’s regularity properties. Restart schemes are a hack which achieve nearly optimal convergence rates, we seek to find better methods that alleviate the need for these schemes.

We make the following contributions.

•

We bound the precision required in estimating the strong convexity parameter $\mu$ to get the fast convergence rate in (3). In particular, we show that sublinear convergence in the estimate of $\mu$ is enough to guarantee fast linear convergence of the iterates.

•

Assuming $f^{*}$ is known, we detail an efficient strategy to produce local estimates of the strong convexity parameter $\mu$ . This estimate has the added benefit of being local, hence better adapts to the geometry of the problem, further speeding up convergence compared to methods given a fixed initial bound on $\mu$ .

•

We test our strategy on a variety of learning problems and show that our method often significantly outperforms restart schemes in practice.

Notation

In what follows, $h$ will denote a $L$ -smooth and $\mu$ -strongly convex function, $\psi$ a lower-continuous proper convex function. $f(x):=h(x)+\psi(x)$ is then a $\mu$ -strongly convex function and $x^{*}$ will denote the unique minimizer of $f$ on ${\mathbb{R}}^{n}$ . Let $f^{*}=f(x^{*})$ be the optimal value of $f$ . $\psi$ will be supposed simple enough so that for $\alpha>0$ the gradient mapping $T_{\alpha}$

[TABLE]

can be computed explicitly. Finally the reduced gradient is defined as

[TABLE]

2 Nesterov Acceleration of Smooth and Strongly Convex Functions

In the following we seek to solve the optimization problem

[TABLE]

in the variable $x\in{\mathbb{R}}^{n}$ .

2.1 APG with Known Strong Convexity Parameter

A classical method for smooth and strongly convex minimization, when the strong convexity parameter is known, is the Accelerated Proximal Gradient (APG) described in Algorithm 1.

It can be derived from the generic formulation of the Optimal Gradient Method in (Nesterov, 2018, §2.2.12-13), using a good choice of estimate sequences and coefficients in order to get only two iterate sequences, $(x_{i})_{i\in\mathbb{N}}$ and $(y_{i})_{i\in\mathbb{N}}$ , with simple updates. Algorithm 2 describes Algorithm 1 using an estimate sequence formulation that will prove useful when introducing an estimated strong convexity in the algorithm. A proof of this statement can be found in Appendix A.2.

We start with the following lemma from (Lin and Xiao, 2014b), which is an extension of (Nesterov, 2018, Th 2.2.13), and will be used in the analysis.

Lemma 2.1

The following inequality holds for $x,y\in{\mathbb{R}}^{n}$ .

[TABLE]

Proof. proof in the Appendix B.1

Corollary 2.2

[TABLE]

Lemma 2.1 guarantees that the components of $m_{k}(x)$ of Algorithm 2 are lower bounds on $f(x)$ . In particular, we have $m_{k}(x^{*})\leq A_{k}f^{*}$ . These estimate sequences have also the huge advantage to be strongly convex quadratic functions. Proposition 2.3 now recalls the convergence bound of APG.

Proposition 2.3

After $k$ iterations the output $y_{k}$ of algorithm 2 satisfies

[TABLE]

and

[TABLE]

Proof. A complete proof using estimate sequence is given in Appendix B.2.

This result shows a linear convergence rate in $\left(1-\sqrt{\frac{\mu}{L}}\right)^{k}$ . A linesearch on the smoothness parameter $L$ can be added to the algorithm without losing the convergence bound Lin and Xiao (2014b). In Algorithms 1 and 2 the strong convexity parameter is given as an input, and is typically hard to estimate. When a misspecified $\hat{\mu}\neq\mu$ is given, two cases are to be distinguished. In the case where we have a lower bound on $\mu$ , the proof of Proposition 2.3 still applies because $\mu$ is only used in lower bounds. Linear convergence is preserved and the rate of convergence becomes $(1-\sqrt{\frac{\hat{\mu}}{L}})^{k}$ . When $\hat{\mu}$ is only an upper bound on $\mu$ , the previous results only guarantee that the iterates of APG will not blow up (cf. see for instance (Lin and Xiao, 2014b, Lemma 10)). In what follows we present robustness result on APG, when using an upper bounding sequence that converges to $\mu$ at a sublinear rate.

2.2 APG with Estimates of Strong Convexity Parameter

The main result of this section is that for all $k\geq 0$ , a sequence $(\mu_{i})_{i\in\mathbb{N}}$ such that

[TABLE]

for $i\in[|1,k|]$ so $\mu_{i}$ converges at a sublinear rate towards $\mu$ , allows us to compute $y_{k}\in{\mathbb{R}}^{n}$ such that

[TABLE]

i.e. $f(y_{k})$ converges at a linear rate towards $f^{*}$ .

Let $(\mu_{i})_{i\in\mathbb{N}}$ be a positive real sequence such that $\mu_{i}\geq\mu,\forall i\geq 0$ . Suppose that the $\mu_{i}$ are available in an online setting, meaning that the $i$ -th term can be used at the $i$ -th iteration of the algorithm. In the formulation of Algorithm 2, two properties have to be satisfied at each iteration to obtain the convergence bound of Proposition 2.3.

[TABLE]

The $m_{k}$ are modified in order to incorporate the strong convexity estimator.

[TABLE]

Adding these estimate sequences in the APG scheme yields Algorithm 3.

With this choice of recurrence for $a_{k}$ , the proximal update for $y_{k+1}$ is preserved. However in this case $x_{k+1}$ can no longer be expressed as a combination of $y_{k}$ and $y_{k-1}$ . In addition, the algorithm keeps the same form of updates as before, ensuring the property $(P_{2}^{k})$ to be preserved at each iteration. However, $(P_{1}^{k})$ relied on the strong convexity lower bounds induced by $\mu$ , and these bounds do not hold anymore with $\mu_{i}$ , introducing additional error terms. Proposition 2.4 below thus gives a preliminary bound on the primal gap depending on the distance between the $\mu_{i}$ and $\mu$ .

Proposition 2.4

Given a non increasing sequence of estimate $(\mu_{i})_{i\in\mathbb{N}}$ such that $\mu_{i}\geq\mu,\forall i>0$ , the output of Algorithm 3 after $k$ iterations satisfies

[TABLE]

and

[TABLE]

Proof. The proof of this result is essentially the same as of Proposition 2.3 and is completely detailed in Appendix B.3.

Our goal now is to control the right hand side given sufficient conditions on the gaps $\mu_{i}-\mu$ . In the strongly convex case, the behaviour of the distance to the optimum of the second iterate sequence $(x_{i})$ can be controlled. The following lemma uses the form of the update in $x_{k+1}$ as a convex combination of $v_{k}$ and $y_{k}$ to bound $\|x_{k}-x_{0}\|^{2}$ .

Lemma 2.5

Given $(\mu_{i})_{i\in\mathbb{N}}$ an non increasing upperbounding sequence of $\mu$ . $(x_{i})_{i\in\mathbb{N}}$ is a sequence defined as in Algorithm 3 on $f$ using $(\mu_{i})_{i\in\mathbb{N}}$ .

[TABLE]

Proof. See Appendix B.4

The recurrence equation that defines the $a_{k}$ allows for a simple bound on the ratio $\frac{a_{k+1}}{A_{k}}$ .

Lemma 2.6

For $(a_{i})$ and $(A_{i})$ defined as in Algorithm 3

[TABLE]

Proof. $(\mu_{i})$ non increasing.

In the next Lemma, we show that when $\mu_{i}$ converges to $\mu$ at a summable rate, then $\|x_{k}-x^{*}\|^{2}$ converges to 0 with the same speed as $f(y_{k})-f^{*}$ .

Lemma 2.7

Given a non increasing sequence $(\mu_{i})_{i\in\mathbb{N}}$ satisfying

[TABLE]

with $C\leq\frac{\mu}{3C_{1}}$ with $C_{1}$ defined as in Lemma 2.6. Then for $(a_{i})$ and $(x_{i})$ defined as in Algorithm 3

[TABLE]

with $C_{0}=\max(a_{0}(\mu_{0}-\mu)\|x_{0}-x^{*}\|^{2},2(f(x_{0})-f^{*})+a_{0}\mu\|x_{0}-x^{*}\|^{2})$

Proof. The proof of this statement can be found in Appendix B.5.

Now we can prove our main result on robustness of the fast gradient method using upper estimates of the strong convexity parameter.

Proposition 2.8

Given a non increasing sequence $(\mu_{i})$ satisfying

[TABLE]

with $C\leq\frac{\mu(1-\sqrt{\frac{\mu_{0}}{L}})}{3\sqrt{\frac{\mu_{0}}{L}}}$ , the output of Algorithm 3 satisfies

[TABLE]

where $C_{0}=\max((\mu_{0}-\mu)\|x_{0}-x^{*}\|^{2},2(f(x_{0})-f^{*})+\mu\|x_{0}-x^{*}\|^{2})$ and

[TABLE]

Proof.

Combine Proposition 2.4 and Lemma 2.7. The bound on $A_{k}$ is true because $\mu_{i}\leq\mu$ .

This results can be extended in the case where the $\mu_{i}$ converge at a summable rate to $\mu$ . Note also that the constant $C_{0}$ is bounded by $C_{0}=\max((\frac{L}{2}-\mu)\|x_{0}-x^{*}\|^{2},2(f(x_{0})-f^{*})+\mu\|x_{0}-x^{*}\|^{2})$ since the $\mu_{0}$ will never be taken larger than $\frac{L}{2}$ in our case of interest.

3 Estimation of Strong Convexity Parameter

In this section we propose an estimate of the strong convexity parameter, that can be computed online with the iterations of the algorithm. We do not prove the convergence of our estimate in the general case but we present hints that support its performance. The optimum function value $f^{*}$ is required to compute these estimates, as for Polyak steps. We set $\mu_{0}$ to a rough upper bound on $\mu$ , for instance $\frac{L}{2}$ is suitable for problems that need to be solved with accelerated methods. Then $\mu_{k+1}$ for $k\geq 0$ is defined as follows

[TABLE]

In the following we keep our study in the case $\psi(x)=0$ and $\hat{\mu}$ becomes

[TABLE]

Lemma A.1 in the Appendix ensures that the $\mu_{k}$ are lower bounded by the strong convexity $\mu$ . The following lemma shows that $\mu_{k}$ is effectively converging to $\mu$ when the $y_{k}$ are iterates of a gradient descent on $h$ , a strongly convex quadratic.

Lemma 3.1

Let $h^{*}\in{\mathbb{R}},x^{*}\in{\mathbb{R}}^{n}$ , $A\in S_{n}^{++}({\mathbb{R}})$ , and suppose $h(x)=h^{*}+\frac{1}{2}(x-x^{*})^{T}A(x-x^{*})$ . Let $y_{k}$ be the iterates of a gradient descent procedure starting at $y_{0}$ with constant step $\frac{1}{L}$ where $L$ is the largest eigenvalue of $A$ . We get

[TABLE]

where $\mu$ is the smallest eigenvalue of $A$ , $\lambda_{2}$ the second smallest and $\omega_{1}$ the component of $y_{0}-x^{*}$ on the eigenspace associated with $\mu$ .

Proof. Decompose the iterates on the eigenvectors of $A$ .

The same kind of convergence with an accelerated rate can be obtain when the $y_{k}$ are the iterates of an APG with a constant momentum $\beta\leq\frac{1-\sqrt{\kappa}}{1+\sqrt{\kappa}}$ on a strongly convex quadratic. The key in these two examples is that the component of $y_{k}$ associated with the smallest eigenvalue of the hessian of $f$ has the slowest convergence rate. This is the conjugate effect of a gradient step that decreases first the components associated with the highest eigenvalues and of a small extrapolation step that preserves the order of convergence between the different components.

4 Numerical Experiments

In this section we present numerical experiments on Algorithm 3. We also show results of Algorithm 4, a very simple modification of APG for which we did not prove robustness but that appears to work very well in practice.

Both Algorithms 3 and 4 compute and use the strong convexity estimates defined in (22) during their execution. In order to get the values of $f^{*}$ in the experiments we run APG for a sufficient amount of time to reach machine precision. We compare our two algorithms (APG adapt) and (APG adapt v2) with Proximal Gradient Descent (PGD), Accelerated Proximal Gradient for smooth functions (APG), Accelerated Proximal Gradient with known strong convexity parameter (APG Optiamal $\mu$ ) (for square loss and regularized logistic loss) and restarted Accelerated Proxmial Gradient using $f^{*}$ in a stopping criterion with decay parameter $\gamma$ (APG Restart $\gamma=\cdot$ ) tuned to give the best result. The restart scheme is described in Appendix C. Even though the theoretical complexity bound is optimal, the $\gamma$ tuning step for the restart strategy still has a significant impact on empirical performance, as shown in Figure 3 in the Appendix. In terms of computational cost, our algorithms require one more call to the gradient oracle per iteration than the restarted algorithm but there is no parameter to tune, indeed $\mu_{0}$ is always chosen as $\frac{L}{2}$ and has no impact in practice.

Figure 1 shows the convergence of the primal gap when solving the matrix completion problem on synthetic data using the nuclear norm penalization formulation. Our adaptive algorithms exhibit linear convergence meaning that they successfully estimate the local strong convexity of the problem.

Figure 2 regroups the results of experiments on two real world datasets of different sizes using 4 different classical losses. In all cases, our algorithms perform well and display the fast converging rate. Figure 4 in Appendix C shows additional experiments and Figure 5 the convergence of our online estimate of the strong convexity parameter during the execution of the algorithm.

Appendix A Usefull Lemmas

Lemma A.1

Since $h$ is $L$ -smooth and $\mu$ -strongly convex, the following bounds hold

[TABLE]

$\forall x,y\in{\mathbb{R}}^{n}$ .

Proof. [Nesterov, 2018, Th 2.1.5, Th 2.1.10]

Lemma A.2

The sequence $(x_{i})_{i\in\mathbb{N}}$ follows the same updates in Algorithm 1 and 2.

Proof. Note that $\tau_{k}=\tau=\sqrt{\kappa}$ . Let $\phi_{k}(x)=m_{k}(x)+\frac{a_{0}\mu}{2}\|x-x_{0}\|^{2}$ , $\phi$ is a quadratic function. Since $v_{k+1}$ is the $\mathop{\rm argmin}$ of

[TABLE]

reinjecting in the expression of $x_{k+2}$ ,

[TABLE]

which is the update of Algorithm 1.

Appendix B Proofs of Lemmas and Propositions

B.1 Proof of Lemma 2.1

The optimality condition of $T_{L}(y)$ can be written $\nabla h(y)-g_{L}(y)+\xi_{L}(y)=0$ with $\xi_{L}(y)\in\partial\psi(T_{L}(y))$ . By strong convexity of $f$ we have

[TABLE]

B.2 Proof of Proposition 2.3

Recall that with this update of $a_{k}$ we have $\tau_{k}=\sqrt{\kappa},\forall k\geq 0$ .

We have $m_{0}(x)=a_{0}f^{*}$ and Lemma 2.1 implies $m_{k}(x)\leq a_{0}f^{*}+(A_{k}-a_{0})f^{*}$ . This leads to the useful bound

[TABLE]

Then we show by induction that $A_{k}f(y_{k})\leq\left(f(x_{0})-f^{*}\right)+\underset{x\in{\mathbb{R}}^{n}}{\min}m_{k}(x)+\frac{a_{0}\mu}{2}\|x-x_{0}\|^{2}$ .

At rank $k=0$ , $a_{0}=1$ , $m_{0}(x)=f^{*}$ and $y_{0}=x_{0}$ thus $A_{0}f(y_{0})=f(x_{0})-f^{*}+f^{*}$ .

Then suppose the property is true at rank $k$ . Denote $\phi_{k+1}(x)\ m_{k+1}(x)+\frac{a_{0}\mu}{2}\|x-x_{0}\|^{2}$

[TABLE]

We conclude by combining the formulae defining $z_{k+1}$ and $x_{k+1}$ .

[TABLE]

finally since $y_{k+1}=T_{L}(x_{k+1})$ we get $(f(x_{0})-f^{*})+\underset{x\in{\mathbb{R}}^{n}}{\min\;}\phi_{k+1}(x)\geq A_{k+1}f(y_{k+1})$ . In addition, $A_{k+1}=\frac{1}{1-\sqrt{\kappa}}A_{k}$ and $a_{0}=1$ leads to $A_{k}=\left(1-\sqrt{\kappa}\right)^{-k}$ .

B.3 Proof of Proposition 2.4

We follow the proof of Proposition 2.3. However here we have a different bound on $m_{k}(x)$ .

$m_{k}(x)\leq a_{0}f^{*}+(A_{k}-a_{0})f(x)+\displaystyle\sum_{i=1}^{k}\frac{a_{i}}{2}(\mu_{i}-\mu)\|x_{i}-x\|^{2},\forall k\geq 0$ . Which leads to

[TABLE]

Now we show by induction that $\boxed{A_{k}f(y_{k})\leq f(x_{0})-f^{*}+\underset{x\in{\mathbb{R}}^{n}}{\min}m_{k}(x)+\frac{a_{0}\mu_{0}}{2}\|x-x_{0}\|^{2}}$ . At rank $k=0$ , $A_{0}=a_{0}=1,y_{0}=x_{0}$ and $m_{0}(x)=a_{0}f^{*}$ , so the property is true. Suppose it is true at rank $k$ . LEt $\phi_{k+1}(x)=m_{k+1}(x)+\frac{a_{0}\mu_{0}}{2}\|x-x_{0}\|^{2}$ .

[TABLE]

We conclude by combining the formulae defining $z_{k+1}$ and $x_{k+1}$ .

[TABLE]

finally since $y_{k+1}=T_{L}(x_{k+1})$ we get $(f(x_{0})-f^{*})+\underset{x\in{\mathbb{R}}^{n}}{\min\;}\phi_{k+1}(x)\geq A_{k+1}f(y_{k+1})$ , re-injecting in (26) gives the right bound. In addition, $A_{k+1}=\frac{1}{1-\sqrt{\frac{\mu_{k+1}}{L}}}A_{k}$ and $a_{0}=1$ leads to $A_{k}=\prod_{i=1}^{k}\left(1-\sqrt{\frac{\mu_{i}}{L}}\right)^{-1}$ .

B.4 Proof of Lemma 2.5

From the definition of $x_{k+1}$ in Algorithm 3, $x_{k+1}=\alpha_{k}v_{k}+(1-\alpha_{k})y_{k}$ with $\alpha_{k}=\frac{\tau_{k}}{1+\tau_{k}}\in[0,1]$ . By convexity of $\|\cdot-x^{*}\|^{2}$

[TABLE]

We denote $\phi_{k}(x)=m_{k}(x)+\frac{a_{0}\mu_{0}}{2}\|x-x_{0}\|^{2}$ , we have that $v_{k}=\underset{x\in{\mathbb{R}}^{n}}{\mathop{\rm argmin}\;}\phi_{k}(x)$ . Note that $\phi_{k}(x)$ is $\left(\sum_{i=0}^{k}a_{i}\mu_{i}\right)$ -strongly convex, which gives

[TABLE]

We can bound $\|y_{k}-x^{*}\|^{2}$ the same way using Corollary 2.2

[TABLE]

combining these inequality in (27) gives the result.

B.5 Proof of Lemma 2.7

We prove our result by induction. For $k=0$ this is true since $C_{0}\geq a_{0}(\mu_{0}-\mu)\|x_{0}-x^{*}\|^{2}$ . Now suppose the property is true until a rank $k\geq 0$ .

By Lemma 2.5,

[TABLE]

Thus

[TABLE]

which concludes the proof.

Appendix C Numerical Experiments

In the quadratic case we dispose of a natural strong convexity parameter which is the smallest eigenvalue of the Hessian. However when the loss has a more complex structure we do not know a priori which quantity our estimates of strong convexity should be compared to. When looking at the proof of the convergence rate of Algorithm 3, the exact error term due to the fact that $\mu_{k}$ upper bounds $\mu$ is

[TABLE]

where $x_{k}$ is an iterate in Algorithm 3. We then define

[TABLE]

C.1 Parameters of the losses in Figure 2

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allen-Zhu et al. [2016] Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning , pages 1110–1119, 2016.
2Arjevani [2017] Yossi Arjevani. Limitations on variance-reduction and acceleration schemes for finite sums optimization. In Advances in Neural Information Processing Systems , pages 3540–3549, 2017.
3Asi and Duchi [2019] H. Asi and J. Duchi. The importance of better models in stochastic optimization. ar Xiv:1903.08619 , 2019.
4Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.
5Defazio et al. [2014] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. ar Xiv preprint ar Xiv:1407.0202 , 2014.
6Fercoq and Qu [2016] Olivier Fercoq and Zheng Qu. Restarting accelerated gradient methods with a rough strong convexity estimate. ar Xiv preprint ar Xiv:1609.07358 , 2016.
7Johnson and Zhang [2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems , pages 315–323, 2013.
8Lan and Zhou [2018] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical programming , 171(1-2):167–215, 2018.