Restart FISTA with Global Linear Convergence

Teodoro Alamo; Pablo Krupa; Daniel Limon

arXiv:1906.09126·math.OC·December 30, 2019·ECC

Restart FISTA with Global Linear Convergence

Teodoro Alamo, Pablo Krupa, Daniel Limon

PDF

TL;DR

This paper introduces a new restart scheme for FISTA that guarantees global linear convergence in non-strongly convex problems satisfying quadratic growth, without needing prior knowledge of certain parameters.

Contribution

The paper proposes a novel restart scheme for FISTA that achieves global linear convergence without requiring prior parameter knowledge.

Findings

01

Outperforms existing restart FISTA schemes in simulations

02

Achieves global linear convergence for non-strongly convex problems

03

Does not require prior knowledge of the objective's optimal value or growth parameter

Abstract

Fast Iterative Shrinking-Threshold Algorithm (FISTA) is a popular fast gradient descent method (FGM) in the field of large scale convex optimization problems. However, it can exhibit undesirable periodic oscillatory behaviour in some applications that slows its convergence. Restart schemes seek to improve the convergence of FGM algorithms by suppressing the oscillatory behaviour. Recently, a restart scheme for FGM has been proposed that provides linear convergence for non strongly convex optimization problems that satisfy a quadratic functional growth condition. However, the proposed algorithm requires prior knowledge of the optimal value of the objective function or of the quadratic functional growth parameter. In this paper we present a restart scheme for FISTA algorithm, with global linear convergence, for non strongly convex optimization problems that satisfy the quadratic growth…

Figures4

Click any figure to enlarge with its caption.

Tables3

Table 1. TABLE I: Test 1. Comparison between restart schemes

Exit Cond.	$E_{c}^{l}$	No restart	$E_{c}^{f}$	$E_{c}^{g}$	$E_{c}^{*}$
Avg. Iter.	$670.6$	$8207.2$	$1648.7$	$687.5$	$1569.5$
Median Iter.	$676$	$8241$	$1608.5$	$666.5$	$1571$
Max. Iter.	$783$	$10109$	$2156$	$930$	$2053$
Min. Iter.	$570$	$6737$	$1192$	$567$	$917$

Table 2. TABLE II: Test 2. Comparison between restart schemes

Exit Cond.	$E_{c}^{l}$	No restart	$E_{c}^{f}$	$E_{c}^{g}$	$E_{c}^{*}$
Avg. Iter.	$1683.7$	$34116.4$	$7743.3$	$1606.7$	$4601.9$
Median Iter.	$1659$	$33127.5$	$7242$	$1594$	$4503$
Max. Iter.	$2162$	$51201$	$14080$	$2201$	$7266$
Min. Iter.	$1406$	$24539$	$3894$	$1306$	$2499$

Table 3. TABLE III: Test 3. Comparison between restart schemes

Exit Cond.	$E_{c}^{l}$	No restart	$E_{c}^{f}$	$E_{c}^{g}$	$E_{c}^{*}$
Avg. Iter.	$705.9$	$8379.5$	$1786.3$	$686$	$1709.4$
Median Iter.	$704.5$	$8135.5$	$1773$	$680.5$	$1703$
Max. Iter.	$873$	$12055$	$3218$	$892$	$2512$
Min. Iter.	$547$	$5943$	$987$	$529$	$1042$

Equations335

f^{*} = x \in X min f (x) = x \in X min Ψ (x) + h (x),

f^{*} = x \in X min f (x) = x \in X min Ψ (x) + h (x),

h (x) \leq h (y) + ⟨ \nabla h (y), x - y ⟩ + \frac{1}{2} ∥ x - y ∥_{R}^{2},

h (x) \leq h (y) + ⟨ \nabla h (y), x - y ⟩ + \frac{1}{2} ∥ x - y ∥_{R}^{2},

x \in X min f (x)

x \in X min f (x)

h (x) \leq h (y) + ⟨ \nabla h (y), x - y ⟩ + \frac{L}{2} ∥ x - y ∥_{S}^{2},

h (x) \leq h (y) + ⟨ \nabla h (y), x - y ⟩ + \frac{L}{2} ∥ x - y ∥_{S}^{2},

\frac{L}{2} ∥ x - y ∥_{S}^{2} = \frac{1}{2} ∥ x - y ∥_{L S}^{2},

\frac{L}{2} ∥ x - y ∥_{S}^{2} = \frac{1}{2} ∥ x - y ∥_{L S}^{2},

Ω ≐ {x : x \in X, f (x) = f^{*}} .

Ω ≐ {x : x \in X, f (x) = f^{*}} .

\overset{x}{ˉ} ≐ arg z \in Ω min ∥ x - z ∥_{R} .

\overset{x}{ˉ} ≐ arg z \in Ω min ∥ x - z ∥_{R} .

x \in X min Ψ (x) + ⟨ \nabla h (y), x - y ⟩ + \frac{1}{2} ∥ x - y ∥_{R}^{2} .

x \in X min Ψ (x) + ⟨ \nabla h (y), x - y ⟩ + \frac{1}{2} ∥ x - y ∥_{R}^{2} .

y^{+}

y^{+}

g (y)

f (y^{+}) - f (x)

f (y^{+}) - f (x)

= ⟨ g (y), y - x ⟩ - \frac{1}{2} ∥ g (y) ∥_{*}^{2}

= - \frac{1}{2} ∥ y^{+} - x ∥_{R}^{2} + \frac{1}{2} ∥ y - x ∥_{R}^{2} .

\frac{1}{2} ∥ g (y) ∥_{*}^{2} \leq f (y) - f (y^{+}) \leq f (y) - f^{*} .

\frac{1}{2} ∥ g (y) ∥_{*}^{2} \leq f (y) - f (y^{+}) \leq f (y) - f^{*} .

y \in Ω \Leftrightarrow g (y) = 0.

y \in Ω \Leftrightarrow g (y) = 0.

∥ g (x_{k}) ∥_{*} \leq ϵ,

∥ g (x_{k}) ∥_{*} \leq ϵ,

f (x_{k}) - f^{*} \leq \frac{2}{( k + 1 ) ^{2}} ∥ x_{0} - \overset{x}{ˉ}_{0} ∥_{R}^{2}, \forall k \geq 1,

f (x_{k}) - f^{*} \leq \frac{2}{( k + 1 ) ^{2}} ∥ x_{0} - \overset{x}{ˉ}_{0} ∥_{R}^{2}, \forall k \geq 1,

∥ g (y_{k}) ∥_{*} \leq \frac{4∥ x _{0} - x ˉ _{0} ∥ _{R}}{k + 2}, \forall k \geq 0.

∥ g (y_{k}) ∥_{*} \leq \frac{4∥ x _{0} - x ˉ _{0} ∥ _{R}}{k + 2}, \forall k \geq 0.

E_{c}^{f} = True \Leftrightarrow f (x_{k}) \geq f (x_{k - 1}) .

E_{c}^{f} = True \Leftrightarrow f (x_{k}) \geq f (x_{k - 1}) .

E_{c}^{g} = True \Leftrightarrow ⟨ g (y_{k - 1}), x_{k - 1} - x_{k} ⟩ \leq 0.

E_{c}^{g} = True \Leftrightarrow ⟨ g (y_{k - 1}), x_{k - 1} - x_{k} ⟩ \leq 0.

f^{*} = x \in X min f (x)

f^{*} = x \in X min f (x)

f (x) - f^{*} \geq \frac{μ}{2} ∥ x - \overset{x}{ˉ} ∥_{R}^{2}, \forall x \in X,

f (x) - f^{*} \geq \frac{μ}{2} ∥ x - \overset{x}{ˉ} ∥_{R}^{2}, \forall x \in X,

E_{c}^{*} = True \Leftrightarrow f (x_{k}) - f^{*} \leq \frac{f ( x _{0} ) - f ^{*}}{e ^{2}},

E_{c}^{*} = True \Leftrightarrow f (x_{k}) - f^{*} \leq \frac{f ( x _{0} ) - f ^{*}}{e ^{2}},

[l e f t = E_{c}^{l} = True \Leftrightarrow \empheqlbrace] f (x_{m}) - f (x_{k}

[l e f t = E_{c}^{l} = True \Leftrightarrow \empheqlbrace] f (x_{m}) - f (x_{k}

f (x_{k})

\frac{16}{μ} ⌈ ln (1 + \frac{2 ( f ( r _{0} ) - f ^{*} )}{ϵ ^{2}}) ⌉ .

\frac{16}{μ} ⌈ ln (1 + \frac{2 ( f ( r _{0} ) - f ^{*} )}{ϵ ^{2}}) ⌉ .

\frac{16}{μ} ⌈ ln (1 + \frac{2 ( f ( r _{0} ) - f ^{*} )}{ϵ ^{2}}) ⌉ .

\frac{16}{μ} ⌈ ln (1 + \frac{2 ( f ( r _{0} ) - f ^{*} )}{ϵ ^{2}}) ⌉ .

x min \frac{1}{2 N} ∥ A x - b ∥_{2}^{2} + ∥ W x ∥_{1},

x min \frac{1}{2 N} ∥ A x - b ∥_{2}^{2} + ∥ W x ∥_{1},

R_{i, i} = j = 1 \sum n ∣ H_{i, j} ∣,

R_{i, i} = j = 1 \sum n ∣ H_{i, j} ∣,

\frac{16}{μ} ⌈ ln (1 + \frac{2 ( f ( r _{0} ) - f ^{*} )}{ϵ ^{2}}) ⌉ .

\frac{16}{μ} ⌈ ln (1 + \frac{2 ( f ( r _{0} ) - f ^{*} )}{ϵ ^{2}}) ⌉ .

I_{{\mathcal{X}}}(x)=\left\{\begin{array}[]{rl}0&\mbox{if }x\in{\mathcal{X}}\\ \infty&\mbox{otherwise}.\end{array}\right.

I_{{\mathcal{X}}}(x)=\left\{\begin{array}[]{rl}0&\mbox{if }x\in{\mathcal{X}}\\ \infty&\mbox{otherwise}.\end{array}\right.

Ψ_{X} (x) ≐ Ψ (x) + I_{X} (x), \forall x \in I R^{n} .

Ψ_{X} (x) ≐ Ψ (x) + I_{X} (x), \forall x \in I R^{n} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Restart FISTA with Global Linear Convergence

Teodoro Alamo, Pablo Krupa and Daniel Limon T. Alamo, P. Krupa and D. Limon are at the Systems Engineering and Automation Department, University of Seville, Spain. E-mail: {talamo,pkrupa,dlm}@us.esThe authors acknowledge MINERCO and FEDER funds for funding project DPI2016-76493-C3-1-R, and MCIU and FSE for the FPI-2017 grant. This paper constitutes an extended and revised version of [1]. Some of the technical results presented in this paper are used in [2].

Abstract

Fast Iterative Shrinking-Threshold Algorithm (FISTA) is a popular fast gradient descent method (FGM) in the field of large scale convex optimization problems. However, it can exhibit undesirable periodic oscillatory behaviour in some applications that slows its convergence. Restart schemes seek to improve the convergence of FGM algorithms by suppressing the oscillatory behaviour. Recently, a restart scheme for FGM has been proposed that provides linear convergence for non strongly convex optimization problems that satisfy a quadratic functional growth condition. However, the proposed algorithm requires prior knowledge of the optimal value of the objective function or of the quadratic functional growth parameter. In this paper we present a restart scheme for FISTA algorithm, with global linear convergence, for non strongly convex optimization problems that satisfy the quadratic growth condition without requiring the aforementioned values. We present some numerical simulations that suggest that the proposed approach outperforms other restart FISTA schemes.

Keywords

Fast gradient method, restart FISTA, convex optimization, linear convergence, quadratic functional growth condition.

I Introduction

Fast gradient methods (FGM) were introduced by Yurii Nesterov in [3], [4], where it was shown that these methods provide a convergence rate O $(1/k^{2})$ for smooth convex optimization problems with non strongly convex objective functions [4], where $k$ is the iteration counter. These methods were generalized to composite non smooth convex optimization problems in [5], [6], [7]. The resulting algorithm is commonly known as FISTA algorithm [5]. Because of its complexity certification, it is often used in the context of embedded model predictive control [8], [9], [10]. Another possibility to address composite convex optimization problems is to use splitting methods like ADMM [11], [12], [13].

FISTA algorithms can be applied in a primal setting (as in the Lasso problem [5]), or in a dual one [14], [15]. They can be thought of as a momentum method, since the linearization point at each iteration depends on the previous iterations. Since the momentum grows with the iteration counter, the algorithm can exhibit undesirable periodic oscillating behavior for certain applications, which slows the convergence rate. To mitigate this, restart schemes have been proposed in the literature which stop the algorithm when a certain criteria is met. It is then restarted using the last value provided by the stopped algorithm as the new initial condition [16], [17], [18].

In [16] two heuristic restart schemes for FGM are proposed which exhibit improved convergence rates over non-restart FGM schemes. These restart schemes reset the momentum of the FGM in order to eliminate the undesirable oscillations whenever the periodical behavior is detected. A restart scheme similar to the ones in [16] with O $(1/k^{2})$ convergence rate for smooth convex optimization is presented in [18]. In [19], an algorithm is proposed that uses the restart schemes from [16]. Numerical results show improvements over previous restart schemes for FGM, but no theoretical results on convergence rates are provided.

Recently, linear convergence rate has been derived for several first order methods applied to convex optimization problems with non strongly convex objective functions that satisfy a relaxation of the strong convexity known as the quadratic functional growth [20].

In [20, Subsection 5.2.2] a restarting scheme of FGM is presented with global linear convergence rate for convex optimization problems that satisfy the functional growth condition with parameter $\mu$ . However, in order to implement this strategy, prior knowledge is needed of either the optimal value of the objective function or the value of $\mu$ , which can be challenging to compute.

In this paper we propose a novel restart scheme for FISTA algorithm applied to solving convex constrained problems. We show that the algorithm guarantees global linear convergence rate O $(1/\sqrt{\mu})$ for convex optimization problems with non strongly convex objective functions that satisfy the quadratic functional growth condition with parameter $\mu$ . The proposed algorithm does not require prior knowledge of the value of $\mu$ or of the optimal value of the objective function. We provide theoretical upper bounds on the number of iterations of the algorithm needed to achieve a given accuracy.

Additionally, we show numerical results comparing the proposed algorithm with the heuristic restart schemes from [16] and the restart scheme from [20] for Lasso problems.

In Section II we introduce the problem formulation. Section III presents FISTA algorithm and some restart schemes. The convergence rate of non restart FISTA algorithm under the satisfaction of the quadratic functional growth condition is presented in Section IV. In Section V we present the proposed restart scheme for FISTA and state its global linear convergence. Numerical results comparing the proposed algorithm with other restart schemes applied to FISTA are shown in Section VI. Finally, conclusions are presented in Section VII.

Notation

Given vectors $x$ and $y$ , we denote by $\langle x,y\rangle$ their scalar product, i.e. $\langle x,y\rangle\doteq x^{\top}y$ . Given vector $x$ , $\|x\|_{2}$ denotes its Euclidean norm ( $\|x\|_{2}\doteq\sqrt{x^{\top}x}$ ), and $\|\cdot\|_{1}$ denotes its $l_{1}$ -norm (sum of the absolute values of the components of $x$ ). Given $R\succ 0$ we denote by $\|\cdot\|_{R}$ the weighted Euclidean norm $\|x\|_{R}\doteq\sqrt{x^{\top}Rx}$ , and by $\|x\|_{*}\doteq\|x\|_{R^{-1}}$ its dual norm. $\ln(\cdot)$ is the natural logarithm and $e$ is Euler’s number. $\lfloor x\rfloor$ denotes the largest integer smaller than or equal to $x$ ; $\lceil x\rceil$ denotes the smallest integer greater than or equal to $x$ . Given a set ${\mathcal{X}}\subseteq{\rm\,I\!R}^{n}$ we denote by $I_{{\mathcal{X}}}$ its indicator function. That is, ${I_{{\mathcal{X}}}(x)=0}$ if ${x\in{\mathcal{X}}}$ , and ${I_{{\mathcal{X}}}(x)=\infty}$ if $x\not\in{\mathcal{X}}$ . The relative interior of set ${\mathcal{X}}$ is denoted by ${\rm ri}({\mathcal{X}})$ . Given the extended real valued function ${f:{\rm\,I\!R}^{n}\to(-\infty,\infty]}$ we denote by ${\rm dom}(f)$ its effective domain. That is, ${{\rm dom}(f)\doteq\{\;x\in{\rm\,I\!R}^{n}\;:\;f(x)<\infty\;\}}$ . We denote by ${\rm epi}(f)$ the epigraph of $f$ . That is, ${{\rm epi}(f)\doteq\{\;(x,t)\in{\rm\,I\!R}^{n}\times{\rm\,I\!R}\;:\;f(x)\leq t\;\}}$ . We say that function ${f:{\rm\,I\!R}^{n}\to(-\infty,\infty]}$ is closed if its epigraph is a closed set. We say that ${f:{\rm\,I\!R}^{n}\to(-\infty,\infty]}$ is proper if its effective domain is not empty. That is, if $f$ is not identically equal to $\infty$ . We say that a vector ${d\in{\rm\,I\!R}^{n}}$ is a subgradient of $f$ at a point ${x\in{\rm dom}(f)}$ if ${f(y)\geq f(x)+\langle d,y-x\rangle}$ , ${\forall y\in{\rm\,I\!R}^{n}}$ . The set of all subgradients of $f$ at $x$ is called the subdifferential of $f$ at $x$ and is denoted by $\partial f(x)$ .

II Problem Formulation

We address the problem of solving the composite convex minimization problem

[TABLE]

under the following assumption.

Assumption 1.

We assume that

(i)

$h:{\rm\,I\!R}^{n}\to{\rm\,I\!R}$ * is a smooth differentiable convex function. That is, there is $R\succ 0$ such that the inequality*

[TABLE]

is satisfied for every $x\in{\rm\,I\!R}^{n}$ and $y\in{\rm\,I\!R}^{n}$ . 2. (ii)

$\Psi:{\rm\,I\!R}^{n}\to(-\infty,\infty]$ * is a closed convex function and ${\mathcal{X}}\subseteq{\rm\,I\!R}^{n}$ is a closed convex set.* 3. (iii)

Denote $f\doteq\Psi+h$ . The minimization problem

[TABLE]

is solvable. That is, there is $x^{*}\in{\mathcal{X}}\bigcap{\rm dom}(\Psi)$ such that $f^{*}=f(x^{*})=\inf\limits_{x\in{\mathcal{X}}}f(x)$ .

We notice that it is standard to write down the first point of Assumption 1 as

[TABLE]

where parameter $L$ serves to characterize the smoothness of $h$ and $S$ is a positive definite matrix. Constant $L$ provides a bound on the Lipschitz constant of the gradient $\nabla h(\cdot)$ [4, Subsection 2.1]. Since

[TABLE]

we have that (3) implies (2) if we take $R=LS$ . This simplifies the algebraic expressions needed to analyze the convergence of the proposed algorithm.

We notice that Assumption 1 guarantees that the minimization problem (1) is solvable. The optimal set $\Omega$ is defined as

[TABLE]

This set is a singleton if $f(x)$ is strictly convex. Given ${x\in{\rm\,I\!R}^{n}}$ we will denote $\bar{x}$ its closest element in the optimal set $\Omega$ (with respect to the norm $\|\cdot\|_{R}$ ). That is,

[TABLE]

Given $y\in{\rm\,I\!R}^{n}$ , one could use the local information given by ${\nabla h}(y)$ to minimize the value of $f=\Psi+h$ around $y$ . Under Assumption 1, this can be done obtaining the minimizer of the strictly convex optimization problem

[TABLE]

It is well known that this problem is solvable and has a unique solution if Assumption 1 holds (see, for example, Subsection 6.1 in [21] for an analogous result). For completeness we provide a proof of this statement in Appendix -A (see Property 5).

The solution to this optimization problem leads to the notion of composite gradient mapping [6], which constitutes a generalization of the gradient mapping that can be found in [4, Subsection 2.2] for the particular case $\Psi(\cdot)=0$ . See also [5] for the particular case ${\mathcal{X}}={\rm\,I\!R}^{n}$ .

Definition 1 (Composite Gradient Mapping $g(y)$ ).

Under Assumption 1, and given $y\in{\rm\,I\!R}^{n}$ , we define

[TABLE]

We notice that the composite gradient mapping is closely related to the notion of proximal operator [22], [21, Chapter 6]. For example, one could state, after some manipulations, the computation of the composite gradient mapping as the computation of a proximal operator. In the context of optimal gradient methods, it is assumed that the computation of $y^{+}$ is cheap. This is the case when ${\mathcal{X}}$ is a simple set (box, ${\rm\,I\!R}^{n}$ , etc.), $R$ diagonal, and $\Psi(\cdot)$ a separable function. For example, in the well known Lasso optimization problem, the computation of $y^{+}$ resorts to the computation of the shrinkage operator [5]. See [23], Section 6 of [22], Chapter 28 in [24], or Chapter 6 in [21], for numerous examples in which the computation of the composite gradient mapping is simple.

The following property gathers well-known properties of the composite gradient mapping $g(y)$ and its dual norm $\|g(y)\|_{*}=\|g(y)\|_{R^{-1}}$ [5], [6]. For completeness, we include the proof in Appendix -B.

Property 1.

Suppose that Assumption 1 holds. Then,

(i)

For every $y\in{\rm\,I\!R}^{n}$ and $x\in{\mathcal{X}}$ :

[TABLE] 2. (ii)

For every $y\in{\mathcal{X}}$ :

[TABLE]

The composite gradient serves to characterize optimality [6]. That is, under Assumption 1 we have the following equivalence

[TABLE]

This fact is proved in Appendix -C.

III Restart FISTA Schemes

For a given initial condition $z\in{\rm\,I\!R}^{n}$ , a minimum number of iterations $k_{min}\geq 0$ , and an exit condition $E_{c}$ , the non restart FISTA algorithm [5] is shown in Algorithm 1. This algorithm solves $\min\limits_{x\in{\mathcal{X}}}\;h(x)+\Psi(x)$ under Assumption 1.

Since the optimality of $x_{k}$ is equivalent to $g(x_{k})=0$ (see Property 7 in Appendix -C), a typical choice for non restart FISTA schemes is to choose $k_{min}$ equal to zero and codify the exit condition

[TABLE]

where $\epsilon>0$ is an accuracy parameter. It is also common to use the exit condition $\|g(y_{k-1})\|_{*}\leq\epsilon$ , since this exit condition requires $y_{k-1}^{+}$ , which has already been computed in step 1 of the algorithm.

It is well known that under Assumption 1, see also (3), the iterations of non-restart FISTA satisfy [5, 6],

[TABLE]

where $\bar{x}_{0}$ represents the point in the optimal set $\Omega$ closest to the initial condition $x_{0}$ of the algorithm (see (4)). For the sake of completeness, we present a detailed proof of this claim in Appendix -D. We also prove in the same appendix that the sequence $\{y_{k}\}$ generated by Algorithm 1 (FISTA) satisfies

[TABLE]

In restart schemes, one invokes several times FISTA algorithm with a relaxed exit condition. Typical choices are (see [16]),

(i)

Function scheme:

[TABLE] 2. (ii)

Gradient scheme:

[TABLE]

Given initial condition $r_{0}\in{\mathcal{X}}$ , a minimum number of iterations $k_{min}\geq 0$ , an exit condition $E_{c}$ , and an accuracy parameter $\epsilon>0$ , the standard restart FISTA algorithm is shown in Algorithm 2.

The implementation of Algorithm 2 usually provides better performance than the original non restart version [16], [18].

IV Convergence of Restart FISTA under a quadratic functional growth condition

It has been recently shown in [20] that some relaxations of the strong convexity conditions of the objective function are sufficient for obtaining linear convergence for several first order methods. In particular, the following relaxation of strong convexity suffices to guarantee linear convergence of different gradient optimization schemes for smooth functions ( $\Psi(\cdot)=0$ ). See [20, Subsection 5.2.2].

Assumption 2 (Quadratic Functional Growth).

We assume that the optimization problem

[TABLE]

is solvable and satisfies the following quadratic functional growth condition with parameter $\mu>0$ :

[TABLE]

where $\bar{x}$ denotes the closest element to $x$ in the optimal set $\Omega$ (see (4)).

As can be seen in [20, Subsection 3.4], strong convexity implies quadratic functional growth. This means that the quadratic functional growth setting encompasses a broad family of convex functions.

It is also shown in [20, Subsection 5.2.2] that if the value of $f^{*}$ is known and $\Psi(\cdot)=0$ , then a restart FISTA based on the exit condition

[TABLE]

exhibits global linear convergence. This exit condition is easily implementable if the optimal value $f^{*}$ is known. This is the case, for example, in some formulations of feasibility optimization problems, in which the optimal value $f^{*}$ is equal to zero for every feasible solution. This restart scheme corresponds to an optimal restart rate of $\frac{2e}{\sqrt{\mu}}$ [20, Subsection 5.2.2].

We present now a novel result that further characterizes the convergence properties of the non restart FISTA algorithm under Assumption 2.

Property 2.

Under Assumptions 1 and 2, the iterations of FISTA algorithm satisfy

(i)

$f(x_{k})-f^{*}\leq{\displaystyle{\frac{4(f(x_{0})-f^{*})}{\mu(k+1)^{2}}}}$ , for all $k\geq 1$ . 2. (ii)

$f(x_{k})\leq f(x_{0})$ , for all $k\geq\left\lfloor\frac{2}{\sqrt{\mu}}\right\rfloor$ . 3. (iii)

$f(x_{k})-f^{*}\leq{\displaystyle{\frac{f(x_{0})-f(x_{k})}{e}}}$ , for all $k\geq\left\lfloor\frac{2\sqrt{e+1}}{\sqrt{\mu}}\right\rfloor$ .

Proof.

See Appendix -F.

V Restart FISTA with global linear convergence

In this section we propose a novel restart FISTA algorithm (Algorithm 3) that exhibits global linear convergence under the quadratic functional growth condition. The algorithm uses exit condition $E_{c}^{l}$ , which is defined to be true if the following two conditions are satisfied,

[TABLE]

with $m=\lfloor\frac{k}{2}\rfloor+1$ .

Inequality (9b) guarantees that the output of the FISTA algorithm is no larger than the one corresponding to its initial condition.

As it is stated in the following property, one of the main features of the proposed algorithm is that the number of iterations $n_{j}$ required at each FISTA iteration ${[r_{j},n_{j}]=FISTA(r_{j-1},n_{j-1},E_{c}^{l})}$ is upper bounded by ${\frac{4\sqrt{e+1}}{\sqrt{\mu}}\approx\frac{7.72}{\sqrt{\mu}}}$ . Moreover, the number of iterations required by the proposed algorithm to attain a given accuracy $\epsilon$ is upper bounded by

[TABLE]

Property 3.

Suppose that Assumptions 1 and 2 hold. Then, the sequences $\{r_{j}\}$ , $\{n_{j}\}$ provided by Algorithm 3 satisfy

(i)

${\displaystyle{\frac{1}{2}}}\|g(r_{j-1})\|_{*}^{2}\leq f(r_{j-1})-f(r_{j})$ , $\forall j\geq 1$ . 2. (ii)

$n_{j}\leq{\displaystyle{\frac{4\sqrt{e+1}}{\sqrt{\mu}}}}$ , $\forall j\geq 0$ . 3. (iii)

The number of iterations ( $\sum\limits_{i=0}^{j}n_{i}$ ) required to guarantee $\|g(r_{j})\|_{*}\leq\epsilon$ is no larger than

[TABLE]

Proof.

See Appendix -G.

We notice that the factor 16 in the worst case complexity analysis is conservative. The authors claim that a better factor might be obtained at the expense of a more involved proof.

VI Numerical results

We consider a weighted Lasso problem of the form

[TABLE]

where $x\in\mathbb{R}^{n}$ , $A\in\mathbb{R}^{N\times n}$ is sparse with an average of $90\%$ of its entries being zero (sparsity was generated by setting a $0.9$ probability for each element of the matrix to be [math]), $n>N$ , and $b\in\mathbb{R}^{N}$ . Each nonzero element in $A$ and $b$ is obtained from a Gaussian distribution with zero mean and covariance 1. $W\in\mathbb{R}^{n\times n}$ is a diagonal matrix with elements obtained from a uniform distribution on the interval $[0,\alpha]$ .

We note that Lasso problems (9j) can be reformulated in such a way that they satisfy the quadratic growth condition [20, Section 6.3]. For this problem, inequality (2) of Assumption 1 is satisfied, for instance, for a matrix $R$ chosen as

[TABLE]

with $H=\frac{1}{N}A^{\top}A$ . This is due to the Gershgorin Circle Theorem [25, Subsection 7.2]. See also [6, Section 6].

We show the results of applying algorithms 2 and 3 with an accuracy parameter $\epsilon=10^{-11}$ using different restart schemes and values of $N$ , $n$ and $\alpha$ . We take $r_{0}=0$ .

The restart schemes shown are $E_{c}^{f}$ (6) and $E_{c}^{g}$ (7) from [16], restart condition $E_{c}^{*}$ (8) [20], and the restart condition $E_{c}^{l}$ (9) proposed in this paper (using Algorithm 3). Additionally, we show the results of applying FISTA algorithm without using a restart scheme. In order to provide a fair comparison between the performance of the restart schemes, the algorithms are exited as soon as a value of $y_{k}$ that satisfies $\|g(y_{k-1})\|_{*}\leq\epsilon$ is found. We note that, in order to implement the restart scheme based on $E_{c}^{*}$ , we had to previously compute the optimal value $f^{*}$ , which was done by using Algorithm 3 with $\epsilon=10^{-12}$ .

Tables I to III show results of performing $100$ tests with different randomized problems (9j) that share common values of parameters $N$ , $n$ and $\alpha$ . Tables show the average, median, maximum and minimum number of iterations.

Figures 1 to 3 show the value of $\|g(x_{k})\|_{*}$ for a randomly selected problem out of the randomized problems used to compute the results shown in tables I to III, respectively.

Figure 4 shows the value of $n_{j}$ at each iteration $j$ of Algorithm 3 for the three examples whose results are shown in Figures 1 to 3. Note that the final value of $n_{j}$ is lower than the previous one in all three instances due to the algorithm exiting as soon as the condition $\|g(y_{k-1})\|_{*}\leq\epsilon$ is satisfied.

VII Conclusions

In this paper we have presented a novel restart scheme with guaranteed global linear convergence. The algorithm relies on a quadratic functional growth condition. One of the advantages of the proposed algorithm is that it does not require the knowledge of the parameter $\mu$ that characterizes the quadratic functional growth condition, or the optimal value of the minimization problem. We provide an upper bound of the required number of iterations equal to

[TABLE]

We have presented numerical evidence of the good performance of the algorithm when compared with other restarts schemes. It outperforms the restart scheme based on the knowledge of the optimal value $f^{*}$ .

-A Existence and Uniqueness of Composite Gradient

We present in this appendix some well known facts about convex analysis that are required to analyze the properties of the composite gradient.

Property 4.

Suppose that

(i)

$\Psi:{\rm\,I\!R}^{n}\to(-\infty,\infty]$ * is a closed convex function.* 2. (ii)

${\mathcal{X}}\subseteq{\rm\,I\!R}^{n}$ * is a closed convex set.* 3. (iii)

The set ${\rm dom}(\Psi)\bigcap{\mathcal{X}}$ is non empty. 4. (iv)

$I_{{\mathcal{X}}}:{\rm\,I\!R}^{n}\to\{0,\infty\}$ * is the indicator function of ${\mathcal{X}}$ . That is,*

[TABLE] 5. (v)

The function $\Psi_{{\mathcal{X}}}:{\rm\,I\!R}^{n}\to(-\infty,\infty]$ is defined as

[TABLE]

Then

(i)

The function $\Psi_{{\mathcal{X}}}$ is proper, closed, and convex. 2. (ii)

The relative interior of ${\rm dom}(\Psi_{{\mathcal{X}}})$ is non empty. 3. (iii)

There is $z\in{\mathcal{X}}$ and $d\in{\rm\,I\!R}^{n}$ such that $\Psi_{{\mathcal{X}}}(z)<\infty$ and

[TABLE]

Proof.

From ${\rm dom}(\Psi)\bigcap{\mathcal{X}}\neq\emptyset$ we have that both ${\rm dom}(\Psi)$ and ${\mathcal{X}}$ are non empty. The epigraph of the indicator function $I_{{\mathcal{X}}}$ is, by definition,

[TABLE]

Since ${\mathcal{X}}$ and ${{\cal{T}}\doteq\{\;t\in{\rm\,I\!R}\;:\;t\geq 0\;\}}$ are non empty closed sets, ${{\rm epi}(I_{{\mathcal{X}}})={\mathcal{X}}\times{\cal{T}}}$ is also a non empty closed convex set. Thus, by definition, ${I_{{\mathcal{X}}}:{\rm\,I\!R}^{n}\to\{0,\infty\}}$ is a closed convex function. Since both $\Psi$ and $I_{{\mathcal{X}}}$ are closed convex functions, ${\Psi_{{\mathcal{X}}}\doteq\Psi+I_{{\mathcal{X}}}}$ is also a closed convex function (the sum of closed convex functions provides closed convex functions [26, Proposition 1.1.5]). Since ${{\rm dom}(\Psi_{{\mathcal{X}}})={\rm dom}(\Psi)\bigcap{\mathcal{X}}\neq\emptyset}$ , we infer that the domain of $\Psi_{{\mathcal{X}}}$ is non empty. This implies that $\Psi_{{\mathcal{X}}}$ is not identically equal to $\infty$ . Moreover, since ${\Psi:{\rm\,I\!R}^{n}\to(-\infty,\infty]}$ we have that ${\Psi_{{\mathcal{X}}}:{\rm\,I\!R}^{n}\to(-\infty,\infty]}$ . We conclude that ${\Psi_{{\mathcal{X}}}(x)>-\infty}$ for every ${x\in{\rm\,I\!R}^{n}}$ . From this and the fact that $\Psi_{{\mathcal{X}}}$ is not identically equal to $\infty$ we have that $\Psi_{{\mathcal{X}}}$ is proper.

Since ${\rm dom}(\Psi_{{\mathcal{X}}})$ is a non empty convex set, it has a non empty relative interior ${\rm ri}({\rm dom}(\Psi_{{\mathcal{X}}}))$ (see [26, Proposition 1.3.2]).

It is a well know fact from convex analysis that the subdifferential of a proper convex function at a point in the relative interior of its domain is non empty [26, Proposition 5.4.1]. Suppose now that $z\in{\rm ri}({\rm dom}(\Psi_{{\mathcal{X}}}))$ . Since $\Psi_{{\mathcal{X}}}$ is a proper convex function we have that the subdifferential of $\Psi_{{\mathcal{X}}}$ at $z$ is non empty. This means, by definition, that there is $d\in{\rm\,I\!R}^{n}$ such that

[TABLE]

∎

Property 5.

Suppose that Assumption 1 holds. Given any $y\in{\rm\,I\!R}^{n}$ , consider the quadratic function $h_{y}:{\rm\,I\!R}^{n}\to{\rm\,I\!R}$ defined as

[TABLE]

Then, the minimization problem

[TABLE]

is solvable and has a unique solution. That is, there exists a unique point $y^{+}\in{\mathcal{X}}$ such that

[TABLE]

Proof.

Notice that the minimization problem (9k) is equivalent to

[TABLE]

where $I_{{\mathcal{X}}}$ is the indicator function of ${\mathcal{X}}$ . If we define ${\Psi_{{\mathcal{X}}}\doteq\Psi+I_{{\mathcal{X}}}}$ we can rewrite the original problem (9k) as

[TABLE]

We notice that the assumptions of Property 4 are satisfied if Assumption 1 holds. Thus, we infer from Property 4 that $\Psi_{{\mathcal{X}}}:{\rm\,I\!R}^{n}\to(-\infty,\infty]$ is a proper closed convex function. We also have that the quadratic function $h_{y}:{\rm\,I\!R}^{n}\to{\rm\,I\!R}$ is also proper and closed because it is a real valued continuous function (see [26, Proposition 1.1.3]). Since the sum of closed functions is closed (see [26, Proposition 1.1.5]), we infer that $F_{y}\doteq\Psi_{{\mathcal{X}}}+h_{y}$ is a closed function. Moreover, from Property 4 we also have that there is $z\in{\mathcal{X}}$ and $d\in{\rm\,I\!R}^{n}$ such that

(i)

$\Psi_{{\mathcal{X}}}(z)<\infty$ . 2. (ii)

$\Psi_{{\mathcal{X}}}(x)\geq\Psi_{{\mathcal{X}}}(z)+\langle d,x-z\rangle$ , $\forall x\in{\rm\,I\!R}^{n}$ .

Therefore,

[TABLE]

We infer from (9l) that the closed function ${F_{y}:{\rm\,I\!R}^{n}\to(-\infty,\infty]}$ is not identically equal to $\infty$ and therefore, proper. We conclude that $F_{y}$ is a proper closed convex function. From Weiertrasss’ Theorem (see Proposition 3.2.1 in [26]) we have that the set of minima of $F_{y}$ over ${\rm\,I\!R}^{n}$ is nonempty and compact if there is a scalar $\bar{\gamma}$ such that the level set ${\Phi(\bar{\gamma})=\{\;x\;:\;F_{y}(x)\leq\bar{\gamma}\;\}}$ is nonempty and bounded. From (9l) we have that $\Phi(\gamma_{z})$ is nonempty. Moreover, we also infer from (9l) that $\Phi(\gamma_{z})$ is a bounded set because $F_{y}$ is lower bounded by a strictly convex quadratic function of $x$ . We conclude that

[TABLE]

is a solvable optimization problem. That is, there is $y^{+}\in{\mathcal{X}}$ such that

[TABLE]

The set of minimizers consists of a single element $y^{+}$ because of the strictly convex nature of $F_{y}$ ( $h_{y}$ is a strictly convex function). ∎

-B Proof of Property 1.

We prove in this appendix Property 1, which is rewritten here for the reader’s convenience.

Property 6.

Suppose that Assumption 1 holds. Then,

(i)

For every $y\in{\rm\,I\!R}^{n}$ and $x\in{\mathcal{X}}$ :

[TABLE] 2. (ii)

For every $y\in{\mathcal{X}}$ :

[TABLE]

Proof.

From Property 5 we have that there is a (unique) $y^{+}\in{\mathcal{X}}$ such that

[TABLE]

where $h_{y}(x)\doteq\langle{\nabla h}(y),x-y\rangle+\frac{1}{2}\|x-y\|^{2}_{R}$ . Denote now $\Psi_{{\mathcal{X}}}=\Psi+I_{{\mathcal{X}}}$ , where $I_{{\mathcal{X}}}:{\rm\,I\!R}^{n}\to\{0,\infty\}$ is the indicator function of ${\mathcal{X}}$ . Since $y^{+}\in{\mathcal{X}}$ we have $I_{{\mathcal{X}}}(y^{+})=0$ . Therefore, inequality (9n) implies

[TABLE]

Denote now $F_{y}=\Psi_{{\mathcal{X}}}+h_{y}$ . From last inequality we have

[TABLE]

By definition of subdifferential at a point, we have that the previous inequality implies

[TABLE]

We have that $\Psi_{{\mathcal{X}}}$ is a proper closed function and ${\rm ri}({\rm dom}(\Psi_{{\mathcal{X}}}))\neq\emptyset$ (see the first two claims of Property 4). The domain of the quadratic function $h_{y}:{\rm\,I\!R}^{n}\to{\rm\,I\!R}$ is ${\rm\,I\!R}^{n}$ . Since $h_{y}$ is a continuous real value function in ${\rm\,I\!R}^{n}$ , it is also closed (see Proposition 1.1.3 in [26]). We have that

[TABLE]

Since $F_{y}=\Psi_{{\mathcal{X}}}+h_{y}$ is equal to the sum of two closed convex functions and

[TABLE]

we have $\partial F_{y}(y^{+})=\partial\Psi_{{\mathcal{X}}}(y^{+})+\partial h_{y}(y^{+})$ (see Proposition 5.4.6 in [26]). The subdifferential of the differentiable function $h_{y}$ at $y^{+}$ is ${\nabla h}_{y}(y^{+})={\nabla h}(y)+R(y^{+}-y)$ . Thus, we obtain from (9o)

[TABLE]

Since $g(y)$ is defined as $R(y-y^{+})$ we obtain

[TABLE]

By definition of $\partial\Psi_{{\mathcal{X}}}(\cdot)$ we have

[TABLE]

Obviously, since ${\mathcal{X}}\subseteq{\rm\,I\!R}^{n}$ , this implies

[TABLE]

Since $y^{+}\in{\mathcal{X}}$ and $\Psi_{{\mathcal{X}}}=\Psi$ for every $x\in{\mathcal{X}}$ , we obtain

[TABLE]

The convexity of $h(\cdot)$ implies

[TABLE]

Adding this inequality to (9p) yields

[TABLE]

From Assumption 1 we have

[TABLE]

Adding this inequality to (9q) yields

[TABLE]

From this inequality we have

[TABLE]

This proves (9ma). We now prove (9mb) and (9mc) by means of simple algebraic manipulations.

[TABLE]

This proves (9mb). From this inequality, and the definition of $g(y)$ , we obtain

[TABLE]

This proves (9mc). Suppose now that $y\in{\mathcal{X}}$ . Particularizing inequality (9r) to $x=y$ yields

[TABLE]

The inequality $f(y)-f(y^{+})\leq f(y)-f^{*}$ trivially follows from $f^{*}\leq f(y^{+})$ . ∎

-C Characterization of optimality

The following property serves to characterize the optimality of a given point $y\in{\rm\,I\!R}^{n}$ .

Property 7.

Suppose that Assumption 1 holds. Then $y\in{\rm\,I\!R}^{n}$ belongs to the optimal set

[TABLE]

if and only if $g(y)=0$ .

Proof.

We first show that $g(y)=0$ implies $y\in\Omega$ . Since $R\succ 0$ , we infer from equality $g(y)=R(y-y^{+})$ that ${g(y)=0}$ is equivalent to $y=y^{+}$ . Suppose that ${x^{*}\in\Omega\subseteq{\mathcal{X}}}$ . Then, we obtain from $g(y)=0$ , $y=y^{+}\in{\mathcal{X}}$ , and the first claim of Property 1, the following inequality

[TABLE]

That is, $f^{*}=f(x^{*})\geq f(y)$ . Since $y=y^{+}\in{\mathcal{X}}$ , this is possible only if $y$ is also optimal ( $f(y)=f^{*}$ ). This proves that ${g(y)=0}$ implies $y\in\Omega$ . We now prove that $y\in\Omega$ implies $g(y)=0$ . Suppose that $y\in\Omega$ . Then, $f(y)=f^{*}$ and we obtain from the second claim of Property 1

[TABLE]

This implies $g(y)=0$ . ∎

-D Convergence of non restart FISTA

Property 8.

Suppose that Assumption 1 holds. Then, the sequences $\{x_{k}\}$ and $\{y_{k}\}$ generated by Algorithm 1 (FISTA) satisfy

(i)

$f(x_{k})-f^{*}\leq{\displaystyle{\frac{2\|x_{0}-\bar{x}_{0}\|_{R}^{2}}{(k+1)^{2}}}}$ , for all $k\geq 1$ , 2. (ii)

$\|g(y_{k})\|_{*}\leq{\displaystyle{\frac{4\|x_{0}-\bar{x}_{0}\|_{R}}{k+2}}}$ , for all $k\geq 0$ ,

where $\bar{x}_{0}$ represents the point in the optimal set $\Omega$ closest to the initial condition $x_{0}$ of the algorithm.

Proof.

First claim:

We denote $g_{k}\doteq g(y_{k})$ , $\forall k\geq 0$ . Additionally, we recall that ${\|\cdot\|_{*}\doteq\|\cdot\|_{R^{-1}}}$ .

From step 4 of FISTA algorithm we have

[TABLE]

This implies that

[TABLE]

Particularizing inequality (9mc) of the first claim of Property 6 to $y=y_{0}\in{\rm\,I\!R}^{n}$ , and $x=\bar{x}_{0}\in\Omega\subseteq{\mathcal{X}}$ , we obtain

[TABLE]

By construction we have that $x_{0}=y_{0}$ and $x_{1}=y_{0}^{+}$ . Furthermore, by definition of $\bar{x}_{0}$ , we have $f(\bar{x}_{0})=f^{*}$ . Therefore we can rewrite previous inequality as

[TABLE]

This proves the claim of the property for $k=1$ . We now proceed to prove the claim for $k\geq 2$ . From equality (9s) we have

[TABLE]

Therefore, from inequality (9mb) of Property 6 we obtain that for every $x\in{\mathcal{X}}$ and every $k\geq 1$

[TABLE]

We notice that, by construction, $x_{k}\in{\mathcal{X}}$ , $k\geq 1$ . Particularizing at $x_{k}$ and $\bar{x}_{0}$ , we obtain from last inequality

[TABLE]

In order to write down the proof in a compact way, we introduce the following incremental notation, valid for all $k\geq 0$ ,

[TABLE]

Inequalities (9ua) and (9ub) in an incremental notation, are

[TABLE]

We introduce now the auxiliary variable $\Gamma_{k}$ , defined as

[TABLE]

From Property 9 in appendix -E we have

[TABLE]

We now use this identity to obtain

[TABLE]

In view of Property 9, $t_{k}\geq 1$ , $\forall k\geq 0$ . This implies that we can replace, in inequality (9w), $\delta f_{k}-\delta f_{k+1}$ and $-\delta f_{k+1}$ by the lower bounds given by inequalities (9va) and (9vb). In this way we obtain

[TABLE]

From step 6 of the algorithm we have for all $k\geq 1$ that ${y_{k}=x_{k}+{\displaystyle{\frac{t_{k-1}-1}{t_{k}}}}(x_{k}-x_{k-1})}$ . This can be rewritten in incremental notation as

[TABLE]

We now define, for every $k\geq 1$

[TABLE]

From the definition of $s_{k}$ and (9y) we obtain

[TABLE]

From (9x) and (9aa) we obtain

[TABLE]

Using (9z) and (9aa) we now show that $g_{k}$ can be written in terms of $s_{k}$ and $s_{k+1}$ .

[TABLE]

With this expression for $t_{k}g_{k}$ we obtain from (9ab)

[TABLE]

Thus, for every $k\geq 1$ ,

[TABLE]

Equivalently

[TABLE]

Since this inequality holds for every $k\geq 1$ we can apply it in a recursive way to obtain

[TABLE]

From (9t) we have

[TABLE]

Thus,

[TABLE]

Therefore,

[TABLE]

From this inequality, and taking now into account that ${t_{k}\geq{\displaystyle{\frac{k+2}{2}}}}$ for all $k\geq 0$ (second claim of Property 9), we conclude

[TABLE]

That is,

[TABLE]

∎

Second claim:

We first prove the claim for $k=0$ .

[TABLE]

From (9t) we derive

[TABLE]

Thus,

[TABLE]

We now prove the claim for $k>0$ . From (9ad) we also have

[TABLE]

We also have that

[TABLE]

From (9ae) we derive ${\|s_{1}\|_{R}=\|x_{1}-\bar{x}_{0}\|_{R}\leq\|x_{0}-\bar{x}_{0}\|_{R}}$ . From this and (9af) we obtain

[TABLE]

From here we derive, for every $k\geq 1$ ,

[TABLE]

From (9ac) we have

[TABLE]

Therefore, for every $k\geq 1$

[TABLE]

We notice that the last inequality is due to the second claim of Property 9. This proves the second claim of the property. ∎

-E Properties of the sequence $\{t_{k}\}$

Property 9.

Let us suppose that $t_{0}=1$ and that

[TABLE]

Then

(i)

$t_{k-1}^{2}=t_{k}^{2}-t_{k}$ , for all $k\geq 1$ . 2. (ii)

$t_{k}\geq{\displaystyle{\frac{k+2}{2}}}\geq 1$ , for all $k\geq 0$ .

Proof.

(i)

For every $k\geq 1$ , $t_{k}$ is defined as one of the roots of

[TABLE]

Therefore we obtain $t_{k-1}^{2}=t_{k}^{2}-t_{k}$ . 2. (ii)

The claim is trivially satisfied for $k$ equal to 0. We now show that if the claim is satisfied for $k-1$ then it is also satisfied for $k$ .

[TABLE]

Since the claim is assumed to be satisfied for $k-1$ we have $t_{k-1}\geq\frac{k+1}{2}$ and consequently

[TABLE]

∎

-F Proof of Property 2

From equation (5) we have

[TABLE]

Due to Assumption 2 we also have

[TABLE]

Therefore,

[TABLE]

This proves the first claim. Denote

[TABLE]

With this notation we rewrite (9ai) as

[TABLE]

Suppose now that $k\geq\left\lfloor{\displaystyle{\frac{2}{\sqrt{\mu}}}}\right\rfloor$ . Then,

[TABLE]

Therefore,

[TABLE]

This, along with inequality (9aj), yields

[TABLE]

Equivalently,

[TABLE]

This proves the second claim of the property. In view of inequality (9aj) we have

[TABLE]

Therefore,

[TABLE]

Suppose now that $k\geq\left\lfloor\frac{2\sqrt{e+1}}{\sqrt{\mu}}\right\rfloor$ . This implies $k\geq\left\lfloor\frac{2}{\sqrt{\mu}}\right\rfloor$ and consequently $1-\alpha_{k}>0$ (see (9ak)). Dividing both terms of inequality (9al) by $1-\alpha_{k}$ , we get

[TABLE]

∎

-G Proof of Property 3

By construction, $r_{j-1}\in{\mathcal{X}}$ , for all $j\geq 1$ . Therefore, we have from the second claim of Property 1, that

[TABLE]

We also notice that $r_{j}$ is computed invoking FISTA algorithm using $r_{j-1}$ as initial condition ( $z=r_{j-1}$ ). That is,

[TABLE]

Since the output value $f(r_{j})$ is forced to be no larger than the one corresponding to ${x_{0}=z^{+}=r_{j-1}^{+}}$ , we have ${f(r_{j})\leq f(r_{j-1}^{+})}$ . Therefore, we obtain from inequality (9am) that

[TABLE]

This proves the first claim of the property. We now show that if $n_{j-1}\leq\frac{4\sqrt{e+1}}{\sqrt{\mu}}$ , then the value $n_{j}$ obtained from

[TABLE]

also satisfies

[TABLE]

Denote

[TABLE]

Since $\bar{m}\geq\left\lfloor\frac{2\sqrt{e+1}}{\sqrt{\mu}}\right\rfloor$ , we infer, from the third claim of Property 2, that

[TABLE]

From this inequality, we obtain

[TABLE]

Therefore, the first exit condition is satisfied for $m=\bar{m}$ . Since $m=\lfloor\frac{k}{2}\rfloor+1$ we have $m\geq\frac{k}{2}$ . This means that for $m=\bar{m}$ , the corresponding value for $k$ is no larger than

[TABLE]

We also notice that, in view of the second claim of Property 2, the additional exit condition $f(x_{k})\leq f(x_{0})$ is satisfied for every

[TABLE]

Therefore, $n_{j-1}\leq\frac{4\sqrt{e+1}}{\sqrt{\mu}}$ implies that $n_{j}$ , obtained from $[r_{j},n_{j}]=FISTA(r_{j-1},n_{j-1},E_{c}^{l})$ , also satisfies (9an). We now prove, by reduction to the absurd, that $n_{j}$ cannot be larger than $\frac{4\sqrt{e+1}}{\sqrt{\mu}}$ . Suppose that

[TABLE]

Because of the previous discussion, the previous inequality could be forced only by the doubling step ${n_{j}=2n_{j-1}}$ of the algorithm. That is, inequality (9ao) is possible only if there is $s$ such that $n_{s-1}>\frac{2\sqrt{e+1}}{\sqrt{\mu}}$ and

[TABLE]

Since

[TABLE]

we have that $r_{s-1}$ is obtained from $r_{s-2}$ applying

[TABLE]

iterations of FISTA algorithm. However, we have from the third claim of Property 2 that this number of iterations implies

[TABLE]

From the second claim of Property 1 we also have ${f(r_{s-2}^{+})\leq f(r_{s-2})}$ . Thus,

[TABLE]

That is, there is no doubling step if $n_{s-1}\geq\frac{2\sqrt{e+1}}{\sqrt{\mu}}$ . This proves the second claim of the property.

We now show that there is a doubling step at least every

[TABLE]

steps of the algorithm. Suppose that there is no doubling step from iteration $j=s+1$ to $j=s+T$ , where $s\geq 1$ . That is,

[TABLE]

From this, and the first claim of the property, we obtain the following sequence of inequalities

[TABLE]

We conclude that $T$ consecutive iterations without doubling step implies that the exit condition is satisfied ( $\|g(r_{s+T-1})\|_{*}\leq\epsilon$ ). We conclude that there must be at least one doubling step every $T$ iterations. This implies that there exist $j\in[s+1,s+T]$ such that

[TABLE]

Therefore, $n_{j}=2n_{j-1}$ . Moreover, since $\{n_{j}\}$ is a non decreasing sequence, we get ${n_{s+T}\geq n_{j}=2n_{j-1}\geq 2n_{s}}$ , ${\forall s\geq 1}$ . That is,

[TABLE]

Suppose that $j$ is rewritten as $j=m+nT$ , where $0\leq m<T$ and $n\geq 0$ . From the non decreasing nature of $\{n_{j}\}$ ,

[TABLE]

Also, from inequality (9ap), we have $n_{j-T}\leq\frac{n_{j}}{2}$ . Using this inequality in a recursive manner we obtain

[TABLE]

This, allows us to infer from (9aq) that

[TABLE]

The last claim of the property follows directly from this one and the bound $n_{j}\leq\frac{4\sqrt{e+1}}{\sqrt{\mu}}$ of the second claim. That is, if $j$ denotes the first index for which $\|g(r_{j})\|_{*}\leq\epsilon$ , we get that the number of total iterations is bounded by

[TABLE]

∎

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. Alamo, P. Krupa, and D. Limon, “Restart FISTA with global linear convergence,” in 2019 18th European Control Conference (ECC) . IEEE, 2019, pp. 1969–1974.
2[2] ——, “Gradient based restart FISTA,” in Proceedings of the 58th IEEE Conference on Decision and Control (CDC) . IEEE, 2019, pp. 3936–3941.
3[3] Y. Nesterov, “A method of solving a convex programming problem with convergence rate O ( 1 / k 2 ) 1 superscript 𝑘 2 (1/k^{2}) ,” Sov. Math. Dokl. , vol. 27, no. 2, pp. 372–376, 1983.
4[4] ——, Introductory Lectures on Convex Optimization: A Basic Course . Springer, 2004.
5[5] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences , vol. 2, no. 1, pp. 183–202, 2009.
6[6] Y. Nesterov, “Gradient methods for minimizing composite functions,” Mathematical Programming , vol. 140, pp. 125–161, 2013.
7[7] P. Tseng, “On accelerated proximal gradient methods for convex-concave optimization,” Dept. Math., Univ. Washington, Seattle, WA, USA, Tech. Rep., 2008.
8[8] M. Kögel and R. Findeisen, “A fast gradient method for embedded linear predictive control,” in 18th IFAC World Congress , 2011, pp. 1362–1367.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Restart FISTA with Global Linear Convergence

Abstract

Keywords

I Introduction

Notation

II Problem Formulation

Assumption 1**.**

Definition 1** (Composite Gradient Mapping g(y)g(y)g(y)).**

Property 1**.**

III Restart FISTA Schemes

IV Convergence of Restart FISTA under a quadratic functional growth condition

Assumption 2** (Quadratic Functional Growth).**

Property 2**.**

Proof.

V Restart FISTA with global linear convergence

Property 3**.**

Proof.

VI Numerical results

VII Conclusions

-A Existence and Uniqueness of Composite Gradient

Property 4**.**

Proof.

Property 5**.**

Proof.

-B Proof of Property 1.

Property 6**.**

Proof.

-C Characterization of optimality

Property 7**.**

Proof.

-D *Convergence of non restart FISTA *

Property 8**.**

Proof.

-E Properties of the sequence {tk}\{t_{k}\}{tk​}

Property 9**.**

Proof.

-F Proof of Property 2

-G Proof of Property 3

Assumption 1.

Definition 1 (Composite Gradient Mapping $g(y)$ ).

Property 1.

Assumption 2 (Quadratic Functional Growth).

Property 2.

Property 3.

Property 4.

Property 5.

Property 6.

Property 7.

-D Convergence of non restart FISTA

Property 8.

-E Properties of the sequence $\{t_{k}\}$

Property 9.