A New Exact Worst-Case Linear Convergence Rate of the Proximal Gradient   Method

Xiaoya Zhang; Hui Zhang

arXiv:1902.09181·math.OC·March 13, 2019

A New Exact Worst-Case Linear Convergence Rate of the Proximal Gradient Method

Xiaoya Zhang, Hui Zhang

PDF

Open Access

TL;DR

This paper establishes a new exact worst-case linear convergence rate for the proximal gradient method based on the proximal gradient norm, refining existing results and improving convergence rate estimates under the Polyak-Lojasiewicz inequality.

Contribution

It introduces a new exact worst-case linear convergence rate for the proximal gradient method, enhancing understanding of its convergence behavior.

Findings

01

New exact worst-case linear convergence rate established

02

Improved convergence rate of objective function under Polyak-Lojasiewicz inequality

03

Refined descent lemma for the proximal gradient method

Abstract

In this note, we establish a new exact worst-case linear convergence rate of the proximal gradient method in terms of the proximal gradient norm, which complements the recent results in [1] and implies a refined descent lemma.descent lemma. Based on the new lemma, we improve the linear convergence rate of the objective function accuracy under the Polyak-Lojasiewicz inequality.

Equations77

ι_{C} : C \to [- \infty, + \infty] : x \to {0, + \infty, x \in C; o t h er w i se .

ι_{C} : C \to [- \infty, + \infty] : x \to {0, + \infty, x \in C; o t h er w i se .

x \in R^{n} min {φ (x) := f (x) + g (x)}

x \in R^{n} min {φ (x) := f (x) + g (x)}

x^{+} = prox_{t g} (x - t \nabla f (x)) = x - t \cdot G_{t} (x), t > 0

x^{+} = prox_{t g} (x - t \nabla f (x)) = x - t \cdot G_{t} (x), t > 0

x^{+} = x - t (\nabla f (x) + s^{+}) .

x^{+} = x - t (\nabla f (x) + s^{+}) .

μ ∥ x - y ∥ \leq ∥\nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥,

μ ∥ x - y ∥ \leq ∥\nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥,

⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \geq \frac{μL}{μ + L} ∥ x - y ∥^{2} + \frac{1}{μ + L} ∥\nabla f (x) - \nabla f (y) ∥^{2},

⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \geq \frac{μL}{μ + L} ∥ x - y ∥^{2} + \frac{1}{μ + L} ∥\nabla f (x) - \nabla f (y) ∥^{2},

f (x) \geq f (y) + ⟨ \nabla f (y), x - y ⟩ + \frac{1}{2 L} ∥\nabla f (x) - \nabla f (y) ∥^{2} + \frac{μL}{2 ( L - μ )} ∥ x - y - \frac{1}{L} (\nabla f (x) - \nabla f (y)) ∥^{2} .

f (x) \geq f (y) + ⟨ \nabla f (y), x - y ⟩ + \frac{1}{2 L} ∥\nabla f (x) - \nabla f (y) ∥^{2} + \frac{μL}{2 ( L - μ )} ∥ x - y - \frac{1}{L} (\nabla f (x) - \nabla f (y)) ∥^{2} .

∥ G_{t} (x) ∥ \leq d (0, \partial φ (x)) .

∥ G_{t} (x) ∥ \leq d (0, \partial φ (x)) .

∥ G_{t} (x^{+}) ∥ \leq d (0, \partial φ (x^{+})) \leq ρ (t) ∥ G_{t} (x) ∥ \leq ρ (t) d (0, \partial φ (x)) .

∥ G_{t} (x^{+}) ∥ \leq d (0, \partial φ (x^{+})) \leq ρ (t) ∥ G_{t} (x) ∥ \leq ρ (t) d (0, \partial φ (x)) .

∥ G_{t} (x^{+}) ∥ \leq d (0, \partial φ (x^{+})) \leq ∥ G_{t} (x) ∥ \leq d (0, \partial φ (x)) .

∥ G_{t} (x^{+}) ∥ \leq d (0, \partial φ (x^{+})) \leq ∥ G_{t} (x) ∥ \leq d (0, \partial φ (x)) .

∥\nabla f (x^{+}) + s^{+} ∥^{2}

∥\nabla f (x^{+}) + s^{+} ∥^{2}

=

=

=

\leq

=

\leq

=

=

∥\nabla f (x^{+}) + s^{+} ∥^{2} \leq ρ^{2} (t) ∥\nabla f (x) + s ∥^{2}, \forall s \in \partial g (x) .

∥\nabla f (x^{+}) + s^{+} ∥^{2} \leq ρ^{2} (t) ∥\nabla f (x) + s ∥^{2}, \forall s \in \partial g (x) .

φ (x) \geq φ (x^{+}) + \frac{t}{2} ∥ G_{t} (x) ∥^{2} + \frac{t}{2 ( 1 - μ t )} ∥ G_{t} (x^{+}) ∥^{2}, 0 < t \leq \frac{1}{L} .

φ (x) \geq φ (x^{+}) + \frac{t}{2} ∥ G_{t} (x) ∥^{2} + \frac{t}{2 ( 1 - μ t )} ∥ G_{t} (x^{+}) ∥^{2}, 0 < t \leq \frac{1}{L} .

φ (x) \geq φ (x^{+}) + \frac{t}{2} ∥ G_{t} (x) ∥^{2} + \frac{t}{2} ∥ G_{t} (x^{+}) ∥^{2}, 0 < t \leq \frac{1}{L} .

φ (x) \geq φ (x^{+}) + \frac{t}{2} ∥ G_{t} (x) ∥^{2} + \frac{t}{2} ∥ G_{t} (x^{+}) ∥^{2}, 0 < t \leq \frac{1}{L} .

f (x) \geq f (x^{+}) + \frac{t}{2} ∥\nabla f (x) ∥^{2} + \frac{t}{2} ∥\nabla f (x^{+}) ∥^{2}, 0 < t \leq \frac{1}{L} .

f (x) \geq f (x^{+}) + \frac{t}{2} ∥\nabla f (x) ∥^{2} + \frac{t}{2} ∥\nabla f (x^{+}) ∥^{2}, 0 < t \leq \frac{1}{L} .

f (x) \geq f (x^{+}) + ⟨ \nabla f (x^{+}), x - x^{+} ⟩ + \frac{t}{2} ∥\nabla f (x) - \nabla f (x^{+}) ∥^{2} + \frac{μ}{2 ( 1 - μ t )} ∥ x - x^{+} - t (\nabla f (x) - \nabla f (x^{+})) ∥^{2} .

f (x) \geq f (x^{+}) + ⟨ \nabla f (x^{+}), x - x^{+} ⟩ + \frac{t}{2} ∥\nabla f (x) - \nabla f (x^{+}) ∥^{2} + \frac{μ}{2 ( 1 - μ t )} ∥ x - x^{+} - t (\nabla f (x) - \nabla f (x^{+})) ∥^{2} .

φ (x) \geq

φ (x) \geq

+ \frac{μ}{2 ( 1 - μ t )} ∥ x - x^{+} - t (\nabla f (x) - \nabla f (x^{+})) ∥^{2}

=

+ \frac{t}{2} ∥\nabla f (x) - \nabla f (x^{+}) ∥^{2} + \frac{μ}{2 ( 1 - μ t )} ∥ x - x^{+} - t (\nabla f (x) - \nabla f (x^{+})) ∥^{2}

φ (x) \geq

φ (x) \geq

+ \frac{t}{2} ∥\nabla f (x) - \nabla f (x^{+}) ∥^{2} + \frac{μ t ^{2}}{2 ( 1 - μ t )} ∥ s^{+} + \nabla f (x^{+}) ∥^{2}

=

+ \frac{1}{2 t} ∥ x - x^{+} ∥^{2} + \frac{μ t ^{2}}{2 ( 1 - μ t )} ∥ s^{+} + \nabla f (x^{+}) ∥^{2}

=

∥ s^{+} + \nabla f (x^{+}) ∥ \geq d (0, \partial φ (x^{+})) \geq ∥ G_{t} (x^{+}) ∥.

∥ s^{+} + \nabla f (x^{+}) ∥ \geq d (0, \partial φ (x^{+})) \geq ∥ G_{t} (x^{+}) ∥.

φ (x) \geq φ (x^{+}) + \frac{t}{2} ∥ G_{t} (x) ∥^{2} + \frac{t}{2 ( 1 - μ t )} ∥ G_{t} (x^{+}) ∥^{2} .

φ (x) \geq φ (x^{+}) + \frac{t}{2} ∥ G_{t} (x) ∥^{2} + \frac{t}{2 ( 1 - μ t )} ∥ G_{t} (x^{+}) ∥^{2} .

φ (x) \geq φ (x^{+}) + \frac{t}{2} ∥ g_{Q} (x, t) ∥^{2}, 0 < t \leq \frac{1}{L} .

φ (x) \geq φ (x^{+}) + \frac{t}{2} ∥ g_{Q} (x, t) ∥^{2}, 0 < t \leq \frac{1}{L} .

φ (x) \geq φ (x^{+}) + \frac{L}{2} ∥ x^{+} - x ∥^{2} .

φ (x) \geq φ (x^{+}) + \frac{L}{2} ∥ x^{+} - x ∥^{2} .

\forall x \in dom f, \frac{1}{2} ∥\nabla f (x) ∥^{2} \geq η (f (x) - min f) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Numerical methods in inverse problems

Full text

∎

11institutetext: Xiaoya Zhang, Hui Zhang 22institutetext: Department of Mathematics, National University of Defense Technology, Changsha, 410073, Hunan, China.

22email: [email protected], [email protected].

A New Exact Worst-Case Linear Convergence Rate of the Proximal Gradient Method

Xiaoya Zhang

Hui Zhang

(Received: date / Accepted: date)

Abstract

In this note, we establish a new exact worst-case linear convergence rate of the proximal gradient method in terms of the proximal gradient norm, which complements the recent results in taylor2018exact and implies a refined descent lemma. Based on the new lemma, we improve the linear convergence rate of the objective function accuracy under the Polyak-Łojasiewicz inequality.

Keywords:

linear convergence proximal gradient method strongly convex Polyak-Łojasiewicz inequality

MSC:

90C25 90C22 90C20

††journal: JOTA

1 Introduction

A well-known algorithm for minimizing the sum of a smooth function with a non-smooth convex one is the proximal gradient (PG) method. Recently, the authors of taylor2018exact studied the exact worst-case linear convergence rates of the PG method for three different standard performance measures: objective function accuracy, distance to optimality and residual gradient norm. However, the first and third measures rely on the minimizers and the optimal value, which are in general unknown; while the second measure is usually difficult to compute. On the other hand, the proximal gradient (also called stepsize in drusvyatskiy2016nonsmooth ) norm is suggested as a more appealing stopping criteria in drusvyatskiy2016nonsmooth . This motivates us to consider the proximal gradient norm as an alternative to the existing three performance measures.

As a result, we derive an exact worst-case linear convergence rate for the PG method in terms of the proximal gradient norm. The proof idea shares the same spirit of Theorem 2 in nutini2018active but is quite different from that in taylor2018exact . Our result not only complements the recent results in taylor2018exact , but also helps us refine the classic descent lemma for the PG method and further yields an improved linear convergence rate of the objective function accuracy for non-strongly convex case.

2 Notations and preliminaries

2.1 Notations and definitions

Throughout the paper, $\mathbb{R}^{n}$ will denote an $n$ -dimensional Euclidean space associated with inner-product $\langle\cdot,\cdot\rangle$ and induced norm $\|\cdot\|$ . For any nonempty $S\subset\mathbb{R}^{n}$ , we define the distance function by $d(x,S):=\inf_{y\in S}\|x-y\|$ . Besides, we define the indicator function of a set $C\subset\mathbb{R}^{n}$ as

[TABLE]

Recall some basic notions, the domain of the function $f:\mathbb{R}^{d}\rightarrow(-\infty,+\infty]$ is defined by $\text{dom}~{}f=\{x\in\mathbb{R}^{d}:f(x)<+\infty\}$ . We say that $f$ is proper if $\text{dom}~{}f\neq\emptyset$ .

The $L$ -smoothness and $\mu$ -strongly convexity are defined as:

$\Box$

$L$ -smoothness: $\forall x\in\mathbb{R}^{n},\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|$ holds.

$\Box$

$\mu$ -strong convexity: $f(x)-\frac{\mu}{2}\|x\|^{2}$ is convex on $\mathbb{R}^{n}$ .

For simplicity, we make the following notations:

$\bullet$

${\mathcal{F}}_{L}^{1,1}(\mathbb{R}^{n})$ : the class of $L$ -smooth convex functions from $\mathbb{R}^{n}$ to $\mathbb{R}$ ;

$\bullet$

${\mathcal{S}}_{\mu,L}^{1,1}(\mathbb{R}^{n})$ : the class of $L$ -smooth and $\mu$ -strongly convex functions from $\mathbb{R}^{n}$ to $\mathbb{R}$ ;

$\bullet$

$\Gamma_{0}(\mathbb{R}^{n})$ : the class of proper closed and convex functions from $\mathbb{R}^{n}$ to $(-\infty,+\infty]$ .

Obviously, we have ${\mathcal{S}}_{\mu,L}^{1,1}(\mathbb{R}^{n})\subseteq{\mathcal{F}}_{L}^{1,1}(\mathbb{R}^{n})$ .

2.2 The proximal gradient algorithm

In this note, we consider the composite convex minimization:

[TABLE]

where $f\in{\mathcal{F}}_{L}^{1,1}(\mathbb{R}^{n})$ and $g\in\Gamma_{0}(\mathbb{R}^{n})$ .

We focus on the PG method with constant step size $t$ to solve (1). For simplicity, we use the superscript ”+” to denote the subsequent iterate. The PG method can be simply expressed by

[TABLE]

where ${\mathbf{prox}}_{tg}(x):=\arg\min_{u\in\mathbb{R}^{n}}\left\{tg(u)+\frac{1}{2}\|u-x\|^{2}\right\}$ and $\mathcal{G}_{t}(x)=t^{-1}\left(x-{\mathbf{prox}}_{tg}(x-t\nabla f(x))\right)$ is defined as the proximal gradient. By the equality ${\mathbf{prox}}_{tg}=(I+t\partial g)^{-1}$ , we have $x-t\nabla f(x)\in x^{+}+t\partial g(x^{+}),$ which implies that there exists $s^{+}\in\partial g(x^{+})$ such that

[TABLE]

2.3 Two important lemmas

Our analysis will rely on the following two lemmas.

Lemma 1 (Theorem 2.1.12, nesterov2013introductory ; Theorem 4, taylor2017smooth )

If $f\in{\mathcal{S}}^{1,1}_{\mu,L}(\mathbb{R}^{n})$ , then for any $x,y\in\mathbb{R}^{n}$ we have

[TABLE]

and

[TABLE]

and the smooth strongly convex interpolation formula

[TABLE]

Lemma 2 (Theorem 3.5, drusvyatskiy2018error )

Let $\varphi=f+g$ , where $f\in{\mathcal{F}}^{1,1}_{L}(\mathbb{R}^{n})$ and $g\in\Gamma_{0}(\mathbb{R}^{n})$ . For any $x\in\mathbb{R}^{n}$ , it holds that

[TABLE]

3 Main result and implications

In this section, we present two new results for the PG method: one is an exact worst-case linear convergence rate in terms of the proximal gradient norm, and the other is a refined sufficient decrease property of the objective function value.

3.1 Main result

Now, we are ready to present the main result of this note.

Theorem 3.1

Let $\varphi=f+g$ , where $f\in{\mathcal{S}}^{1,1}_{\mu,L}(\mathbb{R}^{n})$ and $g\in\Gamma_{0}(\mathbb{R}^{n})$ . Denote $\rho(t):=\max\{|1-Lt|,|1-\mu t|\}$ . Then, the PG method for minimizing $\varphi$ achieves the exact worst-case linear convergence rate in terms of the proximal gradient norm:

[TABLE]

In particular, for $f\in{\mathcal{F}}^{1,1}_{L}(\mathbb{R}^{n})$ , $g\in\Gamma_{0}(\mathbb{R}^{n})$ , and $0<t\leq\frac{2}{L}$ , it holds that

[TABLE]

Proof

Note that $s^{+}\in\partial g(x^{+})$ and hence $d(0,\partial\varphi(x^{+}))\leq\|\nabla f(x^{+})+s^{+}\|$ . Therefore, to show (2), it suffices to show that $\|\nabla f(x^{+})+s^{+}\|^{2}\leq\rho^{2}(t)\|\mathcal{G}_{t}(x)\|^{2}$ in view of Lemma 2. Using Lemma 1, we derive that

[TABLE]

Here, the factor $\rho(t)$ can not be improved; otherwise, it will contradict the following exact worst-case convergence rate, which was recently established in taylor2018exact :

[TABLE]

3.2 Implicated result

The second result is a refined version of the classic descent lemma(see (nesterov2013introductory, , Corollary 2.2.1)(beck2009fast, , Lemma 2.3)).

Lemma 3

Let $\varphi=f+g$ , where $f\in{\mathcal{S}}^{1,1}_{\mu,L}(\mathbb{R}^{n})$ and $g\in\Gamma_{0}(\mathbb{R}^{n})$ . Then, the PG method for minimizing $\varphi$ has the refined sufficient decrease property

[TABLE]

In particular,

•

for $f\in{\mathcal{F}}^{1,1}_{L}(\mathbb{R}^{n})$ , $g\in\Gamma_{0}(\mathbb{R}^{n})$ , it holds that

[TABLE]

•

for $f\in{\mathcal{F}}^{1,1}_{L}(\mathbb{R}^{n})$ , $g\equiv 0$ , it holds that

[TABLE]

Proof

Note that $0<t\leq L^{-1}$ implies $t^{-1}\geq L$ and the fact that ${\mathcal{S}}^{1,1}_{\mu,L}(\mathbb{R}^{n})\subset{\mathcal{S}}^{1,1}_{\mu,t^{-1}}(\mathbb{R}^{n})$ . We can use the smooth strongly convex interpolation formula with $L=t^{-1}$ and $y=x^{+}$ in Lemma 1 to get

[TABLE]

The convexity of $g$ gives $g(x)\geq g(x^{+})+\langle s^{+},x-x^{+}\rangle$ since $s^{+}\in\partial g(x^{+})$ . Adding these two inequalities, we derive that

[TABLE]

Using the expression $x^{+}=x-t(\nabla f(x)+s^{+})$ , we can further derive that

[TABLE]

Note that $x-x^{+}=t\mathcal{G}_{t}(x)$ and

[TABLE]

We finally obtain

[TABLE]

This completes the proof.

Remark 1

In (nesterov2013introductory, , Corollary 2.2.1), for $\varphi=f+g$ with $f\in{\mathcal{S}}^{1,1}_{\mu,L}(\mathbb{R}^{n})$ and $g$ being the indicator function of a set $Q$ , the descent lemma of the projected gradient method can be stated as

[TABLE]

where $g_{Q}(x,t):=t^{-1}(x-x^{+})$ is the gradient mapping of $f$ on $Q$ .

In (beck2009fast, , Lemma 2.3), for $\varphi=f+g$ with $f\in{\mathcal{F}}^{1,1}_{L}(\mathbb{R}^{n})$ and $g\in\Gamma_{0}(\mathbb{R}^{n})$ , the corresponding descent lemma of the PG method is:

[TABLE]

Remarkably, our result improves these existing descent lemmas.

With the refined descent lemma, we can show a better linear convergence rate in terms of the objective function accuracy for the gradient descent method under the classic Polyak-Łojasiewicz inequality polyak1963gradient lojasiewicz1963topological .

Corollary 1

Let $f\in{\mathcal{F}}^{1,1}_{L}(\mathbb{R}^{n})$ , $g\equiv 0$ . Assume that $f$ satisfies the Polyak-Łojasiewicz inequality for some $\eta>0$ :

[TABLE]

Let $x^{+}=x-t\nabla f(x)$ , $0<t\leq\frac{1}{L}$ , then it holds that

[TABLE]

Proof

Using the Polyak-Łojasiewicz inequality and (5) in Lemma 3, we have

[TABLE]

Rearranging and subtracting $\min f$ from both sides yield

[TABLE]

Remark 2

The result (8) with $t=\frac{1}{L}$ improves the existing linear convergence rate in (karimi2016linear, , Theorem 1) from $(1-\frac{\eta}{L})$ to $\frac{L-\eta}{L+\eta}$ .

Finally, we extend the result above to the PG method.

Corollary 2

Let $f\in{\mathcal{F}}^{1,1}_{L}(\mathbb{R}^{n})$ , $g\in\Gamma_{0}(\mathbb{R}^{n})$ . Assume that $\varphi=f+g$ satisfies the generalized Polyak-Łojasiewicz inequality for some $\eta>0$ :

[TABLE]

Let $x^{+}=x-t\mathcal{G}_{t}(x)$ , $0<t\leq\frac{1}{L}$ , then it holds that

[TABLE]

Proof

Using the generalized Polyak-Łojasiewicz inequality and (4) in Lemma 3, we have

[TABLE]

Rearranging and subtracting $\min\varphi$ from both sides give us

[TABLE]

Acknowledgements

This work is supported by the National Science Foundation of China (No.61571008).

Bibliography10

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Adrien B Taylor, Julien M Hendrickx, and François Glineur. Exact worst-case convergence rates of the proximal gradient method for composite convex minimization. Journal of Optimization Theory and Applications , pages 1–22, 2018.
2(2) Dmitriy Drusvyatskiy, Alexander D Ioffe, and Adrian S Lewis. Nonsmooth optimization using taylor-like models: error bounds, convergence, and termination criteria. ar Xiv preprint ar Xiv:1610.03446 , 2016.
3(3) Julie Nutini, Mark Schmidt, and Warren Hare. ” active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern? Optimization Letters , 2018.
4(4) Yurii Nesterov. Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & Business Media, 2013.
5(5) Adrien B Taylor, Julien M Hendrickx, and François Glineur. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming , 161(1-2):307–345, 2017.
6(6) Dmitriy Drusvyatskiy and Adrian S Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research , 43(3):919–948, 2018.
7(7) Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.
8(8) Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki , 3(4):643–653, 1963.