A New Exact Worst-Case Linear Convergence Rate of the Proximal Gradient Method
Xiaoya Zhang, Hui Zhang

TL;DR
This paper establishes a new exact worst-case linear convergence rate for the proximal gradient method based on the proximal gradient norm, refining existing results and improving convergence rate estimates under the Polyak-Lojasiewicz inequality.
Contribution
It introduces a new exact worst-case linear convergence rate for the proximal gradient method, enhancing understanding of its convergence behavior.
Findings
New exact worst-case linear convergence rate established
Improved convergence rate of objective function under Polyak-Lojasiewicz inequality
Refined descent lemma for the proximal gradient method
Abstract
In this note, we establish a new exact worst-case linear convergence rate of the proximal gradient method in terms of the proximal gradient norm, which complements the recent results in [1] and implies a refined descent lemma.descent lemma. Based on the new lemma, we improve the linear convergence rate of the objective function accuracy under the Polyak-Lojasiewicz inequality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Numerical methods in inverse problems
∎
11institutetext: Xiaoya Zhang, Hui Zhang 22institutetext: Department of Mathematics, National University of Defense Technology, Changsha, 410073, Hunan, China.
22email: [email protected], [email protected].
A New Exact Worst-Case Linear Convergence Rate of the Proximal Gradient Method
Xiaoya Zhang
Hui Zhang
(Received: date / Accepted: date)
Abstract
In this note, we establish a new exact worst-case linear convergence rate of the proximal gradient method in terms of the proximal gradient norm, which complements the recent results in taylor2018exact and implies a refined descent lemma. Based on the new lemma, we improve the linear convergence rate of the objective function accuracy under the Polyak-Łojasiewicz inequality.
Keywords:
linear convergence proximal gradient method strongly convex Polyak-Łojasiewicz inequality
MSC:
90C25 90C22 90C20
††journal: JOTA
1 Introduction
A well-known algorithm for minimizing the sum of a smooth function with a non-smooth convex one is the proximal gradient (PG) method. Recently, the authors of taylor2018exact studied the exact worst-case linear convergence rates of the PG method for three different standard performance measures: objective function accuracy, distance to optimality and residual gradient norm. However, the first and third measures rely on the minimizers and the optimal value, which are in general unknown; while the second measure is usually difficult to compute. On the other hand, the proximal gradient (also called stepsize in drusvyatskiy2016nonsmooth ) norm is suggested as a more appealing stopping criteria in drusvyatskiy2016nonsmooth . This motivates us to consider the proximal gradient norm as an alternative to the existing three performance measures.
As a result, we derive an exact worst-case linear convergence rate for the PG method in terms of the proximal gradient norm. The proof idea shares the same spirit of Theorem 2 in nutini2018active but is quite different from that in taylor2018exact . Our result not only complements the recent results in taylor2018exact , but also helps us refine the classic descent lemma for the PG method and further yields an improved linear convergence rate of the objective function accuracy for non-strongly convex case.
2 Notations and preliminaries
2.1 Notations and definitions
Throughout the paper, will denote an -dimensional Euclidean space associated with inner-product and induced norm . For any nonempty , we define the distance function by . Besides, we define the indicator function of a set as
[TABLE]
Recall some basic notions, the domain of the function is defined by . We say that is proper if .
The -smoothness and -strongly convexity are defined as:
-smoothness: holds.
-strong convexity: is convex on .
For simplicity, we make the following notations:
: the class of -smooth convex functions from to ;
: the class of -smooth and -strongly convex functions from to ;
: the class of proper closed and convex functions from to .
Obviously, we have .
2.2 The proximal gradient algorithm
In this note, we consider the composite convex minimization:
[TABLE]
where and .
We focus on the PG method with constant step size to solve (1). For simplicity, we use the superscript ”+” to denote the subsequent iterate. The PG method can be simply expressed by
[TABLE]
where and is defined as the proximal gradient. By the equality , we have which implies that there exists such that
[TABLE]
2.3 Two important lemmas
Our analysis will rely on the following two lemmas.
Lemma 1 (Theorem 2.1.12, nesterov2013introductory ; Theorem 4, taylor2017smooth )
If , then for any we have
[TABLE]
and
[TABLE]
and the smooth strongly convex interpolation formula
[TABLE]
Lemma 2 (Theorem 3.5, drusvyatskiy2018error )
Let , where and . For any , it holds that
[TABLE]
3 Main result and implications
In this section, we present two new results for the PG method: one is an exact worst-case linear convergence rate in terms of the proximal gradient norm, and the other is a refined sufficient decrease property of the objective function value.
3.1 Main result
Now, we are ready to present the main result of this note.
Theorem 3.1
Let , where and . Denote . Then, the PG method for minimizing achieves the exact worst-case linear convergence rate in terms of the proximal gradient norm:
[TABLE]
In particular, for , , and , it holds that
[TABLE]
Proof
Note that and hence . Therefore, to show (2), it suffices to show that in view of Lemma 2. Using Lemma 1, we derive that
[TABLE]
Here, the factor can not be improved; otherwise, it will contradict the following exact worst-case convergence rate, which was recently established in taylor2018exact :
[TABLE]
3.2 Implicated result
The second result is a refined version of the classic descent lemma(see (nesterov2013introductory, , Corollary 2.2.1)(beck2009fast, , Lemma 2.3)).
Lemma 3
Let , where and . Then, the PG method for minimizing has the refined sufficient decrease property
[TABLE]
In particular,
- •
for , , it holds that
[TABLE]
- •
for , , it holds that
[TABLE]
Proof
Note that implies and the fact that . We can use the smooth strongly convex interpolation formula with and in Lemma 1 to get
[TABLE]
The convexity of gives since . Adding these two inequalities, we derive that
[TABLE]
Using the expression , we can further derive that
[TABLE]
Note that and
[TABLE]
We finally obtain
[TABLE]
This completes the proof.
Remark 1
In (nesterov2013introductory, , Corollary 2.2.1), for with and being the indicator function of a set , the descent lemma of the projected gradient method can be stated as
[TABLE]
where is the gradient mapping of on .
In (beck2009fast, , Lemma 2.3), for with and , the corresponding descent lemma of the PG method is:
[TABLE]
Remarkably, our result improves these existing descent lemmas.
With the refined descent lemma, we can show a better linear convergence rate in terms of the objective function accuracy for the gradient descent method under the classic Polyak-Łojasiewicz inequality polyak1963gradient lojasiewicz1963topological .
Corollary 1
Let , . Assume that satisfies the Polyak-Łojasiewicz inequality for some :
[TABLE]
Let , , then it holds that
[TABLE]
Proof
Using the Polyak-Łojasiewicz inequality and (5) in Lemma 3, we have
[TABLE]
Rearranging and subtracting from both sides yield
[TABLE]
Remark 2
The result (8) with improves the existing linear convergence rate in (karimi2016linear, , Theorem 1) from to .
Finally, we extend the result above to the PG method.
Corollary 2
Let , . Assume that satisfies the generalized Polyak-Łojasiewicz inequality for some :
[TABLE]
Let , , then it holds that
[TABLE]
Proof
Using the generalized Polyak-Łojasiewicz inequality and (4) in Lemma 3, we have
[TABLE]
Rearranging and subtracting from both sides give us
[TABLE]
Acknowledgements
This work is supported by the National Science Foundation of China (No.61571008).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Adrien B Taylor, Julien M Hendrickx, and François Glineur. Exact worst-case convergence rates of the proximal gradient method for composite convex minimization. Journal of Optimization Theory and Applications , pages 1–22, 2018.
- 2(2) Dmitriy Drusvyatskiy, Alexander D Ioffe, and Adrian S Lewis. Nonsmooth optimization using taylor-like models: error bounds, convergence, and termination criteria. ar Xiv preprint ar Xiv:1610.03446 , 2016.
- 3(3) Julie Nutini, Mark Schmidt, and Warren Hare. ” active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern? Optimization Letters , 2018.
- 4(4) Yurii Nesterov. Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & Business Media, 2013.
- 5(5) Adrien B Taylor, Julien M Hendrickx, and François Glineur. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming , 161(1-2):307–345, 2017.
- 6(6) Dmitriy Drusvyatskiy and Adrian S Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research , 43(3):919–948, 2018.
- 7(7) Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.
- 8(8) Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki , 3(4):643–653, 1963.
