Proximal extrapolated gradient methods with prediction and correction   for monotone variational inequalities

Xiaokai Chang; Sanyang Liu; Jianchao Bai; Jun Yang

arXiv:1812.04876·math.OC·December 5, 2019

Proximal extrapolated gradient methods with prediction and correction for monotone variational inequalities

Xiaokai Chang, Sanyang Liu, Jianchao Bai, Jun Yang

PDF

Open Access

TL;DR

This paper introduces a proximal extrapolated gradient method with prediction and correction for monotone variational inequalities, enabling larger step sizes and improved numerical efficiency through theoretical convergence guarantees and practical experiments.

Contribution

It extends proximal gradient methods by allowing larger step sizes via prediction and correction, with proven convergence and enhanced numerical performance.

Findings

01

The method converges under a very weak condition.

02

Larger step sizes improve numerical efficiency.

03

Numerical experiments confirm theoretical advantages.

Abstract

An efficient proximal-gradient-based method, called proximal extrapolated gradient method, is designed for solving monotone variational inequality in Hilbert space. The proposed method extends the acceptable range of parameters to obtain larger step sizes. The step size is predicted based a local information of the operator and corrected by linesearch procedures to satisfy a very weak condition, which is even weaker than the boundedness of sequence generated and always holds when the operator is the gradient of a convex function. We establish its convergence and ergodic convergence rate in theory under the larger range of parameters. Furthermore, we improve numerical efficiency by employing the proposed method with non-monotonic step size, and obtain the upper bound of the parameter relating to step size by an extremely simple example. Related numerical experiments illustrate the…

Tables4

Table 1. Table 1: Results for Problem 1 with different d 𝑑 d and C 𝐶 C .

$C$	$d$	TFBF				PEG			MPG		IPEG		IPEG
											( $δ = 1.01$ )		( $δ = 0.73$ )
		Iter	$#$ prox	$#$ F	Time	Iter	$#$ F	Time	Iter	Time	Iter	Time	Iter	Time
$C_{1}$	$10^{3}$	141	294	435	0.05	73	143	0.02	243	0.03	62	0.01	48	0.01
	$10^{4}$	163	341	504	0.1	76	149	0.04	262	0.03	66	0.02	50	0.01
	$10^{5}$	174	365	539	2.21	80	157	0.76	284	1.23	70	0.32	53	0.31
$C_{2}$	$10^{3}$	139	292	431	0.05	78	154	0.02	229	0.03	77	0.01	63	0.01
	$10^{4}$	145	305	450	0.31	83	164	0.09	249	0.24	83	0.08	67	0.07
	$10^{5}$	170	359	529	4.89	88	174	1.44	270	3.41	88	1.13	71	1.01

Table 2. Table 2: Results for Problem 2 with different x 0 subscript 𝑥 0 x_{0} .

$x_{0}$	TFBF				PEG			IPEG( $δ = 1.01$ )		IPEG( $δ = 0.73$ )
$x_{0}$	Iter	$#$ prox	$#$ F	Time	Iter	$#$ F	Time	Iter	Time	Iter	Time
$(0, 0, 0, 0)$	81	173	254	0.02	82	164	0.1	72	0.01	58	0.01
$(1, 1, 1, 1)$	84	177	261	0.02	79	156	0.1	70	0.01	56	0.01
$(0.5, 0.5, 2, 1)$	88	186	274	0.02	85	169	0.1	75	0.01	59	0.01

Table 3. Table 3: Results for Problem 3 with different cases.

$s e e d$	$m$	TFBF				PEG		IPEG( $δ = 1.01$ )		IPEG( $δ = 0.73$ )
$s e e d$	$m$	Iter	$#$ prox	$#$ F	Time	Iter	Time	Iter	Time	Iter	Time
1	500	1066	2279	3345	0.25	1185	0.17	1185	0.13	972	0.09
	1000	1155	2469	3624	1.75	1323	0.93	1268	0.42	1033	0.39
	5000	1389	2969	4358	56.87	1575	27.89	1630	26.43	1326	22.86
2	500	1270	2715	3985	0.29	1447	0.19	1480	0.14	1165	0.12
	1000	1134	2424	3558	1.56	1274	0.86	1262	0.41	1028	0.40
	5000	1365	2918	4283	55.74	1554	33.91	1603	29.94	1303	25.64

Table 4. Table 4: Results for Problem 4 .

data	TFBF				PEG			IPEG( $δ = 1.01$ )		IPEG( $δ = 0.73$ )
data	Iter	$#$ prox	$#$ F	Time	Iter	$#$ F	Time	Iter	Time	Iter	Time
w7a	971	1950	2867	4.1	968	1933	2.9	827	1.6	716	1.4
a9a	6758	14439	21197	27.8	4241	8601	12.2	3498	6.1	2844	5.0
real-sim	3984	8510	12494	153.8	2651	5312	70.9	2230	35.1	1796	32.8

Equations177

\mbox f in d x^{*} \in H s.t. ⟨ F (x^{*}), y - x^{*} ⟩ + g (y) - g (x^{*}) \geq 0, \forall y \in H,

\mbox f in d x^{*} \in H s.t. ⟨ F (x^{*}), y - x^{*} ⟩ + g (y) - g (x^{*}) \geq 0, \forall y \in H,

x \in H min f (x) + g (x) .

x \in H min f (x) + g (x) .

\mbox f in d x^{*} \in C s.t. ⟨ F (x^{*}), y - x^{*} ⟩ \geq 0, \forall y \in H .

\mbox f in d x^{*} \in C s.t. ⟨ F (x^{*}), y - x^{*} ⟩ \geq 0, \forall y \in H .

∥ F (x) - F (y) ∥ \leq L ∥ x - y ∥, \forall x, y \in H;

∥ F (x) - F (y) ∥ \leq L ∥ x - y ∥, \forall x, y \in H;

x_{n + 1} = \mbox p r o x_{λ g} (x_{n} - λ F (x_{n})),

x_{n + 1} = \mbox p r o x_{λ g} (x_{n} - λ F (x_{n})),

y_{n} = P_{C} (x_{n} - λ_{n} F (x_{n})), x_{n + 1} = P_{C} (x_{n} - λ_{n} F (y_{n})),

y_{n} = P_{C} (x_{n} - λ_{n} F (x_{n})), x_{n + 1} = P_{C} (x_{n} - λ_{n} F (y_{n})),

\displaystyle\left.\begin{array}[]{l}y_{n}=P_{C}(x_{n}-\lambda F(x_{n})),\\ T_{n}=\{w\in{\mathcal{H}}|\langle x_{n}-\lambda F(x_{n})-y_{n},~{}w-y_{n}\rangle\leq 0\},\\ x_{n+1}=P_{T_{n}}(x_{n}-\lambda F(y_{n})),\end{array}\right\}

\displaystyle\left.\begin{array}[]{l}y_{n}=P_{C}(x_{n}-\lambda F(x_{n})),\\ T_{n}=\{w\in{\mathcal{H}}|\langle x_{n}-\lambda F(x_{n})-y_{n},~{}w-y_{n}\rangle\leq 0\},\\ x_{n+1}=P_{T_{n}}(x_{n}-\lambda F(y_{n})),\end{array}\right\}

y_{n} = \mbox p r o x_{λ g} (x_{n} - λ F (x_{n})), x_{n + 1} = y_{n} + λ (F (x_{n}) - F (y_{n})),

y_{n} = \mbox p r o x_{λ g} (x_{n} - λ F (x_{n})), x_{n + 1} = y_{n} + λ (F (x_{n}) - F (y_{n})),

x_{n + 1} = Prox_{λ g} (x_{n} - λ F (y_{n})), y_{n + 1} = x_{n + 1} + δ_{n} (x_{n + 1} - x_{n}),

x_{n + 1} = Prox_{λ g} (x_{n} - λ F (y_{n})), y_{n + 1} = x_{n + 1} + δ_{n} (x_{n + 1} - x_{n}),

x_{n + 1} = P_{C} (x_{n} - λ F (2 x_{n} - x_{n - 1})), λ \in] 0, (2 - 1) / L [,

x_{n + 1} = P_{C} (x_{n} - λ F (2 x_{n} - x_{n - 1})), λ \in] 0, (2 - 1) / L [,

\left\{\begin{aligned} \begin{array}[]{llll}\mbox{Choose}\ \ x_{0}=y_{0}\in\mathcal{H},\ \lambda_{0}>0,\ \alpha\in]0,{\sqrt{2}-1}[\\ \mbox{Choose}\ \lambda_{n}\ \ \ s.t.\ \lambda_{n}\parallel Fy_{n}-Fy_{n-1}\parallel\leq\alpha\parallel y_{n}-y_{n-1}\parallel\\ x_{n+1}=P_{C}(x_{n}-\lambda_{n}F(y_{n}))\\ y_{n+1}=2x_{n+1}-x_{n}.\end{array}\end{aligned}\right.

\left\{\begin{aligned} \begin{array}[]{llll}\mbox{Choose}\ \ x_{0}=y_{0}\in\mathcal{H},\ \lambda_{0}>0,\ \alpha\in]0,{\sqrt{2}-1}[\\ \mbox{Choose}\ \lambda_{n}\ \ \ s.t.\ \lambda_{n}\parallel Fy_{n}-Fy_{n-1}\parallel\leq\alpha\parallel y_{n}-y_{n-1}\parallel\\ x_{n+1}=P_{C}(x_{n}-\lambda_{n}F(y_{n}))\\ y_{n+1}=2x_{n+1}-x_{n}.\end{array}\end{aligned}\right.

θ_{n} = \frac{λ _{n}}{δ λ _{n - 1}}, y_{n} = x_{n} + θ_{n} (x_{n} - x_{n - 1}), x_{n + 1} = P_{C} (x_{n} - λ_{n} F (y_{n})),

θ_{n} = \frac{λ _{n}}{δ λ _{n - 1}}, y_{n} = x_{n} + θ_{n} (x_{n} - x_{n - 1}), x_{n + 1} = P_{C} (x_{n} - λ_{n} F (y_{n})),

κ (δ) := ε_{1} > 0, ε_{2} > 0 max min {\frac{ε _{1}}{δ ( ε _{1}^{2} + ε _{2} + 1 )}, \frac{( δ ^{2} + δ - 1 ) ε _{1} ε _{2}}{δ ^{3} ( 1 + ε _{2} )}}

κ (δ) := ε_{1} > 0, ε_{2} > 0 max min {\frac{ε _{1}}{δ ( ε _{1}^{2} + ε _{2} + 1 )}, \frac{( δ ^{2} + δ - 1 ) ε _{1} ε _{2}}{δ ^{3} ( 1 + ε _{2} )}}

δ \in] (5 - 1) /2, + \infty [=] (5 - 1) /2, 1 [\cup [1, + \infty [

δ \in] (5 - 1) /2, + \infty [=] (5 - 1) /2, 1 [\cup [1, + \infty [

\mbox p r o x_{λ g} (x) := y \in H argmin {g (y) + \frac{1}{2 λ} ∥ x - y ∥^{2}}, \forall x \in H, λ > 0.

\mbox p r o x_{λ g} (x) := y \in H argmin {g (y) + \frac{1}{2 λ} ∥ x - y ∥^{2}}, \forall x \in H, λ > 0.

Φ (x, y) := ⟨ F (x), y - x ⟩ + g (y) - g (x),

Φ (x, y) := ⟨ F (x), y - x ⟩ + g (y) - g (x),

⟨ p - x, y - p ⟩ \geq λ [g (p) - g (y)], \forall y \in H .

⟨ p - x, y - p ⟩ \geq λ [g (p) - g (y)], \forall y \in H .

a_{n + 1} \leq a_{n} - b_{n}, \forall n > N .

a_{n + 1} \leq a_{n} - b_{n}, \forall n > N .

ab \leq \frac{a ^{2}}{2 ε} + \frac{ε b ^{2}}{2} .

ab \leq \frac{a ^{2}}{2 ε} + \frac{ε b ^{2}}{2} .

⟨ x - y, x - z ⟩ = \frac{1}{2} ∥ x - y ∥^{2} + \frac{1}{2} ∥ x - z ∥^{2} - \frac{1}{2} ∥ y - z ∥^{2} .

⟨ x - y, x - z ⟩ = \frac{1}{2} ∥ x - y ∥^{2} + \frac{1}{2} ∥ x - z ∥^{2} - \frac{1}{2} ∥ y - z ∥^{2} .

y_{n}

y_{n}

λ_{n}

x_{n + 1} = \mbox p r o x_{λ_{n} g} (x_{n} - λ_{n} F (y_{n})),

x_{n + 1} = \mbox p r o x_{λ_{n} g} (x_{n} - λ_{n} F (y_{n})),

∥ x_{n + 1} - x_{n} ∥ \leq ζ_{n},

∥ x_{n + 1} - x_{n} ∥ \leq ζ_{n},

ζ_{n} = max {ζ_{m i n}, min {μ ∥ x_{n} - x_{n - 1} ∥, ν ∥ x_{1} - x_{0} ∥}},

ζ_{n} = max {ζ_{m i n}, min {μ ∥ x_{n} - x_{n - 1} ∥, ν ∥ x_{1} - x_{0} ∥}},

A := \partial g \mbox an d x_{n + 1} (λ) := \mbox p r o x_{λ g} (x_{n} - λ F (y_{n})) .

A := \partial g \mbox an d x_{n + 1} (λ) := \mbox p r o x_{λ g} (x_{n} - λ F (y_{n})) .

∥ x_{n + 1} (λ) - P_{\overline{dom A}} [x_{n + 1} (0)] ∥

∥ x_{n + 1} (λ) - P_{\overline{dom A}} [x_{n + 1} (0)] ∥

\frac{α ∥ y _{n} - y _{n - 1} ∥}{∥ F ( y _{n} ) - F ( y _{n - 1} ) ∥} \geq \frac{α ∥ y _{n} - y _{n - 1} ∥}{L ∥ y _{n} - y _{n - 1} ∥} = \frac{α}{L}

\frac{α ∥ y _{n} - y _{n - 1} ∥}{∥ F ( y _{n} ) - F ( y _{n - 1} ) ∥} \geq \frac{α ∥ y _{n} - y _{n - 1} ∥}{L ∥ y _{n} - y _{n - 1} ∥} = \frac{α}{L}

κ (δ) = ε_{2} > 0 max \frac{a ε _{2} + ( a - 1 ) ε _{2}^{2} - ε _{2}^{3}}{δ ( a + a ε _{2} )} .

κ (δ) = ε_{2} > 0 max \frac{a ε _{2} + ( a - 1 ) ε _{2}^{2} - ε _{2}^{3}}{δ ( a + a ε _{2} )} .

δ \in] \frac{5 - 1}{2}, + \infty [max κ (δ) = κ (3 - 1) = \frac{1}{2} .

δ \in] \frac{5 - 1}{2}, + \infty [max κ (δ) = κ (3 - 1) = \frac{1}{2} .

∥ x_{n + 1} - x ∥^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Variational Analysis · Advanced Optimization Algorithms Research · Sparse and Compressive Sensing Techniques

Full text

∎

11institutetext: ✉ Xiaokai Chang 22institutetext: [email protected]

✉ Jianchao Bai 33institutetext: [email protected] 44institutetext: 1 School of Science, Lanzhou University of Technology, Lanzhou, P. R. China.

2 School of Mathematics and Statistics, Xidian University, Xi’an, P. R. China.

3 Department of Applied Mathematics, Northwestern Polytechnical University, Xi’an, P. R. China.

4 School of Mathematics and Information Science, Xianyang Normal University, Xianyang, P. R. China.

Proximal extrapolated gradient methods with prediction and correction for monotone variational inequalities

Xiaokai Chang1,2

Sanyang Liu1

Jianchao Bai3

Jun Yang4

(Received: date / Accepted: date)

Abstract

An efficient proximal-gradient-based method, called proximal extrapolated gradient method, is designed for solving monotone variational inequality in Hilbert space. The proposed method extends the acceptable range of parameters to obtain larger step sizes. The step size is predicted based a local information of the operator and corrected by linesearch procedures to satisfy a very weak condition, which is even weaker than the boundedness of sequence generated and always holds when the operator is the gradient of a convex function. We establish its convergence and ergodic convergence rate in theory under the larger range of parameters. Furthermore, we improve numerical efficiency by employing the proposed method with non-monotonic step size, and obtain the upper bound of the parameter relating to step size by an extremely simple example. Related numerical experiments illustrate the improvements in efficiency from the larger step size.

Keywords:

Variational inequalities proximal gradient method convex optimization nonmonotonic step size

MSC:

47J20 65C10 65C15 90C33

††journal: COAP

1 Introduction

Let $\mathcal{H}$ be a real Hilbert space equipped with inner product $\langle\cdot,\cdot\rangle$ and its induced norm $\parallel\cdot\parallel$ . We consider the variational inequality problem:

[TABLE]

where $F:{\mathcal{H}}\rightarrow{\mathcal{H}}$ is an operator and $g:{\mathcal{H}}\rightarrow]-\infty,+\infty]$ is a proper lower semicontinuous convex function. We use $\operatorname*{dom}g$ to represent the domain of $g$ , defined by $\operatorname*{dom}g:=\{x\in{\mathcal{H}}:g(x)<+\infty\}$ . For a continuously differentiable and convex function $f:{\mathcal{H}}\rightarrow]-\infty,+\infty[$ with its gradient denoted by $\nabla f=F$ , then problem (1) is equivalent to

[TABLE]

Let $C$ be a closed and convex subset of ${\mathcal{H}}$ . Let $l_{C}$ be the indicator function of the set $C$ , that is, $l_{C}(x)=0$ if $x\in C$ and $\infty$ otherwise. When $g(x)=l_{C}(x)$ , variational inequality (1) reduces to

[TABLE]

Problem (1) and its special cases (2) and (3) have wide applications in disciplines including mechanics, signal and image processing, and economics app1 ; app2 ; app3 ; app4 ; statistical_learning ; 11. , to cite a few. Throughout the paper, the solution set ${\mathcal{S}}$ of problem (1) is assumed to be nonempty, and the following assumptions hold:

(A1) $F$ is monotone, i.e.,

$\langle F(x)-F(y),x-y\rangle\geq 0,\ \ \forall x,y\in\mathcal{H};$

(A2) $F$ is $L$ -Lipschitz continuous ( $L>0$ ), that is,

[TABLE]

(A3) $g|\operatorname*{dom}g$ is a continuous function.

Many efficient methods have been proposed for solving the problem (1) and its special cases, for instance, alternating direction method of multipliers (ADMM) statistical_learning ; PC-ADMM ; He_ADMM-based ; ADMM , extragradient method 6. ; extragradient ; extragradient-type ; 13. , proximal (projected) gradient method 8. ; FBS ; FBS_P ; FBS2015 ; modified-FB ; New_properties and its accelerated version FISTA-CD ; Nesterov1983 . Here, we would concentrate on the most simple case of these approaches: forward-backward splitting (FBS) method. Under the assumption that $F$ is $L$ -Lipschitz continuous, the iterative scheme of the classical FBS method for problem (1) reads

[TABLE]

where $\lambda$ is some positive number and can be viewed as a step size of the forward step, and the proximal operator $\mbox{prox}_{\lambda g}:{\mathcal{H}}\rightarrow{\mathcal{H}}$ is defined in Section 2.

To establish convergence of the iteration (4), it often requires the restrictive assumptions that $F$ is $L$ -Lipschitz continuous, strongly (or inverse strongly) monotone with $\lambda\in]0,\frac{2}{L}[$ . To overcome this drawback, Korpelevich extragradient and Antipin 13. proposed the following extragradient method for (3) with two-step projection procedures

[TABLE]

where $P_{C}:\mathcal{H}\rightarrow C$ denotes the (metric) projection onto $C$ , $\lambda_{n}$ is any positive sequence verifying $\lambda_{n}\in[l,u]$ for some values $l,u\in]0,\frac{1}{L}[$ . The extragradient method has received great attentions and has been improved in various ways 5. ; M-extra ; Non-Lip ; Low-cost ; 9. , including linesearch procedures or/and avoiding Lipschitz-continuity assumption, decreasing a number of metric projections, etc. For instance, Censor, Gibali and Reich 5. introduced

[TABLE]

where the step size satisfies $\lambda\in]0,\frac{1}{L}[$ . Since the second projection $P_{T_{n}}$ in (8) can be found in a closed form, this method is more applicable when a projection onto the closed convex set $C$ is a nontrivial problem. For a more general problem (1), Tseng modified-FB modified the iteration (4) and proposed the following forward-backward-forward (FBF) method involving one proximal operator and two values of $F$ per iteration:

[TABLE]

where $\lambda\in]0,\frac{1}{L}[$ . Since then, Tseng’s method has attracted a lot of interests due to its simplicity and generality, see FB-Tseng ; Tseng ; inertial-FBF for more details.

In the literature, the inertial extrapolation has been conducted to accelerated proximal gradient methods in the spirit of Nesterov’s extrapolation techniques Nesterov1983 ; Nesterov2004 , whose basic idea is to make full use of historical information at each iteration. A typical scheme of the proximal gradient method with extrapolation for solving (1) is

[TABLE]

where $\delta_{n}>0$ . Recently, using a fixed parameter $\delta=1$ in (9), Malitsky 9. introduced the iteration

[TABLE]

for solving (3). However, the step size ( $\lambda$ or $\lambda_{n}$ ) requires the information of the Lipschitz constant $L$ , which is a main drawback of the algorithms introduced above. In fact, these algorithms with a large value of $L$ can lead to very small step size, which may give rise to a slow convergent algorithm 10. . To obtain a proper step size, Armijo-type line search and outer approximation techniques were involved in Khobotov ; search-strategy ; Solodov ; extragradient-type . Due to the extra proximal operator as well as the evaluations of $F$ , these algorithms will be computationally expensive when proximal operator or $F$ is hard to compute and somewhat expensive.

For getting a proper step without using the Lipschitz constant $L$ , Malitsky 9. introduced an efficient method whose main updates are

[TABLE]

By updating the step size $\lambda_{n}$ given by a specific procedure according the progress of algorithm, a weak convergence result was proved, but this process involves the computation of additional projections onto $C$ . Later, Mainge and Gobinddass 10. introduced a more general framework:

[TABLE]

where the step size $\lambda_{n}$ needs to satisfy many inequality constraints and can be obtained by linesearch procedure, see (10., , Section 3.1 and Section 3.2.2). Based on the scheme (10), local information of the operator and some linesearch procedures, Malitsky Proximal-extrapolated proposed simpler schemes which do not require Lipschitz continuity of the operator. Furthermore, the involved linesearch procedure doesn’t need extra prox or projection and it can be applied to a more general problem (1). By overcoming the estimation of $L$ and linesearch procedure for the scheme (10), Yang and Liu yang proposed an extragradient method with lower computational complexity but nonincreasing step sizes. The important parameter $\alpha$ relating to the step size $\lambda_{n}$ was restricted on $\alpha\in]0,\frac{\sqrt{2}-1}{\delta}[$ with $\delta\in]1,+\infty[$ in yang and $\alpha\in]0,\sqrt{2}-1[$ with variable $\delta_{n}$ from linesearch in 9. for guaranteeing the convergence.

The aim of this paper is to propose a proximal gradient algorithm with larger step size, extend the range of $\delta$ to that is less than or equal to 1, and then improve the range of $\alpha$ . Our proposed methods do not require Lipschitz constant, and its step size is predicted by using two previous iterates, and corrected by linesearch to satisfy a very weak condition, which always holds when $F=\nabla f$ for a convex function $f$ . Specifically, by the aid of the vital inequalities in convergence’s proof we first introduce a function $\kappa(\delta)$ defined as

[TABLE]

for any $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ to ensure some convergence properties. Then we get $\max\limits_{\delta\in]\frac{\sqrt{5}-1}{2},+\infty[}\kappa(\delta)=\kappa(\sqrt{3}-1)=\frac{1}{2}$ , and use $\alpha\in]0,\kappa(\delta)[$ to control the step size. Our range of $\alpha$ is larger than that presented in 9. ; yang , see Lemma 2 for more explanations. Secondly, the region of $\delta$ is partitioned as

[TABLE]

to explore convergence of the proposed method, and the ${\mathcal{O}}(1/n)$ ergodic convergence rate is established. Finally, we obtain the upper bound of $\alpha$ by an extremely simple example, and improve numerical efficiency by introducing nonmonotonic step size $\lambda_{n}$ but $\frac{\lambda_{n}}{\lambda_{n-1}}\rightarrow 1$ . In fact, the proposed nonmonotonic step size can break away from overdependence on the initial point, but it would have to be monotonic in the end for getting convergence.

The paper is organized as follows. In Section 2, we provide some useful facts and notations. In Section 3, we introduce our algorithm and explore the properties of the function $\kappa(\delta)$ . A weak convergence theorem of our method is proved in Section 3.1. In Section 3.2, we establish the ergodic convergence rate of the proposed algorithms, and we improve the algorithms in Section 3.3 to avoid the adverse effects of the nonincreasing step size. In Section 4, we show by an example that any value of $\alpha\in]\frac{2}{2\delta+1},+\infty[$ with $\delta\in]0,+\infty[$ does not guarantee convergence of our algorithm. Numerical experiments on solving some problems tested in the literatures are provided and analyzed in Section 5. We finally conclude our paper in Section 6.

2 Preliminaries

In this section, we introduce some notations and facts on the well-known properties of the proximal operator, Opial condition and Young’s inequality, which are used for the sequel convergence analyses.

The proximal operator prox ${}_{\lambda g}:{\mathcal{H}}\rightarrow{\mathcal{H}}$ with prox ${}_{\lambda g}(x)=(I+\lambda\partial g)^{-1}(x),\lambda>0,x\in{\mathcal{H}}$ , is defined by

[TABLE]

Setting

[TABLE]

it is clear that problem (1) is equivalent to finding $x^{*}\in{\mathcal{H}}$ such that $\Phi(x^{*},y)\geq 0$ for all $y\in{\mathcal{H}}$ .

Fact 1

Bauschke2011Convex * Let $g:{\mathcal{H}}\rightarrow(-\infty,+\infty]$ be a convex function, $\lambda>0$ and $x\in{\mathcal{H}}$ . Then $p=\mbox{prox}_{\lambda g}(x)$ if and only if*

[TABLE]

Fact 2

Opial * (Opial 1967) Let ${\mathcal{S}}$ be a nonempty set of ${\mathcal{H}}$ and $\{x_{n}\}_{k\in{\mathbb{N}}}$ be a sequence in ${\mathcal{H}}$ such that the following two conditions hold:

(1) for every $x^{*}\in{\mathcal{S}}$ , $\lim\limits_{n\rightarrow+\infty}\|x_{n}-x^{*}\|$ exists;

(2) every sequential weak cluster point of $\{x_{n}\}_{k\in{\mathbb{N}}}$ is in ${\mathcal{S}}$ .

Then $\{x_{n}\}_{k\in{\mathbb{N}}}$ converges weakly to a point in ${\mathcal{S}}$ .*

Fact 3

Let $\{a_{n}\}$ , $\{b_{n}\}$ be two nonnegative real sequences and $\exists N>0$ such that

[TABLE]

Then $\{a_{n}\}$ is convergent and $\lim\limits_{n\rightarrow\infty}b_{n}=0$ .

Fact 4

(Young’s inequality) For all $a,b\geq 0$ and $\varepsilon>0$ , we have

[TABLE]

The following identity (cosine rule) appears in many times and we will use it for simplicity of convergence analyses. For all $x,y,z\in{\mathcal{H}}$ ,

[TABLE]

3 Proximal Extrapolated Gradient Method with Prediction and Correction

In this section, we state our proximal extrapolated gradient method with prediction and correction (PEG), by using the step size function $\kappa(\delta)$ defined in (11).

Algorithm 1 (PEG for solving (1))

Step 0.

Take $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ , choose $x_{0}\in\mathcal{H},$ $\lambda_{0}>0$ , $\gamma\in(0,1)$ , $\alpha\in]0,\kappa(\delta)[$ and a bounded sequence $\{\zeta_{n}>0\}$ . Set $y_{0}=x_{0}$ , $x_{1}=\mbox{prox}_{\lambda_{0}g}(x_{0}-\lambda_{0}F(x_{0}))$ and $n=1$ .

Step 1.

**Prediction:

**1.a. Compute

[TABLE]

1.b. Compute

[TABLE]

if $x_{n+1}=x_{n}=y_{n}$ , then stop: $x_{n+1}$ is a solution.

Step 2.

Correction when $\delta<1$ :

Check

[TABLE]

if not hold, set $\lambda_{n}\leftarrow\gamma\lambda_{n}$ and return to Step 1.b.

Step 3.

Set $n\leftarrow n+1$ and return to Step 1.

The aim of Correction step is to bound $\{\|x_{n}-x_{n-1}\|\}$ by the given sequence $\{\zeta_{n}\}$ when $\delta<1$ , as convergence analysis requires $\|x_{n+1}-x_{n}\|<+\infty$ . In practice, we don’t need to give the sequence $\{\zeta_{n}\}$ , but generate adaptively by

[TABLE]

for given $1<\mu\leq\nu$ and small $\zeta_{\min}$ (e.g., $\zeta_{\min}=10^{-6}$ ), then $\zeta_{n}\leq\nu\|x_{1}-x_{0}\|$ for all $n\geq 1$ and $\zeta_{n}\geq\zeta_{\min}$ . Moreover, we observe $\|x_{n+1}-x_{n}\|\leq\mu\|x_{n}-x_{n-1}\|$ for bounding more tightly due to $\|x_{n+1}-x_{n}\|\rightarrow 0$ .

For a convex function $f$ , if $F=\nabla f$ we observe $\|x_{n+1}-x_{n}\|<+\infty$ , see (40), so Correction step is not necessary. However for other cases, one needs to apply linesearch to ensure $\|x_{n+1}-x_{n}\|<+\infty$ . Interestingly, for all the tested problems shown in Section 5, the linesearch in Correction step does not start to arrive termination conditions, when using (16) with $\mu=\nu=10$ . Namely, the predicted step is good enough for obtaining a convergent sequence for the tested problems, though the convergence without prediction is unknown in general.

The following lemma shows that the correction procedure described in Algorithm 1 is well-defined.

Lemma 1

The correction procedure always terminates. i.e., $\{\lambda_{n}\}$ is well defined when $\delta\in]\frac{\sqrt{5}-1}{2},1[$ .

Proof. Denote

[TABLE]

From (Bauschke2011Convex, , Theorem 23.47), we have that $\mbox{prox}_{\lambda g}[x_{n+1}(0)]\rightarrow P_{\overline{\operatorname*{dom}A}}[x_{n+1}(0)]$ as $\lambda\rightarrow 0$ ( $\overline{\operatorname*{dom}A}$ denotes the closures of $\operatorname*{dom}A$ ), which together with the nonexpansivity of $\mbox{prox}_{\lambda g}$ yields

[TABLE]

By taking the limit as $\lambda\rightarrow 0$ , we deduce that $x_{n+1}(\lambda)\rightarrow P_{\overline{\operatorname*{dom}A}}[x_{n+1}(0)]$ . Notice that $x_{n+1}(0)=x_{n}$ , we observe $P_{\overline{\operatorname*{dom}A}}[x_{n+1}(0)]=x_{n}$ .

By a contradiction, suppose that the correction procedure in Algorithm 1 fails to terminate at the $n$ -th iteration. Then, for all $\lambda=\gamma^{i}\lambda_{n}$ with $i=0,1,\cdots$ , we have $\|x_{n+1}(\lambda)-x_{n}\|>\zeta_{n}$ . Since $\gamma^{i}\rightarrow 0$ as $i\rightarrow\infty$ , so $\lambda\rightarrow 0$ , this gives a contradiction $0\geq\zeta_{n}$ , which completes the proof. $\Box$

Remark 1

Note that the sequence $\{\lambda_{n}\}$ is monotonically decreasing. Since $F$ is a $L$ -Lipschitz continuous mapping ( $L>0$ ), we have

[TABLE]

for $F(y_{n})\neq F(y_{n-1})$ . Thus the predicted step sequence $\{\lambda_{n}\}_{n\in{\mathbb{N}}}$ has a lower bound $\tau:=\min\{{\frac{\alpha}{L},\lambda_{0}}\}$ , then when $\delta\geq 1$ its limit exists and $\lim\limits_{n\rightarrow\infty}\lambda_{n}\geq\tau>0$ . If $\delta<1$ , $\{\lambda_{n}\}$ is well defined from Lemma 1, and has a lower bound $\tau:=\min\{{\frac{\gamma^{i_{0}}\alpha}{L},\lambda_{0}}\}$ for some $i_{0}\geq 0$ , which implies $\lim\limits_{n\rightarrow\infty}\lambda_{n}>0$ as well.

Below, we derive the analytical expression of $\kappa(\delta)$ .

Lemma 2

For the function $\kappa(\delta)$ defined in (11), we have $\kappa(\delta)=\frac{\sqrt{a+1}}{\delta(a+1+\sqrt{a+1})}$ with $a=\frac{\delta^{2}}{\delta^{2}+\delta-1}$ for $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ .

Proof. Fix $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ , then $\delta^{2}+\delta-1>0$ . Noting that the structure of (11) and $\kappa(\delta)$ is a maximum value, so $\frac{\varepsilon_{1}}{\delta(\varepsilon_{1}^{2}+\varepsilon_{2}+1)}=~{}\frac{(\delta^{2}+\delta-1)\varepsilon_{1}\varepsilon_{2}}{\delta^{3}(1+\varepsilon_{2})}$ , which together with $a=\frac{\delta^{2}}{\delta^{2}+\delta-1}$ and $\varepsilon_{1}=\sqrt{a+1}$ shows

[TABLE]

By the first-order optimality condition of the optimization problem (17), we have $\varepsilon_{2}=\sqrt{a+1}-1$ . Substituting it into (17), the result can be deduced. $\Box$

By Lemma 2 and Fig. 1, the maximum value of $\kappa(\delta)$ is $\frac{1}{2}$ when $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ , and in fact,

[TABLE]

In this case, we have $a=2$ , $\varepsilon_{1}=\sqrt{3}$ and $\varepsilon_{2}=\sqrt{3}-1$ .

Remark 2

It can be noticed that the method proposed in yang is a special case of Algorithm 1, when $g(x)=l_{C}(x)$ and $\delta\in]1,+\infty[$ , but $\kappa(\delta)>\frac{\sqrt{2}-1}{\delta}$ from Lemma 2. Namely, we extend the range of $\delta$ and then improve the upper bound of $\alpha$ when the operator is the gradient of a convex function or using linesearch, see Fig. 1, which causes larger step size $\lambda_{n}$ that will be more efficient for numerical experiments.

3.1 Convergence Analysis

This section devotes to studying convergence properties of Algorithm 1. For $\delta\in[1,+\infty[$ , its convergence and convergence rate can be obtained by combining the methods in yang ; Proximal-extrapolated with the basic theory of limit. However, it is a completely different situation for $\delta\in]\frac{\sqrt{5}-1}{2},1[$ , since the desired properties (such as monotonicity and nonnegativity) are no longer valid in the case of $\delta\in]\frac{\sqrt{5}-1}{2},1[$ although we can adopte a larger value of $\alpha$ .

We next give a basic lemma about the iterations generated by Algorithm 1 for any $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ , which play a crucial role in proving the main convergence results.

Lemma 3

Let $\{x_{n}\}$ and $\{y_{n}\}$ be two sequences generated by Algorithm 1. For any $x\in{\mathcal{H}}$ , we have

[TABLE]

Proof. Followed by $x_{n+1}=\mbox{prox}_{\lambda_{n}g}(x_{n}-\lambda_{n}F(y_{n}))$ and Fact 1, we have

[TABLE]

which shows

[TABLE]

Substituting $x:=x_{n+1}$ and $x:=x_{n-1}$ into the above inequality respectively, we obtain

[TABLE]

Multiplying (21) by $\delta$ and then adding it to (20), which by $y_{n}=x_{n}+\delta(x_{n}-x_{n-1})$ yields

[TABLE]

Multiplying (22) by $\frac{\lambda_{n}}{\lambda_{n-1}}$ and using $y_{n}=x_{n}+\delta(x_{n}-x_{n-1})$ again, we get

[TABLE]

Finally, adding (19) to (23) gives us

[TABLE]

Then, using (13), the updating of $\lambda_{n}$ and Cauchy-Schwarz inequality, we obtain

[TABLE]

The proof is completed. $\Box$

Lemma 4

Let $\{x_{n}\}$ , $\{y_{n}\}$ be two sequences generated by Algorithm 1 and $\bar{x}\in{\mathcal{S}}$ (the solution set of problem (1)). Then, for any $\varepsilon_{1},\varepsilon_{2}>0$ , we have

[TABLE]

where $\Phi$ is defined as in (12).

Proof. Using Fact 4, for any $\varepsilon_{1}>0$ we have

[TABLE]

Meanwhile, for any $\varepsilon_{2}>0$ we deduce

[TABLE]

Combining the above inequalities we have

[TABLE]

In addition, the monotonicity of $F$ implies for any $x\in{\mathcal{H}}$

[TABLE]

Substituting (24) and (25) into (18), we deduce by the aids of $\Phi(x,y)$ in (12) that

[TABLE]

Since $\delta>0$ and $\{\lambda_{n}\}_{n\in{\mathbb{N}}}$ is a monotone decreasing sequence, we have $\lambda_{n}\delta\leq\lambda_{n-1}\delta\leq(1+\delta)\lambda_{n-1}$ . Note that $\Phi(\bar{x},x_{n-1})\geq 0$ for any $\bar{x}\in{\mathcal{S}}$ , then

[TABLE]

This completes the proof.∎

By Lemma 4 and some transpositions, we have the following results directly.

Lemma 5

Let $\{x_{n}\}$ , $\{y_{n}\}$ be two sequences generated by Algorithm 1 and $\bar{x}\in{\mathcal{S}}$ . Then, for any $\varepsilon_{1},\varepsilon_{2}>0$ , we have

[TABLE]

where

[TABLE]

or

[TABLE]

Because the sequence $\{\lambda_{n}\}_{n\in{\mathbb{N}}}$ is monotonically decreasing, we have $1-\frac{\lambda_{n}}{\delta\lambda_{n-1}}\geq 0$ for any $\delta\geq 1$ . But for $\delta\in]\frac{\sqrt{5}-1}{2},1[$ , we have $\lim\limits_{n\rightarrow+\infty}\left(1-\frac{\lambda_{n}}{\delta\lambda_{n-1}}\right)=1-\frac{1}{\delta}<0$ . So, convergence of Algorithm 1 with $\delta<1$ is different from that with $\delta\geq 1$ , and hence cannot be established by the similar methods as in yang ; Proximal-extrapolated .

Notice that $a_{n}\geq 0$ in (32) when $\delta\geq 1$ , we take (32) to study the convergence of Algorithm 1 with $\delta\geq 1$ . Consequently, a larger upper bound $\kappa(\delta)$ of $\alpha$ is obtained than that in yang . While for the case of $\delta<1$ , we take (36) as $a_{n}\geq 0$ for all $n\geq 1$ , and further investigate the properties of $b_{n}$ to ensure convergence of Algorithm 1.

Below we state and prove our main convergence result of Algorithm 1 for above two different regions: $\delta\in]\frac{\sqrt{5}-1}{2},1[$ and $\delta\in[1,+\infty[$ .

Theorem 1

Let $\{x_{n}\}$ be the sequence generated by Algorithm 1 with $\delta\in[1,+\infty[$ . Then, $\{x_{n}\}$ converges weakly to a solution of problem (1).

Proof. From Remark 1, we have $\lim\limits_{n\rightarrow\infty}\lambda_{n}=\lambda>0$ . Then for any $\delta\in[1,+\infty[$ and $\alpha<\kappa(\delta)$ , we have

[TABLE]

Thus, there exists an integer $N>2,$ such that for any $n>N$ ,

[TABLE]

which implies that $b_{n}\geq 0$ in (32) when $n>N$ . Recall $1-\frac{\lambda_{n-1}}{\delta\lambda_{n-2}}\geq 0$ for any $\delta\geq 1$ , we deduce $a_{n}\geq 0$ in (32). Hence, by Lemma 5 and Fact 3, $\{a_{n}\}_{n\in{\mathbb{N}}}$ is convergent and $\lim\limits_{n\rightarrow\infty}b_{n}=0$ . This means that $\{\|x_{n}-\bar{x}\|^{2}\}$ is bounded and so does $\{x_{n}\}_{n\in{\mathbb{N}}}$ . Also, we have $\lim\limits_{n\rightarrow\infty}\|x_{n+1}-y_{n}\|=0$ and $\lim\limits_{n\rightarrow\infty}\|x_{n}-y_{n}\|=0.$ By $\|x_{n+1}-x_{n}\|=\frac{1}{\delta}\|x_{n+1}-y_{n+1}\|$ , we also have that $\lim\limits_{n\rightarrow\infty}\|x_{n+1}-x_{n}\|=0$ and $\{y_{n}\}_{n\in{\mathbb{N}}}$ is bounded.

In what follows, we prove the sequence $\{x_{n}\}$ converges weakly to a solution of problem (1). For any cluster $x^{*}\in{\mathcal{H}}$ of $\{x_{n}\}$ , there exists a subsequence $\{x_{n_{k}}\}$ that converges weakly to $x^{*}$ , namely $x_{n_{k}}\rightharpoonup x^{*}$ . It is obvious that $\{y_{n_{k}}\}$ also converges weakly to $x^{*}$ . Next we verify that $x^{*}\in{\mathcal{S}}$ . Applying Fact 1, we deduce

[TABLE]

Letting $k\rightarrow\infty$ in (38) and using the facts $\lim\limits_{k\rightarrow\infty}\|x_{n_{k}+1}-x_{n_{k}}\|=0$ , $g(x)$ is lower semicontinuous and $\lim\limits_{n\rightarrow\infty}\lambda_{n}=\lambda>0,$ we obtain

[TABLE]

which confirms $x^{*}\in{\mathcal{S}}$ .

Finally, we prove that $x_{n}\rightharpoonup x^{*}$ . We take $\bar{x}=x^{*}$ in the definition (32) of $a_{n}$ and label as $a_{n}^{*}$ . Notice that $\{\lambda_{n}\}$ is bounded and $\Phi(x^{*},\cdot)$ is continuous from (A3), we observe

[TABLE]

Therefore, $\lim\limits_{k\rightarrow\infty}\|x_{n}-x^{*}\|=0$ , which by Fact 2 shows $x_{n}\rightharpoonup x^{*}$ . $\Box$

Now, we focus on convergence analysis of Algorithm 1 with $\delta\in]\frac{\sqrt{5}-1}{2},1[$ and use (36). For this case, we can not establish the nonnegativity of $\{b_{n}\}$ and the monotonic decreasing of $\{a_{n}\}$ because $\frac{1}{\delta}-1>0$ . Consequently, convergence of $\{a_{n}\}_{n\in{\mathbb{N}}}$ can not be obtained from (27). We thus need to further investigate the sequence $\{a_{n}\}$ for getting a clear convergence, by using the boundedness of $\{\|x_{n}-x_{n-1}\|\}$ from Correction step.

First, we show that $\|x_{n+1}-x_{n}\|<+\infty$ when $\delta\in]\frac{\sqrt{5}-1}{2},1[$ and the operator $F$ is the gradient of a convex function $f:{\mathcal{H}}\rightarrow{\mathbb{R}}$ , i.e., $F=\nabla f$ . From (26) with $x=x_{n}$ and $\Phi(x_{n},x_{n})=0$ , we deduce

[TABLE]

Using $F=\nabla f$ and the convexity of $f$ yields

[TABLE]

where $\phi=f+g$ . This together with $\lim\limits_{n\rightarrow\infty}\lambda_{n}=\lambda>0$ , $\alpha<\kappa(\delta)$ and $\lambda_{n}\leq\lambda_{n-1}$ gives us

[TABLE]

where $\bar{x}\in{\mathcal{S}}$ , which implies from $\phi(x_{n})-\phi(\bar{x})\geq 0$ that

[TABLE]

That is to say, Correction step is not necessary when $F=\nabla f$ , for a convex function $f$ .

Theorem 2

Let $\{x_{n}\}$ be the sequence generated by Algorithm 1 with $\delta\in]\frac{\sqrt{5}-1}{2},1[$ . Then, $\{x_{n}\}$ converges weakly to a solution of problem (1).

Proof. Firstly, $\delta\in]\frac{\sqrt{5}-1}{2},1[$ gives $\frac{1}{\delta}>\frac{\delta^{2}+\delta-1}{\delta^{3}}>0$ . Note that $\lim\limits_{n\rightarrow\infty}\lambda_{n}=\lambda>0$ , by taking the limit and from $\alpha<\kappa(\delta)$ , we have

[TABLE]

for any $\delta\in]\frac{\sqrt{5}-1}{2},1[$ . Thus, there exists an integer $N>2,$ such that for any $n>N$ ,

[TABLE]

By $x_{n+1}-x_{n}=\frac{y_{n+1}-x_{n+1}}{\delta}$ , Remark 1 and Lemma 5, for any $\varepsilon_{1},\varepsilon_{2}>0$ and $M>N+1$ , we have

[TABLE]

where $\xi_{M}=\frac{1}{\delta^{2}}\left(\frac{\lambda_{M}}{\delta\lambda_{M-1}}-1\right)\|x_{M+1}-y_{M+1}\|^{2}<+\infty$ from Remark 1 and Correction step. This together with $a_{n}\geq 0$ in (36) implies that $\{a_{n}\}_{n\in{\mathbb{N}}}$ is bounded and

[TABLE]

so $\lim\limits_{n\rightarrow\infty}\|x_{n+1}-y_{n}\|=0$ and $\lim\limits_{n\rightarrow\infty}\|x_{n}-y_{n}\|=0.$ By the fact $\|x_{n+1}-x_{n}\|=\frac{1}{\delta}\|x_{n+1}-y_{n+1}\|$ , we have $\lim\limits_{n\rightarrow\infty}\|x_{n+1}-x_{n}\|=0$ .

Due to $\|x_{n}-\bar{x}\|^{2}\leq a_{n}$ , then $\{x_{n}\}_{n\in{\mathbb{N}}}$ is bounded. We can complete the proof by Remark 1 and the similar methods as in the proof of Theorem 1. $\Box$

Remark 3

By the above analysis, it seems that convergence of the proposed algorithm could be still ensured without the assumption (A3), but it is not clear how to prove this as far as we known. Actually, the assumption (A3) is not restrictive, $g$ is continuous on $\operatorname*{dom}g$ when $\operatorname*{dom}g$ is an open set (this includes all finite-valued functions) or $g=\delta_{C}$ for any closed convex set $C$ . Moreover, (A3) holds for any separable lower semicontinuous convex function from (Bauschke2011Convex, , Corollary 9.15).

3.2 Ergodic Convergence Rate for $\delta\in]\frac{\sqrt{5}-1}{2},1]$

Since there are many researches about the convergence rate when $\delta\geq 1$ , we just focus on the case when $\delta\in]\frac{\sqrt{5}-1}{2},1]$ . Actually, the optimal rate of convergence is ${\mathcal{O}}(1/n)$ for the extragradient method rate-convergence . In this subsection, we investigate the ergodic convergence rate of the sequence $\{y_{n}\}_{n\in{\mathbb{N}}}$ for the general case (1).

From 11. and (Proximal-extrapolated, , Lemma 2.12), $x^{*}\in{\mathcal{S}}$ if and only if $x^{*}\in\operatorname*{dom}g$ and

[TABLE]

The following theorem shows that the above criteria can be used to find $x^{*}$ under a desired accuracy.

Theorem 3

Let $\{x_{n}\}$ and $\{y_{n}\}$ be generated by Algorithm 1. For any $n_{1}>N$ and a sufficiently large $J\in{\mathbb{N}}$ related to $n_{1}$ , we define

[TABLE]

for any $j>J$ , then $\hat{x}_{j}\in\operatorname*{dom}g$ and

[TABLE]

Proof. First of all, we have by (26) that

[TABLE]

Since $\|x_{n}-y_{n}\|\rightarrow 0$ as $n\rightarrow+\infty$ , there exists a sufficiently large $J$ such that for any $j>J$ , it holds $\|x_{j}-y_{j}\|\leq\|x_{n_{1}}-y_{n_{1}}\|\neq 0$ (If $\|x_{n_{1}}-y_{n_{1}}\|=0$ , then $\|x_{n_{1}+1}-y_{n_{1}+1}\|\neq 0$ , else $x_{n_{1}+1}$ is a solution). So, we let $\|x_{n_{1}}-y_{n_{1}}\|\neq 0$ with $n_{1}>N$ . Recalling (48) we deduce for any $j>J$ that

[TABLE]

Note that the function $\Phi(\bar{x},\cdot)$ is convex. Now, applying the Jensen’s inequality to the left-hand side of the above inequality and taking

[TABLE]

into account, we have

[TABLE]

where

[TABLE]

Evidently, $\hat{x}_{j}\in\operatorname*{dom}g$ which ends the proof. $\Box$

Notice that $\{\lambda_{n}\}$ has a lower bound $\tau>0$ from Remark 1. Fixing $n_{1}>N$ , then we get $\hat{\lambda}_{j}\rightarrow\infty$ as $j\rightarrow\infty$ . This implies $\hat{\lambda}_{j}\geq(j-n_{1})\tau$ and Algorithm 1 has the ergodic convergence rate ${\mathcal{O}}(1/j)$ when $j>J$ .

3.3 Heuristics on Nonmonotonic Step Sizes

Generally speaking, the variable step is more beneficial than a fixed step for the proximal gradient methods. In Algorithm 1, the step size $\{\lambda_{n}\}_{n\in{\mathbb{N}}}$ is updated but in a nonincreasing way, which might be adverse if the algorithm starts in the region with a big curvature of $F$ . Namely, the step size in Algorithm 1 is overdependent on the initial point. For the purpose of obtaining nonmonotonic step sizes, we present an improved algorithm as follows:

Algorithm 2 (Improved PEG with nonmonotonic step size.)

Step 0.

Take $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ , choose $x_{0}\in\mathcal{H},$ $\lambda_{0}>0$ , $\gamma\in(0,1)$ , $\alpha\in]0,\kappa(\delta)[$ and a bounded sequence $\{\zeta_{n}\}$ . Set $y_{0}=x_{0}$ , $x_{1}=\mbox{prox}_{\lambda_{0}g}(x_{0}-\lambda_{0}F(x_{0}))$ and $n=1$ . Choose $\widehat{\lambda}>0$ and a sequence $\{\phi_{n}\}$ with $\phi_{n}\in[1,\frac{1+\delta}{\delta}]$ and $\phi_{n}=1$ when $n\geq n_{0}$ for given $n_{0}$ .

Step 1.

**Prediction:

**1.a. Compute

[TABLE]

1.b. Compute

[TABLE]

if $x_{n+1}=x_{n}=y_{n}$ , then stop: $x_{n+1}$ is a solution.

Step 2.

Correction:

Check

[TABLE]

if not hold, set $\lambda_{n}\leftarrow\gamma\lambda_{n}$ and return to Step 1.b.

Step 3.

Set $n\leftarrow n+1$ and return to Step 1.

Since the step size is no longer monotonically decreasing, $a_{n}\geq 0$ in (32) is not necessarily valid when $\delta\geq 1$ , so Algorithm 2 implements Correction step for any $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ . By $\phi_{n}\in[1,\frac{1+\delta}{\delta}]$ and $\lambda_{n+1}\leq\phi_{n}\lambda_{n}$ , we can deduce $\delta\lambda_{n+1}\leq(1+\delta)\lambda_{n}$ . Then Lemmas 3, 4 and 5 with (36) are still valid for sequences $\{x_{n}\}$ and $\{y_{n}\}$ generated by Algorithm 2.

The constant $\widehat{\lambda}$ in Algorithm 2 is given only to ensure the upper boundedness of $\{\lambda_{n}\}$ . Hence, it makes sense to choose $\widehat{\lambda}$ quite large. In this case, the step sizes generated are allowed to increase but be bounded from Remark 1. Consequently, it follows from $\phi_{n}=1$ when $n\geq n_{0}$ for given $n_{0}$ that the sequence $\{\lambda_{n}\}_{n>n_{0}}$ generated by Algorithm 2 is monotonically decreasing and then convergent,

[TABLE]

and $\frac{1}{\delta^{2}}\left(\frac{\lambda_{n}}{\delta\lambda_{n-1}}-1\right)\|x_{n+1}-y_{n+1}\|^{2}<+\infty$ . Under these conditions, it is not difficult to prove the following convergence theorem by using Lemma 5 with (36), though we do not know how to choose a proper $n_{0}$ .

Theorem 4

Let $\{x_{n}\}_{n\in{\mathbb{N}}}$ be a sequence generated by Algorithm 2 with $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ . Then, $\{x_{n}\}_{n\in{\mathbb{N}}}$ converges weakly to a solution of problem (1).

4 Further Discussion

From the statement above, the condition $\alpha\in]0,\kappa(\delta)[$ for any $\delta\in]\frac{\sqrt{5}-1}{2},+\infty[$ is sufficient to ensure convergence of the proposed method. In this section, we explain by an extremely simple example that Algorithm 1 is not convergent when $\alpha\in]\frac{2}{2\delta+1},+\infty[$ for any $\delta\in]0,+\infty[$ . That is to say, we would derive an upper bound of $\alpha$ to guarantee the convergence of Algorithm 1, but Algorithm 1 with $(\delta,\alpha)$ in some regions remains to be further studied, see Fig. 2.

Consider the simplest optimization problem

[TABLE]

Obviously, it can be formulated as a special case of problem (3) with $F=I$ (the identity operator), $L=1$ and $C={\mathbb{R}}^{m}$ . Followed by the updates of Algorithm 1, we have

[TABLE]

For any $\widetilde{d}_{n}$ , $\widehat{d}_{n}\in{\mathbb{R}}$ , if

[TABLE]

then we can rewrite (54) as

[TABLE]

By (55) and Vieta’s Theorem, we have

[TABLE]

If $\max\left\{|\widetilde{d}_{n}|,~{}|\widehat{d}_{n}|\right\}>1$ , then the iterative (56) is not convergent. As a result, (54) is not convergent either. Namely, if

[TABLE]

then the iterative (54) is not convergent. By Remark 1 and $L=1$ , the convergence of Algorithm 1 can not be guaranteed if $\lambda_{0}>\frac{2}{2\delta+1}$ and $\alpha\in]\frac{2}{2\delta+1},+\infty[$ for any $\delta\in]0,+\infty[$ .

5 Numerical Experiments

In this section, we perform Algorithm 2 111All codes are available at http://www.escience.cn/people/changxiaokai/Codes.html (denoted by “IPEG”) for solving some randomly generated minimization problems over difficult nonlinear constraints. The following state-of-the-art algorithms are compared to investigate the computational efficiency of IPEG:

•

Tseng’s forward-backward-forward splitting method used as in (Proximal-extrapolated, , Section 4) (denoted by “TFBF”), with $\beta=0.7,\theta=0.99$ ;

•

Proximal extrapolated gradient methods (Proximal-extrapolated, , Algorithm 2) (denoted by “PEG”), with line search and $\alpha=0.41,\sigma=0.7$ ;

•

Modified projected gradient method yang (denoted by “MPG”), with $\alpha=0.41,\delta=1.01$ .

•

FISTA Nesterov1983 with standard linesearch (denoted by “FISTA”), with $\beta=0.7,\lambda_{0}=1$ ;

We denote the random number generator by $seed$ for generating data again in Python 3.8. All experiments are performed on an Intel(R) Core(TM) i5-4590 CPU@ 3.30 GHz PC with 8GB of RAM running on 64-bit Windows operating system.

Since solutions of (1) coincide with zeros of the residual function

[TABLE]

for some positive number $\lambda$ , and $r_{n}:=r(x_{n},y_{n})=\|x_{n+1}-y_{n}\|+\|x_{n}-y_{n}\|=0$ implies $x_{n+1}=x_{n}=y_{n}$ , thus we use $r_{n}<\epsilon$ with given $\epsilon=10^{-6}$ to terminate our algorithms, and the same $\epsilon$ is used to terminate PEG, MPG, FB and FISTA. In particular for TFBF, we use

[TABLE]

as in Proximal-extrapolated .

We generate $\lambda_{0}$ as in M-extra , choose $y_{-1}$ as a small perturbation of $y_{0}$ and take $\lambda_{0}=\frac{\|y_{-1}-y_{0}\|}{\|F(y_{-1})-F(y_{0})\|}$ . This gives us an approximation of the local inverse Lipschitz constant of $F$ at $y_{0}$ . There are many choices of the sequence $\{\phi_{n}\}_{n\in{\mathbb{N}}}$ , but in the earlier iterations the large range of $\lambda_{n}$ is benefit for selecting proper step size, we thus use

[TABLE]

for a given $\hat{n}\in{\mathbb{N}}$ . In this section, we fix $\hat{n}=500$ and $n_{0}=1000$ . For applying Correction step, we use $\gamma=0.7$ and (16) with $\zeta_{\min}=10^{-6}$ and $\mu=\nu=10$ .

We report the number of iterations (Iter), the number of proximal operators ( $\#$ prox), the number of $F$ ( $\#F$ ) and the computing time (Time) measured in seconds. Note that the number of iterations equals that of proximal operators for PEG and IPEG, and is 2 smaller than that of $F$ for IPEG, we thus report the number of iterations and the number of $F$ for PEG and only the number of iterations for IPEG. The bold letter indicates the best results in the following tables.

Problem 1

The first problem (called Sun’s problem) was considered in 9. ; 15. ; yang , and the Lipschitz-continuous and monotone operator was generated by

[TABLE]

where

[TABLE]

and $H(x)=Ex+c.$ Here $E$ is a square matrix $m\times m$ defined by

[TABLE]

and $c=(-1,-1,\ldots,-1).$ We choose the feasible set $C$ as $C_{1}={\mathbb{R}}_{+}^{m}$ and $C_{2}=\{x\in{\mathbb{R}}^{m}_{+}~{}|~{}\sum_{i=1}^{m}x_{i}=m\}$ .

For Problem 1, the initial point $x_{0}$ is generated uniformly randomly from $[-10,10]^{d}$ . For every $d=10^{3},10^{4},10^{5}$ and every $C$ above, the test results are listed in Table 1. In addition, we show the evolutions of $r_{n}$ and $\lambda_{n}$ with respect to Iter for solving Problem 1 with $C=C_{1}$ , $d=10^{3}$ in Fig. 3.

Problem 2

The second test problem is the so-called Kojima-Shindo Nonlinear Complementarity Problem (NCP), considered in 10. ; P-G , where $m=4$ and the mapping $F$ is defined by

[TABLE]

The feasible set is $C=\{x\in{\mathbb{R}}^{4}_{+}~{}|~{}x_{1}+x_{2}+x_{3}+x_{4}=4\}$ and $g(x)=l_{C}(x)$ .

We choose three particular starting points: $(0,0,0,0)$ , $(1,1,1,1)$ and $(0.5,0.5,2,1)$ . The numerical results are reported in Table 2 and the evolutions of $r_{n}$ and $\lambda_{n}$ with respect to Iter for solving Problem 1 with $x_{0}=(1,1,1,1)$ are shown in Fig. 4.

Problem 3

The third problem is HpHard problem, considered as in yang ; Proximal-extrapolated . Let $F(x)=Mx+q$ with $M=NN^{T}+S+D$ and $q\in{\mathbb{R}}^{m}$ , where $N$ , $D$ and $S\in{\mathbb{R}}^{m\times m}$ , $S$ is a skew-symmetric matrix, every entry of $N$ and $S$ is uniformly generated from $(-5,5)$ . The matrix $D$ is diagonal and its diagonal entry is uniformly generated from $(0,0.3)$ . Every entry of $q$ is uniformly generated from $(-500,0)$ . The feasible set is $C=\{x\in{\mathbb{R}}^{m}_{+}~{}|~{}\sum_{i=1}^{m}x_{i}=m\}$ and $g(x)=l_{C}(x)$ .

For every $m$ , as shown in Table 3, we have generated randomly two different $M$ and $q$ with $seed=1$ and $2$ . For all tests, we take $x_{0}=(1,1,\cdots,1)$ . Since $F$ is an affine operator, the number of iterations is 2 smaller than that of $F$ for PEG, thus we just report the number of iterations.

Problem 4

The fourth example is a sparse logistic regression problem for binary classification. Let $(h_{i},l_{i})\in{\mathbb{R}}^{n}\times\{\pm 1\},i=1,\cdots,m$ be the training set, where $h_{i}\in{\mathbb{R}}^{n}$ is the feature vector of each data sample, and $l_{i}$ is the binary label. The formulation of sparse logistic regression reads

[TABLE]

where $\mu>0$ and is set to be $0.005\|H^{T}l\|_{\infty}$ in the numerical test.

Let $K_{ij}=-l_{i}h_{ij}$ and set $\hat{f}(y)=\sum^{m}_{i=1}\log(1+\exp(y_{i}))$ . Then the objective in (61) is $\phi(x)=f(x)+g(x)$ with $g(x)=\mu\|x\|_{1}$ and $f(x)=\hat{f}(Kx)$ . It is easy to derive that $L_{\nabla\hat{f}}=\frac{1}{4}$ . Thus, $L_{\nabla f}=\frac{1}{4}\|K^{T}K\|$ . We take three popular datasets from LIBSVM 222https://www.csie.ntu.edu.tw/$\sim$cjlin/libsvmtools/datasets/: w7a with $m=24692$ , $n=300$ , a9a with $m=32561$ , $n=123$ and real-sim with $m=72309$ , $n=20958$ .

Since $f$ is convex and $F=\nabla f$ , we apply IPEG to (61) without Correction step. We use $\epsilon=10^{-10}$ to terminate all the algorithms for getting more accurate solution, and choose the smallest objective value among all methods and set it to $\phi(x^{*})$ . The results are shown in Table 4. To illustrate how does the value $\phi(x_{n})-\phi(x^{*})$ and $r_{n}$ change over times, we give two convergence plots for data “a9a” in Fig. 6.

To summarize our numerical experiments on Problems 1-4, we want to make some observations. Firstly, the advantage of IPEG in comparison with other algorithms is a larger interval for possible step size $\lambda_{n}$ , see Fig. 3(b), Fig. 4(b) and Fig. 5(b), which resulted from the proper choice of $\delta$ and the larger value of $\alpha$ .

Secondly, we observed that for the majority of the test problems, IPEG is more efficient than other algorithms in both the number of iterations and the CPU time. Furthermore, IPEG with $\delta=0.73$ performs efficiently than that with $\delta=1.01$ from the convergence plots of $r_{n}$ shown in Fig. 3(a), Fig. 4(a) and Fig. 5(a), which is extremely due to the larger step size $\lambda_{n}$ and the use of only one value of the mapping required per iteration. Although linesearch is involved in Correction step, the condition required is so weak that the linesearch is not started for many problems.

In addition, since MPG yang adopted nonincreasing step sizes, it is adverse when starting in the region with a big curvature of $F$ , see Fig. 3(b) and the results of MPG for Problem 1. From Fig. 5, the step sizes generated by IEPG have fluctuated within a range at the first 500 iterations, after that the range decreases as we use (59) with $\widehat{n}=500$ to control the increase of step sizes.

6 Conclusions

Without the knowledge of Lipschitz constant, we have proposed a proximal extrapolated gradient method using a prediction-correction procedure to determine stepsizes, and improved it numerically with non-monotonic step size. The method extended the range of parameters (considering the case of $\delta<1$ ) and obtained a larger step size than the existing methods by using correction step. Finally, a number of experiments illustrate that the proposed method is efficient, and the improvement can be resulted from the larger step size.

In addition, we have shown by an extremely simple example that our method is not convergent if $\lambda_{0},~{}\alpha\in]\frac{2}{2\delta+1},+\infty[$ for any $\delta>0$ . From Fig. 3, the convergence of the proposed method remains unknown for $(\delta,\alpha)$ in some regions. Especially for $\delta\in]0,\frac{\sqrt{5}-1}{2}]$ , it remains to be explored whether there are any (larger) $\alpha>0$ such that Algorithms 1 and 2 are convergent. Perhaps our method without the correction step is convergent as well, and can be generalized to other methods that need to estimate the Lipschitz constant. We leave this as an interesting topic for our future research.

Acknowledgements.

The research of Xiaokai Chang was supported by the Hongliu Foundation of First-class Disciplines of Lanzhou University of Technology. The project was supported by the National Natural Science Foundation of China under Grant 61877046 and the Natural Science Basic Research Plan in Shaanxi Province of China (2017JM1014).

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Antipin, A.S.: On a method for convex programs using a symmetrical modification of the Lagrange function. Ekonomika i Matematicheskie Metody, 12(6), 1164–1173 (1976)
2(2) Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Berlin, New York (2011)
3(3) Bertsekas, D.P., Gafni, E.M.: Projection methods for variational inequalities with applications to the traffic assignment problem. Math. Program. Study, 17, 139–159 (1982)
4(4) Burachik, R.S., Lopes, J.O., Svaiter, B.F.: An outer approximation method for the variational inequality problem. SIAM J. Control Optim. 43(6), 2071–2088 (2005)
5(5) Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
6(6) Bot R.I., Csetnek, E.R.: Forward-backward and Tseng’s type penalty schemes for monotone inclusion problems. Set-Valued Var. Anal. 22, 313–331 (2014)
7(7) Bot R.I., Csetnek, E.R.: An inertial forward-backward-forward primal-dual splitting algorithm for solving monotone inclusion problems. Numer. Algor. 71, 519–540 (2016)
8(8) Chang, X., Liu, S., Zhao, P., Li, X.: Convergent prediction-correction-based ADMM for multi-block separable convex programming. J. Comput. Appl. Math. 335, 270–288 (2018)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Proximal extrapolated gradient methods with prediction and correction for monotone variational inequalities

Abstract

Keywords:

MSC:

1 Introduction

2 Preliminaries

Fact 1

Fact 2

Fact 3

Fact 4

3 Proximal Extrapolated Gradient Method with Prediction and Correction

Algorithm 1** (PEG for solving (1))**

Lemma 1

Remark 1

Lemma 2

Remark 2

3.1 Convergence Analysis

Lemma 3

Lemma 4

Lemma 5

Theorem 1

Theorem 2

Remark 3

3.2 Ergodic Convergence Rate for δ∈]5−12,1]\delta\in]\frac{\sqrt{5}-1}{2},1]δ∈]25​−1​,1]

Theorem 3

3.3 Heuristics on Nonmonotonic Step Sizes

Algorithm 2** (Improved PEG with nonmonotonic step size.)**

Theorem 4

4 Further Discussion

5 Numerical Experiments

Problem 1

Problem 2

Problem 3

Problem 4

6 Conclusions

Acknowledgements.

Algorithm 1 (PEG for solving (1))

3.2 Ergodic Convergence Rate for $\delta\in]\frac{\sqrt{5}-1}{2},1]$

Algorithm 2 (Improved PEG with nonmonotonic step size.)