An efficient adaptive accelerated inexact proximal point method for   solving linearly constrained nonconvex composite problems

Weiwei Kong; Jefferson G. Melo; Renato D.C. Monteiro

arXiv:1812.06352·math.OC·December 9, 2019·Comput. Optim. Appl.

An efficient adaptive accelerated inexact proximal point method for solving linearly constrained nonconvex composite problems

Weiwei Kong, Jefferson G. Melo, Renato D.C. Monteiro

PDF

Open Access

TL;DR

This paper introduces an adaptive accelerated inexact proximal point method for efficiently solving linearly constrained nonconvex composite optimization problems, improving upon previous methods with adaptive strategies and nonconvex subproblem handling.

Contribution

It develops a novel adaptive variant of the quadratic penalty accelerated inexact proximal point method that handles nonconvex subproblems more efficiently.

Findings

01

The proposed methods outperform existing approaches in numerical tests.

02

Adaptive stepsize adjustment improves convergence speed.

03

The methods effectively solve large-scale nonconvex constrained problems.

Abstract

This paper proposes an efficient adaptive variant of a quadratic penalty accelerated inexact proximal point (QP-AIPP) method proposed earlier by the authors. Both the QP-AIPP method and its variant solve linearly set constrained nonconvex composite optimization problems using a quadratic penalty approach where the generated penalized subproblems are solved by a variant of the underlying AIPP method. The variant, in turn, solves a given penalized subproblem by generating a sequence of proximal subproblems which are then solved by an accelerated composite gradient algorithm. The main difference between AIPP and its variant is that the proximal subproblems in the former are always convex while the ones in the latter are not necessarily convex due to the fact that their prox parameters are chosen as aggressively as possible so as to improve efficiency. The possibly nonconvex proximal…

Equations263

min {f (z) + h (z) : A z = b, z \in ℜ^{n}},

min {f (z) + h (z) : A z = b, z \in ℜ^{n}},

f (u) \geq f (z) + ⟨ \nabla f (z), u - z ⟩ - \frac{m}{2} ∥ u - z ∥^{2} \forall z, u \in dom h .

f (u) \geq f (z) + ⟨ \nabla f (z), u - z ⟩ - \frac{m}{2} ∥ u - z ∥^{2} \forall z, u \in dom h .

min {f (z) + h (z) + \frac{c}{2} ∥ A z - b ∥^{2} : z \in ℜ^{n}},

min {f (z) + h (z) + \frac{c}{2} ∥ A z - b ∥^{2} : z \in ℜ^{n}},

ϕ_{*} := min {ϕ (z) := g (z) + h (z) : z \in ℜ^{n}}

ϕ_{*} := min {ϕ (z) := g (z) + h (z) : z \in ℜ^{n}}

- \frac{m}{2} ∥ u - z ∥^{2} \leq g (u) - [g (z) + ⟨ \nabla g (z), u - z ⟩] \leq \frac{M}{2} ∥ u - z ∥^{2} \forall z, u \in dom h,

- \frac{m}{2} ∥ u - z ∥^{2} \leq g (u) - [g (z) + ⟨ \nabla g (z), u - z ⟩] \leq \frac{M}{2} ∥ u - z ∥^{2} \forall z, u \in dom h,

min {g (z) + h (z) + \frac{1}{2 λ _{k}} ∥ z - z_{k - 1} ∥^{2} : z \in ℜ^{n}}

min {g (z) + h (z) + \frac{1}{2 λ _{k}} ∥ z - z_{k - 1} ∥^{2} : z \in ℜ^{n}}

ℓ_{ψ} (z; \overset{z}{ˉ}) := ψ (\overset{z}{ˉ}) + ⟨ \nabla ψ (\overset{z}{ˉ}), z - \overset{z}{ˉ} ⟩ \forall z \in ℜ^{n} .

ℓ_{ψ} (z; \overset{z}{ˉ}) := ψ (\overset{z}{ˉ}) + ⟨ \nabla ψ (\overset{z}{ˉ}), z - \overset{z}{ˉ} ⟩ \forall z \in ℜ^{n} .

\partial_{ε} ψ (z) := {v \in ℜ^{n} : ψ (u) \geq ψ (z) + ⟨ v, u - z ⟩ - ε, \forall u \in ℜ^{n}} .

\partial_{ε} ψ (z) := {v \in ℜ^{n} : ψ (u) \geq ψ (z) + ⟨ v, u - z ⟩ - ε, \forall u \in ℜ^{n}} .

N_{X} (x) := {u \in ℜ^{n \times n} : ⟨ u, x^{'} - x ⟩ \leq 0, \forall x^{'} \in X} = \partial δ_{X} (x) .

N_{X} (x) := {u \in ℜ^{n \times n} : ⟨ u, x^{'} - x ⟩ \leq 0, \forall x^{'} \in X} = \partial δ_{X} (x) .

∥\nabla g (u) - \nabla g (z) ∥ \leq M ∥ u - z ∥ \forall u, z \in dom h;

∥\nabla g (u) - \nabla g (z) ∥ \leq M ∥ u - z ∥ \forall u, z \in dom h;

\underline{m} := in f {m \in ℜ_{++} : g (u) \geq ℓ_{g} (u; z) - \frac{m}{2} ∥ u - z ∥^{2} \forall u, z \in dom h},

\underline{m} := in f {m \in ℜ_{++} : g (u) \geq ℓ_{g} (u; z) - \frac{m}{2} ∥ u - z ∥^{2} \forall u, z \in dom h},

\overset{v}{^} \in \nabla g (\overset{z}{^}) + \partial h (\overset{z}{^}), ∥ \overset{v}{^} ∥ \leq \overset{ρ}{^} .

\overset{v}{^} \in \nabla g (\overset{z}{^}) + \partial h (\overset{z}{^}), ∥ \overset{v}{^} ∥ \leq \overset{ρ}{^} .

M_{λ} := λ M + 1, f_{λ} := λ g + \frac{1}{2} ∥ \cdot - z^{-} ∥^{2} - ⟨ v, \cdot ⟩, h_{λ} := λh;

M_{λ} := λ M + 1, f_{λ} := λ g + \frac{1}{2} ∥ \cdot - z^{-} ∥^{2} - ⟨ v, \cdot ⟩, h_{λ} := λh;

\overset{z}{^} := u arg min {⟨ \nabla f_{λ} (z), u - z ⟩ + \frac{M _{λ}}{2} ∥ u - z ∥^{2} + h_{λ} (u)},

\overset{z}{^} := u arg min {⟨ \nabla f_{λ} (z), u - z ⟩ + \frac{M _{λ}}{2} ∥ u - z ∥^{2} + h_{λ} (u)},

\overset{v}{^} := \frac{1}{λ} [(v + z^{-} - z) + M_{λ} (z - \overset{z}{^})] + \nabla g (\overset{z}{^}) - \nabla g (z),

Δ := (f_{λ} + h_{λ}) (z) - (f_{λ} + h_{λ}) (\overset{z}{^});

\overset{v}{^} \in \nabla g (\overset{z}{^}) + \partial h (\overset{z}{^}), λ ∥ \overset{v}{^} ∥ \leq ∥ v + z^{-} - z ∥ + 2 2 M_{λ} Δ

\overset{v}{^} \in \nabla g (\overset{z}{^}) + \partial h (\overset{z}{^}), λ ∥ \overset{v}{^} ∥ \leq ∥ v + z^{-} - z ∥ + 2 2 M_{λ} Δ

(\overset{z}{^}_{k}, \overset{v}{^}_{k}, Δ_{k}) = R P (λ_{k}, z_{k - 1}, z_{k}, v_{k})

(\overset{z}{^}_{k}, \overset{v}{^}_{k}, Δ_{k}) = R P (λ_{k}, z_{k - 1}, z_{k}, v_{k})

∥ v_{k} + z_{k - 1} - z_{k} ∥^{2} \leq θ λ_{k} [ϕ (z_{k - 1}) - ϕ (z_{k})],

∥ v_{k} + z_{k - 1} - z_{k} ∥^{2} \leq θ λ_{k} [ϕ (z_{k - 1}) - ϕ (z_{k})],

2 (λ_{k} M + 1) Δ_{k} \leq τ ∥ v_{k} + z_{k - 1} - z_{k} ∥^{2};

\overset{v}{^}_{k} \in \nabla g (\overset{z}{^}_{k}) + \partial h (\overset{z}{^}_{k}), i \leq k min ∥ \overset{v}{^}_{i} ∥^{2} \leq θ (1 + 2 τ)^{2} \frac{[ ϕ ( z _{0} ) - ϕ _{*} ]}{Λ _{k}},

\overset{v}{^}_{k} \in \nabla g (\overset{z}{^}_{k}) + \partial h (\overset{z}{^}_{k}), i \leq k min ∥ \overset{v}{^}_{i} ∥^{2} \leq θ (1 + 2 τ)^{2} \frac{[ ϕ ( z _{0} ) - ϕ _{*} ]}{Λ _{k}},

ϕ (z_{0}) - ϕ_{*} \geq i = 1 \sum k [ϕ (z_{i - 1}) - ϕ (z_{i})] \geq i = 1 \sum k \frac{∥ v _{i} + z _{i - 1} - z _{i} ∥ ^{2}}{θ λ _{i}} \geq \frac{Λ _{k}}{θ} i \leq k min \frac{1}{λ _{i}^{2}} ∥ v_{i} + z_{i - 1} - z_{i} ∥^{2} .

ϕ (z_{0}) - ϕ_{*} \geq i = 1 \sum k [ϕ (z_{i - 1}) - ϕ (z_{i})] \geq i = 1 \sum k \frac{∥ v _{i} + z _{i - 1} - z _{i} ∥ ^{2}}{θ λ _{i}} \geq \frac{Λ _{k}}{θ} i \leq k min \frac{1}{λ _{i}^{2}} ∥ v_{i} + z_{i - 1} - z_{i} ∥^{2} .

∥ \overset{v}{^}_{i} ∥ \leq (1 + 2 τ) \frac{∥ v _{i} + z _{i - 1} - z _{i} ∥}{λ _{i}} .

∥ \overset{v}{^}_{i} ∥ \leq (1 + 2 τ) \frac{∥ v _{i} + z _{i - 1} - z _{i} ∥}{λ _{i}} .

v_{k} \in \partial_{ε_{k}} (λ_{k} ϕ + \frac{1}{2} ∥ \cdot - z_{k - 1} ∥^{2}) (z_{k}), ∥ v_{k} ∥^{2} + 2 ε_{k} \leq σ ∥ v_{k} + z_{k - 1} - z_{k} ∥^{2},

v_{k} \in \partial_{ε_{k}} (λ_{k} ϕ + \frac{1}{2} ∥ \cdot - z_{k - 1} ∥^{2}) (z_{k}), ∥ v_{k} ∥^{2} + 2 ε_{k} \leq σ ∥ v_{k} + z_{k - 1} - z_{k} ∥^{2},

v \in \partial_{ε} (λ ϕ + \frac{1}{2} ∥ \cdot - z^{-} ∥^{2}) (z) .

v \in \partial_{ε} (λ ϕ + \frac{1}{2} ∥ \cdot - z^{-} ∥^{2}) (z) .

λ ϕ (z^{'}) + \frac{1}{2} ∥ z^{'} - z^{-} ∥^{2} \geq λ ϕ (z) + \frac{1}{2} ∥ z - z^{-} ∥^{2} + ⟨ v, z^{'} - z ⟩ - ε \forall z^{'} \in ℜ^{n} .

λ ϕ (z^{'}) + \frac{1}{2} ∥ z^{'} - z^{-} ∥^{2} \geq λ ϕ (z) + \frac{1}{2} ∥ z - z^{-} ∥^{2} + ⟨ v, z^{'} - z ⟩ - ε \forall z^{'} \in ℜ^{n} .

ε

ε

θ \geq \frac{2}{1 - σ}, τ \geq sup {σ (λ_{k} M + 1) : k \geq 1} .

θ \geq \frac{2}{1 - σ}, τ \geq sup {σ (λ_{k} M + 1) : k \geq 1} .

ϕ^{(s)} (u) \leq ℓ_{ϕ^{(s)}} (u; x) + \frac{M}{2} ∥ u - x ∥^{2} \forall u, x \in dom ψ^{(n)} .

ϕ^{(s)} (u) \leq ℓ_{ϕ^{(s)}} (u; x) + \frac{M}{2} ∥ u - x ∥^{2} \forall u, x \in dom ψ^{(n)} .

∥ x_{0} - x + u ∥^{2} \leq θ [ϕ (x_{0}) - ϕ (x)],

∥ x_{0} - x + u ∥^{2} \leq θ [ϕ (x_{0}) - ϕ (x)],

u \in \partial_{η} (ϕ + \frac{1}{2} ∥ \cdot - x_{0} ∥^{2}) (x), 2 (M + 1) η \leq τ ∥ x_{0} - x + u ∥^{2} .

∥ z_{k - 1} - z_{k} + v_{k} ∥^{2} \leq θ [ϕ (z_{k - 1}) - ϕ (z_{k})] = θ λ_{k} [ϕ (z_{k - 1}) - ϕ (z_{k})]

∥ z_{k - 1} - z_{k} + v_{k} ∥^{2} \leq θ [ϕ (z_{k - 1}) - ϕ (z_{k})] = θ λ_{k} [ϕ (z_{k - 1}) - ϕ (z_{k})]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Advanced Optimization Algorithms Research · Optimization and Variational Analysis

Full text

11institutetext: Weiwei Kong 22institutetext: Renato D.C. Monteiro 33institutetext: School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332-0205. (33email: [email protected] & [email protected]). The works of these authors were partially supported by ONR Grant N00014-18-1-2077.

Jefferson G. Melo 44institutetext: Institute of Mathematics and Statistics, Federal University of Goias, Campus II- Caixa Postal 131, CEP 74001-970, Goiânia-GO, Brazil. (44email: [email protected]). The work of this author was supported in part by CNPq Grant 406975/2016-7.

An efficient adaptive accelerated inexact proximal point method for solving linearly constrained nonconvex composite problems

Weiwei Kong

Jefferson G. Melo

Renato D.C. Monteiro

(March 17, 2024)

Abstract

This paper proposes an efficient adaptive variant of a quadratic penalty accelerated inexact proximal point (QP-AIPP) method proposed earlier by the authors. Both the QP-AIPP method and its variant solve linearly set constrained nonconvex composite optimization problems using a quadratic penalty approach where the generated penalized subproblems are solved by a variant of the underlying AIPP method. The variant, in turn, solves a given penalized subproblem by generating a sequence of proximal subproblems which are then solved by an accelerated composite gradient algorithm. The main difference between AIPP and its variant is that the proximal subproblems in the former are always convex while the ones in the latter are not necessarily convex due to the fact that their prox parameters are chosen as aggressively as possible so as to improve efficiency. The possibly nonconvex proximal subproblems generated by the AIPP variant are also tentatively solved by a novel adaptive accelerated composite gradient algorithm based on the validity of some key convergence inequalities. As a result, the variant generates a sequence of proximal subproblems where the stepsizes are adaptively changed according to the responses obtained from the calls to the accelerated composite gradient algorithm. Finally, numerical results are given to demonstrate the efficiency of the proposed AIPP and QP-AIPP variants.

2000 Mathematics Subject Classification: 47J22, 90C26, 90C30, 90C60, 65K10.

Key words: quadratic penalty method, nonconvex program, iteration-complexity, proximal point method, first-order accelerated methods.

1 Introduction

This paper presents a computationally efficient variant of the quadratic penalty accelerated inexact proximal point (QP-AIPP) method studied in WJRproxmet1 .

The QP-AIPP method of WJRproxmet1 is designed for solving the linearly–constrained nonconvex composite optimization problem

[TABLE]

where $A:\Re^{n}\mapsto\Re^{p}$ is a linear operator, $b\in\Re^{p}$ , $h:\Re^{n}\to(-\infty,\infty]$ is a closed proper convex function, and $f$ is a real-valued differentiable (possibly nonconvex) function whose gradient is $L$ –Lipschitz and which, for some $0<m\leq L$ , satisfies

[TABLE]

The QP-AIPP method solves (1) via a quadratic penalty method, i.e., a sequence of penalty subproblems of the form

[TABLE]

for an increasing sequence of positive penalty parameters $c$ , is solved by the accelerated inexact proximal point (AIPP) method (discussed below) in which each penalty subproblem is solved using a common starting point $z_{0}\in\mathrm{dom}\,h$ (i.e., a cold–start strategy is adopted).

We briefly outline the AIPP method of WJRproxmet1 . First, note that (3) is a special case of

[TABLE]

where $g(z):=f(z)+c\|Az-b\|^{2}/2$ is a function satisfying

[TABLE]

where $M=L+c\|A\|^{2}$ . In the general setting of (4)–(5), the AIPP method generates a sequence $\{z_{k}\}$ using an inexact proximal point (IPP) framework (see for example Rock:ppa ; hpe_svaiter99 ), i.e., given $z_{k-1}\in\mathrm{dom}\,h$ , it computes $z_{k}$ as a suitable approximate solution of the proximal subproblem

[TABLE]

for some prox-parameter $\lambda_{k}>0$ . Note that the first inequality in (5) implies that the objective function of (6) is convex as long as $\lambda_{k}$ is not larger than $1/m$ . The AIPP method sets $\lambda_{k}=1/(2m)$ for every $k$ and uses an accelerated composite gradient (ACG) variant (see for example beck2009fast ; MontSvaiter_fista ; Nesterov1983 ) to approximately solve (6).

Since the larger $\lambda_{k}$ is the faster the above IPP framework converges to a desirable approximate solution, the main goal of this paper is to develop an aggressive AIPP variant, and subsequently an aggressive QP-AIPP variant, which possibly chooses $\lambda_{k}$ substantially larger than $1/m$ despite potential loss of convexity of (6). An important ingredient in obtaining this aggressive AIPP variant is the development of a relaxed ACG (R-ACG) algorithm that approximately solves (6) according to a more relaxed termination criterion. More specifically, within a reasonably number of iterations, the algorithm: (i) either solves the possibly nonconvex subproblem (6) according to the relaxed criterion or stops with failure due to $\lambda_{k}$ being too large; and (ii) always solves (6) according to the relaxed criterion when its objective function is convex. The aforementioned relaxed AIPP (R-AIPP) variant starts with a relatively large initial prox parameter and, in each one of its steps, calls the R-ACG algorithm to solve the corresponding prox subproblem. If a key descent inequality fails, then the prox parameter $\lambda_{k}$ is halved, the prox center $z_{k-1}$ is maintained, and the R-ACG algorithm is invoked once again to solve the resulting prox subproblem; otherwise, the prox parameter $\lambda_{k}$ is preserved and $z_{k}$ takes the place of $z_{k-1}$ .

This paper also considers a more general version of (1) in which the linear constraint $Az=b$ is replaced by the linear set constraint $Az\in S$ , where $S\subseteq\Re^{p}$ is a closed convex set. Clearly, when $S=\{b\}$ , the more general problem reduces to (1). Under the assumption that $\mathrm{dom}\,h$ is bounded and all penalty subproblems are solved by the AIPP variant using the aforementioned cold–start strategy, it turns out that the iteration complexity of the QP-AIPP variant for finding the desired approximate solution is considerably worse than that of the QP-AIPP method of WJRproxmet1 . If, on the other hand, the QP-AIPP variant adopts the warm–start strategy in which the R-AIPP method for solving the current penalty subproblem starts from the approximate solution found for the previous subproblem, then the iteration complexity of this relaxed QP-AIPP (R-QP-AIPP) variant is shown to be the same as that of the QP-AIPP method of WJRproxmet1 , up to a logarithmic factor.

The proposed AIPP and QP-AIPP variants are compared with three state-of-the-art optimization methods on five different optimization problems. The computational results obtained show that the variants can substantially outperform most of the competing methods on many problem instances.

Related works. We first discuss papers dealing with related algorithms for solving the convex version of (1) and other related monotone problems. Iteration-complexity analysis of quadratic penalty methods for solving (1) under the assumption that $f$ is convex and $h$ is a convex indicator function was first studied in LanRen2013PenMet and further explored in Aybatpenalty ; IterComplConicprog . Iteration-complexity of first-order augmented Lagrangian methods for solving the latter class of linearly constrained convex programs was studied in AybatAugLag ; LanMonteiroAugLag ; ShiqiaMaAugLag16 ; zhaosongAugLag18 ; Patrascu2017 ; YangyangAugLag17 . Inexact proximal point methods using accelerated gradient algorithms to solve their prox-subproblems were previously considered in GlanPDaccel2014 ; YHe2 ; YheMoneiroNash ; OliverMonteiro ; MonteiroSvaiterAcceleration in the setting of convex-concave saddle point problems and monotone variational inequalities.

We now discuss papers dealing with related algorithms for solving (1) when $f$ is nonconvex and the assumptions mentioned after (1) hold. Paper WJRproxmet1 is, up to our knowledge, the first one to consider a proximal method with acceleration strategy for solving (1). Previous works using acceleration strategies were concerned with the unconstrained problem (4). Namely, nonconv_lan16 proposed an accelerated gradient framework to solve (4) with better iteration complexity than the usual composite gradient method. Since then, many authors have proposed other accelerated frameworks for solving (4) under different assumptions on the functions $g$ and $h$ (see, for example, Aaronetal2017 ; Paquette2017 ; Ghadimi2019 ; Li_Lin2015 ; CatalystNC ). In particular, by exploiting the lower curvature $m$ , Aaronetal2017 ; Paquette2017 ; CatalystNC proposed some algorithms which improve the iteration-complexity bound of nonconv_lan16 in terms of the dependence on the upper curvature $M$ . Finally, there has been a growing interest in the iteration complexity of methods for solving optimization problems using second order information (see, for example, Aaronetal2017 ; MonteiroSvaiterNewton ; NesterovSec_ord ; CartToint ).

Organization of the paper. Subsection 1.1 provides some basic definitions and notation. Section 2 begins with presenting some background materials and transitions into defining a general descent (GD) framework for solving the nonconvex optimization problem (4). Section 3 presents and derives the complexity of an R-ACG algorithm which attempts to solve (6) even when it is not convex. Section 4 presents a relaxed variant of the AIPP method proposed in WJRproxmet1 . Section 5 presents a relaxed variant of the QP-AIPP method proposed in WJRproxmet1 . Section 6 presents numerical results to illustrate the efficiency of the AIPP and QP-AIPP variants. Finally, Section 7 presents some concluding remarks.

1.1 Basic definitions and notation

This subsection provides some basic definitions and notation used in this paper.

The set of natural numbers is denoted by $\mathbb{N}$ . The set of real numbers is denoted by $\Re$ . The set of non-negative real numbers and the set of positive real numbers are denoted by $\Re_{+}$ and $\Re_{++}$ , respectively. Let $\Re^{n}$ denote a real valued $n$ –dimension inner product space, whose inner product and its associated induced norm are denoted by $\left\langle\cdot,\cdot\right\rangle$ and $\|\cdot\|$ , respectively. Let $\left\langle\cdot,\cdot\right\rangle_{F}$ denote the Frobenius inner product. Let $S_{+}^{n}$ denote the cone of positive semidefinite $n$ –by– $n$ matrices. For $t>0$ , define $\log^{+}_{1}(t):=\max\{\log t,1\}$ . The set of proper lower semi-continuous convex functions defined on $\Re^{n}$ is denoted by $\overline{\text{Conv}}(\Re^{n})$ . Given a linear operator $A:\Re^{n}\mapsto\Re^{p}$ , the operator norm of $A$ is denoted by $\|A\|:=\sup\{\|Az\|/\|z\|:z\in\Re^{n},z\neq 0\}$ .

Let $\psi:\Re^{n}\rightarrow(-\infty,+\infty]$ be given. The effective domain of $\psi$ is denoted by $\mathrm{dom}\,\psi:=\{x\in\Re^{n}:\psi(x)<\infty\}$ and $\psi$ is proper if $\mathrm{dom}\,\psi\neq\emptyset$ . If $\psi$ is differentiable at $\bar{z}\in\Re^{n}$ , then its affine approximation $\ell_{\psi}(\cdot;\bar{z})$ at $\bar{z}$ is denoted by

[TABLE]

Also, for $\varepsilon\geq 0$ , its $\varepsilon$ -subdifferential at $z\in\mathrm{dom}\,\psi$ is denoted by

[TABLE]

The subdifferential of $\psi$ at $z\in\mathrm{dom}\,\psi$ , denoted by $\partial\psi(z)$ , corresponds to $\partial_{0}\psi(z)$ .

For a given $X\subseteq\Re^{n}$ , the closure of the set $X$ is denoted by $\mathrm{cl}\,X$ , the indicator function of $X$ , denoted by $\delta_{X}$ , is defined as $\delta_{X}(x)=0$ if $x\in X$ and $\delta_{X}(x)=\infty$ if $x\notin X$ . Moreover, the normal cone of $X$ at a point $x\in X$ is denoted by

[TABLE]

2 A general descent framework

As discussed in Section 1, all the penalized subproblems (see (2)) that arise during the execution of the QP-AIPP method, as well as the R-QP-AIPP method, are of the form (4). Hence, efficiently obtaining a solution of (4) is of paramount importance for both the QP-AIPP and R-QP-AIPP methods. While the QP-AIPP method uses the AIPP method to solve (4), the R-QP-AIPP method uses the R-AIPP method which will be discussed in Section 4. The discussion of this section (as well as Section 3) will essentially pave the way towards the presentation of the R-AIPP method.

More specifically, this section presents and analyzes a GD framework for solving (4) that makes use of a black box (see step 1 of the GD framework below). In addition, it describes: the assumptions and relevant quantities underlying problem (4), the notion of approximate stationary point of (4) adopted in this section and Section 4, and the relationship between the GD framework and the GIPP framework of WJRproxmet1 , of which the AIPP method is an instance of.

Our problem of interest in this section and Section 4 is (4) which is assumed to satisfy the following assumptions:

(A1)

$h\in\overline{\text{Conv}}(\Re^{n})$ ;

(A2)

$g$ is a nonconvex differentiable function on $\mathrm{dom}\,h$ and there exist a scalar $M>0$ such that

[TABLE]

(A3)

$\phi_{*}>-\infty$ .

In addition, the analysis in Section 4 makes use of the quantity

[TABLE]

which is positive in view of assumption (A2). While it is generally difficult to compute the above quantity, it is well known that assumption (A2) implies that $\underline{m}\in(0,M]$ . Moreover, it is shown in Proposition 6 below that the smaller $\underline{m}$ is, the better the iteration complexity of R-AIPP method in Section 4 becomes.

It is well-known that a necessary condition for $z^{*}\in\mathrm{dom}\,h$ to be a local minimum of (4) is that $z^{*}$ be a stationary point of $\phi$ , i.e., $0\in\nabla g(z^{*})+\partial h(z^{*})$ . A relaxation of this inclusion leads to the following definition of an approximate stationary point of (4): given a tolerance $\hat{\rho}>0$ , a pair $(\hat{z},\hat{v})$ is said to be a $\hat{\rho}$ –approximate stationary point of (4) if

[TABLE]

Given a general quadruple $(\lambda,z^{-},z,v)\in\Re_{++}\times\Re^{n}\times\mathrm{dom}\,h\times\Re^{n}$ , the following simple refinement procedure shows how to obtain a pair $(\hat{z},\hat{v})$ satisfying the inclusion in (11) with a technically useful bound on the residual $\hat{v}$ (see Proposition 1 below).

Refinement procedure.

Input: a scalar $M>0$ , a pair of functions $(g,h)$ satisfying assumptions (A1) and (A2), and a quadruple $(\lambda,z^{-},z,v)\in\Re_{++}\times\Re^{n}\times\mathrm{dom}\,h\times\Re^{n}$ ;

Output: a triple $(\hat{z},\hat{v},\Delta)\in\mathrm{dom}\,h\times\Re^{n}\times\Re_{++}$ satisfying (16);

(0)

set

[TABLE]

(1)

compute

[TABLE]

(2)

return the triple $(\hat{z},\hat{v},\Delta)$ .

For the sake of brevity, we write $(\hat{z},\hat{v},\Delta)=RP(\lambda,z^{-},z,v)$ to indicate that the triple $(\hat{z},\hat{v},\Delta)$ is the output of the above refinement procedure with inputs $M$ , $(g,h)$ , and $(\lambda,z^{-},z,v)$ . We now state an important property of this procedure, whose proof can be found in Appendix A.

Proposition 1

Let a pair of functions $(g,h)$ satisfying (A1)–(A3) and a quadruple $(\lambda,z^{-},z,v)\in\Re_{++}\times\Re^{n}\times\mathrm{dom}\,h\times\Re^{n}$ be given and let $(\hat{z},\hat{v},\Delta)=RP(\lambda,z^{-},z,v)$ . Then, $\Delta\geq 0$ and

[TABLE]

where $M_{\lambda}$ is as in (12).

The above proposition shows that the pair $(\hat{z},\hat{v})$ , computed as in (13) and (14), clearly satisfies the inclusion in (11) and that the quantity $\lambda\|\hat{v}\|$ has an upper bound expressed in terms of the two quantities: $\|v+z^{-}-z\|$ and $\sqrt{M_{\lambda}\Delta}$ . Given a tolerance $\hat{\rho}>0$ , it will be shown in Proposition 2 below that the GD framework stated next generates a sequence of iterates $\{(\lambda_{k},z_{k},v_{k})\}$ whose corresponding refined sequence $\{(\hat{z}_{k},\hat{v}_{k})\}$ obtained as $(\hat{z}_{k},\hat{v}_{k})=RP(\lambda_{k},z_{k-1},z_{k},v_{k})$ for every $k\geq 1$ yields a $\hat{\rho}$ –approximate stationary point of (4).

GD framework.

Input: a scalar $M>0$ , a function pair $(g,h)$ satisfying assumptions (A1)–(A3), an initial point $z_{0}\in\mathrm{dom}\,h$ , and a scalar pair $(\theta,\tau)\in\Re_{++}^{2}$ ;

(0)

set $\phi:=g+h$ and $k=1$ ;

(1)

find a triple $(\lambda_{k},z_{k},v_{k})\in\Re_{++}\times\mathrm{dom}\,h\times\Re^{n}$ such that its corresponding refined triple

[TABLE]

satisfies

[TABLE]

(2)

set $k=k+1$ and go to step 1.

We now make three remarks about the GD framework. First, no termination criterion is added to the GD framework so as to be able to discuss convergence rate results about its generated sequence. A discussion of how to terminate it is given after Proposition 2 below. Second, step 1 should be viewed as an oracle in that it does not specify how to compute the triple $(\lambda_{k},z_{k},v_{k})$ . Third, Corollary 1 below shows that if the stepsize $\lambda_{k}$ is chosen so that the prox subproblem (6) is a strongly convex composite problem, i.e., $\lambda_{k}\in(0,1/\underline{m})$ where $\underline{m}$ is as in (10), the point $z_{k}$ is chosen as its unique optimal solution, and $v_{k}$ is set to zero, then the triple $(\lambda_{k},z_{k},v_{k})$ satisfies (17) and (18) with $\theta=2$ and $\tau=0$ . Thus, when $(\theta,\tau)\in[2,\infty)\times[0,\infty)$ , we conclude that: (i) there always exists a triple satisfying (17) and (18); and, (ii) the GD framework can be viewed as an IPP method. Fourth, the R-AIPP of Section 4, being a special instance of the GD framework, can also be viewed as an IPP method which chooses $(\theta,\tau)$ in the open rectangle $(2,\infty)\times(0,\infty)$ and applies an ACG variant, such as the one described in Section 3, to problem (6) in order to obtain a triple $(\lambda_{k},z_{k},v_{k})$ satisfying (17) and (18).

The following result shows an important property about the sequence of iterates $\{(\lambda_{k},\hat{z}_{k},\hat{v}_{k})\}$ .

Proposition 2

The sequences of stepsizes $\{\lambda_{k}\}$ and iterate pairs $\{(\hat{z}_{k},\hat{v}_{k})\}$ satisfy

[TABLE]

for every $k\geq 1$ , where $\Lambda_{k}:=\sum_{i=1}^{k}\lambda_{i}$ .

Proof

Let $k\geq 1$ be fixed. The inclusion in (19) follows from Proposition 1 with $(\hat{z},\hat{v})=(\hat{z}_{k},\hat{v}_{k})$ and the definitions of $\hat{z}_{k}$ and $\hat{v}_{k}$ in step 1 of the GD framework. To show the inequality in (19), first observe that (17) and the definition of $\phi_{*}$ in (4) implies that

[TABLE]

Now, let $i\geq 1$ be arbitrary. In view of step 1 of the GD framework we have $(\hat{z}_{i},\hat{v}_{i},\Delta_{i})=RP(\lambda_{i},z_{i-1},z_{i},v_{i})$ . Hence Proposition 1 with $(\lambda,z^{-},z,v,\hat{v})=(\lambda_{i},z_{i-1},z_{i},v_{i},\hat{v}_{i})$ and (18) with $k=i$ imply that

[TABLE]

The inequality in (19) now follows by combining (20) and (21).

We now make three remarks about the GD framework in light of Proposition 2. First, if the GD framework stops when a pair $(\hat{z}_{k},\hat{v}_{k})$ such that $\|\hat{v}_{k}\|\leq\hat{\rho}$ is found, then it follows from (11) and the inclusion in (19) that $(\hat{z}_{k},\hat{v}_{k})$ is a $\hat{\rho}$ –approximate stationary point of (4). Second, if the sequence of stepsizes $\{\lambda_{i}\}$ satisfies $\lim_{k\to\infty}\Lambda_{k}=\infty$ , then it follows from the inequality in (19) and assumption (A3) that the GD framework indeed stops according to the above termination criterion. Third, (19) indicates that the larger the stepsizes $\lambda_{k}$ are, the faster the quantity $\min_{i\leq k}\|\hat{v}_{i}\|$ approaches zero.

For the remainder of this section, our goal is to show that the GD framework can be seen as a relaxation of the GIPP framework studied in WJRproxmet1 . The proof of this fact is not essential in establishing any results pertaining to the R-AIPP method in Section 4 or the R-QP-AIPP method in Section 5 and may skipped without any loss of continuity.

Recall that, for a given $z_{0}\in\mathrm{dom}\,h$ and $\sigma\in[0,1)$ , the GIPP framework in WJRproxmet1 considers a sequence $\{(\lambda_{k},z_{k},v_{k},\varepsilon_{k})\}\subseteq\Re_{++}\times\mathrm{dom}\,\phi\times\Re^{n}\times\Re_{+}$ satisfying

[TABLE]

for every $k\geq 1$ . We now state a simple technical result which will not only be used in this section but also later in the analysis of the R-ACG algorithm (see Section 3).

Lemma 1

Assume that $\varepsilon\geq 0$ and $(\lambda,z^{-},z,v)\in\Re_{++}\times\Re^{n}\times\mathrm{dom}\,h\times\Re^{n}$ satisfy

[TABLE]

Then, the quantity $\Delta$ defined in (15) satisfies $\Delta\leq\varepsilon$ .

Proof

Let $(\hat{z},\Delta)$ be computed as in (13) and (15). It follows from (8) and (23) that

[TABLE]

Considering the above inequality at the point $z^{\prime}=\hat{z}$ , along with some algebraic manipulation, we have

[TABLE]

where the last equality is due to the definitions of $\phi$ and $\Delta$ given in (4) and (15), respectively.

The following result shows the relationship between the GIPP framework of WJRproxmet1 and the GD framework of this section.

Proposition 3

If, for some $z_{k-1}\in\mathrm{dom}\,h$ , constant $\sigma\in[0,1)$ , and index $k\geq 1$ , the quadruple $(\lambda_{k},z_{k},v_{k},\varepsilon_{k})$ satisfies (22), then $(\lambda_{k},z_{k},v_{k})$ satisfies (17) and (18) for any $\theta\geq 2/(1-\sigma)$ and $\tau\geq\sigma(\lambda_{k}M+1)$ . As a consequence, if $\sup\{\lambda_{k}:k\geq 1\}<\infty$ , then every instance of the GIPP framework is an instance of the GD framework for any $(\theta,\tau)$ satisfying

[TABLE]

Proof

The proof that $(\lambda_{k},z_{k},v_{k})$ satisfies (17) with $\theta=2/(1-\sigma)$ can be found in (WJRproxmet1, , Proposition 5(a)). Now, let $k\geq 1$ and observe that from Lemma 1 with $(\lambda,z^{-},z,v)=(\lambda_{k},z_{k-1},z_{k},v_{k})$ and $\varepsilon=\varepsilon_{k}$ we have $\Delta\leq\varepsilon_{k}$ . It follows from the last inequality and the inequality in (22) that $2\Delta\leq\sigma\|v_{k}+z_{k-1}-z_{k}\|^{2}$ . Combining the previous inequality with the assumption on $\tau$ now shows that $(\lambda_{k},z_{k},v_{k})$ satisfies (18). The second part of the proposition follows immediately from the first part and condition (24).

The above proposition shows that if $\{\lambda_{k}\}$ is bounded and the parameter triple $(\sigma,\theta,\tau)$ satisfies (24), then the condition for finding an iterate $(\lambda_{k},z_{k},v_{k})$ in the GD framework is more relaxed than the condition for finding an iterate $(\lambda_{k},z_{k},v_{k},\varepsilon_{k})$ in the GIPP framework. As a consequence, under the conditions in (24), an optimization algorithm (such as the R-ACG algorithm of Section 3) applied to (6) is expected to find the triple $(\lambda_{k},z_{k},v_{k})$ for the GD framework faster than the quadruple $(\lambda_{k},z_{k},v_{k},\varepsilon_{k})$ for the GIPP framework.

The following corollary justifies the third remark following the GD framework.

Corollary 1

Let $z_{k-1}\in\mathrm{dom}\,h$ and $\lambda_{k}\in(0,1/\underline{m})$ be given, where $\underline{m}$ is as in (10). Then, (6) has a unique global minimum $z_{k}$ and the triple $(\lambda_{k},z_{k},v_{k})\in\Re_{++}\times\mathrm{dom}\,h\times\Re^{n}$ where $v_{k}=0$ satisfies (17) and (18) with $\theta=2$ and $\tau=0$ .

Proof

The existence and unique uniqueness of $z_{k}$ follows from the fact that $\phi+\|\cdot-z_{k-1}\|^{2}/\lambda_{k}$ is strongly convex. Moreover, the fact that $z_{k}$ is the unique global minimum of (6) implies that the quadruple $(\lambda_{k},z_{k},v_{k},\varepsilon_{k})$ , where $(v_{k},\varepsilon_{k})=(0,0)$ , satisfies (22) with $\sigma=0$ . The conclusion of the corollary now follows immediately from the first part of Proposition 3 with $\sigma=0$ .

3 A relaxed accelerated composite gradient algorithm

This section presents and analyzes an ACG variant, namely, the R-ACG algorithm, which is used as an important tool in the development of the R-AIPP method of Section 4. More specifically, the R-AIPP method can be viewed as a special instance of the GD framework where step 1 is implemented by repeatedly calling the ACG variant of this section.

Before describing the variant, we consider its assumptions as well as the problem that it solves. First, we describe the assumptions. Let $\widetilde{\phi}:\Re^{n}\to(-\infty,\infty]$ be given and assume that it can be decomposed as $\widetilde{\phi}=\widetilde{\phi}^{(s)}+\widetilde{\phi}^{(n)}$ where:

(B1)

$\widetilde{\phi}^{(n)}\in\overline{\text{Conv}}(\Re^{n})$ ;

(B2)

$\widetilde{\phi}^{(s)}$ is a differentiable function on $\mathrm{dom}\,\widetilde{\phi}^{(n)}$ such that for some $\widetilde{M}>0$ ,

[TABLE]

We now describe our problem of interest in this section.

Problem A: Given $\widetilde{\phi}:\Re^{n}\to(-\infty,+\infty]$ satisfying the above assumptions, a point $x_{0}\in\Re^{n}$ , and a pair of parameters $(\theta,\tau)\in(2,\infty)\times(0,\infty)$ , the problem is to find a triple $(x,u,\eta)\in\Re^{n}\times\Re^{n}\times\Re_{+}$ such that

[TABLE]

The following simple result shows how the ability to solve Problem A allows us to implement the “step 1” oracle in the GD framework.

Proposition 4

Assume that $\phi=g+h$ satisfies conditions (A1) and (A2), and let $z_{k-1}\in\mathrm{dom}\,h$ be given. Then the following statements hold:

(a)

if $(x,u)$ satisfies (25) with $(\widetilde{\phi},\widetilde{M},x_{0})=(\lambda\phi,\lambda M,z_{k-1})$ for some $\lambda>0$ , then the triple $(\lambda_{k},z_{k},v_{k}):=(\lambda,x,u)$ satisfies (17);

(b)

*if $(x,u,\eta)$ solves Problem A with input $(\widetilde{\phi},\widetilde{M},x_{0})=(\lambda\phi,\lambda M,z_{k-1})$ for some $\lambda>0$ , then the triple $(\lambda_{k},z_{k},v_{k})=(\lambda,x,u)$ solves step 1 of the GD framework. *

Proof

(a) Assume that $(x,u)$ satisfies (25). It follows from the fact that $(\lambda,x,u)=(\lambda_{k},z_{k},v_{k})$ and the definition of $\widetilde{\phi}$ that

[TABLE]

and thus the triple $(\lambda_{k},z_{k},v_{k})$ satisfies (17).

(b) Assume that $(x,u,\eta)$ satisfies (26) and define $\varepsilon:=\eta$ and $(z^{-},z,v):=(x_{0},x,u)$ . Moreover, let $\Delta$ be computed as in (15) with $\hat{z}$ as in (13). It follows from Lemma 1, the definition of $\widetilde{\phi}$ , the fact that $\eta=\varepsilon$ , and the inclusion in (26) that $\Delta\leq\eta$ . Using the inequality in (26) and the fact that $(x_{0},x,u)=(z_{k-1},z_{k},v_{k})$ gives $2(\widetilde{M}+1)\Delta\leq\tau\|z_{k-1}-z_{k}+v_{k}\|^{2}$ and thus the pair $(z_{k},v_{k})$ satisfies (18) in view of the definition of $\widetilde{M}$ . As a consequence, the triple $(\lambda_{k},z_{k},v_{k})$ solves step 1 of the GD framework.

The R-ACG algorithm presented below, which is a modified ACG variant for minimizing the function $\psi:=\widetilde{\phi}+\|\cdot-x_{0}\|^{2}/2$ , solves Problem A under the assumption that $\psi$ is convex (see Proposition 5(c) below). As a consequence, it can be used to implement step 1 of the GD framework whenever $\lambda_{k}$ is sufficiently small. More specifically, since $\lambda_{k}\phi+\|\cdot-z_{k-1}\|^{2}/2$ is clearly convex whenever $\lambda_{k}$ is chosen in $(0,1/\underline{m}]$ , where $\underline{m}$ is as in (10), we can use the R-ACG algorithm to solve problem A with $\widetilde{\phi}=\lambda_{k}\phi$ and $x_{0}=z_{k-1}$ , and hence the “step 1” oracle in the GD framework in view of Proposition 4(b). In fact, the AIPP method developed in WJRproxmet1 is an instance of the GIPP framework (and hence an instance of the GD framework) in which, given an upper bound $m$ on $\underline{m}$ , it chooses $\lambda_{k}=1/(2m)$ for all $k$ and in which step 1 is implemented with a single call to the R-ACG algorithm presented below.

However, our main goal in this paper is the development of an instance of the GD framework which aggressively chooses $\lambda_{k}$ (possibly) much larger than $1/\underline{m}$ since, according to Proposition 2, this strategy can potentially reduce its number of iterations. In this regard, the R-ACG algorithm presented below accepts as input a function $\widetilde{\phi}$ of the form $\widetilde{\phi}=\lambda\phi$ for some $\lambda>0$ in which $\widetilde{\phi}+\|\cdot-x_{0}\|^{2}/2$ is not necessarily convex, and terminates with either failure or by finding a triple $(x,u,\eta)$ satisfying (25) within ${\cal O}(\widetilde{M}^{1/2}\log^{+}_{1}\widetilde{M})$ iterations (see statements (a) and (b) of Proposition 5 below). Clearly, in the second case, the triple $(\lambda_{k},z_{k},v_{k})=(\lambda,x,u)$ is guaranteed to satisfy (17) but not necessarily (18) (see Proposition 4(a)). If (18) is satisfied then the R-ACG algorithm clearly provides a solution of the “step 1” oracle of the GD framework; otherwise, the stepsize $\lambda$ is considered large. The R-AIPP method of Section 4 is an instance of the GD framework which attempts to provide a solution of its “step 1” oracle in this manner and adaptively reduces $\lambda$ whenever it is found to be large.

R-ACG algorithm.

Input: a scalar $\widetilde{M}>0$ , a function pair $(\widetilde{\phi}^{(s)},\widetilde{\phi}^{(n)})$ satisfying assumptions (B1) and (B2), an initial point $x_{0}\in\mathrm{dom}\,\widetilde{\phi}^{(n)}$ , and a pair of parameters $(\theta,\tau)\in(2,\infty)\times(0,\infty)$ ;

Output: a triple $(x,u,\eta)\in\mathrm{dom}\,\widetilde{\phi}^{(n)}\times\Re^{n}\times\Re_{+}$ satisfying (25) or a failure status;

(0)

set $y_{0}=x_{0}$ , $A_{0}=0$ , $\Gamma_{0}\equiv 0$ , $j=1$ , and define

[TABLE]

(1)

compute

[TABLE]

and set

[TABLE]

(2)

if both inequalities

[TABLE]

hold, then go to step 3; otherwise, stop with failure;

(3)

if both inequalities

[TABLE]

hold, then return $(x,u,\eta)=(x_{j},u_{j},\eta_{j})$ ; otherwise, increment $j=j+1$ and go to step 1.

Some comments about the above algorithm are in order. First, step 1 is essentially a standard step of an ACG variant (see, for example, YHe2 ; WJRproxmet1 ) applied to the problem $\min\{\widetilde{\phi}(x)+\|x-x_{0}\|^{2}/2:x\in\Re^{n}\}$ with the exception that it also computes in (33) the quantities $u_{j}$ and $\eta_{j}$ which, together with $x_{j}$ , determine the termination criteria for the method. Second, it is shown in (WJRproxmet1, , Lemma 9) that a simplified version of the above algorithm, namely, one that does not include the two tests performed in step 2 and stops whenever the inequality in (22) is satisfied with $(z_{k-1},z_{k},{v}_{k},{\varepsilon}_{k})=(x_{0},x_{j},u_{j},\eta_{j})$ , implements step 1 of the GIPP framework in WJRproxmet1 . Finally, it is well-known (see, for example, (YHe2, , Proposition 2.3)) that the scalar $A_{j}$ updated according to (29) satisfies

[TABLE]

The next result establishes the iteration-complexity bound and some properties of the R-ACG algorithm.

Proposition 5

The R-ACG algorithm satisfies the following statements:

(a)

it stops (either with success or failure) in at most

[TABLE]

iterations, where

[TABLE]

(b)

if it stops with success then its output $(x,u,\eta)$ satisfies

[TABLE]

(c)

if $\widetilde{\phi}^{(s)}+\|\cdot-x_{0}\|^{2}/2$ is convex then it always terminates with success and its output $(x,u,\eta)$ solves Problem A.

Proof

(a) See Appendix A.2.

(b) This follows from the fact that when the R-ACG algorithm stops with success, the last iterate $(x,u)=(x_{j},u_{j})$ satisfies (37).

(c) It follows from (WJRproxmet1, , Proposition 8(c)) that if $\widetilde{\phi}^{(s)}+\|\cdot-x_{0}\|^{2}/2$ is convex, then the iterate $(x_{j},u_{j},\eta_{j},A_{j})$ satisfies (34) and the inclusion $u_{j}\in\partial_{\eta_{j}}(\widetilde{\phi}+\|\cdot-x_{0}\|^{2}/2)(x_{j})$ for every $j\geq 1$ . Hence, since the aforementioned inclusion and the definition of $\psi$ in (27) imply (35), we conclude that the R-ACG algorithm does not terminate with failure (see step 2). As a consequence, it follows from statement (a) that it must terminate with success. It then follows from the previous inclusion, and the fact that the last iterate $(x,u,\eta):=(x_{j},u_{j},\eta_{j})$ satisfies (36), that $\eta$ fulfills (26).

4 A relaxed accelerated inexact proximal point method

This section states and analyzes a relaxed variant of the AIPP method proposed in WJRproxmet1 , namely, the R-AIPP method, for computing an approximate stationary point of (4) as in (11).

The R-AIPP method stated below is an instance of the GD framework which implements its step 1 by repeatedly invoking the ACG variant in Section 3 and thereby generates the method’s iteration sequence. More specifically, if $z_{k-1}$ denotes the previous iterate in the GD framework and $\lambda:=\lambda_{k}$ then the R-ACG algorithm is invoked to attempt to solve Problem A with curvature $\widetilde{M}$ , function pair $(\widetilde{\phi}^{(s)},\widetilde{\phi}^{(n)})$ , and initial point $x_{0}$ given by

[TABLE]

If it succeeds, it obtains a pair $(x,u)$ which will satisfy condition (25) of Problem A. Consequently, if the triple $(\lambda_{k},z_{k},v_{k})=(\lambda,x,u)$ satisfies (18), then it is a solution of step 1 of the GD framework. If the R-ACG algorithm declares failure or the triple does not satisfy (18), then the stepsize $\lambda$ is reduced and the above procedure is repeated.

R-AIPP method.

Input: a tolerance $\hat{\rho}>0$ , a scalar $M>0$ , a function pair $(g,h)$ satisfying assumptions (A1)–(A3), an initial point $z_{0}\in\mathrm{dom}\,h$ , a scalar $\lambda_{0}>0$ , and a pair of parameters $(\theta,\tau)\in(2,\infty)\times(0,\infty)$ ;

Output: a pair $(\hat{z},\hat{v})\in\mathrm{dom}\,h\times\Re^{n}$ satisfying (11);

(0)

set $\lambda=\lambda_{0}$ and $k=1$ ;

(1)

apply the R-ACG algorithm to Problem A in Section 3 with inputs $\widetilde{M}$ , $(\widetilde{\phi}^{(s)},\widetilde{\phi}^{(n)})$ , $x_{0}$ , and $(\theta,\tau)$ , where

[TABLE]

if the R-ACG algorithm stops with failure then set $\lambda=\lambda/2$ and repeat this step; otherwise, let $(x,u,\eta)$ denote its output triple and go to step 2;

(2)

compute $(\hat{z},\hat{v},\Delta)=RP(\lambda,z_{k-1},x,u)$ through the refinement procedure; if

[TABLE]

then set $\lambda=\lambda/2$ and go to step 1; otherwise, set

[TABLE]

and go to step 3;

(3)

if $\hat{v}_{k}$ satisfies

[TABLE]

then return $(\hat{z},\hat{v})=(\hat{z}_{k},\hat{v}_{k})$ ; otherwise, increment $k=k+1$ and go to step 1;

We now give some comments about the above method. First, it performs two types of iterations, namely, the outer iterations which are indexed by $k$ and the inner ones which are performed by the R-ACG algorithm every time it is called in step 1. Second, if the call to the R-ACG algorithm in step 1 does not stop with failure then, by Proposition 5(b), the triple $(x,u,\eta)$ output by the R-ACG algorithm together with the stepsize $\lambda$ will satisfy (41) where $\widetilde{\phi}=\lambda(g+h)$ . Hence, by Proposition 4(a), the triple $(\lambda_{k},z_{k},v_{k}):=(\lambda,x,u)$ will satisfy (17). If $\lambda$ is also not halved in step 2 then the definition of $\widetilde{M}$ and Proposition 4(b) imply that the triple $(\lambda_{k},z_{k},v_{k})$ also satisfies (18). As a consequence, a single iteration of the R-AIPP method implements step 1 of the GD framework. Third, the termination condition (43) and Proposition 1, with $(\lambda,z^{-},z,v)=(\lambda_{k},z_{k-1},z_{k},v_{k})$ , imply that the required solution, i.e., a pair $(\hat{z},\hat{v})$ satisfying (11), is obtained when the R-AIPP method terminates. Fourth, since the R-AIPP iterates implement step 1 of GD framework, and the sequence $\{\lambda_{k}\}$ is bounded below (see Lemma 2(b) below), Proposition 2 implies that the sequence $\{\hat{v}_{k}\}$ generated by the R-AIPP method has a subsequence approaching zero, and thus the method must terminate in step 3. Fifth, although the R-AIPP method does not necessarily generate proximal subproblems with convex objective functions, it is shown in Proposition 6 below that it has an iteration-complexity similar to that of the AIPP method of WJRproxmet1 . Finally, in contrast to the aforementioned AIPP method, the R-AIPP neither requires an upper bound on the quantity $\underline{m}$ in (10) as part of its input nor does it place any restriction on the initial stepsize $\lambda_{0}$ .

Each iteration of the R-AIPP method may call the R-ACG algorithm multiple times (possibly just one time). Invocations of the R-ACG algorithm that stop with success are said to be of type $S$ while the other invocations are said to be of type $O$ . Let $K_{S}$ (resp., $K_{O}$ ) denote the total number of R-ACG calls of type $S$ (resp., type $O$ ). The following technical result provides some basic facts about $K_{S}$ , $K_{O}$ and the sequence of stepsizes $\{\lambda_{k}\}$ .

Lemma 2

The following statements hold for the R-AIPP method:

(a)

if the stepsize $\lambda_{\bar{k}}\leq 1/(2\underline{m})$ for some $\bar{k}\geq 1$ , then every iteration $k\geq\bar{k}$ is of type $S$ and, as a consequence, $\lambda_{k}=\lambda_{\bar{k}}$ for every $k>\bar{k}$ ;

(b)

$K_{O}$ * can be bounded as $2^{K_{O}}\leq\max\{1,4{\lambda_{0}}\underline{m}\}$ ;*

(c)

$\{\lambda_{k}\}$ * is non-increasing and satisfies $1/\lambda_{k}\leq\max\{1/\lambda_{0},4\underline{m}\}$ for all $k\geq 1$ .*

Proof

(a) Since $\lambda_{\bar{k}}\leq 1/(2\underline{m})$ , the definition of $\underline{m}$ in (10) implies that $\widetilde{\phi}^{(s)}+\|\cdot-z_{k-1}\|^{2}/2$ is convex, where $\widetilde{\phi}^{(s)}$ is as defined in (42) with $\lambda:=\lambda_{\bar{k}}$ . Hence, Proposition 5(c) together with Proposition 4(b) imply that step 1 and step 2 do not halve $\lambda$ at the $\bar{k}^{\rm th}$ iteration, which is to say that this iteration is of type $S$ . Since $\{\lambda_{k}\}$ is clearly nonincreasing, the same conclusion holds true for every iteration $k\geq\bar{k}$ . Moreover, as $\lambda$ is not halved for subsequent iterations following $\bar{k}$ , it follows that $\lambda_{k}=\lambda_{\bar{k}}$ for every $k>\bar{k}$ .

(b) Using the fact that immediately before each iteration of type $O$ , the stepsize $\lambda$ is halved, we see that the condition $\lambda_{\bar{k}}\leq 1/(2m)$ in part (a) would eventually be satisfied for some iteration $\bar{k}\geq 1$ , and hence $K_{O}$ is finite. Now, note that if $K_{O}=0$ then the inequality in part (b) follows immediately. Assume then that $K_{O}\geq 1$ . It now follows from part (a) and the definition of $K_{O}$ that $\lambda_{0}/2^{K_{O}-1}>1/(2\underline{m})$ , which clearly implies the inequality in part (b).

(c) The first statement follows trivially from the update rule of $\lambda_{k}$ in the R-AIPP method. Now, note that the definition of $K_{O}$ together with the update rule for $\lambda_{k}$ imply, for every $k\geq 1$ , that ${\lambda_{0}}/{2^{K_{O}}}\leq\lambda_{k}.$ The inequality in part (c) then follows from the inequality in part (b).

In view of Lemma 2(a), choosing an initial stepsize $\lambda_{0}$ satisfying $\lambda_{0}\leq 1/(2\underline{m})$ results in an R-AIPP variant with constant stepsize, which resembles the AIPP method described in WJRproxmet1 .

The next proposition presents a worst-case iteration complexity bound on the number of inner iterations of the R-AIPP method with respect to the inputs $M,\lambda_{0},$ and $z_{0}$ , the quantity $\underline{m}$ in (10), and the tolerance $\hat{\rho}$ .

Proposition 6

Defining $\xi_{0}:=\max\{1/\lambda_{0},4\underline{m}\}$ , the R-AIPP method outputs a $\hat{\rho}$ –approximate stationary point $(\hat{z},\hat{v})$ of (4) in at most

[TABLE]

inner iterations.

Proof

Let ${\rm TI}_{S}$ (resp. ${\rm TI}_{O}$ ) denote the total number of inner iterations performed during all calls of type $S$ (resp. type $O$ ) (see the paragraph preceding Lemma 2). Clearly, the total number of inner iterations is ${\rm TI}:={\rm TI}_{S}+{\rm TI}_{O}$ . We now bound each one of the quantities ${\rm TI}_{S}$ and ${\rm TI}_{O}$ separately by using the fact that assumption (A2), (42), and Proposition 5(a) imply that the number of inner iterations performed during each call to the R-ACG algorithm is bounded by

[TABLE]

where $\lambda$ is the value of $\lambda$ just before the call and $C$ is as in (40) with $\widetilde{M}=\bar{\lambda}M$ .

We first consider ${\rm TI}_{O}$ . Note that Lemma 2(b) implies that $K_{O}$ is finite. Since ${\rm TI}_{O}=0$ when $K_{O}=0$ , we may assume without loss of generality that $K_{O}\geq 1$ . Note that the values of $\lambda$ just before the $K_{O}$ calls of type O are exactly $\lambda_{0},\lambda_{0}/2,\ldots,\lambda_{0}/2^{K_{O}-1}$ . Hence, we conclude that

[TABLE]

where the second inequality is due the fact that Lemma 2(b) implies $2^{i-1}\leq 2^{K_{O}-1}\leq 2\lambda_{0}\xi_{0}$ for every $i\leq K_{O}$ . Thus, we obtain

[TABLE]

We now bound ${\rm TI}_{S}$ . Suppose that $K_{S}>1$ and observe that the termination criterion (43) is not satisfied in the first $K_{S}-1$ iterations. Since the R-AIPP method is an instance of the GD framework, it follows from Proposition 2 that

[TABLE]

Using the fact that Lemma 2(c) implies $1/\lambda_{j}\leq\max\{1/\lambda_{0},4\bar{m}\}=\xi_{0}$ and $\lambda_{j}\leq\lambda_{0}$ for every $j\geq 1$ , we obtain

[TABLE]

Hence, we conclude that

[TABLE]

It can be easily seen that the bound in (47) trivially holds when $K_{S}\leq 1$ in view of the last term in it. Indeed, to prove this, just assume that $\sum_{j=1}^{K_{S}-1}\lambda_{j}=0$ in the above argument bounding ${\rm TI}_{S}$ . Now, since ${\rm TI}={\rm TI}_{O}+{\rm TI}_{S}$ , the bound in (44) follows by adding (45) and (47).

The last statement of the proposition follows due to Proposition 1 and the termination condition in step 3 of the R-AIPP method.

Observe that, unless $\lambda_{0}$ is large or $\underline{m}$ is small, the first term in (44) dominates the second one.

The numerical experiments in Section 6 consider three variants of the R-AIPP method, two of which are R-AIPP instances with different choices of $\lambda_{0}$ . More specifically, given an upper bound $m$ on $\underline{m}$ , one of the R-AIPP instances chooses $\lambda_{0}=0.9/(2m)$ while the other one chooses $\lambda_{0}=1$ . For the problem instances considered, the former choice of $\lambda_{0}$ is relatively small, while the latter choice is relatively large.

We now end this section by discussing some possible choices of the initial stepsize $\lambda_{0}$ and how the corresponding R-AIPP instances compare to the AIPP method of WJRproxmet1 . First, the AIPP method requires knowledge of an upper bound $m$ on $\underline{m}$ such that $m={\cal O}(M)$ , and, as a consequence of a more general iteration complexity bound derived in (WJRproxmet1, , Corollary 14), its inner iteration complexity can be shown to be

[TABLE]

Now, if $m$ as above is also known to the R-AIPP and the input $\lambda_{0}$ is set to $1/(4m)$ , then its inner iteration complexity (44) reduces to

[TABLE]

which is the same as (48) up to a logarithmic factor. On the other hand, if $\lambda_{0}$ is chosen so that $1/\lambda_{0}={\cal O}(\underline{m})$ then (44) reduces to

[TABLE]

whose dominant first term is as good as the dominant first term in (48) whenever $\sqrt{\underline{m}}\log_{1}^{+}(\lambda_{0}M)={\cal O}(\sqrt{m})$ .

5 A relaxed quadratic penalty AIPP method

This section presents the R-QP-AIPP method for solving a class of linearly–set–constrained nonconvex composite optimization problems. Similar to the QP-AIPP method of WJRproxmet1 , the R-QP-AIPP method is a quadratic penalty–based method that solves a sequence of penalized subproblems, for increasing values of the penalty parameter, using the R-AIPP method of Section 4. The section contains two subsections. The first one describes the main problem of interest, its underlying assumptions, and the notion of a corresponding approximate stationary point which R-QP-AIPP method will provably obtain, and briefly outlines a cold–started quadratic penalty–based method for obtaining such a point. The second one presents a warm–started quadratic penalty–based method, namely, the R-QP-AIPP method, for obtaining the desired stationary point and establishes its ACG iteration complexity.

5.1 The linearly–set–constrained problem

This subsection describes the main problem of interest in this section, namely, the linearly–set–constrained nonconvex composite optimization problem (51), its underlying assumptions, and a notion of an approximate stationary point of it. Moreover, it describes the quadratic penalty subproblem (parameterized a penalty parameter) associated with (51) and discusses the relationship between their corresponding approximate stationary points. It then outlines a (static and dynamic) cold–started quadratic penalty–based method and its corresponding iteration-complexity bound, which turns out to be larger than that of the QP-AIPP method of WJRproxmet1 .

The main problem of interest for this section is the linearly–set–constrained nonconvex composite optimization problem

[TABLE]

where closed convex set $S\subseteq\Re^{p}$ , linear operator $A:\Re^{n}\mapsto\Re^{p}$ , and functions $f,h:\Re^{n}\mapsto(-\infty,\infty]$ , satisfy the following assumptions:

(C1)

$h\in\overline{\text{Conv}}(\Re^{n})$ and its diameter

[TABLE]

is finite;

(C2)

$A\neq 0$ and ${\cal F}:=\left\{z\in\mathrm{dom}\,h:Az\in S\right\}\neq\emptyset$ ;

(C3)

$f$ is a nonconvex differentiable function on $\mathrm{dom}\,h$ and there exist a scalar $L>0$ such that

[TABLE]

(C4)

${\varphi_{0}^{*}}:=\inf\{\varphi(z):z\in\Re^{n}\}>-\infty$ .

We make two remarks about the above assumptions. First, Lemma 4 in Appendix A.3 shows that (C1), (C3), and the additional assumption that $f$ be lower semicontinuous on $\mathrm{cl}\,(\mathrm{dom}\,h)$ imply (C4). Second, denoting $\underline{m}$ as the quantity (10) with $g=f$ , assumption (C3) implies that $\underline{m}\in(0,L]$ . Moreover, it is shown in Theorem 5.1 below that the smaller $\underline{m}$ is, the better the iteration complexity of the R-QP-AIPP method becomes.

We now discuss a notion of approximate stationary point for (51). Clearly, (51) is equivalent to the problem

[TABLE]

Moreover, a necessary condition for a point $(\hat{z},\hat{s})\in\mathrm{dom}\,h\times S$ to be a local minimum to the above problem is that there exists a multiplier $\hat{q}\in\Re^{p}$ such that

[TABLE]

Given a tolerance pair $(\hat{\rho},\hat{\eta})\in\Re^{2}_{++}$ , a triple $([\hat{z},\hat{s}],\hat{q},\hat{v})\in[\mathrm{dom}\,h\times S]\times\Re^{p}\times\Re^{n}$ is said to be a $(\hat{\rho},\hat{\eta})$ –approximate stationary point of (1) if it satisfies

[TABLE]

Clearly, a $(\hat{\rho},\hat{\eta})$ –approximate stationary point $([\hat{z},\hat{s}],\hat{q},\hat{v})$ of (51) when $(\hat{\rho},\hat{\eta})=(0,0)$ means that the pair $(\hat{z},\hat{s})$ and the multiplier $\hat{q}$ satisfy (55).

We now describe the quadratic penalty subproblem (parameterized by a penalty parameter) with respect to (51). Defining the quadratic penalty function $p_{S}:\Re^{p}\mapsto\Re_{+}$ as

[TABLE]

where

[TABLE]

for every $x\in\Re^{p}$ , the quadratic penalty subproblem parameterized by a penalty parameter $c>0$ with respect to (51) is

[TABLE]

We now make four remarks regarding (59). First, (3) is an instance of (59) in which $S=\{b\}$ . Second, when $c=0$ , the optimal value of (59) coincides with $\varphi^{*}_{0}$ in (C4), and hence there is no abuse of notation made here. Third, it is easily seen that

[TABLE]

where $\varphi^{*}$ is as in (51). Finally, (59) is a penalty subproblem involving only the original variable $z$ of formulation (51) rather than the one associated with (54) (constructed as in Section 1 with $Az=b$ replaced by $Az-s=0$ ), which involves the pair of variables $(z,s)$ .

The following result shows how a $\hat{\rho}$ –approximate stationary point of (59) yields a $(\hat{\rho},\hat{\eta})$ –approximate stationary point of (51) when the penalty parameter $c$ is sufficiently large.

Proposition 7

Let $(\hat{\rho},\hat{\eta})\in\Re_{++}^{2}$ and $c\geq 0$ be given and suppose that $(\hat{z},\hat{v})$ is a $\hat{\rho}$ –approximate stationary point of (59) as in (11) with $g=f+c\cdot(p_{S}\circ A)$ . Moreover, let $\underline{m}$ be as in (10) with $g=f$ and define

[TABLE]

where ${\varphi}^{*}$ and ${\varphi}^{*}_{0}$ are as in (51) and (C4), respectively. Then, the following statements hold:

(a)

for every $u,z\in\mathrm{dom}\,h$ , the pair $(g,M)=(g_{c},M_{c})$ satisfies (9);

(b)

the triple $([\hat{z},\hat{s}],\hat{q},\hat{v})$ satisfies the inclusions and the first inequality of (56) and

[TABLE]

(c)

if, in addition, the penalty parameter $c$ satisfies

[TABLE]

then $\|A\hat{z}-\hat{s}\|\leq\hat{\eta}$ , and hence $([\hat{z},\hat{s}],\hat{q},\hat{v})$ is a $(\hat{\rho},\hat{\eta})$ –approximate stationary point of (51).

Proof

Throughout this proof, we will make use of the well known fact (see, for example, (beck2017first, , Theorems 6.39 & 6.60)) that $p_{S}$ is convex, differentiable, its gradient is $1$ –Lipschitz, and, for every $x\in\Re^{p}$ ,

[TABLE]

(a) This follows immediately from the definition of $g_{c}$ in (61), assumption (C3), and the fact that $\nabla p_{S}$ is 1–Lipschitz continuous.

(b) Using the definitions of $\hat{q}$ and $\hat{s}$ given in (62), and the fact that (65) at $x=A\hat{z}$ implies $c\nabla p_{S}(A\hat{z})=\hat{q}$ , observe that: (i) $c\nabla(p_{S}\circ A)\hat{z}=cA^{*}\nabla p_{S}(A\hat{z})=A^{*}\hat{q}$ ; and (ii) $\hat{q}\in N_{S}(\hat{s})$ . It now follows from the definition of a $\hat{\rho}$ –approximate stationary point of (59) with $g=f+c\cdot(p_{S}\circ A)$ and the previous observations that

[TABLE]

Hence, with the additional fact that $\|\hat{v}\|\leq\hat{\rho}$ from (11), it follows that the inclusions and first inequality of (56) hold. Next, observe that the convexity of $p_{S}$ and the first inclusion in (5.1) imply that $\hat{v}\in\nabla f(\hat{z})+\partial\left[h+c\cdot(p_{S}\circ A)\right](\hat{z})$ or equivalently,

[TABLE]

Considering (67) at any $u\in{\cal F}$ and using the fact that $p_{S}(Au)=0$ for any $u\in{\cal F}$ , the definition of $\underline{m}$ in (10), and the definitions of $p_{S}$ and $\hat{s}$ , we conclude that

[TABLE]

Taking the infimum over $u\in{\cal F}$ immediately implies (63).

(c) Using (64), the fact that $\varphi(\hat{z})\geq{\varphi}^{*}_{0}$ , and the definition of $T$ , it follows from part (b) that

[TABLE]

In view of the above proposition, we now outline a static penalty method for obtaining a $(\hat{\rho},\hat{\eta})$ –approximate stationary point of (51). First, let $z_{0}\in\mathrm{dom}\,h$ be given and select a penalty parameter $c={\cal O}(\hat{\eta}^{-2})$ satisfying (64). Second, obtain a $\hat{\rho}$ –approximate stationary point $(\hat{z},\hat{v})$ of (59) using the R-AIPP method of Section 4 with starting point $z_{0}$ and inputs $M=M_{c}$ and $(g,h)=(g_{c},h)$ , which satisfy assumptions (A1)–(A3) in view of Proposition 7(a) and assumptions $(C1)$ and $(C3)$ . Finally, compute the pair $(\hat{s},\hat{q})$ according to (62) and output the triple $([\hat{z},\hat{s}],\hat{q},\hat{v})$ , which is a $(\hat{\rho},\hat{\eta})$ –approximate stationary point of (51) in view of Proposition 7(c). Using (61) with $(c,\bar{c})=(0,c)$ , the definitions in (61), the fact that $c={\cal O}(\hat{\eta}^{-2})$ , and the complexity bound for the R-AIPP method described in Proposition 6 with $M=M_{c}$ , it is easy to see that the ACG iteration complexity of the outlined method is

[TABLE]

where $\xi_{0}:=\max\{1/\lambda_{0},4\underline{m}\}$ and the last quantity ignores any constants aside from the tolerances. A drawback of this static penalty method is that it requires in its first step the selection of a single parameter $c$ , which is generally difficult to obtain. This issue can be circumvented by considering a dynamic cold–started penalty method in which the static penalty method is repeated for a sequence of increasing values of $c$ and common starting point $z_{0}$ . It can be shown that the resulting cold–started dynamic penalty method has an ACG iteration complexity that is still on the same order as (68). Note that the bound (68) is actually ${\cal O}(\hat{\rho}^{-2}\hat{\eta}^{-1}\log_{1}^{+}\hat{\eta}^{-1})$ when $z_{0}\in{\cal F}$ (see (C2)) but our interest lies in the case where $z_{0}\notin{\cal F}$ since an initial point $z_{0}\in{\cal F}$ is generally not known.

The QP-AIPP method of WJRproxmet1 is a modified cold–started dynamic penalty method like the one just outlined, but which replaces the R-AIPP method called in step 2 of the static penalty method with the AIPP method of WJRproxmet1 . It has been shown in (WJRproxmet1, , Theorem 18) that its ACG iteration complexity bound for finding a $(\hat{\rho},\hat{\eta})$ –approximate stationary point of (1) is ${\cal O}(\hat{\rho}^{-2}\hat{\eta}^{-1})$ . This bound is established without assuming that $\mathrm{dom}\,h$ is bounded and is clearly better than the one in (68).

The next subsection considers a warm–started dynamic penalty method, similar to the one described immediately after Proposition 7, in which the input $z_{0}$ to the R-AIPP call for solving the next penalty subproblem is chosen to be the output $\hat{z}$ from the R-AIPP call for solving the current one. It is shown in Theorem 5.1 of Subsection 5.2 that its ACG iteration complexity is ${\cal O}(\hat{\rho}^{-2}\hat{\eta}^{-1}\log_{1}^{+}\hat{\eta}^{-1})$ , which is the same as the one for the QP-AIPP method up to a logarithmic factor. As a side remark, we note that although a warm–started version of the QP-AIPP method in WJRproxmet1 can be also considered, the aforementioned ${\cal O}(\hat{\rho}^{-2}\hat{\eta}^{-1})$ ACG iteration complexity bound was derived for its cold–started version.

5.2 The R-QP-AIPP method

The goal of this subsection is to describe the R-QP-AIPP method, i.e., the warm–started dynamic penalty method mentioned at the end of Subsection 5.1, and establish its corresponding ACG iteration complexity.

We start by describing the R-QP-AIPP method.

R-QP-AIPP method.

Input: a problem instance of the form in (51), a scalar $L>0$ , a tolerance pair $(\hat{\rho},\hat{\eta})\in\Re_{++}^{2}$ , an initial point $\hat{z}_{0}\in\mathrm{dom}\,h$ , a scalar $\lambda_{0}>0$ , and a pair of parameters $(\theta,\tau)\in(2,\infty)\times(0,\infty)$ ;

Output: a triple $([\hat{z},\hat{s}],\hat{q},\hat{v})\in[\mathrm{dom}\,h\times S]\times\Re^{p}\times\Re^{n}$ satisfying (56);

(0)

set $c_{0}:=L/\|A\|^{2}$ and $l=1$ ;

(1)

set $(c,z_{0}):=(c_{l-1},\hat{z}_{l-1})$ and

[TABLE]

call the R-AIPP method on (4) with inputs $\hat{\rho}$ , $M_{c}$ , $(g_{c},h)$ , $z_{0}$ , $\lambda_{0}$ , and $(\theta,\tau)$ , to obtain a $\hat{\rho}$ -approximate stationary point $(\hat{z},\hat{v})$ of (4), and set

[TABLE]

(2)

if the residual

[TABLE]

then return $([\hat{z},\hat{s}],\hat{q},\hat{v})=([\hat{z}_{l},\hat{s}_{l}],\hat{q}_{l},\hat{v}_{l})$ ; otherwise, set $c_{l}=2c_{l-1}$ , increment $l=l+1$ , and go to step 1.

Before giving some remarks about the above method, we discuss its general structure. Every loop of the R-QP-AIPP method invokes in its step 1 the R-AIPP method of Section 4 to compute a $\hat{\rho}$ -approximate stationary point of the current penalty subproblem (59). The latter method in turn uses the R-ACG algorithm of Section 3 as a subroutine in its implementation (see step 1 of the R-AIPP method). Moreover, step 1 of the R-QP-AIPP implements a warm–start strategy, namely, the input point $z_{0}$ of the current R-AIPP call is set to be the output point $\hat{z}_{l-1}$ of the previous R-AIPP call.

We now make three remarks about the R-QP-AIPP method. First, it follows from Proposition 7(b) that, for every $l\geq 1$ , the triple $([\hat{z},\hat{s}],\hat{q},\hat{v})=([\hat{z}_{l},\hat{s}_{l}],\hat{q}_{l},\hat{v}_{l})$ satisfies the inclusions and the first inequality in (56). Second, since every loop of the R-QP-AIPP method doubles $c$ , the condition (64) will be eventually satisfied. Hence, in view of Proposition 7(c), the pair $(\hat{z},\hat{s})$ corresponding to this $c$ will satisfy the condition $\|A\hat{z}-\hat{s}\|\leq\hat{\eta}$ and the R-QP-AIPP method will stop in view of its stopping criterion in step 2. Finally, in view of the first and second remarks, we conclude that the R-QP-AIPP method outputs a triple $([\hat{z},\hat{s}],\hat{q},\hat{v})$ satisfying (56).

Before deriving the ACG iteration complexity of the R-QP-AIPP method, we note that the number of ACG iterations needed in the $(l+1)^{\rm th}$ execution of its step 1 depends on the quantity $\varphi_{c_{l}}(\hat{z}_{l})-{\varphi}^{*}_{c_{l}}$ (see the left–hand–side of (68) with $(c,z_{0})=(c_{l},\hat{z}_{l})$ ). The result below shows that the warm–start strategy in step 1 of the method together with the boundedness of $\mathrm{dom}\,h$ imply that the aforementioned quantity has an upper bound that is independent of the size of the parameter $c_{l}$ .

Lemma 3

Let $c_{0}$ and $\hat{z}_{0}$ be as in step 0 and the input of the R-QP-AIPP method, respectively, and define

[TABLE]

where ${\varphi}^{*}_{c}$ and $T$ are as in (59) and (62), respectively. Then, for every $l\geq 0$ , we have

[TABLE]

Proof

The case in which $l=0$ follows trivially from the definition of $S_{0}$ in (69). Consider now the case in which $l\geq 1$ . Remark that $c_{l}=2c_{l-1}$ due to step 2 of R-QP-AIPP and (59) and that $(\hat{z}_{l},\hat{v}_{l})$ is a $\hat{\rho}$ –approximate stationary point of (59) with $c=c_{l-1}$ due to the warm–start strategy in step 1 of the R-QP-AIPP method. It now follows from the aforementioned remarks, the last inequality in (60) with $c=c_{l}$ , and Proposition 7(b) with $(\hat{z},c)=(\hat{z}_{l},c_{l-1})$ , that

[TABLE]

Grouping terms in the last expression together, using the definition of $Q_{0}$ given in (69), and the fact that $\varphi(\hat{z}_{l})\geq\varphi_{0}^{*}$ , we conclude that

[TABLE]

Combining (71) and (72) yields (70).

The following result establishes the iteration complexity of the R-QP-AIPP method with respect to the inputs $L,\lambda_{0},$ and $z_{0}$ , the quantity $\underline{m}$ in (10) with $g=f$ , and the tolerance pair $(\hat{\rho},\hat{\eta})$ .

Theorem 5.1

Given a tolerance pair $(\hat{\rho},\hat{\eta})\in\Re_{+}^{2}$ , define

[TABLE]

where $T$ is given in (62). Then, defining $\xi_{0}:=\max\{1/\lambda_{0},4\underline{m}\}$ , the R-QP-AIPP method outputs a $(\hat{\rho},\hat{\eta})$ –approximate stationary point $([\hat{z},\hat{s}],\hat{q},\hat{v})$ of (51) in at most

[TABLE]

ACG iterations, where $Q_{0}$ is as in (69).

Proof

Define $T_{\hat{\eta}}:=T/\hat{\eta}^{2}$ and let $\bar{l}$ be the smallest index such that $c_{\bar{l}-1}\geq T_{\hat{\eta}}$ . Since the R-QP-AIPP invokes the R-AIPP method with $(M,g)=(M_{c_{l-1}},g_{c_{l-1}})$ , it follows from Lemma 3 and Proposition 6, with $M=M_{c_{l-1}}$ , that the total number of ACG iterations at the $l^{\rm th}$ iteration of the R-QP-AIPP method is on the order of

[TABLE]

Hence, the R-QP-AIPP method stops in a total number of ACG iterations bounded above by the sum of the quantity in (75) over $l=1,\ldots,\bar{l}$ .

We now focus on simplifying some of the quantities in the aforementioned sum. Using the fact that $L=c_{0}\|A\|^{2}$ , we obtain the bound

[TABLE]

Now, if $\bar{l}=1$ , then the above inequality implies that $M_{c_{\bar{l}-1}}\leq 2c_{0}\|A\|^{2}=2L={\cal O}\left(\Xi_{\hat{\eta}}\right)$ . Assume then that $\bar{l}\geq 2$ . Observe that the definition of $\bar{l}$ implies that $2^{\bar{l}-1}c_{0}\leq 2T_{\hat{\eta}}$ or, equivalently, $\sqrt{c_{0}}\sqrt{2}^{\bar{l}}\leq 2\sqrt{T_{\hat{\eta}}}$ . Combining the previous inequality with (76), we conclude that

[TABLE]

and also

[TABLE]

It now follows from (75), (77), and (78) that the R-QP-AIPP method stops in a total number of ACG iterations bounded by the quantity in (74).

The statement that $([\hat{z},\hat{s}],\hat{q},\hat{v})$ is a $(\hat{\rho},\hat{\eta})$ –approximate stationary point follows from Proposition 7(b) and the termination condition in step 2 of the R-QP-AIPP method.

We now make three remarks about the complexity bound in (74). First, in terms of the tolerance pair $(\hat{\rho},\hat{\eta})$ , it is ${\cal O}(\hat{\rho}^{-2}\hat{\eta}^{-1}\log_{1}^{+}\hat{\eta}^{-1})$ , which improves upon the complexity in (68) by a $\Theta(\hat{\eta}^{-2})$ factor. Second, unless $\lambda_{0}$ is large or $\underline{m}$ is small, the first term in (74) dominates the second one.

We now end this section by discussing some possible choices of the initial stepsize $\lambda_{0}$ and how the corresponding R-QP-AIPP instances compare to the QP-AIPP method of WJRproxmet1 . First, recall that the QP-AIPP method requires the knowledge of an upper bound $m$ on $\underline{m}$ such that $m={\cal O}(L)$ , and remark that, under the same assumptions of this paper, it can be shown using (WJRproxmet1, , Theorem 18) that its ACG iteration complexity is

[TABLE]

Now, if $m$ as above is also known to the R-AIPP and the input $\lambda_{0}$ is set to $1/(4m)$ , then the ACG iteration complexity (74) reduces to

[TABLE]

which is the same as (74) up to a logarithmic factor. On the other hand, if $\lambda_{0}$ is chosen so that $1/\lambda_{0}={\cal O}(\underline{m})$ then (74) reduces to

[TABLE]

whose dominant first term is as good as the dominant first term in (79) when $\sqrt{\underline{m}}\log_{1}^{+}(\lambda_{0}\Xi_{\hat{\eta}})={\cal O}(\sqrt{m})$ .

6 Numerical experiments

This section presents computational results that highlight the performance of the R-AIPP and R-QP-AIPP methods. It contains three subsections. The first subsection compares three variants of the R-AIPP method against three state-of-the-art nonconvex composite optimization algorithms. The second subsection uses the six algorithms in the first subsection as subroutines in a quadratic penalty method similar to the one in Section 5. More specifically, given an algorithm $A$ out of the six algorithms in the first subsection, a corresponding quadratic penalty method is considered in which steps 0 to 2 of the R-QP-AIPP method in Section 5 are executed with algorithm $A$ replacing the R-AIPP method. The third subsection presents a summary of the numerical experiments.

We first describe the three different R-AIPP variants considered. While the second variant does not assume knowledge of an upper bound $m$ on the quantity $\underline{m}$ in (10), the first and third variants do in order to determine their initial stepsize $\lambda_{0}$ . More specifically, the first variant, referred to as R-AIPPc, is the R-AIPP method with initial stepsize chosen to be $\lambda_{0}=0.9/(2m)$ . As opposed to the two algorithms explained below, which can adaptively change $\lambda_{k}$ between iterations, this algorithm is a constant stepsize method (see Lemma 2 and the paragraph following it). The second variant, referred to as R-AIPPv1, is the R-AIPP method with initial stepsize chosen to be $\lambda_{0}=1$ . Since $\lambda_{0}$ is relatively large in the experiments considered, $\lambda$ is halved in some of its outer iterations. The third variant, referred to as R-AIPPv2, is a variant of the R-AIPP method with initial stepsize chosen to be $\lambda_{0}=1/(5m)$ . This variant modifies the R-AIPP method by adding conditions that allow the stepsize $\lambda$ to increase between subproblems. More specifically, the R-AIPPv2 method doubles the value of $\lambda$ at the end of iteration $k$ when: (a) $\lambda$ has never been halved in step 1 or 2 and (b) the number of inner iterations performed by the R-ACG algorithm in step 1 is less than 250. All R-AIPP variants are run with $\theta=4$ , a problem–specific value of $\tau$ , and adaptively estimate the constant $\widetilde{M}$ that is used in each iteration of the R-ACG algorithm.

We now make three remarks about the above R-AIPP variants and the AIPP method of WJRproxmet1 . First, while both the R-AIPPc and AIPP method choose the stepsizes $\{\lambda_{k}\}$ to be constant, the former method differs from the latter one in that it uses a more relaxed criterion, i.e., (17) and (18), for solving the $k^{\rm th}$ prox subproblem (6). Moreover, the limited numerical experiments in Appendix A.4 show that this relaxation drastically improves upon the efficiency of the AIPP method, regardless of the magnitude of the ratio $M/m$ . As we believe that this effect would observed in the other problem instances of this section, we choose not to include the AIPP method as part of our suite of benchmark algorithms for the sake of brevity. Second, the R-AIPPv1 and R-AIPPv2 methods differ from the R-AIPPc method in that they permit the stepsizes $\{\lambda_{k}\}$ to be significantly larger than the constant ones chosen for the R-AIPPc method. As will be observed in the numerical experiments below, this can drastically improve the efficiency of the adaptive stepsize R-AIPP variants. Third, in view of the descriptions of the R-AIPP variants in the previous paragraph, both the R-AIPPc and R-AIPPv1 methods are instances of the R-AIPP method while the R-AIPPv2 method is not. However, the R-AIPPv2 method is clearly an instance of the GD framework, and hence a similar analysis to the one in Section 4 may be used to establish its ACG iteration complexity. For sake of brevity we omit its analysis in this paper.

We now describe the three other nonconvex composite optimization algorithms considered. The first algorithm is an implementation of the unified problem-parameter free accelerated gradient (UPFAG) method that is proposed and analyzed in Ghadimi2019 . The particular implementation considered is the UPFAG-fullBB method, which utilizes a Barzilai–Borwein type stepsize selection strategy and is described in (Ghadimi2019, , Section 4). Its input parameters include $(\gamma_{1},\gamma_{2},\gamma_{3})=(0.4,0.4,1.0)$ and $(\delta,\sigma)=(10^{-2},10^{-10})$ . The second algorithm is an implementation of the NC-FISTA method in liang2019fistatype . The particular implementation considered uses input parameters $(\xi,\lambda)=(1.05m,0.99/M)$ . The third algorithm is an implementation of the accelerated gradient (AG) method that is proposed and analyzed in nonconv_lan16 . The particular implementation considered is Algorithm 2, which is described in (nonconv_lan16, , Section 2).

Finally, we state some additional details about the numerical experiments. First, for each linearly–set–constrained problem of the form given in (51), the quadratic penalty method used to solve it starts with the initial penalty parameter chosen to be $c_{0}=\max\{10^{-10},(1000m-L)/\|A\|^{2}\}$ . Second, each algorithm is run with a time limit of 4000 seconds. If an algorithm does not terminate with a solution for a particular problem instance, we do not report any details about its iteration count or function value at the point of termination and the runtime for that instance is marked with a [*] symbol. Third, the iterations listed in the tables this section include backtracking iterations if a parameter line search method is used as part of the algorithm. Finally, all algorithms described at the beginning of this section are implemented in MATLAB 2019a and are run on Linux 64-bit machines each containing Xeon E5520 processors and at least 8 GB of memory.

6.1 Unconstrained problems

This subsection examines the performance of the R-AIPP method as a nonconvex composite optimization solver for solving problems of the form given in (4). Given a function pair $(g,h)$ satisfying assumptions (A1)–(A3) with $\phi=g+h$ , tolerance $\hat{\rho}>0$ , and an initial point $z_{0}\in\mathrm{dom}\,h$ , each algorithm seeks a pair $(\hat{z},\hat{v})$ satisfying

[TABLE]

Two problems are considered, namely: (i) the quadratic matrix problem; and (ii) the support vector machine problem in Ghadimi2019 .

All methods that terminated within 4000 seconds converged to the same objective value, which, for each table in this subsection, is given in a column labeled $\phi(\hat{z})$ . The bold numbers in each of the aforementioned tables highlight the algorithm that performed the most efficiently in terms of iteration count or total runtime.

6.1.1 Quadratic matrix problem

Given a pair of dimensions $(l,n)\in\mathbb{N}^{2}$ , scalar pair $(\alpha_{1},\alpha_{2})\in\Re_{++}^{2}$ , linear operators ${\cal B}:S_{+}^{n}\mapsto\Re^{n}$ and ${\cal C}:S_{+}^{n}\mapsto\Re^{l}$ defined by

[TABLE]

for matrices $\{B_{j}\}_{j=1}^{n},\{C_{i}\}_{i=1}^{l}\subseteq\Re^{n\times n}$ , positive diagonal matrix $D\in\Re^{n\times n}$ , and vector $d\in\Re^{l}$ , this sub–subsection considers the following quadratic matrix (QM) problem:

[TABLE]

where $P_{n}=\{z\in S_{+}^{n}:\operatorname*{tr}z=1\}$ denotes the $n$ –dimensional spectraplex.

We now describe the experiment parameters for the instances considered. First, the dimensions were set to be $(l,n)=(50,200)$ and only 2.5% of the entries of the submatrices $B_{j}$ and $C_{i}$ being nonzero. Second, the entries of $B_{j},C_{i}$ , and $d$ (resp., $D$ ) are generated by sampling from the uniform distribution ${\cal U}[0,1]$ (resp., ${\cal U}[1,1000]$ ). Third, the initial starting point is $z_{0}=I_{n}/n$ , where $I_{n}$ is the $n$ -dimensional identity matrix. Fourth, with respect to the termination criterion (82), the inputs, for every $z\in S_{+}^{n}$ , are

[TABLE]

Fifth, the R-AIPP variants used a parameter value of $\tau=10000$ . Finally, each problem instance considered is based on a specific curvature pair $(m,M)\in\Re_{++}^{2}$ for which the scalar pair $(\alpha_{1},\alpha_{2})\in\Re_{++}^{2}$ is selected so that $M=\lambda_{\max}(\nabla^{2}g)$ and $-m=\lambda_{\min}(\nabla^{2}g)$ .

We now present the numerical tables for this set of problem instances. We start with instances in which $m$ is fixed.

We now present instances where $m=M$ .

6.1.2 Support vector machine problem

Given a pair of dimensions $(n,k)\in\mathbb{N}^{2}$ , matrix $U\in\Re^{n\times k},$ and vector $v\in\{-1,+1\}^{n},$ this sub–subsection considers the following (sigmoid) support vector machine (SVM) problem

[TABLE]

where $u_{i}$ denotes the $i^{\rm th}$ column of $U$ .

We now describe the experiment parameters for the instances considered. First, the entries of $U$ are generated by sampling from the uniform distribution ${\cal U}[0,1]$ , with only 5% of the entries being nonzero, and $v=\mathrm{sgn}(U^{T}x)$ where the entries of $x$ are sampled from the uniform distribution over the $k$ –dimensional ball centered at 0 with radius 50. Second, the initial starting point is $z_{0}=0$ . Third, the curvature parameters for each problem instance are $m=M=(4\sqrt{3}\|U\|_{F}^{2})/(9k)+1/k.$ Fourth, with respect to the termination criterion (82), the inputs, for every $z\in\Re^{n}$ , are

[TABLE]

Fifth, the R-AIPP variants used a parameter value of $\tau=5000$ . Finally, each problem instance considered is based on a specific dimension pair $(n,k)\in\mathbb{N}^{2}$ .

We now present the numerical tables for this set of problem instances.

6.2 Linearly constrained problems

This subsection examines the performance of the R-QP-AIPP method as a nonconvex linearly–set–constrained composite optimization solver for solving problems of the form given in (51). Given a linear operator $A$ , convex set $S$ , function pair $(f,h)$ satisfying assumptions (C1)–(C3), tolerance pair $(\hat{\rho},\hat{\eta})\in\Re_{++}^{2}$ , and an initial point $z_{0}\in\mathrm{dom}\,h$ , each algorithm seeks a triple $([\hat{z},\hat{s}],\hat{p},\hat{v})$ satisfying

[TABLE]

Three problems are considered, namely: (i) the linearly–constrained quadratic matrix problem; (ii) the sparse principal component analysis problem in NIPS2014_5615 ; and (iii) the bounded matrix completion problem in yao2017efficient .

The bold numbers in each of the tables in this subsection highlight the algorithm that performed the most efficiently in terms of iteration count or total runtime.

6.2.1 Linearly–constrained quadratic matrix problem

Given a pair of dimensions $(l,n)\in\mathbb{N}^{2}$ , scalar pair $(\alpha_{1},\alpha_{2})\in\Re_{++}^{2}$ , linear operators ${\cal A}:S_{+}^{n}\mapsto\Re^{l}$ , ${\cal B}:S_{+}^{n}\mapsto\Re^{n}$ , and ${\cal C}:S_{+}^{n}\mapsto\Re^{l}$ defined by

[TABLE]

for matrices $\{A_{i}\}_{i=1}^{l},\{B_{j}\}_{j=1}^{n},\{C_{i}\}_{i=1}^{l}\subseteq\Re^{n\times n}$ , positive diagonal matrix $D\in\Re^{n\times n}$ , and vector pair $(b,d)\in\Re^{l}\times\Re^{l}$ , this sub–subsection considers the following linearly–constrained quadratic matrix (LCQM) problem:

[TABLE]

where $P_{n}=\{z\in S_{+}^{n}:\operatorname*{tr}z=1\}$ denotes the $n$ –dimensional spectraplex.

We now describe the experiment parameters for the instances considered. First, the dimensions were set to be $(l,n)=(50,200)$ and only 1.0% of the entries of the submatrices $A_{i},B_{j},$ and $C_{i}$ being nonzero. Second, the entries of $A_{i},B_{j},C_{i},b$ , and $d$ (resp., $D$ ) were generated by sampling from the uniform distribution ${\cal U}[0,1]$ (resp., ${\cal U}[1,1000]$ ). Third, the initial starting point $z_{0}$ was chosen to be a random point in $S_{+}^{n}$ . More specifically, three unit vectors $\nu_{1},\nu_{2},\nu_{3}\in\Re^{n}$ and three scalars $e_{1},e_{2},e_{2}\in\Re_{+}$ are first generated by sampling vectors $\widetilde{\nu}_{i}\sim{\cal U}^{n}[0,1]$ and scalars $\widetilde{d}_{i}\sim{\cal U}[0,1]$ and setting $\nu_{i}=\widetilde{\nu}_{i}/\|\widetilde{\nu}_{i}\|$ and $e_{i}=\widetilde{e}_{i}/(\sum_{j=1}^{3}\widetilde{e}_{i})$ for $i=1,2,3$ . The initial iterate for the first subproblem is then set to $z_{0}=\sum_{i=1}^{3}e_{i}\nu_{i}\nu_{i}^{T}$ . Fourth, with respect to the termination criterion (82), the inputs, for every $z\in S_{+}^{n}$ , are

[TABLE]

Fifth, the R-AIPP variants used a parameter value of $\tau=5000$ . Finally, each problem instance considered is based on a specific curvature pair $(m,M)\in\Re_{++}^{2}$ for which the scalar pair $(\alpha_{1},\alpha_{2})\in\Re_{++}^{2}$ is selected so that $M=\lambda_{\max}(\nabla^{2}f)$ and $-m=\lambda_{\min}(\nabla^{2}f)$ .

We now present the numerical tables for this set of problem instances.

6.2.2 Sparse principal component analysis problem

Given integer $k$ , positive scalar pair $(\nu,b)\in\Re_{++}^{2}$ , and matrix $\Sigma\in S_{+}^{n}$ , this sub–subsection considers the following sparse principal component analysis (PCA) problem:

[TABLE]

where ${\cal F}^{k}=\{z\in S_{+}^{n}:0\preceq z\preceq I,\operatorname*{tr}M=k\}$ denotes the $k$ –Fantope and $q_{\nu}$ is the minimax concave penalty (MCP) function given by

[TABLE]

We now describe the experiment parameters for the instances considered. First, the scalar parameters are chosen to be $(\nu,b)=(100,100,0.1)$ . Second, the matrix $\Sigma$ is generated according to an eigenvalue decomposition $\Sigma=P\Lambda P^{T}$ , based on a parameter pair $(s,k)$ , where $k$ is as in the problem description and $s$ is a positive integer. In particular, we choose $\Lambda=(100,1,...,1)$ , the first column of $P$ to be a sparse vector whose first $s$ entries are $1/\sqrt{s}$ , and the other entries of $P$ to be sampled randomly from the standard Gaussian distribution. Third, the initial starting point is $(\Pi_{0},\Phi_{0})=(D_{k},0)$ where $D_{k}$ is a diagonal matrix whose first $k$ entries are 1 and whose remaining entries are 0. Fourth, the curvature parameters for each problem instance are $m=M=1/b.$ Fifth, with respect to the termination criterion (82), the inputs, for every $(\Pi,\Phi)\in S_{+}^{n}\times\Re^{n\times n}$ , are

[TABLE]

Sixth, the R-AIPP variants used a parameter value of $\tau=100000$ . Finally, each problem instance considered is based on a specific parameter pair $(s,k)\in\mathbb{N}^{2}$ where $s$ is part of the process of generating $\Sigma$ (see the second description above).

We now present the numerical tables for this set of problem instances.

6.2.3 Bounded matrix completion problem

Given a dimension pair $(p,q)\in\mathbb{N}^{2}$ , positive scalar triple $(\beta,\mu,\theta)\in\Re_{++}^{3}$ , scalar pair $(u,l)\in\Re^{2}$ , matrix $A\in\Re^{p\times q}$ , and indices $\Omega$ , this sub–subsection considers the following bounded matrix completion (BMC) problem:

[TABLE]

where $\|\cdot\|_{*}$ denotes the nuclear norm, the function $P_{\Omega}$ is the linear operator that zeros out any entry not in $\Omega$ , the function $\sigma_{i}(X)$ denotes the $i^{\rm th}$ largest singular value of $X$ , and

[TABLE]

We now describe the experiment parameters for the instances considered. First, the matrix $A$ is the user–movie ratings data matrix of the MovieLens 100K dataset111See the MovieLens 100K dataset containing 610 users and 9724 movies, which is found in https://grouplens.org/datasets/movielens/., the index set $\Omega$ is the set of nonzero entries in $A$ , and the dimension pair is set to be $(p,q)=(610,9724)$ . Second, the initial starting point was chosen to be $X_{0}=0$ . Third, the curvature parameters for each problem instance are $m=2\beta\mu/\theta^{2}$ and $M=\max\left\{1,m\right\}$ and the bounds are set to $(l,u)=(0,5)$ . Fourth, with respect to the termination criterion (82), the inputs, for every $X\in\Re^{n\times n}$ , are

[TABLE]

Fifth, the R-AIPP variants used a parameter value of $\tau=1000$ . Finally, each problem instance considered is based on a specific parameter triple $(\beta,\mu,\theta)\in\Re_{++}^{3}$ .

We now present the numerical tables for this set of problem instances.

6.3 Summary of the numerical experiments

All three variants of the R-AIPP method perform well (relative to the other methods) in the numerical experiments of this section. The R-AIPPv2 method, in particular, is the best performing method in a large proportion of both the unconstrained and constrained problem instances. A potential explanation is that the stepsizes $\{\lambda_{k}\}$ generated by this method may become significantly larger than the initial stepsize parameters ${\lambda_{0}}=1$ and ${\lambda_{0}}=0.9/(2m)$ used in the R-AIPPv1 and R-AIPPc methods, respectively, which in view of the third remark following Proposition 2, speeds up the convergence of the quantity $\min_{i\leq k}\|\hat{v}_{i}\|$ to zero.

Moreover, the adaptive stepsize R-AIPP variants, namely, the R-AIPPv1 and R-AIPPv2 methods, have been shown to perform well regardless of the size of the ratio $M/m$ (see, for example, Tables 1–4). This is a significant improvement over the AIPP method of WJRproxmet1 which has only been shown to perform well when the ratio $M/m$ is large (see, for example, Table 13).

7 Concluding remarks

Observing the arguments used in the proofs of Proposition 7, Lemma 3, and Theorem 5.1, it is straightforward to see that the assumption of $\mathrm{dom}\,h$ being bounded can be relaxed to assuming that the iterates $\{\hat{z}_{l}\}$ generated by R-QP-AIPP method of Section 5 be bounded. Explicitly assuming that the iterates satisfy $\|\hat{z}_{l}\|\leq B$ , for every $l\geq 1$ and some $B>0$ , the resulting ACG iteration complexity of R-QP-AIPP method is (74) with $Q_{0}$ replaced by the quantity

[TABLE]

where $c_{0}$ is as in step 0 of the method, $d_{0}:=\inf\{\|u-\hat{z}_{0}\|:z\in{\cal F}\}$ , the quantity $\underline{m}$ is as in (10) with $g=f$ , and the quantities $\hat{z}_{0},\varphi_{c},$ and $\varphi_{c}^{*}$ are from the input of the R-QP-AIPP method and (59). It should be noted however that we were not able to show that the iterates $\{\hat{z}_{l}\}$ is bounded. Hence, it is still an open problem to establish the iteration complexity of R-QP-AIPP when $\mathrm{dom}\,h$ is unbounded.

Note that the description of the R-AIPP (resp. R-QP-AIPP) method of Section 4 (resp. Section 5) does not actually require knowledge of an upper bound $m$ on the parameter $\underline{m}$ in (10). This is in contrast to the AIPP (resp. QP-AIPP) method of WJRproxmet1 , which requires $m$ in order to establish its validity and iteration complexity. In addition, one could consider a R-AIPP (resp. R-QP-AIPP) variant in which the quantity $M$ (resp. $L$ ) is adaptively inferred from its iterates rather than requiring knowledge of its value beforehand. While for the sake of brevity we omit the formal description and analysis of such a variant in this paper, we conjecture that the iteration complexity of the R-AIPP (resp. R-QP-AIPP) variant is as in (44) (resp. (74)) with $M$ (resp. $L$ ) replaced with a quantity that lower bounds it, e.g., the maximum of the lower estimates of $M$ (resp. $L$ ) which are inferred by the generated iterates.

Appendix A Appendix

This appendix contains proofs and statements of several technical results used in the main body of the paper. It contains three subsections. The first subsection consists of proofs about the refinement procedure of Section 2; the second subsection consists of proofs about the R-ACG algorithm of Section 3; and the third subsection consists of technical results related to Section 5.

A.1 Properties of the refinement procedure

Proof (of Proposition 1)

It follows from (WJRproxmet1, , Lemma 19) with $(f,h,L)=(f_{\lambda},h_{\lambda},M_{\lambda})$ that $\Delta\geq 0$ and

[TABLE]

Dividing by $\lambda$ and rearranging terms yields

[TABLE]

Adding $\nabla g(\hat{z})$ to both sides and using the definition of $\hat{v}$ gives

[TABLE]

which is the inclusion in (16).

We now bound $\lambda\|\hat{v}\|$ . Since (WJRproxmet1, , Lemma 19) implies that $\|z-\hat{z}\|\leq\sqrt{2M_{\lambda}^{-1}\Delta}$ and $\nabla g$ is $M$ –Lipschitz continuous then

[TABLE]

which is the inequality in (16).

A.2 Properties of the R-ACG algorithm

Proof (of Proposition 5(a))

Let $\ell$ denote the quantity in (39). Assume that the R-ACG algorithm has performed $\ell$ -iterations without declaring failure. In view of step 2 of the R-ACG algorithm, it follows that both (34) and (35) hold for every $1\leq j\leq\ell$ . We will show that it must stop successfully at the end of the $\ell^{\rm th}$ iteration, and hence that the conclusion of the lemma holds. Indeed, note that (38), (39), and the fact that $\log(1+t)\leq t$ for all $t\geq 0$ implies that

[TABLE]

Combining the triangle inequality, (34), the fact that $2/A_{\ell}\leq 1/C$ and $(2/A_{\ell})^{2}<2/A_{\ell}<1$ from (86), and the relation $(a+b)^{2}\leq 2(a^{2}+b^{2})$ for all $a,b\in\Re$ , we obtain

[TABLE]

On the other hand, using the triangle inequality and the fact that $(a+b)^{2}\leq(1+s)a^{2}+(1+1/s)b^{2}$ for every $(a,b,s)\in\Re\times\Re\times R_{++}$ (under the choice of $s=1/(\sqrt{C}-1)$ ), we obtain

[TABLE]

Combining the previous estimates, we then conclude that

[TABLE]

which, after a simple algebraic manipulation, easily implies that

[TABLE]

Using the first term in the maximum of (40) together with the second inequality of (88) immediately implies that (36) holds with $j=\ell$ . To show that (37) holds at $j=\ell$ , observe that the definition of $\psi$ in (27), (35) with $j=\ell$ , the second inequality of (88), and the second term in the maximum of (40) imply that

[TABLE]

A.3 Results related to Section 5

Lemma 4

Assume that $f,h:\Re^{n}\mapsto(-\infty,\infty]$ satisfy assumptions (C1) and (C3) in Section 5, and that, in addition, $f$ is lower semicontinuous on $\mathrm{cl}\,(\mathrm{dom}\,h)$ . Then, $\varphi:=f+h$ is a proper lower semicontinuous function which has a global minimum over $\Re^{n}$ .

Proof

Suppose $\bar{z}\in\Re^{n}\backslash\mathrm{cl}\,(\mathrm{dom}\,h)$ . Since $\mathrm{cl}\,(\mathrm{dom}\,h)$ is closed, there exists $\varepsilon>0$ such that $h(u)=\infty$ for every $u\in\Re^{n}\backslash\mathrm{cl}\,(\mathrm{dom}\,h)$ satisfying $\|u-\bar{z}\|<\varepsilon$ . Hence, $\liminf_{u\to\bar{z}}\varphi(u)=\infty=\varphi(\bar{z})$ . Now suppose $\bar{z}\in\mathrm{cl}\,(\mathrm{dom}\,h)$ . By the lower semicontinuity of $f$ and $h$ we have

[TABLE]

and, since $f$ is differentiable on $\mathrm{dom}\,h$ , the function $\varphi$ is proper lower semicontinuous with $\mathrm{dom}\,\varphi=\mathrm{dom}\,h$ . The last statement of the lemma follows from the well known fact that infimum of a lower semicontinuous function over a bounded set, namely, $\mathrm{dom}\,\varphi$ , is always attained.

A.4 Comparison with the AIPP method

This subsection presents some computational results that compare the AIPP method of WJRproxmet1 with the R-AIPPc method described at the beginning of Section 6. The main problem of interest for this sub-subsection is the quadratic matrix problem described in Sub-subsection 6.1.1.

We now describe the particular implementation of the AIPP method used in this sub-subsection, which differs from its description in WJRproxmet1 in two ways. First, its innermost subroutine, namely, the ACG method, stops immediately when a quadruple $(\lambda_{k},z_{k},v_{k},\varepsilon_{k})$ satisfying (22) is found. Second, for each iteration $k$ of the method, a triple $(\hat{z},\hat{v},\Delta)$ is generated from the refinement procedure in Section 2 by assigning $(\hat{z},\hat{v},\Delta)=RP(\lambda_{k},z_{k-1},z_{k},v_{k})$ , and the method stops with the desired output when $\hat{v}$ satisfies condition (82).

All experiment parameters for the R-AIPPc method and the problem instances are as described in Sub-subsection 6.1.1 below, while the AIPP uses a parameter input of $(\sigma,\lambda)=(0.3,1/(2m))$ for its results.

We now present the numerical tables for this set of problem instances.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Amir. First-order methods in optimization , volume 25. SIAM, 2017.
2[2] N.S. Aybat and G. Iyengar. A first-order smoothed penalty method for compressed sensing. SIAM J. Optim. , 21(1):287–313, 2011.
3[3] N.S. Aybat and G. Iyengar. A first-order augmented Lagrangian method for compressed sensing. SIAM J. Optim. , 22(2):429–459, 2012.
4[4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. , 2(1):183–202, 2009.
5[5] Y. Carmon, J.C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM J. Optim. , 28(2):1751–1772, 2018.
6[6] C. Cartis, N. Gould, and P. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. , 20(6):2833–2852, 2010.
7[7] Y. Chen, G. Lan, and Y. Ouyang. Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim. , 24(4):1779–1814, 2014.
8[8] D. Drusvyatskiy and C. Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program. , pages 1–56, 2018.