Single-Forward-Step Projective Splitting: Exploiting Cocoercivity

Patrick R. Johnstone; Jonathan Eckstein

arXiv:1902.09025·math.OC·August 24, 2020

Single-Forward-Step Projective Splitting: Exploiting Cocoercivity

Patrick R. Johnstone, Jonathan Eckstein

PDF

2 Repos

TL;DR

This paper introduces a novel single-forward-step projective splitting algorithm that efficiently exploits cocoercivity, enabling larger stepsizes and improved convergence in solving maximal monotone inclusions and convex optimization problems.

Contribution

The paper presents a new variant of projective splitting that processes cocoercive operators with a single forward step, matching the stepsize bounds of classical forward-backward splitting.

Findings

01

Allows larger stepsizes for cocoercive operators

02

Establishes a symmetry with classical splitting methods

03

Demonstrates competitive computational performance

Abstract

This work describes a new variant of projective splitting for solving maximal monotone inclusions and complicated convex optimization problems. In the new version, cocoercive operators can be processed with a single forward step per iteration. In the convex optimization context, cocoercivity is equivalent to Lipschitz differentiability. Prior forward-step versions of projective splitting did not fully exploit cocoercivity and required two forward steps per iteration for such operators. Our new single-forward-step method establishes a symmetry between projective splitting algorithms, the classical forward-backward splitting method (FB), and Tseng's forward-backward-forward method (FBF). The new procedure allows for larger stepsizes for cocoercive operators: the stepsize bound is $2 β$ for a $β$ -cocoercive operator, the same bound as has been established for FB. We show that FB…

Tables4

Table 1. Table 1: Tuning parameters for the portfolio problem ( ada3op does not have a tuning parameter.)

	$δ_{r}$
	$0.5$	$0.8$	$1$	$1.5$
ps1fbt ( $γ$ )	$0.01$	$0.01$	$0.5$	$5$
ps2fbt ( $γ$ )	$0.1$	$0.1$	$10$	$10$
cp-bt ( $β^{- 1}$ )	$1$	$1$	$2$	$2$
tseng-pd ( $γ_{p d}$ )	$1$	$1$	$1$	$10$
frb-pd ( $γ_{p d}$ )	$1$	$1$	$10$	$10$

Table 2. Table 2: For the portfolio problem, average running times in seconds and iterations (in parentheses) for each method until c ( x ) < 10 − 5 𝑐 𝑥 superscript 10 5 c(x)<10^{-5} for all subsequent iterations across 10 trials. The best time in each column is in bold .

	$δ_{r}$
	$0.5$	$0.8$	$1$	$1.5$
ps1fbt	3.6 (102)	4.7 (102)	16.3 (583)	8.5 (255.2)
ps2fbt	5.0 (151.1)	7.9 (155)	24.3 (523.4)	9.2 (222.9)
ada3op	5.3 (180.8)	9.2 (180.8)	6.8 (174.3)	3.4 (89.2)
cp-bt	6.2 (136)	8.3 (134.3)	11.8 (218.4)	5.6 (113.6)
tseng-pd	15.9 (387.1)	21 (387.8)	25.7 (525.3)	11.1 (245.4)
frb-pd	10.5 (559.9)	16.4 (560.4)	22.8 (1074.8)	6.3 (350.8)

Table 3. Table 3: The number of nonzeros and nonzero groups in the solution, along with the training error, for each value of λ 𝜆 \lambda .

	$λ$ (breast cancer)			$λ$ (IBD)
	0.05	0.5	0.85	0.1	0.5	1.0
# Nonzeros	114	50	20	135	40	18
# Nonzero groups	16	7	3	13	4	2
Training error	0%	5%	35%	0%	5.5%	26.8%

Table 4. Table 4: Tuning parameters for sparse group LR ( ada3op does not have a tuning parameter).

	$λ$ (breast cancer)			$λ$ (IBD)
	0.05	0.5	0.85	0.1	0.5	1.0
ps1fbt ( $γ$ )	$0.05$	$10^{2}$	$10^{2}$	0.1	1	1
ps2fbt ( $γ$ )	$1$	$10^{2}$	$10^{5}$	1	1	1
cp-bt ( $β^{- 1}$ )	$10$	$10^{3}$	$10^{4}$	$10^{4}$	$10^{3}$	$10^{5}$
tseng-pd ( $γ_{p d}$ )	$10^{3}$	$10^{5}$	$10^{5}$	$10^{4}$	$10^{6}$	$10^{6}$
frb-pd ( $γ_{p d}$ )	$10^{3}$	$10^{5}$	$10^{5}$	$10^{4}$	$10^{6}$	$10^{6}$

Equations266

\displaystyle\min_{x\in\mathcal{H}_{0}}\sum_{i=1}^{n}\big{(}f_{i}(G_{i}x)+h_{i}(G_{i}x)\big{)},

\displaystyle\min_{x\in\mathcal{H}_{0}}\sum_{i=1}^{n}\big{(}f_{i}(G_{i}x)+h_{i}(G_{i}x)\big{)},

0 \in i = 1 \sum n G_{i}^{*} (A_{i} + B_{i}) G_{i} z

0 \in i = 1 \sum n G_{i}^{*} (A_{i} + B_{i}) G_{i} z

L_{i} ⟨ B_{i} x_{1} - B_{i} x_{2}, x_{1} - x_{2} ⟩ \geq ∥ B_{i} x_{1} - B_{i} x_{2} ∥^{2}

L_{i} ⟨ B_{i} x_{1} - B_{i} x_{2}, x_{1} - x_{2} ⟩ \geq ∥ B_{i} x_{1} - B_{i} x_{2} ∥^{2}

0 \in i = 1 \sum n G_{i}^{*} T_{i} G_{i} z .

0 \in i = 1 \sum n G_{i}^{*} T_{i} G_{i} z .

x = J_{ρ A} (t) ⟺ x + ρ a = t and a \in A x,

x = J_{ρ A} (t) ⟺ x + ρ a = t and a \in A x,

prox_{ρ f} (t) = z arg min {ρ f (z) + \frac{1}{2} ∥ z - t ∥^{2}} .

prox_{ρ f} (t) = z arg min {ρ f (z) + \frac{1}{2} ∥ z - t ∥^{2}} .

G_{n} : H_{n} \to H_{n}

G_{n} : H_{n} \to H_{n}

S ≜ {(z, w_{1}, \dots, w_{n - 1}) \in H (\forall i \in {1, \dots, n - 1}) w_{i} \in T_{i} G_{i} z, - \sum_{i = 1}^{n - 1} G_{i}^{*} w_{i} \in T_{n} z} .

S ≜ {(z, w_{1}, \dots, w_{n - 1}) \in H (\forall i \in {1, \dots, n - 1}) w_{i} \in T_{i} G_{i} z, - \sum_{i = 1}^{n - 1} G_{i}^{*} w_{i} \in T_{n} z} .

(z^{*}, w_{1}^{*}, \dots, w_{n - 1}^{*}) \in S .

(z^{*}, w_{1}^{*}, \dots, w_{n - 1}^{*}) \in S .

φ_{k} (z, w_{1}, \dots, w_{n - 1})

φ_{k} (z, w_{1}, \dots, w_{n - 1})

= ⟨ z, i = 1 \sum n G_{i}^{*} y_{i}^{k} ⟩ + i = 1 \sum n - 1 ⟨ x_{i}^{k} - G_{i} x_{n}^{k}, w_{i} ⟩ - i = 1 \sum n ⟨ x_{i}^{k}, y_{i}^{k} ⟩,

φ_{k} (p)

φ_{k} (p)

w_{n} ≜ - i = 1 \sum n - 1 G_{i}^{*} w_{i},

w_{n} ≜ - i = 1 \sum n - 1 G_{i}^{*} w_{i},

φ_{k} (z, w_{1}, \dots, w_{n - 1})

φ_{k} (z, w_{1}, \dots, w_{n - 1})

φ_{i, k} (z, w_{i}) ≜ ⟨ G_{i} z - x_{i}^{k}, y_{i}^{k} - w_{i} ⟩ .

φ_{i, k} (z, w_{i}) ≜ ⟨ G_{i} z - x_{i}^{k}, y_{i}^{k} - w_{i} ⟩ .

x_{i}^{k}

x_{i}^{k}

y_{i}^{k}

x_{i}^{k} + ρ_{i}^{k} a_{i}^{k}

x_{i}^{k} + ρ_{i}^{k} a_{i}^{k}

y_{i}^{k}

x_{i}^{k} + ρ_{i} a_{i}^{k}

x_{i}^{k} + ρ_{i} a_{i}^{k}

b_{i}^{k}

y_{i}^{k}

t

t

x_{i}^{k}

a_{i}^{k}

x_{i}^{k} + ρ_{i} y_{i}^{k} = G_{i} z^{k} + ρ_{i} w_{i}^{k} : y_{i}^{k} \in T_{i} x_{i}^{k}

x_{i}^{k} + ρ_{i} y_{i}^{k} = G_{i} z^{k} + ρ_{i} w_{i}^{k} : y_{i}^{k} \in T_{i} x_{i}^{k}

x_{i}^{k} + \frac{ρ _{i}}{α _{i}} y_{i}^{k} = G_{i} z^{k} + \frac{ρ _{i}}{α _{i}} w_{i}^{k} : y_{i}^{k} \in T_{i} x_{i}^{k} .

x_{i}^{k} + \frac{ρ _{i}}{α _{i}} y_{i}^{k} = G_{i} z^{k} + \frac{ρ _{i}}{α _{i}} w_{i}^{k} : y_{i}^{k} \in T_{i} x_{i}^{k} .

0

0

⟹ 0

\tilde{B}_{i} v = B_{i} v + \frac{α _{i}}{ρ _{i}} (v - G_{i} z^{k} - \frac{ρ _{i}}{α _{i}} w_{i}^{k}) .

\tilde{B}_{i} v = B_{i} v + \frac{α _{i}}{ρ _{i}} (v - G_{i} z^{k} - \frac{ρ _{i}}{α _{i}} w_{i}^{k}) .

x_{i}^{k}

x_{i}^{k}

= J_{ρ_{i} A_{i}} (x_{i}^{k - 1} - ρ_{i} (B_{i} x_{i}^{k - 1} + \frac{α _{i}}{ρ _{i}} (x_{i}^{k - 1} - G_{i} z^{k} - \frac{ρ _{i}}{α _{i}} w_{i}^{k})))

= J_{ρ_{i} A_{i}} ((1 - α_{i}) x_{i}^{k - 1} + α_{i} G_{i} z^{k} - ρ_{i} (B_{i} x_{i}^{k - 1} - w_{i}^{k})),

∥ (z, w) ∥^{2}

∥ (z, w) ∥^{2}

\nabla φ_{k} = (i = 1 \sum n - 1 G_{i}^{*} y_{i}^{k} + y_{n}^{k}, x_{1}^{k} - G_{1} x_{n}^{k}, x_{2}^{k} - G_{2} x_{n}^{k}, \dots, x_{n - 1}^{k} - G_{n - 1} x_{n}^{k}) .

\nabla φ_{k} = (i = 1 \sum n - 1 G_{i}^{*} y_{i}^{k} + y_{n}^{k}, x_{1}^{k} - G_{1} x_{n}^{k}, x_{2}^{k} - G_{2} x_{n}^{k}, \dots, x_{n - 1}^{k} - G_{n - 1} x_{n}^{k}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Single-Forward-Step Projective Splitting: Exploiting Cocoercivity

Patrick R. Johnstone Department of Management Science and Information Systems, Rutgers Business School Newark and New Brunswick, Rutgers University. Contact: [email protected], [email protected]

Jonathan Eckstein∗

Abstract

This work describes a new variant of projective splitting for solving maximal monotone inclusions and complicated convex optimization problems. In the new version, cocoercive operators can be processed with a single forward step per iteration. In the convex optimization context, cocoercivity is equivalent to Lipschitz differentiability. Prior forward-step versions of projective splitting did not fully exploit cocoercivity and required two forward steps per iteration for such operators. Our new single-forward-step method establishes a symmetry between projective splitting algorithms, the classical forward-backward splitting method (FB), and Tseng’s forward-backward-forward method (FBF). The new procedure allows for larger stepsizes for cocoercive operators: the stepsize bound is $2\beta$ for a $\beta$ -cocoercive operator, the same bound as has been established for FB. We show that FB corresponds to an unattainable boundary case of the parameters in the new procedure. Unlike FB, the new method allows for a backtracking procedure when the cocoercivity constant is unknown. Proving convergence of the algorithm requires some departures from the prior proof framework for projective splitting. We close with some computational tests establishing competitive performance for the method.

1 Introduction

1.1 Problem Statement

For a collection of real Hilbert spaces $\{\mathcal{H}_{i}\}_{i=0}^{n}$ consider the finite-sum convex minimization problem:

[TABLE]

where every $f_{i}:\mathcal{H}_{i}\to(-\infty,+\infty]$ and $h_{i}:\mathcal{H}_{i}\to\mathbb{R}$ is closed, proper, and convex, every $h_{i}$ is also differentiable with $L_{i}$ -Lipschitz-continuous gradients, and the operators $G_{i}:\mathcal{H}_{0}\to\mathcal{H}_{i}$ are linear and bounded. Under appropriate constraint qualifications, (1) is equivalent to the monotone inclusion problem of finding $z\in\mathcal{H}_{0}$ such that

[TABLE]

where all $A_{i}:\mathcal{H}_{i}\to 2^{\mathcal{H}_{i}}$ and $B_{i}:\mathcal{H}_{i}\to\mathcal{H}_{i}$ are maximal monotone and each $B_{i}$ is $L_{i}^{-1}$ -cocoercive, meaning that it is single-valued and

[TABLE]

for some $L_{i}\geq 0$ . (When $L_{i}=0$ , $B_{i}$ must be a constant operator, that is, there is some $v_{i}\in\mathcal{H}_{i}$ such that $B_{i}x=v_{i}$ for all $x\in\mathcal{H}_{i}$ . ) In particular, if we set $A_{i}=\partial f_{i}$ (the subgradient map of $f_{i}$ ) and $B_{i}=\nabla h_{i}$ (the gradient of $h_{i}$ ) then the solution sets of the two problems coincide under a special case of the constraint qualification of [9, Prop. 5.3].

Defining $T_{i}=A_{i}+B_{i}$ for all $i$ , problem (2) may be written as

[TABLE]

This more compact problem statement will be used occasionally in our analysis below.

1.2 Background

Operator splitting algorithms are an effective way to solve structured convex optimization problems and monotone inclusions such as (1), (2), and (3). Their defining feature is that they decompose a problem into a set of manageable pieces. Each iteration consists of relatively easy calculations confined to each individual component of the decomposition, in conjunction with some simple coordination operations orchestrated to converge to a solution. Arguably the three most popular classes of operator splitting algorithms are the forward-backward splitting (FB) [11], Douglas/Peaceman-Rachford splitting (DR) [26], and forward-backward-forward (FBF) [40] methods. Indeed, many algorithms in convex optimization and monotone inclusions are in fact instances of one of these methods. The popular Alternating Direction Method of Multipliers (ADMM), in its standard form, can be viewed as a dual implementation of DR [20].

Projective splitting is a relatively recent and currently less well-known class of operator splitting methods, operating in a primal-dual space. Each iteration $k$ of these methods explicitly contructs an affine “separator” function $\varphi_{k}$ for which $\varphi_{k}(p)\leq 0$ for every $p$ in the set $\mathcal{S}$ of primal-dual solutions. The next iterate $p^{k+1}$ is then obtained by projecting the current iterate $p^{k}$ onto the halfspace defined by $\varphi_{k}(p)\leq 0$ , possibly with some over- or under-relaxation. Crucially, $\varphi_{k}$ is obtained by performing calculations that consider each operator $T_{i}$ separately, so that the procedures are indeed operator splitting algorithms. In the original formulations of projective splitting [18, 19], the calculation applied to each operator $T_{i}$ was a standard resolvent operation, also known as a “backward step”. Resolvent operations remained the only way to process individual operators as projective splitting was generalized to cover compositions of maximal monotone operators with bounded linear maps [1] — as in the $G_{i}$ in (3) — and block-iterative (incremental) or asynchronous calculation patterns [10, 17]. Convergence rate and other theoretical results regarding projective splitting may be found in [22, 23, 28, 29].

The algorithms in [39, 21] were the first to construct projective splitting separators by applying calculations other than resolvent steps to the operators $T_{i}$ . In particular, [21] developed a procedure that could instead use two forward (explicit or gradient) steps for operators $T_{i}$ that are Lipschitz continuous. However, that result raised a question: if projective splitting can exploit Lipschitz continuity, can it further exploit the presence of cocoercive operators? Cocoercivity is in general a stronger property than Lipschitz continuity. However, when an operator is the gradient of a closed proper convex function (such as $h_{i}$ in (1)), the Baillon-Haddad theorem [2, 3] establishes that the two properties are equivalent: $\nabla h_{i}$ is $L_{i}$ -Lipschitz continuous if and only if it is $L_{i}^{-1}$ -cocoercive.

Operator splitting methods that exploit cocoercivity rather than mere Lipschitz continuity typically have lower per-iteration computational complexity and a larger range of permissible stepsizes. For example, both FBF and the extragradient (EG) method [25] only require Lipchitz continuity, but need two forward steps per iteration and limit the stepsize to $L^{-1}$ , where $L$ is the Lipschitz constant. If one strengthens the assumption to $L^{-1}$ -cocoercivity, one can instead use FB, which only needs one forward step per iteration and allows stepsizes bounded away from $2L^{-1}$ . One departure from this pattern is the recently developed method of [31], which only requires Lipschitz continuity but uses just one forward step per iteration. While this property is remarkable, it should be noted that its stepsizes must be bounded by $(1/2)L^{-1}$ , which is half the allowable stepsize for EG or FBF and just a fourth of FB’s stepsize range.

Much like EG and FBF, the projective splitting computation in [21] requires Lipschitz continuity111If backtracking is used, then all three of these methods can converge under weaker local continuity assumptions., two forward steps per iteration, and limits the stepsize to be less than $L^{-1}$ (when not using backtracking). Considering the relationship between FB and FBF/EG leads to the following question: is there a variant of projective splitting which converges under the stronger assumption of $L^{-1}$ -cocoercivity, while processing each cocoercive operator with a single forward step per iteration and allowing stepsizes bounded above by $2L^{-1}$ ?

This paper shows that the answer to this question is “yes”. Referring to (2), the new procedure analyzed here requires one forward step on $B_{i}$ and one resolvent for $A_{i}$ at each iteration. In the context of (1), the new procedure requires one forward step on $\nabla h_{i}$ and one proximal operator evaluation on $f_{i}$ . When the resolvent is easily computable (for example, when $A_{i}$ is the zero map and its resolvent is simply the identity), the new procedure can effectively halve the computation necessary to run the same number of iterations as the previous procedure of [21]. This advantage is equivalent to that of FB over FBF and EG when cocoercivity is present. Another advantage of the proposed method is that it allows for a backtracking linesearch when the cocoercivity constant is unknown, whereas no such variant of general cocoercive FB is currently known.

The analysis of this new method is significantly different from our previous work in [21], using a novel “ascent lemma” (Lemma 17) regarding the separators generated by the algorithm. The new procedure also has an interesting connection to the original resolvent calculation used in the projective splitting papers [18, 19, 1, 10]: in Section 2.2 below, we show that the new procedure is equivalent to one iteration of FB applied to evaluating the resolvent of $T_{i}=A_{i}+B_{i}$ . That is, we can use a single forward-backward step to approximate the operator-processing procedure of [18, 19, 1, 10], but still obtain convergence.

The new procedure has significant potential for asynchronous and incremental implementation following the ideas and techniques of previous projective splitting methods [10, 17, 21]. To keep the analysis relatively manageable, however, we plan to develop such generalizations in a follow-up paper. Here, we will simply assume that every operator is processed once per iteration.

1.3 The Optimization Context

For optimization problems of the form (1), our proposed method is a first-order proximal splitting method that “fully splits” the problem: at each iteration, it utilizes the proximal operator for each nonsmooth function $f_{i}$ , a single evaluation of the gradient $\nabla h_{i}$ for each smooth function $h_{i}$ , and matrix-vector multiplications involving $G_{i}$ and $G_{i}^{*}$ . There is no need for any form of matrix inversion, nor to use resolvents of composed functions like $f_{i}\circ G_{i}$ , which may in general be much more challenging to evaluate than resolvents of the $f_{i}$ . Thus, the method achieves the maximum possible decoupling of the elements of (1). There are also no assumptions on the rank, row spaces, or columns spaces of the $G_{i}$ . Beyond the basic resolvent, gradient, and matrix-vector multiplication operations invoked by our algorithm, the only computations at each iteration are a constant number of inner products, norms, scalar multiplications, and vector additions, all of which can all be carried out within flop counts linear in the dimension of each Hilbert space.

Besides projective splitting approaches, there are a few first-order proximal splitting methods that can achieve full splitting on (1). The most similar to projective splitting are those in the family of primal-dual (PD) splitting methods; see [13, 12, 7, 35] and references therein. In fact, projective splitting is also a kind of primal-dual method, since it produces primal and dual sequences jointly converging to a primal-dual solution. However, the convergence mechanisms are different: PD methods are usually constructed by applying an established operator splitting technique such as FB, FBF, or DR to an appropriately formulated primal-dual inclusion in a primal-dual product space, possibly with a specially chosen metric. Projective splitting methods instead work by projecting onto (or through) explicitly constructed separating hyperplanes in the primal-dual space.

There are several potential advantages of our proposed method over the more established PD schemes. First, unlike the PD methods, the norms $\|G_{i}\|$ do not effect the stepsize constraints of our proposed method, making such constraints easier to satisfy. Furthermore, projective splitting’s stepsizes may vary at each iteration and may differ for each operator. In general,

projective splitting methods allow for asynchronous parallel and incremental implementations in an arguably simpler way than PD methods (although we do not develop this aspect of projective splitting in this paper). Projective splitting methods can incorporate deterministic block-iterative and asynchronous assumptions [10, 17], resulting in deterministic convergence guarantees, with the analysis being similar to the synchronous case. In contrast, existing asynchronous and block-coordinate analyses of PD methods require stochastic assumptions which only lead to probabilistic convergence guarantees [35].

1.4 Notation and a Simplifying Assumption

We use the same general notation as in [21, 23, 22]. Summations of the form $\sum_{i=1}^{n-1}a_{i}$ will appear throughout this paper. To deal with the case $n=1$ , we use the standard convention that $\sum_{i=1}^{0}a_{i}=0.$

We will use a boldface ${\bf w}=(w_{1},\ldots,w_{n-1})$ for elements of $\mathcal{H}_{1}\times\ldots\times\mathcal{H}_{n-1}$ . Let $\boldsymbol{\mathcal{H}}\triangleq\mathcal{H}_{0}\times\mathcal{H}_{1}\times\cdots\times\mathcal{H}_{n-1}$ , which we refer to as the “collective primal-dual space”, and note that the assumption on $G_{n}$ implies that $\mathcal{H}_{n}=\mathcal{H}_{0}$ . We use $p$ to refer to points in $\boldsymbol{\mathcal{H}}$ , so $p\triangleq(z,{\bf w})=(z,w_{1},\ldots,w_{n-1})$ .

Throughout, we will simply write $\|\cdot\|_{i}=\|\cdot\|$ as the norm for $\mathcal{H}_{i}$ and let the subscript be inferred from the argument. In the same way, we will write $\langle\cdot,\cdot\rangle_{i}$ as $\langle\cdot,\cdot\rangle$ for the inner product of $\mathcal{H}_{i}$ . For the collective primal-dual space we will use a special norm and inner product with its own subscript defined in (16).

We use the standard “ $\rightharpoonup$ ” notation to denote weak convergence, which is of course equivalent to ordinary convergence in finite-dimensional settings.

For the definition of maximal monotone operators and their basic properties, we refer to [4]. For any maximal monotone operator $A$ and scalar $\rho>0$ , we will use the notation $J_{\rho A}\triangleq(I+\rho A)^{-1},$ to denote the resolvent operator, also known as the backward or implicit step with respect to $A$ . Thus,

[TABLE]

the $x$ and $a$ satisfying this relation being unique. Furthermore, $J_{\rho A}$ is defined everywhere and $\text{range}(J_{A})=\text{dom}(A)$ [4, Prop. 23.2].

If $A=\partial f$ for a closed, convex, and proper function $f$ , the resolvent is often referred to as the proximal operator and written as $J_{\rho\partial f}={\text{prox}}_{\rho f}$ . Computing the proximal operator requires solving

[TABLE]

Many functions encountered in applications to machine learning and signal processing have proximal operators which can be computed exactly with low computational complexity. In this paper, for a single-valued maximal monotone operator $A$ , a forward step (also known as an explicit step) refers to the direct evaluation of $Ax$ (or $\nabla f(x)$ in convex optimization) as part of an algorithm.

For the rest of the paper, we will impose the simplifying assumption

[TABLE]

As noted in [21], the requirement that $G_{n}=I$ is not a very restrictive assumption. For example, one can always enlarge the original problem by one operator, setting $A_{n}=B_{n}=0$ .

2 Projective Splitting

The goal of our algorithm will be to find a point in

[TABLE]

It is clear that $z^{*}$ solves (2)–(3) if and only if there exist $w_{1}^{*},\ldots,w_{n-1}^{*}$ such that

[TABLE]

Under reasonable assumptions, the set $\mathcal{S}$ is closed and convex; see Lemma 2. $\mathcal{S}$ is often called the Kuhn-Tucker solution set of problem (3).

A separator-projector algorithm for finding a point in $\mathcal{S}$ (and hence a solution to (3)) will, at each iteration $k$ , find a closed and convex set $H_{k}$ which separates $\mathcal{S}$ from the current point, meaning $\mathcal{S}$ is entirely in the set (preferably, the current point is not). One can then attempt to “move closer” to the solution set by projecting the current point onto the set $H_{k}$ . This general setup guarantees that the sequence generated by the method is Fejér monotone [8] with respect to $\mathcal{S}$ . This alone is not sufficient to guarantee that the iterates actually converge to a point in the solution set. To establish this, one needs to show that the set $H_{k}$ “sufficiently separates” the current point from the solution set, or at least does so sufficiently often. Such “sufficient separation” allows one to establish that any weakly convergent subsequence of the iterates must have its limit in the set $\mathcal{S}$ , from which overall weak convergence follows from [8, Prop. 2].

With $\mathcal{S}$ as in (5), the separator formulation presented in [10] constructs the halfspace $H_{k}$ using the function $\varphi_{k}:\boldsymbol{\mathcal{H}}\to\mathbb{R}$ defined as

[TABLE]

for some auxiliary points ( $x_{i}^{k},y_{i}^{k})\in\mathcal{H}_{i}^{2}$ . These points ( $x_{i}^{k},y_{i}^{k}$ ) will be specified later and must be chosen at each iteration in a specific manner guaranteeing the validity of the separator and convergence to $\mathcal{S}$ . Among other properties, they must be chosen so that $y^{k}_{i}\in T_{i}x^{k}_{i}$ for $i=1,\ldots,n$ . Under this condition, it follows readily that $\varphi_{k}$ has the promised separator properties:

Lemma 1.

The function $\varphi_{k}$ defined in (6) is affine, and if $y^{k}_{i}\in T_{i}x^{k}_{i}$ for all $i=1,\ldots,n$ , then $\varphi_{k}(z,w_{1},\ldots,w_{n-1})\leq 0$ for all $(z,w_{1},\ldots,w_{n-1})\in\mathcal{S}$ .

Proof.

That $\varphi_{k}$ is affine is clear from its expression in (7). Now suppose that $y^{k}_{i}\in T_{i}x^{k}_{i}$ for all $i=1,\ldots,n$ and $p=(z,w_{1},\ldots,w_{n-1})\in\mathcal{S}$ . Then

[TABLE]

where $w_{n}\triangleq-\sum_{i=1}^{n-1}G_{i}^{*}w_{i}.$ From $(z,w_{1},\ldots,w_{n-1})\in\mathcal{S}$ and the definition of $\mathcal{S}$ , one has that $w_{i}\in T_{i}z$ for all $i=1,\ldots,n-1$ , as well as $w_{n}\in T_{n}z$ . Since $y_{i}\in T_{i}x_{i}$ for $i=1,\ldots,n$ , it follows from the monotonicity of $T_{1},\ldots,T_{n}$ that every inner product displayed in (8) is nonnegative, and so $\varphi_{k}(p)\leq 0$ . ∎

Figure 1 presents a rough depiction of the current algorithm iterate $p^{k}=(z^{k},w_{1}^{k},\ldots,w_{n-1}^{k})$ and the separator $\varphi_{k}$ in the case that $\varphi_{k}(p^{k})>0$ . The basic iterative cycle pursued by projective splitting methods is:

For each operator $T_{i}$ , identify a pair $(x_{i}^{k},y_{i}^{k})\in\operatorname{gra}T_{i}$ . These pairs define an affine function $\varphi_{k}$ such that $\varphi_{k}(p)\leq 0$ for all $p\in\mathcal{S}$ , using the construction (6) (or related constructions for variations of the basic problem formulation). 2. 2.

Obtain the next iterate $p^{k+1}$ by projecting the current iterate $p^{k}$ onto the halfspace $H_{k}\triangleq\left\{p\;\left|\;\;\varphi_{k}(p)\leq 0\right.\right\}$ , with possible over- or under-relaxation.

Figure 2 presents a rough depiction of two iterations of this process in the absence of over- or under-relaxation. The projection operation in part 2 of the cycle is a straightforward application of standard formulas for projecting onto a halfspace. For the particular formulation (3), the necessary calculations are derived in [21] and displayed in Algorithm 3 below. This projection is a low-complexity operation involving only inner products, norms, matrix multiplication by $G_{i}$ , and sums of scalars. For example, when $\mathcal{H}_{i}=\mathbb{R}^{d}$ for $i=1,\ldots,n$ and each $G_{i}=I$ , then the projection step has computational complexity $\operatorname{O}(nd)$ .

The key question in the design of algorithms in this class therefore concerns step 1 in the cycle: how might one select the points $(x_{i}^{k},y_{i}^{k})\in\operatorname{gra}T_{i}$ so that convergence to $\mathcal{S}$ may be established? The usual approach has been to choose $(x^{k}_{i},y^{k}_{i})\in\operatorname{gra}T_{i}$ to be some function of $(z^{k},w_{i}^{k})$ such that $\varphi_{k}(p^{k})$ is positive and “sufficiently large” whenever $p^{k}\not\in\mathcal{S}$ . Then projecting the current point onto this hyperplane makes progress toward the solution and can be shown to lead (with some further analysis) to overall convergence. In the original versions of projective splitting, the calculation of $(x^{k}_{i},y^{k}_{i})$ involved (perhaps approximately) evaluating a resolvent; later [21] introduced the alternative of a two-forward-step calculation for Lipschitz continuous operators that achieved essentially the same sufficient separation condition.

Here, we introduce a one-forward-step calculation for the case of cocoercive operators. A principal difference between this analysis and earlier work on projective splitting is that processing all the operators $T_{1},\ldots,T_{n}$ at iteration $k$ need not result in $\varphi_{k}(p^{k})$ being positive. Instead, we establish an “ascent lemma” that relates the values $\varphi_{k}(p^{k})$ and $\varphi_{k-1}(p^{k-1})$ in such a way that overall convergence may still be proved, even though it is possible that $\varphi_{k}(p^{k})\leq 0$ at some iterations $k$ . In particular, $\varphi_{k}(p^{k})$ will be larger than the previous value $\varphi_{k-1}(p^{k-1})$ , up to some error term that vanishes as $k\to\infty$ .

When $\varphi_{k}(p^{k})\leq 0$ , projection onto $H_{k}=\left\{p\;\left|\;\;\varphi_{k}(p)\leq 0\right.\right\}$ results in $p^{k+1}=p^{k}$ . In this case, the algorithm continues to compute new points $(x_{i}^{k+1},y_{i}^{k+1})$ , $(x_{i}^{k+2},y_{i}^{k+2}),\ldots$ until, for some $\ell\geq 0$ , it constructs a hyperplane $H_{k+\ell}$ such that the $\varphi_{k+\ell}(p^{k})>0$ and projection results in $p^{k+\ell+1}\neq p^{k+\ell}=p^{k}$ .

Additional Notation for Projective Splitting

For an arbitrary $(w_{1},w_{2},\ldots,w_{n-1})\in\mathcal{H}_{1}\times\mathcal{H}_{2}\times\ldots\times\mathcal{H}_{n-1}$ we use the notation

[TABLE]

as in the proof of Lemma 1. Note that when $n=1$ , $w_{1}=0$ . Under the above convention, we may write $\varphi_{k}:\boldsymbol{\mathcal{H}}\to\mathbb{R}$ in the more compact form

[TABLE]

We also use the following notation for $i=1,\ldots,n$ :

[TABLE]

Note that $\varphi_{k}(z,w_{1},\ldots,w_{n-1})=\sum_{i=1}^{n}\varphi_{i,k}(z,w_{i})$ .

2.1 The New Procedure

Suppose $A_{i}=0$ for some $i\in\{1,\ldots,n\}$ . Since $B_{i}$ is cocoercive, it is also Lipschitz continuous. In [21] we introduced the following two-forward-step update for Lipschitz continuous $B_{i}$ :

[TABLE]

Under $L_{i}$ -Lipschitz continuity and the condition $\rho_{i}^{k}<1/L_{i}$ , it is possible to show that updating $(x_{i}^{k},y_{i}^{k})$ in this way leads to $\varphi_{i,k}(z^{k},w_{i}^{k})$ being sufficiently positive to establish overall convergence. Although we did not discuss it in [21], this two-forward step procedure can be extended to handle nonzero $A_{i}$ in the following manner:

[TABLE]

Following (4), it is clear that (9) is essentially a resolvent calculation applied to its right-hand side $G_{i}z^{k}-\rho_{i}^{k}(B_{i}G_{i}z^{k}-w_{i}^{k})$ . This type of update, with forward steps and backward steps together, was introduced in [39] for a more limited form of projective splitting.

An obvious drawback of (9)–(10) is that it requires two forward steps per iteration, one to compute $B_{i}G_{i}z^{k}$ and another to compute $B_{i}x_{i}^{k}$ . The initial motivation for the current paper was the following question: is there a way to reuse $B_{i}x_{i}^{k-1}$ so as to avoid computing $B_{i}G_{i}z^{k}$ at each iteration, perhaps under the stronger assumption of cocoercivity? With some effort we arrived at the following update for each block $i=1,\ldots,n$ at each iteration $k\geq 0$ :

[TABLE]

where $\alpha_{i}\in(0,1)$ , $\rho_{i}\leq 2(1-\alpha_{i})/L_{i}$ , and $b_{i}^{0}=B_{i}x_{i}^{0}$ . Condition (11) is readily satisfied by some simple linear algebra calculations and a resolvent calculation involving $A_{i}$ .

In particular, referring to (4), one may see that (11) is equivalent to computing

[TABLE]

Following this resolvent calculation, (12) requires only an evaluation (forward step) on $B_{i}$ , and (13) is a simple vector addition. In comparison to (9), we have replaced $B_{i}G_{i}z^{k}$ with the previously computed point $B_{i}x_{i}^{k-1}$ . However, in order to establish convergence, it turns out that we also need to replace $G_{i}z^{k}$ with a convex combination of $x_{i}^{k-1}$ and $G_{i}z^{k}$ .

The parameter $\rho_{i}$ plays the role of the stepsize in the resolvent calculation. It also plays the role of a forward (gradient) stepsize, since it multiplies $-b_{i}^{k-1}$ in (11), and $b_{i}^{k-1}=B_{i}x_{i}^{k-1}$ by (12). From the assumptions on $\alpha_{i}$ and $\rho_{i}$ immediately following 13, it follows that $\rho_{i}$ may be made arbitrarily close to $2/L_{i}$ by setting $\alpha_{i}$ close to [math]. However, in practice it may be better to use an intermediate value, such as $\alpha_{i}=0.1$ , since doing so causes the update to make significant use of the information in $z^{k}$ , a point computed more recently than $x_{i}^{k-1}$ .

Computing $(x_{i}^{k},y_{i}^{k})$ as proposed in (11)-(13) does not guarantee that the quantity $\varphi_{i,k}(z^{k},w_{i}^{k})$ is positive. In the next section, we give some intuition as to why (11)-(13) nevertheless leads to convergence to $\mathcal{S}$ .

2.2 A Connection with the Forward-Backward Method

In the projective splitting literature preceeding [21], the pairs $(x_{i}^{k},y_{i}^{k})$ are solutions of

[TABLE]

for some $\rho_{i}>0$ , which — again following (4) — is a resolvent calculation. It can be shown that the resulting $(x_{i}^{k},y_{i}^{k})\in\operatorname{gra}T_{i}$ are such that $\varphi_{i,k}(z^{k},w_{i}^{k})$ is positive and sufficiently large to guarantee overall convergence to a solution of (3). Since the stepsize $\rho_{i}$ in (14) can be any positive number, let us replace $\rho_{i}$ with $\rho_{i}/\alpha_{i}$ for some $\alpha_{i}\in(0,1)$ and rewrite (14) as

[TABLE]

The reason for this reparameterization will become apparent below.

In this paper, $T_{i}=A_{i}+B_{i}$ , with $B_{i}$ being cocoercive and $A_{i}$ maximal monotone. For $T_{i}$ in this form, computing the resolvent as in (14) exactly may be impossible, even when the resolvent of $A_{i}$ is available. With this structure, $x_{i}^{k}$ in (15) satisfies:

[TABLE]

which can be rearranged to $0\in A_{i}x_{i}^{k}+\tilde{B}_{i}x_{i}^{k},$ where

[TABLE]

Since $B_{i}$ is $L_{i}^{-1}$ -cocoercive, $\tilde{B}_{i}$ is $(L_{i}+\alpha_{i}/\rho_{i})^{-1}$ -cocoercive [4, Prop. 4.12]. Consider the generic monotone inclusion problem $0\in A_{i}x+\tilde{B}_{i}x$ : $A_{i}$ is maximal and $\tilde{B}_{i}$ is cocoercive, and thus one may solve the problem with the forward-backward (FB) method [4, Theorem 26.14]. If one applies a single iteration of FB initialized at $x_{i}^{k-1}$ , with stepsize $\rho_{i}$ , to the inclusion $0\in A_{i}x+\tilde{B}_{i}x$ , one obtains the calculation:

[TABLE]

which is precisely the update (11). So, our proposed calculation is equivalent to one iteration of FB initialized at the previous point $x_{i}^{k-1}$ , applied to the subproblem of computing the resolvent in (15). Prior versions of projective splitting require computing this resolvent either exactly or to within a certain relative error criterion, which may be time consuming. Here, we simply make a single FB step toward computing the resolvent, which we will prove is sufficient for the projective splitting method to converge to $\mathcal{S}$ . However, our stepsize restriction on $\rho_{i}$ will be slightly stronger than the natural stepsize limit that would arise when applying FB to $0\in A_{i}x+\tilde{B}_{i}x$ .

3 The Algorithm

3.1 Main Problem Assumptions and Preliminary Results

Assumption 1.

Problem (2) conforms to the following:

$\mathcal{H}_{0}=\mathcal{H}_{n}$ * and $\mathcal{H}_{1},\ldots,\mathcal{H}_{n-1}$ are real Hilbert spaces.* 2. 2.

For $i=1,\ldots,n$ , the operators $A_{i}:\mathcal{H}_{i}\to 2^{\mathcal{H}_{i}}$ and $B_{i}:\mathcal{H}_{i}\to\mathcal{H}_{i}$ are monotone. Additionally each $A_{i}$ is maximal. 3. 3.

Each operator $B_{i}$ is either $L_{i}^{-1}$ -cocoercive for some $L_{i}>0$ (and thus single-valued) and $\operatorname{dom}B_{i}=\mathcal{H}_{i}$ , or $L_{i}=0$ and $B_{i}x=v_{i}$ for all $x\in\mathcal{H}_{i}$ and some $v_{i}\in\mathcal{H}_{i}$ (that is, $B_{i}$ is a constant function). 4. 4.

Each $G_{i}:\mathcal{H}_{0}\to\mathcal{H}_{i}$ for $i=1,\ldots,n-1$ is linear and bounded. 5. 5.

Problem (2) has a solution, so the set $\mathcal{S}$ defined in (5) is nonempty.

Problem (1) will be equivalent to an instance of Problem (2) satisfying Assumption 1 if each $f_{i}$ and $h_{i}$ is closed, convex, and proper, each $h_{i}$ has $L_{i}$ -Lipschitz continuous gradients, and a special case of the constraint qualification in [9, Prop. 5.3] holds.

In order to apply a separator-projector algorithm, the target set must be closed and convex. Establishing this for $\mathcal{S}$ is very similar to in our previous work [21], which in turn follows many earlier results.

Lemma 2.

Suppose Assumption 1 holds. The set $\mathcal{S}$ defined in (5) is closed and convex.

Proof.

By [4, Cor. 20.28] each $B_{i}$ is maximal. Furthermore, since $\operatorname{dom}(B_{i})=\mathcal{H}_{i}$ , $T_{i}=A_{i}+B_{i}$ is maximal monotone by [4, Cor. 25.5(i)]. The rest of the proof is identical to [21, Lemma 3]. ∎

Throughout, we will use $p=(z,{\bf w})=(z,w_{1},\ldots,w_{n-1})$ for a generic point in $\boldsymbol{\mathcal{H}}$ , the collective primal-dual space. For $\boldsymbol{\mathcal{H}}$ , we adopt the following (standard) norm and inner product:

[TABLE]

Lemma 3.

[21, Lemma 4]* Let $\varphi_{k}$ be defined as in (6). Then:*

$\varphi_{k}$ * is affine on $\boldsymbol{\mathcal{H}}$ .* 2. 2.

With respect to inner product $\langle\cdot,\cdot\rangle$ on $\boldsymbol{\mathcal{H}}$ , the gradient of $\varphi_{k}$ is

[TABLE]

3.2 Abstract One-Forward-Step Update

We sharpen the notation for the one-forward-step update introduced in (11)–(13) as follows:

Definition 1.

Suppose $\mathcal{H}$ and $\mathcal{H}^{\prime}$ are real Hilbert spaces, $A:\mathcal{H}\to 2^{\mathcal{H}}$ is maximal-monotone with nonempty domain, $B:\mathcal{H}\to\mathcal{H}$ is $L^{-1}$ -cocoercive, and $G:\mathcal{H}^{\prime}\to\mathcal{H}$ is bounded and linear. For $\alpha\in[0,1]$ and $\rho>0$ , define the mapping $\mathcal{F}_{\alpha,\rho}(z,x,w;A,B,G):\mathcal{H}^{\prime}\times\mathcal{H}^{2}\to\mathcal{H}^{2}$ , with additional parameters $A,B$ , and $G$ , as

[TABLE]

To simplify the presentation, we will also use the notation

[TABLE]

With this notation, (11)–(13) may be written as $(x_{i}^{k},y_{i}^{k})=\mathcal{F}^{i}(z^{k},x_{i}^{k-1},w_{i}^{k}).$

3.3 Algorithm Definition

Algorithms 1–3 define the main method proposed in this work. They produce a sequence of primal-dual iterates $p^{k}=(z^{k},w_{1}^{k},\ldots,w_{n-1}^{k})\in\boldsymbol{\mathcal{H}}$ and, implicitly, $w_{n}^{k}\triangleq-\sum_{i=1}^{n-1}G_{i}^{*}w_{i}^{k}$ . Algorithm 1 gives the basic outline of our method; for each operator, it invokes either our new one-forward-step update with a user-defined stepsize (through line 1) or its backtracking variant given in Algorithm 2 (through line 1). Together, algorithms 1–2 specify how to update the points $(x_{i}^{k},y_{i}^{k})$ used to define the separating affine function $\varphi_{k}$ in (6). Algorithm 3, called from line 1 of Algorithm 1, defines the projectToHplane function that performs the projection step to obtain the next iterate.

Taken together, algorithms 1–3 are essentially the same as Algorithm 2 of [21], except that the update of $(x_{i}^{k},y_{i}^{k})$ uses the new procedure given in (11)–(13). For simplicity, the algorithm also lacks the block-iterative and asynchronous features of [10, 17, 21], which we plan to combine with algorithms 1–3 in a follow-up paper.

The computations in projectToHplane are all straightforward and of relatively low complexity. They consist of matrix multiplies by $G_{i}$ , inner products, norms, and sums of scalars. In particular, there are no potentially difficult minimization problems involved. If $G_{i}=I$ and $\mathcal{H}_{i}=\mathbb{R}^{d}$ for $i=1,\ldots,n$ , then the computational complexity of projectToHplane is $\operatorname{O}(nd)$ .

3.4 Algorithm Parameters

The method allows two ways to select the stepsizes $\rho_{i}$ . One may either choose them manually or invoke the backTrack procedure. If one decides to select the stepsizes manually, the upper bound condition $\rho_{i}\leq 2(1-\alpha_{i})/L_{i}$ is required whenever $L_{i}>0$ . However, it may be difficult to ensure that this condition is satisfied when the cocoercivity constant is hard to estimate. The global cocoercivity constant $L_{i}$ may also be conservative in parts of the domain of $B_{i}$ , leading to unnecessarily small stepsizes in some cases. We developed the backtracking linesearch technique for these reasons. The set $\mathcal{B}$ holds the indices of operators for which backtracking is to be used.

For a trial stepsize $\tilde{\rho}_{j}$ , Algorithm 2 generates candidate points $(\tilde{x}_{j},\tilde{y}_{j})$ using the single-forward-step procedure of (22). For these candidates, Algorithm 2 checks two conditions on lines 2–2. If both of these inequalities are satisfied, then backtracking terminates and returns the successful candidate points. If either condition is not satisfied, the stepsize is reduced by the factor $\delta\in(0,1)$ and the process is repeated. These two conditions arise in the analysis in Section 5.

The parameter $\hat{\rho}$ is a global upper bound on the stepsizes (both backtracked and fixed) and must be chosen to satisfy Assumption 2. In backTrack, one must choose an initial trial stepsize within a specified interval (line 2 of Algorithm 2). This interval arises in the analysis (see lemmas 16 and 17). Written in terms of the parameters passed into backTrack in the call on line 1 of Algorithm 1, and assuming the global upper bound $\hat{\rho}$ is sufficiently large to not be active on line 2, the interval is

[TABLE]

An obvious choice is to set the initial stepsize to be at the upper limit of the interval. In practice we have observed that $\|y_{i}^{k}-w_{i}^{k}\|$ and $\|\hat{y}_{i}^{k}-w_{i}^{k}\|$ tend to be approximately equal, so this allows for an increase in the trial stepsize by up to a factor of approximately $1+\alpha_{i}$ over the previous stepsize.

Note that backTrack returns the chosen stepsize $\tilde{\rho}_{j}$ as well as the quantity $\eta$ which are needed to compute the available interval in the call to backTrack during the next iteration.

In the analysis it will be convenient to let $\tilde{\rho}^{(i,k)}$ be the initial trial stepsize chosen during iteration $k$ of Algorithm 1, when backTrack has been called through line 1 for some $i\in\mathcal{B}$ .

We call the stepsize returned by backTrack $\rho_{i}^{k}$ . Assuming that backTrack always terminates finitely (which we will show to be the case), we may write for $i\in\mathcal{B}$

[TABLE]

The only difference between the update for $i\in\mathcal{B}$ on line 1 and this update for $i\notin\mathcal{B}$ is that in the former, the stepsize $\rho_{i}^{k}$ is discovered by backtracking, while in the latter it is directly user-supplied.

The backTrack procedure computes several auxiliary quantities used to check the two backtracking termination conditions. The point $\hat{y}_{j}$ is calculated to be the same as $\hat{y}$ given in Definition 2. The quantity $\varphi_{j}^{+}=\langle Gz-\tilde{x}_{j},\tilde{y}_{j}-w\rangle$ is the value of $\varphi_{i,k}(z^{k},w_{i}^{k})$ corresponding to the candidate points $(\tilde{x}_{j},\tilde{y}_{j})$ . The quantity $\varphi$ computed on line 2 is equal to $\varphi_{i,k-1}(z^{k},w_{i}^{k})=\langle G_{i}z^{k}-x_{i}^{k-1},y_{i}^{k-1}-w_{i}^{k}\rangle$ . Typically, we want $\varphi_{j}^{+}$ to be as large as possible to get a bigger cut with the separating hyperplane, but the condition checked on line 2 will ultimately suffice to prove convergence.

Algorithm 1 has several additional parameters.

** $(\hat{\theta}_{i},\hat{w}_{i})$ **

these are used in the backtracking procedure for $i\in\mathcal{B}$ . An obvious choice which we used in our numerical experiments was $(\hat{\theta}_{i},\hat{w}_{i})=(x_{i}^{0},y_{i}^{0})$ , i.e. the initial point.

$\gamma>0$ :

allows for the projection to be performed using a slightly more general primal-dual metric than (16). In effect, this parameter changes the relative size of the primal and dual updates in lines 3–3 of Algorithm 3. As $\gamma$ increases, a smaller step is taken in the primal and a larger step in the dual. As $\gamma$ decreases, a smaller step is taken in the dual update and a larger step is taken in the primal. See [19, Sec. 5.1] and [18, Sec. 4.1] for more details.

In Algorithm 1, the averaging parameters $\alpha_{i}$ and user-selected stepsizes $\rho_{i}$ are fixed across all iterations. In the preprint version of this paper [24], we instead allow these parameters to vary by iteration, subject to certain restrictions. Doing so complicates the notation and the analysis, so for relative simplicity we consider only fixed values of these parameter here. This simplification also accords with the parameter choices in our computational tests below. For the full, more complicated analysis, please refer to [24].

As written, Algorithm 1 is not as efficient as it could be. On the surface, it seems that we need to recompute $B_{i}x_{i}^{k-1}$ in order to evaluate $\mathcal{F}$ on line 1. However, $B_{i}x_{i}^{k-1}$ was already computed in the previous iteration and can obviously be reused, so only one evaluation of $B_{i}$ is needed per iteration. Similarly, within backTrack, each invocation of $\mathcal{F}$ on line 2 may reuse the quantity $Bx=B_{i}x_{i}^{k-1}$ which was computed in the previous iteration of Algorithm 1. Thus, each iteration of the loop within backTrack requires one new evaluation of $B$ , to compute $B\tilde{x}_{j}$ within $\mathcal{F}$ .

We now precisely state our stepsize assumption for the manually chosen stepsizes, as well as the stepsize upper bound $\hat{\rho}$ .

Assumption 2.

For $i\notin\mathcal{B}$ : If $L_{i}>0$ , then $0<\rho_{i}\leq 2(1-\alpha_{i})/L_{i},$ otherwise $\rho_{i}>0$ . The parameter $\hat{\rho}$ must satisfy

[TABLE]

Note that if $L_{i}>0$ , Assumption 2 effectively limits $\alpha_{i}$ to be strictly less than $1$ , otherwise the stepsize $\rho_{i}$ would be forced to [math], which is prohibited. In this case $\alpha_{i}$ must be chosen in $(0,1)$ . On the other hand, if $L_{i}=0$ , there is no constraint on $\rho_{i}$ other than that it is positive and nonzero, and in this case $\alpha_{i}$ may be chosen in $(0,1]$ .

3.5 Separator-Projector Properties

Lemma 4 details the key results for Algorithm 1 that stem from it being a seperator-projector algorithm. While these properties alone do not guarantee convergence, they are important to all of the arguments that follow.

Lemma 4.

Suppose that Assumption 1 holds. Then for Algorithm 1

The sequence $\{p^{k}\}=\{(z^{k},w_{1}^{k},\ldots,w_{n-1}^{k})\}$ is bounded. 2. 2.

If the algorithm never terminates via line 1, $p^{k}-p^{k+1}\to 0$ . Furthermore $z^{k}-z^{k-1}\to 0$ and $w_{i}^{k}-w_{i}^{k-1}\to 0$ for $i=1,\ldots n$ . 3. 3.

If the algorithm never terminates via line 1 and $\|\nabla\varphi_{k}\|$ remains bounded for all $k\geq 1$ , then $\limsup_{k\to\infty}\varphi_{k}(p^{k})\leq 0$ .

Proof.

Parts 1–2 are proved in lemmas 2 and 6 of [21]. Part 3 can be found in Part 1 of the proof of Theorem 1 in [21]. The analysis in [21] uses a different procedure to construct the pairs $(x_{i}^{k},y_{i}^{k})$ , but the result is generic and not dependent on that particular procedure. Note also that [21] establishes the results in a more general setting allowing asynchrony and block-iterativeness, which we do not analyze here. ∎

4 The Special Case $n=1$

Before starting the analysis, we consider the important special case $n=1$ . In this case, we have by assumption that $G_{1}=I$ , $w_{1}^{k}=0$ , and we are solving the problem $0\in Az+Bz,$ where both operators are maximal monotone and $B$ is $L^{-1}$ -cocoercive. In this case, Algorithm 1 reduces to a method which is similar to FB. Let $x^{k}\triangleq x_{1}^{k}$ , $y^{k}\triangleq y_{1}^{k}$ , $\alpha\triangleq\alpha_{1}$ , and $\rho\triangleq\rho_{1}$ . Assuming for simplicity that $\mathcal{B}=\{\emptyset\}$ , meaning backtracking is not being used, then the updates carried out by the algorithm are

[TABLE]

If $\alpha=0$ , then for all $k\geq 2$ , the iterates computed in (25) reduce simply to

[TABLE]

which is exactly FB. However, $\alpha=0$ is not allowed in our analysis. Thus, FB is a forbidden boundary case which may be approached by setting $\alpha$ arbitrarily close to [math]. As $\alpha$ approaches [math], the stepsize constraint $\rho\leq 2(1-\alpha)/L$ approaches the classical stepsize constraint for FB: $\rho\leq 2/L-\epsilon$ for some arbitrarily small constant $\epsilon>0$ . A potential benefit of Algorithm 1 over FB in the $n=1$ case is that it does allow for backtracking when $L$ is unknown or only a conservative estimate is available.

5 Main Proof

The core of the proof strategy will be to establish (26) below. If this can be done, then weak convergence to a solution follows from part 3 of Theorem 1 in [21].

Lemma 5.

Suppose Assumption 1 holds and Algorithm 1 produces an infinite sequence of iterations without terminating via Line 1. If

[TABLE]

then there exists $(\overline{z},\overline{{\bf w}})\in\mathcal{S}$ such that $(z^{k},{\bf w}^{k})\rightharpoonup(\overline{z},\overline{{\bf w}})$ . Furthermore, we also have $x_{i}^{k}\rightharpoonup G_{i}\bar{z}$ and $y_{i}^{k}\rightharpoonup\overline{w}_{i}$ for all $i=1,\ldots,n-1$ , $x_{n}^{k}\rightharpoonup\bar{z}$ , and $y_{n}^{k}\rightharpoonup-\sum_{i=1}^{n-1}G_{i}^{*}\overline{w}_{i}$ .

Proof.

Equivalent to part 3 of the proof of Theorem 1 in [21].∎

Lemma 5 can be intuitively understood as follows. If we define, for all $k\geq 1$ ,

[TABLE]

then (26) is equivalent to saying that $\epsilon_{k}\to 0$ . For all $k\geq 1$ , we have $(x_{i}^{k},y_{i}^{k})\in\operatorname{gra}T_{i}$ . If $\epsilon_{k}=0$ , then $w_{i}^{k}=y_{i}^{k}\in T_{i}x_{i}^{k}=T_{i}G_{i}z^{k}$ and since $\sum_{i=1}^{n}G_{i}^{*}w_{i}^{k}=0$ , it follows that $(z^{k},{\bf w}^{k})\in\mathcal{S}$ and $z^{k}$ solves (3). Thus $\epsilon_{k}$ can be thought of as the “residual” of the algorithm which measures how far it is from finding a point in $\mathcal{S}$ and a solution to (3). In finite dimension, it is straightforward to show that if $\epsilon_{k}\to 0$ , $(z^{k},{\bf w}^{k})$ must converge to some element of $\mathcal{S}$ . This can be done using Fejér monotonicity [4, Theorem 5.5] combined with the fact that the graph of a maximal-monotone operator in a finite-dimensional Hilbert space is closed [4, Proposition 20.38]. However in the general Hilbert space setting the proof is more delicate, since the graph of a maximal-monotone operator is not in-general closed in the weak-to-weak topology [4, Example 20.39]. Nevertheless the overall result was established in the general Hilbert space setting in part 3 of Theorem 1 of [21], which is itself an instance of [1, Proposition 2.4] (see also [4, Proposition 26.5]). An arguably more transparent proof can be found in [16] (this proof is only for the case $n=2$ , but it can be extended).

In order to establish (26), we start by establishing certain contractive and “ascent” properties for the mapping $\mathcal{F}$ , and also show that the backtracking procedure terminates finitely. Then, we prove the boundedness of $x_{i}^{k}$ and $y_{i}^{k}$ , in turn yielding the boundedness of the gradients $\nabla\varphi_{k}$ and hence the result that $\limsup_{k\to\infty}\{\varphi_{k}(p^{k})\}\leq 0$ by Lemma 4. Next we establish a “Lyapunov-like” recursion for $\varphi_{i,k}(z^{k},w_{i}^{k})$ , relating $\varphi_{i,k}(z^{k},w_{i}^{k})$ to $\varphi_{i,k-1}(z^{k-1},w_{i}^{k-1})$ . Eventually this result will allow us to establish that $\liminf_{k}\varphi_{k}(p^{k})\geq 0$ and hence that $\varphi_{k}(p^{k})\to 0$ , which will in turn allow an argument that $y_{i}^{k}-w_{i}^{k}\to 0$ . The proof that $G_{i}z^{k}-x_{i}^{k}\to 0$ will then follow fairly elementary arguments.

The primary innovations of the upcoming proof are the ascent lemma and the way that it is used in Lemma 18 to establish $\varphi_{k}(p^{k})\to 0$ and $y_{i}^{k}-w_{i}^{k}\to 0$ . This technique is a significant deviation from previous analyses in the projective splitting family. In previous work, the strategy was to show that $\varphi_{i,k}(z^{k},w_{i}^{k})\geq C\max\{\|G_{i}z^{k}-x_{i}^{k}\|^{2},\|y_{i}^{k}-w_{i}^{k}\|^{2}\}$ for a constant $C>0$ , which may be combined with $\limsup\varphi_{k}(p^{k})\leq 0$ to imply (26). In contrast, in the algorithm of this paper we cannot establish such a result and in fact $\varphi_{i,k}(z^{k},w_{i}^{k})$ may be negative. Instead, we relate $\varphi_{k}(p^{k})$ to $\varphi_{k-1}(p^{k-1})$ to show that the separation improves at each iteration in a way which still leads to overall convergence.

5.1 Some Basic Results

We begin by stating three elementary results on sequences, which may be found in [36], and a basic, well known nonexpansivity property for forward steps with cocoercive operators.

Lemma 6.

[36, Lemma 1, Ch. 2]* Suppose that $a_{k}\geq 0$ for all $k\geq 1$ , $b\geq 0$ , $0\leq\tau<1$ , and $a_{k+1}\leq\tau a_{k}+b$ for all $k\geq 1$ . Then $\{a_{k}\}$ is a bounded sequence.*

Lemma 7.

[36, Lemma 3, Ch. 2]* Suppose that $a_{k}\geq 0,b_{k}\geq 0$ for all $k\geq 1$ , $b_{k}\to 0$ , and there is some $0\leq\tau<1$ such that $a_{k+1}\leq\tau a_{k}+b_{k}$ for all $k\geq 1$ . Then $a_{k}\to 0$ .*

Lemma 8.

Suppose that $0\leq\tau<1$ and $\{r_{k}\},\{b_{k}\}$ are sequences in $\mathbb{R}$ with the properties $b_{k}\to 0$ and $r_{k+1}\geq\tau r_{k}+b_{k}$ for all $k\geq 1$ . Then $\lim\inf_{k\to\infty}\{r_{k}\}\geq 0$ .

Proof.

Negating the assumed inequality yields $-r_{k+1}\leq\tau(-r_{k})-b_{k}$ . Applying [36, Lemma 3, Ch. 2] then yields $\lim\sup\{-r_{k}\}\leq 0$ .∎

Lemma 9.

Suppose $B$ is $L^{-1}$ -cocoercive and $0\leq\rho\leq 2/L$ . Then for all $x,y\in\operatorname{dom}(B)$

[TABLE]

Proof.

Squaring the left hand side of (27) yields

[TABLE]

∎

5.2 A Contractive Result

We begin the main proof with a result on the one-forward-step mapping: $\mathcal{F}$ from Definition 1. The following lemma will ultimately be used to show that the iterates remain bounded.

Lemma 10.

Suppose $(x^{+},y^{+})=\mathcal{F}_{\alpha,\rho}(z,x,w;A,B,G)$ , where $\mathcal{F}_{\alpha,\rho}$ is given in Definition 1. Recall that $B$ is $L^{-1}$ -cocoercive. If $L=0$ or $\rho\leq 2(1-\alpha)/L$ , then

[TABLE]

for any $\hat{\theta}\in\operatorname{dom}(A)$ and $\hat{w}\in A\hat{\theta}+B\hat{\theta}$ .

Proof.

Select any $\hat{\theta}\in\operatorname{dom}(A)$ and $\hat{w}\in A\hat{\theta}+B\hat{\theta}$ . Let $\hat{a}=\hat{w}-B\hat{\theta}\in A\hat{\theta}$ . It follows immediately from (4) that

[TABLE]

Therefore, (22) and (29) yield

[TABLE]

To obtain (a), one uses the nonexpansivity of the resolvent [4, Prop. 23.8(ii)]. To obtain (b), one regroups terms and adds and subtracts $B\hat{\theta}$ . Then (c) follows from the triangle inequality. Finally we consider (d): If $L>0$ , apply Lemma 9 to the first term on the right-hand side of (30) with the stepsize $\rho/(1-\alpha)$ which by assumption satisfies

[TABLE]

by Assumption 2. Alternatively, if $L=0$ , implying that $B$ is a constant-valued operator, then $Bx=B\hat{\theta}$ and (d) is just an equality. ∎

We now prove the key “ascent lemma”. It shows that, while the one-forward-step update is not guaranteed to find a separating hyperplane at each iteration, it does make a certain kind of progress toward separation.

Lemma 11.

Suppose $(x^{+},y^{+})=\mathcal{F}_{\alpha,\rho}(z,x,w;A,B,G)$ , where $\mathcal{F}_{\alpha,\rho}$ is given in Definition 1. Recall $B$ is $L^{-1}$ -cocoercive. Let $y\in Ax+Bx$ and define $\varphi\triangleq\langle Gz-x,y-w\rangle$ . Further, define $\varphi^{+}\triangleq\langle Gz-x^{+},y^{+}-w\rangle$ , $t$ as in (22), and $\hat{y}\triangleq\rho^{-1}(t-x^{+})+Bx$ . If $\alpha\in(0,1]$ and $\rho\leq 2(1-\alpha)/L$ whenever $L>0$ , then

[TABLE]

Proof.

Since $y\in Ax+Bx$ , there exists $a\in Ax$ such that $y=a+Bx$ . Let $a^{+}\triangleq\rho^{-1}(t-x^{+})$ . Note by (4) that $a^{+}\in Ax^{+}$ . With this notation, $\hat{y}=a^{+}+Bx$ .

We may write the $x^{+}$ -update in (22) as

[TABLE]

which rearranges to

[TABLE]

Adding $Gz$ to both sides yields

[TABLE]

Substituting this equation into the definition of $\varphi^{+}$ yields

[TABLE]

We now focus on the second term in (33). Assume for now that $L>0$ (we will deal with the $L=0$ case below). We write

[TABLE]

To derive (34) we substituted $(y^{+},y)=(a^{+}+Bx^{+},a+Bx)$ and for the following inequality we used the monotonicity of $A$ and $L^{-1}$ -cocoercivity of $B$ (recall that $a\in Ax$ and $a^{+}\in Ax^{+}$ ). Substituting the resulting inequality back into (33) yields

[TABLE]

Subtracting $(1-\alpha)\varphi^{+}$ from both sides of the above inequality produces

[TABLE]

Using (32) once again, this time to the third term on the right-hand side of (36), we write

[TABLE]

Substituting this equation back into (36) yields

[TABLE]

We next use the identity $\langle x_{1},x_{2}\rangle=\frac{1}{2}\|x_{1}\|^{2}+\frac{1}{2}\|x_{2}\|^{2}-\frac{1}{2}\|x_{1}-x_{2}\|^{2}$ on both inner products in (38), as follows:

[TABLE]

and

[TABLE]

Here we have used the identities

[TABLE]

Using (39)–(40) in (38) yields

[TABLE]

Consider this last expression: since $\alpha\leq 1$ , the coefficient $(1-\alpha)\rho/2$ multiplying $\|a^{+}-a\|^{2}$ is nonnegative. Furthermore, since $\rho\leq 2(1-\alpha)/L$ , the coefficient multiplying $\|Bx^{+}-Bx\|^{2}$ is positive. Therefore we may drop these two terms from the above inequality and divide by $\alpha$ to obtain (31).

Finally, we deal with the case in which $L=0$ , which implies that $Bx=v$ for some $v\in\mathcal{H}$ for all $x\in\mathcal{H}$ . The main difference is that the $\|Bx^{+}-Bx\|^{2}$ terms are no longer present since $Bx^{+}=Bx$ . The analysis is the same up to (33). In this case $Bx^{+}=v$ so instead of (35) we may deduce from (34) that

[TABLE]

Since $Bx^{+}=Bx=v$ is constant we also have that

[TABLE]

Thus, instead of (36) in this case we have the simpler inequality

[TABLE]

The term $\langle Gz-x^{+},w-y\rangle$ in (41) is dealt with just as in (36), by substitution of (32). This step now leads via (37) to

[TABLE]

Once again using $\langle x_{1},x_{2}\rangle=\frac{1}{2}\|x_{1}\|^{2}+\frac{1}{2}\|x_{2}\|^{2}-\frac{1}{2}\|x_{1}-x_{2}\|^{2}$ on the second term on the r.h.s. above yields

[TABLE]

We can lower-bound the $\|y^{+}-y\|^{2}$ term by [math]. Dividing through by $\alpha$ and rearranging, we obtain

[TABLE]

Since $y^{+}=\hat{y}$ in the $L=0$ case, this is equivalent to (31).∎

5.3 Finite Termination of Backtracking

In all the following lemmas in sections 5.3 and 5.4 regarding algorithms 1–3, assumptions 1 and 2 are in effect and will not be explicitly stated in each lemma. We start by proving that backTrack terminates in a finite number of iterations, and that the stepsizes it returns are bounded away from [math].

Lemma 12.

For $i\in\mathcal{B}$ , Algorithm 2 terminates in a finite number of iterations for all $k\geq 1$ . There exists $\underline{\rho}_{i}>0$ such that $\rho_{i}^{k}\geq\underline{\rho}_{i}$ for all $k\geq 1$ , where $\rho_{i}^{k}$ is the stepsize returned by Algorithm 2 on line 1. Furthermore $\rho_{i}^{k}\leq\hat{\rho}$ for all $k\geq 1$ .

Proof.

Assume we are at iteration $k\geq 1$ in Algorithm 1 and backTrack has been called through line 1 for some $i\in\mathcal{B}$ . The internal variables within backTrack are defined in terms of the variables passed from Algorithm 1 as follows: $z=z^{k}$ , $x=x_{i}^{k-1}$ , $w=w_{i}^{k}$ , $y=y_{i}^{k-1}$ , $\rho=\rho_{i}^{k-1}$ and $\eta=\eta_{i}^{k-1}$ . Furthermore $\alpha=\alpha_{i}$ , $\hat{\theta}=\hat{\theta}_{i}$ , $\hat{w}=\hat{w}_{i}$ , $A=A_{i}$ , $B=B_{i}$ , and $G=G_{i}$ . The calculation on line 2 of Algorithm 2 yields $\varphi=\varphi_{i,k-1}(z^{k},w_{i}^{k})$ . In the following argument, we mostly refer to the internal name of the variables within backTrack without explicitly making the above substitutions. With that in mind, let $L=L_{i}$ be the cocoercivity constant of $B=B_{i}$ .

Recall that $\tilde{\rho}^{(i,k)}$ is the initial trial stepsize $\tilde{\rho}_{1}$ chosen on line 2 of backTrack. We must establish that the interval on line 2 is always nonempty and so a valid initial stepsize can be chosen. Since $\eta\alpha\geq 0$ , this will be true if $\hat{\rho}\geq\rho=\rho_{i}^{k-1}$ , which we will prove by induction. Note that by Assumption 2, $\hat{\rho}\geq\rho_{i}^{0}$ for all $i\in\mathcal{B}$ . Therefore for $k=1$ , $\hat{\rho}\geq\rho=\rho_{i}^{0}$ . We will prove the induction step below.

Observe that backtracking terminates via line 2 if two conditions are met. The first condition,

[TABLE]

is identical to (28) of Lemma 10, with $\tilde{x}_{j}$ and $\tilde{\rho}_{j}$ respectively in place of $x^{+}$ and $\rho$ . The initialization step of Algorithm 2 provides us with $\hat{w}\in A\hat{\theta}+B\hat{\theta}$ for some $\hat{\theta}\in\operatorname{dom}(A)$ . Furthermore, since

[TABLE]

the findings of Lemma 10 may be applied. In particular, if $L>0$ and $\tilde{\rho}_{j}\leq 2(1-\alpha)/L$ , then (42) will be met. Alternatively, if $L=0$ , (42) will hold for any value of the stepsize $\tilde{\rho}_{j}>0$ .

Next, consider the second termination condition,

[TABLE]

This relation is identical to (31) of Lemma 11, with $(\tilde{y}_{j},\hat{y}_{j},\tilde{\rho}_{j})$ in place of $(y^{+},\hat{y},\rho)$ . However, to apply the lemma we must show that $y=y_{i}^{k-1}\in Ax_{i}^{k-1}+Bx_{i}^{k-1}=Ax+Bx$ . We will also prove this by induction.

For $k=1$ , $y=y_{i}^{k-1}\in Ax_{i}^{k-1}+Bx_{i}^{k-1}=Ax+Bx$ holds by the initialization step of Algorithm 1. Now assume that at iteration $k\geq 2$ it holds that $y=y_{i}^{k-1}\in Ax_{i}^{k-1}+Bx_{i}^{k-1}=Ax+Bx$ and furthermore that $\hat{\rho}\geq\rho=\rho_{i}^{k-1}$ , therefore the interval on line 2 is nonempty. We may then apply the findings of Lemma 11 to conclude that if $L>0$ and $\tilde{\rho}_{j}\leq 2(1-\alpha)/L$ , then condition (43) is satisfied. Or, if $L=0$ , condition (43) is satisfied for any $\tilde{\rho}_{j}>0$ .

Combining the above observations, we conclude that if $L>0$ and $\tilde{\rho}_{j}\leq 2(1-\alpha)/L$ , backtracking will terminate for that iteration $j$ of backTrack via line 2. Or, if $L=0$ , it will terminate in the first iteration of backTrack. The stepsize decrement condition on line 2 of the backtracking procedure implies that $\tilde{\rho}_{j}\leq 2(1-\alpha)/L$ will eventually hold for large enough $j$ , and hence that the two backtracking termination conditions must eventually hold.

Let $j^{*}\geq 1$ be the iteration at which backtracking terminates when called for operator $i$ at iteration $k$ of Algorithm 1. For the pair $(x_{i}^{k},y_{i}^{k})$ returned by backTrack on line 1 of Algorithm 1, we may write

[TABLE]

Thus, by the definition of $\mathcal{F}$ in (22), $y_{i}^{k}\in A_{i}x_{i}^{k}+B_{i}x_{i}^{k}$ . Therefore, induction establishes that $y_{i}^{k}\in A_{i}x_{i}^{k}+B_{i}x_{i}^{k}$ holds for all $k\geq 1$ .

Now the returned stepsize must satisfy $\rho_{i}^{k}=\tilde{\rho}_{j^{*}}\leq\tilde{\rho}^{(i,k)}\leq\hat{\rho}$ . In the next iteration, $\rho=\rho_{i}^{k}\leq\hat{\rho}$ . Thus we have also established by induction that $\hat{\rho}\geq\rho=\rho_{i}^{k}$ and therefore that the interval on line 2 is nonempty for all iterations $k\geq 1$ . Finally, we now also infer by induction that backTrack terminates in a finite number of iterations for all $k\geq 1$ and $i\in\mathcal{B}$ .

Now $\tilde{\rho}^{(i,k)}$ must be chosen in the range

[TABLE]

Since we have established that this interval remains nonempty, it holds trivially that $\tilde{\rho}^{(i,k)}\geq\rho_{i}^{k-1}$ . For all $k\geq 1$ and $i\in\mathcal{B}$ , the returned stepsize $\rho_{i}^{k}=\tilde{\rho}_{j^{*}}$ must satisfy

[TABLE]

Therefore for all $k\geq 1$ and all $i\in\mathcal{B}$ such that $L_{i}>0$ , one has

[TABLE]

where the first inequality uses (44) and $\tilde{\rho}^{(i,k)}\geq\rho_{i}^{k-1}$ , the second inequality recurses, and the final inequality is just (44) for $k=1$ . If $L_{i}=0$ , the argument is simply

[TABLE]

∎

5.4 Boundedness Results and their Direct Consequences

Lemma 13.

For all $i=1,\ldots,n$ , the sequences $\{x_{i}^{k}\}$ and $\{y_{i}^{k}\}$ are bounded.

Proof.

To prove this, we first establish that for $i=1,\ldots,n$ and $k\geq 1$

[TABLE]

For $i\in\mathcal{B}$ , Lemma 12 establishes that backTrack terminates for finite $j\geq 1$ for all $k\geq 1$ . For fixed $k\geq 1$ and $i\in\mathcal{B}$ , let $j^{*}\geq 1$ be the iteration of backTrack that terminates. At termination, the following condition is satisfied via line 2:

[TABLE]

Into this inequality, now substitute in the following variables from Algorithm 1, as passed to and from backTrack: $x_{i}^{k}=\tilde{x}_{j^{*}}$ , $\hat{\theta}_{i}=\hat{\theta}$ , $\alpha_{i}=\alpha$ , $x_{i}^{k-1}=x$ , $G_{i}=G$ , $z^{k}=z$ , $\rho_{i}^{k}=\tilde{\rho}_{j^{*}}$ , $w_{i}^{k}=w$ , and $\hat{w}_{i}=w$ . Further noting that $\rho_{i}^{k}\leq\hat{\rho}$ , the result is (45).

For $i\notin\mathcal{B}$ , we note that line 1 of Algorithm 1 reads as

[TABLE]

and since Assumption 2 holds, we may apply Lemma 10. Further noting that by Assumption 2 $\rho_{i}\leq\hat{\rho}$ we arrive at yield (45).

Since $\{z^{k}\}$ , and $\{w_{i}^{k}\}$ are bounded by Lemma 4 and $\|G_{i}\|$ is bounded by Assumption 1, boundedness of $\{x_{i}^{k}\}$ now follows by applying Lemma 6 with $\tau=1-\alpha_{i}<1$ to (45).

Next, boundedness of $B_{i}x_{i}^{k}$ follows from the continuity of $B_{i}$ . Since Lemma 12 established that backTrack terminates in a finite number of iterations we have for any $k\geq 2$ that

[TABLE]

where for $i\notin\mathcal{B}$ $\rho_{i}^{k}\triangleq\rho_{i}$ . Expanding the $y^{+}$ -update in the definition of $\mathcal{F}$ in (22), we may write

[TABLE]

Since $G_{i}$ , $z^{k}$ , and $w_{i}^{k}$ are bounded, for $i\in\mathcal{B}$ $\rho_{i}^{k}\leq\hat{\rho}$ , and $\rho_{i}^{k}\geq\underline{\rho}_{i}$ (using Lemma 12 for $i\in\mathcal{B}$ ), and for $i\notin\mathcal{B}$ $\rho_{i}^{k}=\rho_{i}$ is constant, we conclude that $y_{i}^{k}$ remains bounded.∎

With $\{x_{i}^{k}\}$ and $\{y_{i}^{k}\}$ bounded for all $i=1,\ldots,n$ , the boundedness of $\nabla\varphi_{k}$ follows immediately:

Lemma 14.

The sequence $\{\nabla\varphi_{k}\}$ is bounded. If Algorithm 1 never terminates via line 1, $\limsup_{k\to\infty}\varphi_{k}(p^{k})\leq 0$ .

Proof.

By Lemma 3, $\nabla_{z}\varphi_{k}=\sum_{i=1}^{n}G_{i}^{*}y_{i}^{k}$ , which is bounded since each $G_{i}$ is bounded by assumption and each $\{y_{i}^{k}\}$ is bounded by Lemma 13. Furthermore, $\nabla_{w_{i}}\varphi_{k}=x_{i}^{k}-G_{i}x_{n}^{k}$ is bounded using the same two lemmas. That $\limsup_{k\to\infty}\varphi_{k}(p^{k})\leq 0$ then immediately follows from Lemma 4(3).∎

Using the boundedness of $\{x_{i}^{k}\}$ and $\{y_{i}^{k}\}$ , we can next derive the following simple bound relating $\varphi_{i,k-1}(z^{k},w_{i}^{k})$ to $\varphi_{i,k-1}(z^{k-1},w_{i}^{k-1})$ :

Lemma 15.

There exists $M_{1},M_{2}\geq 0$ such that for all $k\geq 2$ and $i=1,\ldots,n$ ,

[TABLE]

Proof.

For each $i\in\{1,\ldots,n\}$ , let $M_{1,i},M_{2,i}\geq 0$ be respective bounds on $\big{\{}\|G_{i}z^{k-1}-x_{i}^{k-1}\|\big{\}}$ and $\big{\{}\|y_{i}^{k-1}-w_{i}^{k}\|\big{\}}$ , which must exist by Lemma 4, the boundedness of $\{x_{i}^{k}\}$ and $\{y_{i}^{k}\}$ , and the boundedness of $G_{i}$ . Let $M_{1}=\max_{i=1,\ldots,m}\{M_{1,i}\}$ and $M_{2}=\max_{i=1,\ldots,m}\{M_{2,i}\}$ . Then, for any $k\geq 2$ and $i\in\{1,\ldots,n\}$ , we may write

[TABLE]

where the last step uses the Cauchy-Schwarz inequality and the definitions of $M_{1}$ and $M_{2}$ .∎

5.5 A Lyapunov-Like Recursion for the Hyperplane

We now establish a Lyapunov-like recursion for the hyperplane. For this purpose, we need two more definitions.

Definition 2.

For all $k\geq 1$ , since Lemma 12 establishes that Algorithm 2 terminates in a finite number of iterations, we may write for $i=1,\ldots,n$ :

[TABLE]

where for $i\notin\mathcal{B}$ $\rho_{i}^{k}=\rho_{i}$ are actually fixed. Using (4) and the $x^{+}$ -update in (22), there exists $a_{i}^{k}\in A_{i}x_{i}^{k}$ such that

[TABLE]

Define $\hat{y}_{i}^{k}\triangleq a_{i}^{k}+B_{i}x_{i}^{k-1}$ .

Definition 3.

For $i\notin\mathcal{B}$ we will use $\rho_{i}^{k}\triangleq\rho_{i}$ , even though these stepsizes are fixed, so that we can use the same statements as for $i\in\mathcal{B}$ . Similarly we will use $\underline{\rho}_{i}\triangleq\rho_{i}$ for $i\notin\mathcal{B}$ .

Lemma 16.

For all $k\geq 1$ , and $i=1,\ldots,n$

[TABLE]

Proof.

For $i\in\mathcal{B}$ , recall that $\tilde{\rho}^{(i,k)}$ is the initial trial stepsize chose on line 2 of backTrack at iteration $k$ for some $i\in\mathcal{B}$ . The condition on line 2 of backTrack guarantees that

[TABLE]

Multiplying through by $\alpha_{i}^{-1}\|y_{i}^{k}-w_{i}^{k}\|^{2}$ and noting that $\rho_{i}^{k+1}\leq\tilde{\rho}^{(i,k+1)}$ proves the lemma.

For $i\notin\mathcal{B}$ the expression holds trivially because $\rho_{i}^{k+1}=\rho_{i}^{k}=\rho_{i}$ . ∎

Lemma 17.

For all $k\geq 2$ and $i=1,\ldots,n$ ,

[TABLE]

and

[TABLE]

Proof.

Take any $i\in\mathcal{B}$ . Lemma 12 guarantees the finite termination of backTrack. Now consider the backtracking termination condition

[TABLE]

Fix some $k\geq 2$ , and let $j^{*}\geq 1$ be the iteration at which backTrack terminates. In the above inequality, make the following substitutions for the internal variables of backTrack by those passed in/out of the function: $\varphi_{i,k}(z^{k},x_{i}^{k})=\varphi_{j^{*}}^{+}$ , $\rho_{i}^{k}=\tilde{\rho}_{j^{*}}$ , $\alpha_{i}=\alpha$ , $y_{i}^{k}=\tilde{y}_{j}$ , $w_{i}^{k}=w$ , $\varphi_{i,k-1}(z^{k},w_{i}^{k})=\varphi$ . Furthermore, $\hat{y}_{i}^{k}=\hat{y}_{j^{*}}$ where $\hat{y}_{i}^{k}$ is defined in Definition 2. Together, these substitutions yield (47). We can then apply Lemma 16 to get (48).

Now take any $i\in\{1,\ldots,n\}\backslash\mathcal{B}$ . From line 1 of Algorithm 1, Assumption 2, and Lemma 11, we directly deduce (47). Combining this relation with (46) we obtain (48).∎

5.6 Finishing the Proof

We now work toward establishing the conditions of Lemma 5. Unless otherwise specified, we henceforth assume that Algorithm 1 runs indefinitely and does not terminate at line 1. Termination at line 1 is dealt with in Theorem 1 to come.

Lemma 18.

For all $i=1,\ldots,n$ , we have $y_{i}^{k}-w_{i}^{k}\to 0$ and $\varphi_{k}(p^{k})\to 0$ .

Proof.

Fix any $i\in\{1,\ldots,n\}$ . First, note that for all $k\geq 2$ ,

[TABLE]

where $d_{i}^{k}\triangleq M_{3}\|w_{i}^{k}-w_{i}^{k-1}\|+\|w^{k}_{i}-w^{k-1}_{i}\|^{2}$ and $M_{3}\geq 0$ is a bound on $2\|y_{i}^{k-1}-w_{i}^{k-1}\|$ , which must exist because both $\{y_{i}^{k}\}$ and $\{w_{i}^{k}\}$ are bounded by lemmas 4 and 13. Note that $d_{i}^{k}\to 0$ as a consequence of Lemma 4.

Second, recall Lemma 15, which states that there exists $M_{1},M_{2}\geq 0$ such that for all $k\geq 2$ ,

[TABLE]

Now let, for all $k\geq 1$ ,

[TABLE]

so that

[TABLE]

Using (49) and (50) in (48) yields

[TABLE]

where

[TABLE]

Note that $\rho_{i}^{k}$ is bounded, $0<\alpha_{i}\leq 1$ , $\|G_{i}\|$ is finite, $\|z^{k}-z^{k-1}\|\to 0$ and $\|w_{i}^{k}-w_{i}^{k-1}\|\to 0$ by Lemma 4, and $d_{i}^{k}\to 0$ . Thus $e_{i}^{k}\to 0$ .

Since $0<\alpha_{i}\leq 1$ , we may apply Lemma 8 to (53) with $\tau=1-\alpha_{i}<1$ , which yields $\liminf_{k\to\infty}\{r^{k}_{i}\}\geq 0$ . Therefore

[TABLE]

On the other hand, $\limsup_{k\to\infty}\varphi_{k}(p^{k})\leq 0$ by Lemma 14. Therefore, using (52) and (55),

[TABLE]

Therefore $\lim_{k\to\infty}\big{\{}\varphi_{k}(p^{k})\big{\}}=0$ . Consider any $i\in\{1,\ldots,n\}$ . Combining $\lim_{k\to\infty}\big{\{}\varphi_{k}(p^{k})\big{\}}=0$ with $\liminf_{k\to\infty}\sum_{i=1}^{n}r_{i}^{k}\geq 0,$ we have

[TABLE]

Since $\rho_{i}^{k}\geq\underline{\rho}_{i}>0$ (using Lemma 12 for $i\in\mathcal{B}$ ) we conclude that $y_{i}^{k}-w_{i}^{k}\to 0$ .∎

We have already proved the first requirement of Lemma 5, that $y_{i}^{k}-w_{i}^{k}\to 0$ for all $i\in\{1,\ldots,n\}$ . We now work to establish the second requirement, that $G_{i}z^{k}-x_{i}^{k}\to 0$ . In the upcoming lemmas we continue to use the quantity $\hat{y}_{i}^{k}$ which is given in Definition 2.

Lemma 19.

Recall $\{\hat{y}_{i}^{k}\}_{k\in\mathbb{N}}$ from Definition 2. For all $i=1,\ldots,n$ , $\hat{y}_{i}^{k}-w_{i}^{k}\to 0$ .

Proof.

Fix any $k\geq 1$ . For all $i=1,\ldots,n$ , repeating (47) from Lemma 17, we have

[TABLE]

where we have used $r_{i}^{k}$ defined (51) along with (49)–(50) and $e_{i}^{k}$ is defined in (54). This is the same argument used in Lemma 18, but now we apply (49)–(50) to (47), rather than (48), so that we can upper bound the $\|\hat{y}_{i}^{k}-w_{i}^{k}\|^{2}$ term. Summing over $i=1,\ldots,n$ , yields

[TABLE]

Since $\varphi_{k}(p^{k})\to 0$ , $e_{i}^{k}\to 0$ , $\liminf_{k\to\infty}\{r_{i}^{k}\}\geq 0$ , and $\rho_{i}^{k}\geq\underline{\rho}_{i}>0$ for all $k$ , the above inequality implies that $\hat{y}_{i}^{k}-w_{i}^{k}\to 0$ .∎

Lemma 20.

For $i=1\ldots,n$ , $x_{i}^{k}-x_{i}^{k-1}\to 0$ .

Proof.

Fix $i\in\{1,\ldots,n\}$ . Using the definition of $a_{i}^{k}$ in Definition 2, we have for $k\geq 1$ that

[TABLE]

Using the definition of $\hat{y}_{i}^{k}$ , also in Definition 2, this implies that

[TABLE]

Subtracting the second of these equations from the first yields, for all $k\geq 2$ ,

[TABLE]

Taking norms and using the triangle inequality yields, for all $k\geq 2$ , that

[TABLE]

where

[TABLE]

Since $\rho_{i}^{k}$ is bounded from above, $\tilde{e}_{i}^{k}\to 0$ using Lemma 19, the finiteness of $\|G_{i}\|$ , and Lemma 4. Furthermore, $\alpha_{i}>0$ , so we may apply Lemma 7 to (57) to conclude that $x_{i}^{k}-x_{i}^{k-1}\to 0$ .∎

Lemma 21.

For $i=1,\dots,n$ , $G_{i}z^{k}-x_{i}^{k}\to 0$ .

Proof.

Recalling (56), we first write

[TABLE]

Lemma 20 implies that the first term on the right-hand side of (58) converges to zero. Since $\{\rho_{i}^{k}\}$ is bounded, Lemma 19 implies that the second term on the right-hand side also converges to zero. Since $\alpha_{i}>0$ , we conclude that $\|G_{i}z^{k}-x_{i}^{k}\|\to 0$ . ∎

Finally, we can state our convergence result for Algorithm 1:

Theorem 1.

Suppose that assumptions 1-2 hold. If Algorithm 1 terminates by reaching line 1, then its final iterate is a member of the extended solution set $\mathcal{S}$ . Otherwise, the sequence $\{(z^{k},{\bf w}^{k})\}$ generated by Algorithm 1 converges weakly to some point $(\bar{z},\overline{{\bf w}})$ in the extended solution set $\mathcal{S}$ of (2) defined in (5). Furthermore, $x_{i}^{k}\rightharpoonup G_{i}\bar{z}$ and $y_{i}^{k}\rightharpoonup\overline{w}_{i}$ for all $i=1,\ldots,n-1$ , $x_{n}^{k}\rightharpoonup\bar{z}$ , and $y_{n}^{k}\rightharpoonup-\sum_{i=1}^{n-1}G_{i}^{*}\overline{w}_{i}$ .

Proof.

For the finite termination result we refer to Lemma 5 of [21]. Otherwise, lemmas 18 and 21 imply that the hypotheses of Lemma 5, hold, and the result follows.∎

6 Numerical Experiments

All our numerical experiments were implemented in Python (using numpy and scipy) on an Intel Xeon workstation running Linux with 16 cores and 64 GB of RAM. The code is available via github at https://github.com/projective-splitting/coco. We restricted our attention to algorithms with comparable features and benefits to our proposed method. Thus we only considered methods that:

Are first-order and “fully split” the problem (that is, separate the linear operators $G_{i}$ from the resolvent calculations, and use gradient-type steps for smooth functions), 2. 2.

Do not (either approximately or exactly) solve a linear system of equations at each iteration or before the first iteration, 3. 3.

Avoid having to apply “smoothing” to nonsmooth operators, 4. 4.

Incorporate a backtracking linesearch in a manner that avoids the need for bounds on Lipschitz or cocoercivity constants, and 5. 5.

Do not use iterative approximation of resolvents.

The last property we include for reasons of simplicity, while the rest contribute to making algorithms scalable and easy to apply. For a given application, there may of course be effective algorithms which could have been considered but do not satisfy all of the above requirements. However, because of the general desirability of properties 1-4 and the relative simplicity of algorithms with property 5, we only considered methods having all of them.

We compared this paper’s backtracking one-forward-step projective splitting algorithm given in Algorithm 1 (which we call ps1fbt) with the following methods:

•

The two-forward-step projective splitting algorithm with backtracking we developed in [21] (ps2fbt). This method requires only Lipschitz continuity of single-valued operators, as opposed to cocoercivity.

•

The adaptive three-operator splitting algorithm of [34] (ada3op) (where “adaptive” is used to mean “backtracking linesearch”); this method is a backtracking adaptation of the fixed-stepsize method proposed in [14]. This method requires $G_{i}=I$ in problem (2) and hence can only be readily applied to two of the three test applications described below.

•

The backtracking linesearch variant of the Chambolle-Pock primal-dual splitting method [30] (cp-bt).

•

The algorithm of [12]. This is essentially Tseng’s method applied to a product-space “monotone + skew” inclusion in the following way: Assume $T_{n}$ is Lipschitz monotone, problem (3) is equivalent to finding $p\triangleq(z,w_{1},\ldots,w_{n-1})$ such that $w_{i}\in T_{i}G_{i}z$ (which is equivalent to $G_{i}z\in T_{i}^{-1}w_{i}$ ) for $i=1,\ldots,n-1$ , and $\sum_{i=1}^{n-1}G_{i}^{*}w_{i}=-T_{n}z$ . In other words, we wish to solve $0\in\tilde{A}p+\tilde{B}p$ , where $\tilde{A}$ and $\tilde{B}$ are defined by

[TABLE]

$\tilde{A}$ is maximal monotone, while $\tilde{B}$ is the sum of two Lipshitz monotone operators (the second being skew linear), and therefore also Lipschitz monotone. The algorithm in [12] is essentially Tseng’s forward-backward-forward method [40] applied to this inclusion, using resolvent steps for $\tilde{A}$ and forward steps for $\tilde{B}$ . Thus, we call this method tseng-pd. In order to achieve good performance with tseng-pd we had to incorporate a diagonal preconditioner as proposed in [41].

•

The recently proposed forward-reflected-backward method [31], applied to this same primal-dual inclusion $0\in\tilde{A}p+\tilde{B}p$ specified by (59)-(72). We call this method frb-pd.

Recently there have been several stochastic extensions of ada3op and cp-bt [42, 43, 33]. The method of [43] requires estimates of the Lipschitz constants and matrix norms, and so does not satisfy our experimental requirements. Since one of our problems is not in “finite-sum” format, and another includes a matrix $G_{i}$ which is not equal to the identity, the methods of [42, 33] could only be applied to one of our three test problems. Even for this problem, the number of training examples in the two datasets were $60$ and $127$ , respectively, while the feature dimensions were $7,\!705$ and $19,\!806$ , so finite-sum methods are not particularly suitable. For these reasons we did not include these methods in our experiments.

6.1 Portfolio Selection

Consider the optimization problem:

[TABLE]

where $Q\succeq 0$ , $r>0$ , and $m\in\mathbb{R}^{d}_{+}$ . This model arises in Markowitz portfolio theory. We chose this particular problem because it features two constraint sets (a general halfspace and a simplex) onto which it is easy to project individually, but whose intersection poses a more difficult projection problem. This property makes it difficult to apply first-order methods such as ISTA/FISTA [5] as they can only perform one projection per iteration and thus cannot fully split the problem. On the other hand, projective splitting can handle an arbitrary number of constraint sets so long as one can compute projections onto each of them. We consider a fairly large instance of this problem so that standard interior point methods (for example, those in the CVXPY [15] package) are disadvantaged by their high per-iteration complexity and thus not generally competitive with first-order methods. Furthermore, backtracking variants of first-order methods are preferable for large problems as they avoid the need to estimate the largest eigenvalue of $Q$ .

To convert (73) to a monotone inclusion, we set $A_{1}=N_{C_{1}}$ where $N_{C_{1}}$ is the normal cone of the simplex $C_{1}=\{x\in\mathbb{R}^{d}:\sum_{i=1}^{d}x_{i}=0,x_{i}\geq 0\}$ . We set $B_{1}=2Qx$ , which is the gradient of the objective function and is cocoercive (and Lipschitz-continuous). Finally, we set $A_{2}=N_{C_{2}}$ , where $C_{2}=\{x:m^{\top}x\geq r\}$ , and let $B_{2}$ be the zero operator. Note that the resolvents of $N_{C_{1}}$ and $N_{C_{2}}$ (that is, the projections onto $C_{1}$ and $C_{2}$ ) are easily computed in $\operatorname{O}(d)$ operations [32]. With this notation, one may write (73) as the the problem of finding $z\in\mathbb{R}^{d}$ such that

[TABLE]

which is an instance of (2) with $n=2$ and $G_{1}=G_{2}=I$ .

To terminate each method in our comparisons, we used the following common criterion incorporating both the objective function and the constraints of (73):

[TABLE]

where $F^{*}$ is the optimal value of the problem. Note that $c(x)=0$ if and only if $x$ solves (73). To estimate $F^{*}$ , we used the best feasible value returned by any method after $1000$ iterations.

We generated random instances of (73) as follows: we set $d=10,000$ to obtain a relatively large instance of the problem. We then generated a $d\times d$ matrix $Q_{0}$ with each entry drawn from $\mathcal{N}(0,1)$ . The matrix $Q$ is then formed as $(1/d)\cdot Q_{0}Q_{0}^{\top}$ , which is guaranteed to be positive semidefinite. We then generate the vector $m\in\mathbb{R}^{d}$ of length $d$ to have entries uniformly distributed between [math] and $100$ . The constant $r$ is set to $\delta_{r}\sum_{i=1}^{d}m_{i}/d$ for various values of $\delta_{r}>0$ . We solved the problem for $\delta_{r}\in\{0.5,0.8,1,1.5\}$ .

All methods were initialized at the same point $[1~{}1~{}\ldots~{}1]^{\top}/d$ . For all the backtracking linesearch procedures except cp-bt , the initial stepsize estimate is the previously discovered stepsize; at the first iteration, the initial stepsize is $1$ . For cp-bt we allowed the stepsize to increase in accordance with [30, Algorithm 4], as performance was poor otherwise. The backtracking stepsize decrement factor ( $\delta$ in Algorithm 2) was $0.7$ for all algorithms.

For ps1fbt and ps2fbt, $\rho_{1}^{k}$ was discovered via backtracking. We also set the other stepsize $\rho_{2}^{k}$ equal to $\rho_{1}^{k}$ at each iteration. While this is not necessary, this heuristic performed well and eliminated $\rho_{2}^{k}$ as a separately tunable parameter. For the averaging parameters in ps1fbt, we used $\alpha_{1}=0.1$ and $\alpha_{2}=1$ (which is possible because $L_{2}=0$ ). For ps1fbt we set $\hat{\theta}_{1}=x_{1}^{0}$ and $\hat{w}_{1}=2Qx_{1}^{0}$ .

For tseng-pd and frb-pd, we used the following preconditioner:

[TABLE]

where $U$ is used as in [41, Eq. (3.2)] for tseng-pd ( $M^{-1}$ on [31, p. 7] for frb-pd). In this case, the “monotone + skew” primal-dual inclusion described in (59)-(72) features two $d$ -dimensional dual variables in addition to the $d$ -dimensional primal variable. The parameter $\gamma_{pd}$ changes the relative size of the steps taken in the primal and dual spaces, and plays a similar role to $\gamma$ in our algorithm (see Algorithm 3). The parameter $\beta$ in [30, Algorithm 4] plays a similar role for cp-bt. For all of these methods, we have found that performance is highly sensitive to this parameter: the primal and dual stepsizes need to be balanced. The only method not requiring such tuning is ada3op, which is a purely primal method. With this setup, all the methods have one tuning parameter except ada3op , which has none. For each method, we manually tuned the parameter for each $\delta_{r}$ ; Table 1 shows the final choices.

We calculated the criterion $c(x)$ in (74) for $x_{1}^{k}$ computed by ps1fbt and ps2fbt, $x_{t}$ computed on Line 3 of [34, Algorithm 1] for ada3op, $y^{k}$ computed in [30, Algorithm 4] for cp-bt, and the primal iterate for tseng-pd and frb-pd. Table 2 displays the average number iterations and running time, over $10$ random trials, until $c(x)$ falls (and stays) below $10^{-5}$ for each method. Examining the table,

•

For all four problems, ps1fbt outperforms ps2fbt. This behavior is not suprising, as ps1fbt only requires one forward step per iteration, rather than two. Since the matrix $Q$ is large and dense, reducing the number of forward steps should have a sizable impact.

•

For $\delta_{r}<1$ , ps1fbt is the best-performing method. However, for $\delta_{r}\geq 1$ , ada3op is the quickest.

6.2 Sparse Group Logistic Regression

Consider the following problem:

[TABLE]

where $a_{i}\in\mathbb{R}^{d}$ and $y_{i}\in\{\pm 1\}$ for $i=1,\ldots,n$ are given data, $\lambda_{1},\lambda_{2}\geq 0$ are regularization parameters, and $\mathcal{G}$ is a set of subsets of $\{1,\ldots,d\}$ such that no element is in more than one group $g\in\mathcal{G}$ . This is the non-overlapping group-sparse logistic regression problem, which has applications in bioinformatics, image processing, and statistics [37]. It is well understood that the $\ell_{1}$ penalty encourages sparsity in the solution vector. On the other hand the group-sparse penalty encourages group sparsity, meaning that as $\lambda_{2}$ increases more groups in the solution will be set entirely to [math]. The group-sparse penalty can be used when the features/predictors can be put into correlated groups in a meaningful way. As with the portfolio experiment, this problem features two nonsmooth regularizers and so methods like FISTA cannot easily be applied.

Problem (76) may be treated as a special case of (1) with $n=2$ , $G_{1}=G_{2}=I$ , and

[TABLE]

Since the logistic regression loss has a Lipschitz-continuous gradient and the $\ell_{1}$ -norm and non-overlapping group-lasso penalties both have computationally simple proximal operators, all our candidate methods may be applied.

We applied (76) to two bioinformatics classification problems with real data. Following [37], we use the breast cancer dataset of [27] and the inflammatory bowel disease (IBD) dataset of [6].222The breast cancer dataset is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1379. The IBD dataset is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3365. The breast cancer dataset contains gene expression levels for 60 patients with estrogen-positive breast cancer. The patients were treated with tamoxifen for 5 years and classified based on whether the cancer recurred (there were 28 recurrences). The goal is to use the gene expression values to predict recurrence. The IBD data set contains gene expression levels for 127 patients, 85 of which have IBD. The IBD data set actually features three classes: ulcerative colitis (UC), Crohn’s disease (CD), and normal, and so the most natural goal would be to perform three-way classification. For simplicity, we considered a two-way classification problem of UC/CD patients versus normal patients.

For both datasets, as in [37], the group structure $\mathcal{G}$ was extracted from the C1 dataset [38], which groups genes based on cytogenetic position data.333The C1 dataset is available at http://software.broadinstitute.org/gsea/index.jsp. Genes that are in multiple C1 groups were removed from the dataset.444Overlapping group norms can also be handled with our method, but using a different problem formulation than (76). We also removed genes that could not be found in the C1 dataset, although doing so was not strictly necessary. After these steps, the breast cancer data had 7,705 genes in 324 groups, with each group having an average of 23.8 genes. For the IBD data there were 19,836 genes in 325 groups, with an average of 61.0 genes per group. Let $A$ be the data matrix with each row is equal to $a_{i}^{\top}\in\mathbb{R}^{d}$ for $i=1,\ldots,n$ ; as a final preprocessing step, we normalized the columns of $A$ to have unit $\ell_{2}$ -norm, which tended to improve the performance of the first-order methods, especially the primal-dual ones.

For simplicity we set the regularization parameters to be equal: $\lambda_{1}=\lambda_{2}\triangleq\lambda$ . In practice, one would typically solve (76) for various values of $\lambda$ and then choose the final model based on cross-validation performance combined with other criteria such as sparsity. Therefore, to give an overall sense of the performance of each algorithm, we solved (76) for three values of $\lambda$ : large, medium, and small, corresponding to decreasing the amount of regularization and moving from a relatively sparse solution to a dense solution. For the breast cancer data, we selected $\lambda\in\{0.05,0.5,0.85\}$ and for IBD we chose $\lambda\in\{0.1,0.5,1\}$ . The corresponding number of non-zero entries, non-zero groups, and training error of the solution are reported in Table 3. Since the goal of these experiments is to assess the computational performance of the optimization solvers, we did not break up the data into training and test sets, instead treating the entire dataset as training data.

We initialized all the methods to the [math] vector. As in the portfolio problem, all stepsizes were initially set to $1$ . Since the logistic regression function does not have uniform curvature, we allowed the initial trial stepsize in the backtracking linesearch to increase by a factor of $1.1$ multiplied by the previously discovered stepsize. The methods ps1fbt, cp-bt, and ada3op have an upper bound on the trial stepsize at each iteration, so the trial stepsize was taken to be the minimum of $1.1$ multiplied by the previous stepsize and this upper bound.

Otherwise, the setup was the same as the portfolio experiment. tseng-pd and frb-pd use the same preconditioner as given in (75). For ps1fbt and ps2fbt we set $\rho_{2}^{k}$ to be equal to the discovered backtracked stepsize $\rho_{1}^{k}$ at each iteration. For ps1fbt we again set $\hat{\theta}_{1}=x_{1}^{0}$ , $\hat{w}_{1}=\nabla h_{1}(x_{1}^{0})$ , and $\alpha_{1}^{k}$ fixed to $0.1$ . As such, all methods (except ada3op) have one tuning parameter which was hand-picked for each method; the chosen values are given in Table 4.

Figure 3 shows the results of the experiments, plotting $(F(x_{0},x)-F^{*})/F^{*}$ against time for each algorithm, where $F$ is the objective function in (76) and $F^{*}$ is the estimated optimal value. To approximate $F^{*}$ , we ran each algorithm for 4,000 iterations and took the lowest value obtained. Overall, ps1fbt and ada3op were much faster than the other methods. For the highly regularized cases (the right column of the figure), ps1fbt was faster than all other methods. For middle and low regularization, ps1fbt and ada3op are comparable, and for $\lambda=0.05$ ada3op is slightly faster for the the breast cancer data. The methods ps1fbt and ada3op may be succesful because they exploit the cocoercivity of the gradient, while ps2fbt, tseng-pd,and frb-pd only treat it as Lipschitz continuous. cp-bt also exploits cocoercivity, but its convergence was slow nonetheless. We discuss the performance of ps1fbt versus ps2fbt more in Section 6.3.

6.3 Final Comments: ps1fbt versus ps2fbt

On the portfolio problem, ps1fbt and ps2fbt have fairly comparable performance, with ps1fbt being slightly faster. However, for the group logistic regression problem, ps1fbt is significantly faster. Given that both methods are based on the same projective splitting framework but use different forward-step procedures to update $(x_{1}^{k},y_{1}^{k})$ , this difference may be somewhat surprising. Since ps1fbt only requires one forward step per iteration while ps2fbt requires two, one might expect ps1fbt to be about twice as fast as ps2fbt. But for the group logistic regression problem, ps1fbt significantly outpaces this level of performance.

Examining the stepsizes returned by backtracking for both methods reveals that ps1fbt returns much larger stepsizes for the logistic regression problem, typically $2$ - $3$ orders of magnitude larger; see Figure 4. For the portfolio problem, where the performance of the two methods is more similar, this is not the case: the ps1fbt stepsizes are typically about twice as large as the ps2fbt stepsizes, in keeping with their theoretical upper bounds of $1/L_{i}$ and $2(1-\alpha_{i})/L_{i}$ , respectively.

Note that the portfolio problem has a smooth function which is quadratic and hence has the same curvature everywhere, while group logisitic regression does not. We hypothesize that the backtracking scheme in ps1fbt does a better job adapting to nonuniform curvature. A possible reason for this behavior is that the termination criterion for the backtracking search in ps1fbt may be weaker than for ps2fbt. For example, while ps2fbt requires $\varphi_{i,k}$ to be positive at each iteration $k$ and operator $i$ , ps1fbt does not.

Acknowledgments

This research was supported by the National Science Foundation grant CCF-1617617.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Alotaibi, A., Combettes, P.L., Shahzad, N.: Solving coupled composite monotone inclusions by successive Fejér approximations of their Kuhn–Tucker set. SIAM Journal on Optimization 24 (4), 2076–2095 (2014)
2[2] Baillon, J.B., Haddad, G.: Quelques propriétés des opérateurs angle-bornés n 𝑛 n -cycliquement monotones. Israel Journal of Mathematics 26 (2), 137–150 (1977)
3[3] Bauschke, H.H., Combettes, P.L.: The Baillon-Haddad Theorem Revisited. Journal of Convex Analysis 17 (3-4, SI), 781–787 (2010)
4[4] Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces, 2nd edn. Springer (2017)
5[5] Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. Image Processing, IEEE Transactions on 18 (11), 2419–2434 (2009)
6[6] Burczynski, M.E., Peterson, R.L., Twine, N.C., Zuberek, K.A., Brodeur, B.J., Casciotti, L., Maganti, V., Reddy, P.S., Strahs, A., Immermann, F., et al.: Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. The Journal of Molecular Diagnostics 8 (1), 51–61 (2006)
7[7] Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40 (1), 120–145 (2011)
8[8] Combettes, P.L.: Fejér monotonicity in convex optimization. In: Encyclopedia of optimization, vol. 2, pp. 106–114. Springer Science & Business Media (2001)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Single-Forward-Step Projective Splitting: Exploiting Cocoercivity

Abstract

1 Introduction

1.1 Problem Statement

1.2 Background

1.3 The Optimization Context

1.4 Notation and a Simplifying Assumption

2 Projective Splitting

Lemma 1**.**

Proof.

Additional Notation for Projective Splitting

2.1 The New Procedure

2.2 A Connection with the Forward-Backward Method

3 The Algorithm

3.1 Main Problem Assumptions and Preliminary Results

Assumption 1**.**

Lemma 2**.**

Proof.

Lemma 3**.**

3.2 Abstract One-Forward-Step Update

Definition 1**.**

3.3 Algorithm Definition

3.4 Algorithm Parameters

Assumption 2**.**

3.5 Separator-Projector Properties

Lemma 4**.**

Proof.

4 The Special Case n=1n=1n=1

5 Main Proof

Lemma 5**.**

Proof.

5.1 Some Basic Results

Lemma 6**.**

Lemma 7**.**

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

5.2 A Contractive Result

Lemma 10**.**

Proof.

Lemma 11**.**

Proof.

5.3 Finite Termination of Backtracking

Lemma 12**.**

Proof.

5.4 Boundedness Results and their Direct Consequences

Lemma 13**.**

Proof.

Lemma 14**.**

Proof.

Lemma 15**.**

Proof.

5.5 A Lyapunov-Like Recursion for the Hyperplane

Definition 2**.**

Definition 3**.**

Lemma 16**.**

Proof.

Lemma 17**.**

Proof.

5.6 Finishing the Proof

Lemma 18**.**

Proof.

Lemma 19**.**

Proof.

Lemma 20**.**

Proof.

Lemma 21**.**

Proof.

Theorem 1**.**

Proof.

6 Numerical Experiments

6.1 Portfolio Selection

Lemma 1.

Assumption 1.

Lemma 2.

Lemma 3.

Definition 1.

Assumption 2.

Lemma 4.

4 The Special Case $n=1$

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.

Lemma 15.

Definition 2.

Definition 3.

Lemma 16.

Lemma 17.

Lemma 18.

Lemma 19.

Lemma 20.

Lemma 21.

Theorem 1.