TL;DR
This paper introduces a novel single-forward-step projective splitting algorithm that efficiently exploits cocoercivity, enabling larger stepsizes and improved convergence in solving maximal monotone inclusions and convex optimization problems.
Contribution
The paper presents a new variant of projective splitting that processes cocoercive operators with a single forward step, matching the stepsize bounds of classical forward-backward splitting.
Findings
Allows larger stepsizes for cocoercive operators
Establishes a symmetry with classical splitting methods
Demonstrates competitive computational performance
Abstract
This work describes a new variant of projective splitting for solving maximal monotone inclusions and complicated convex optimization problems. In the new version, cocoercive operators can be processed with a single forward step per iteration. In the convex optimization context, cocoercivity is equivalent to Lipschitz differentiability. Prior forward-step versions of projective splitting did not fully exploit cocoercivity and required two forward steps per iteration for such operators. Our new single-forward-step method establishes a symmetry between projective splitting algorithms, the classical forward-backward splitting method (FB), and Tseng's forward-backward-forward method (FBF). The new procedure allows for larger stepsizes for cocoercive operators: the stepsize bound is for a -cocoercive operator, the same bound as has been established for FB. We show that FB…
| ps1fbt () | ||||
|---|---|---|---|---|
| ps2fbt () | ||||
| cp-bt () | ||||
| tseng-pd () | ||||
| frb-pd () | ||||
| ps1fbt | 3.6 (102) | 4.7 (102) | 16.3 (583) | 8.5 (255.2) |
|---|---|---|---|---|
| ps2fbt | 5.0 (151.1) | 7.9 (155) | 24.3 (523.4) | 9.2 (222.9) |
| ada3op | 5.3 (180.8) | 9.2 (180.8) | 6.8 (174.3) | 3.4 (89.2) |
| cp-bt | 6.2 (136) | 8.3 (134.3) | 11.8 (218.4) | 5.6 (113.6) |
| tseng-pd | 15.9 (387.1) | 21 (387.8) | 25.7 (525.3) | 11.1 (245.4) |
| frb-pd | 10.5 (559.9) | 16.4 (560.4) | 22.8 (1074.8) | 6.3 (350.8) |
| (breast cancer) | (IBD) | |||||
| 0.05 | 0.5 | 0.85 | 0.1 | 0.5 | 1.0 | |
| # Nonzeros | 114 | 50 | 20 | 135 | 40 | 18 |
| # Nonzero groups | 16 | 7 | 3 | 13 | 4 | 2 |
| Training error | 0% | 5% | 35% | 0% | 5.5% | 26.8% |
| (breast cancer) | (IBD) | |||||
|---|---|---|---|---|---|---|
| 0.05 | 0.5 | 0.85 | 0.1 | 0.5 | 1.0 | |
| ps1fbt () | 0.1 | 1 | 1 | |||
| ps2fbt () | 1 | 1 | 1 | |||
| cp-bt () | ||||||
| tseng-pd () | ||||||
| frb-pd () | ||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Single-Forward-Step Projective Splitting: Exploiting Cocoercivity
Patrick R. Johnstone Department of Management Science and Information Systems, Rutgers Business School Newark and New Brunswick, Rutgers University. Contact: [email protected], [email protected]
Jonathan Eckstein∗
Abstract
This work describes a new variant of projective splitting for solving maximal monotone inclusions and complicated convex optimization problems. In the new version, cocoercive operators can be processed with a single forward step per iteration. In the convex optimization context, cocoercivity is equivalent to Lipschitz differentiability. Prior forward-step versions of projective splitting did not fully exploit cocoercivity and required two forward steps per iteration for such operators. Our new single-forward-step method establishes a symmetry between projective splitting algorithms, the classical forward-backward splitting method (FB), and Tseng’s forward-backward-forward method (FBF). The new procedure allows for larger stepsizes for cocoercive operators: the stepsize bound is for a -cocoercive operator, the same bound as has been established for FB. We show that FB corresponds to an unattainable boundary case of the parameters in the new procedure. Unlike FB, the new method allows for a backtracking procedure when the cocoercivity constant is unknown. Proving convergence of the algorithm requires some departures from the prior proof framework for projective splitting. We close with some computational tests establishing competitive performance for the method.
1 Introduction
1.1 Problem Statement
For a collection of real Hilbert spaces consider the finite-sum convex minimization problem:
[TABLE]
where every and is closed, proper, and convex, every is also differentiable with -Lipschitz-continuous gradients, and the operators are linear and bounded. Under appropriate constraint qualifications, (1) is equivalent to the monotone inclusion problem of finding such that
[TABLE]
where all and are maximal monotone and each is -cocoercive, meaning that it is single-valued and
[TABLE]
for some . (When , must be a constant operator, that is, there is some such that for all . ) In particular, if we set (the subgradient map of ) and (the gradient of ) then the solution sets of the two problems coincide under a special case of the constraint qualification of [9, Prop. 5.3].
Defining for all , problem (2) may be written as
[TABLE]
This more compact problem statement will be used occasionally in our analysis below.
1.2 Background
Operator splitting algorithms are an effective way to solve structured convex optimization problems and monotone inclusions such as (1), (2), and (3). Their defining feature is that they decompose a problem into a set of manageable pieces. Each iteration consists of relatively easy calculations confined to each individual component of the decomposition, in conjunction with some simple coordination operations orchestrated to converge to a solution. Arguably the three most popular classes of operator splitting algorithms are the forward-backward splitting (FB) [11], Douglas/Peaceman-Rachford splitting (DR) [26], and forward-backward-forward (FBF) [40] methods. Indeed, many algorithms in convex optimization and monotone inclusions are in fact instances of one of these methods. The popular Alternating Direction Method of Multipliers (ADMM), in its standard form, can be viewed as a dual implementation of DR [20].
Projective splitting is a relatively recent and currently less well-known class of operator splitting methods, operating in a primal-dual space. Each iteration of these methods explicitly contructs an affine “separator” function for which for every in the set of primal-dual solutions. The next iterate is then obtained by projecting the current iterate onto the halfspace defined by , possibly with some over- or under-relaxation. Crucially, is obtained by performing calculations that consider each operator separately, so that the procedures are indeed operator splitting algorithms. In the original formulations of projective splitting [18, 19], the calculation applied to each operator was a standard resolvent operation, also known as a “backward step”. Resolvent operations remained the only way to process individual operators as projective splitting was generalized to cover compositions of maximal monotone operators with bounded linear maps [1] — as in the in (3) — and block-iterative (incremental) or asynchronous calculation patterns [10, 17]. Convergence rate and other theoretical results regarding projective splitting may be found in [22, 23, 28, 29].
The algorithms in [39, 21] were the first to construct projective splitting separators by applying calculations other than resolvent steps to the operators . In particular, [21] developed a procedure that could instead use two forward (explicit or gradient) steps for operators that are Lipschitz continuous. However, that result raised a question: if projective splitting can exploit Lipschitz continuity, can it further exploit the presence of cocoercive operators? Cocoercivity is in general a stronger property than Lipschitz continuity. However, when an operator is the gradient of a closed proper convex function (such as in (1)), the Baillon-Haddad theorem [2, 3] establishes that the two properties are equivalent: is -Lipschitz continuous if and only if it is -cocoercive.
Operator splitting methods that exploit cocoercivity rather than mere Lipschitz continuity typically have lower per-iteration computational complexity and a larger range of permissible stepsizes. For example, both FBF and the extragradient (EG) method [25] only require Lipchitz continuity, but need two forward steps per iteration and limit the stepsize to , where is the Lipschitz constant. If one strengthens the assumption to -cocoercivity, one can instead use FB, which only needs one forward step per iteration and allows stepsizes bounded away from . One departure from this pattern is the recently developed method of [31], which only requires Lipschitz continuity but uses just one forward step per iteration. While this property is remarkable, it should be noted that its stepsizes must be bounded by , which is half the allowable stepsize for EG or FBF and just a fourth of FB’s stepsize range.
Much like EG and FBF, the projective splitting computation in [21] requires Lipschitz continuity111If backtracking is used, then all three of these methods can converge under weaker local continuity assumptions., two forward steps per iteration, and limits the stepsize to be less than (when not using backtracking). Considering the relationship between FB and FBF/EG leads to the following question: is there a variant of projective splitting which converges under the stronger assumption of -cocoercivity, while processing each cocoercive operator with a single forward step per iteration and allowing stepsizes bounded above by ?
This paper shows that the answer to this question is “yes”. Referring to (2), the new procedure analyzed here requires one forward step on and one resolvent for at each iteration. In the context of (1), the new procedure requires one forward step on and one proximal operator evaluation on . When the resolvent is easily computable (for example, when is the zero map and its resolvent is simply the identity), the new procedure can effectively halve the computation necessary to run the same number of iterations as the previous procedure of [21]. This advantage is equivalent to that of FB over FBF and EG when cocoercivity is present. Another advantage of the proposed method is that it allows for a backtracking linesearch when the cocoercivity constant is unknown, whereas no such variant of general cocoercive FB is currently known.
The analysis of this new method is significantly different from our previous work in [21], using a novel “ascent lemma” (Lemma 17) regarding the separators generated by the algorithm. The new procedure also has an interesting connection to the original resolvent calculation used in the projective splitting papers [18, 19, 1, 10]: in Section 2.2 below, we show that the new procedure is equivalent to one iteration of FB applied to evaluating the resolvent of . That is, we can use a single forward-backward step to approximate the operator-processing procedure of [18, 19, 1, 10], but still obtain convergence.
The new procedure has significant potential for asynchronous and incremental implementation following the ideas and techniques of previous projective splitting methods [10, 17, 21]. To keep the analysis relatively manageable, however, we plan to develop such generalizations in a follow-up paper. Here, we will simply assume that every operator is processed once per iteration.
1.3 The Optimization Context
For optimization problems of the form (1), our proposed method is a first-order proximal splitting method that “fully splits” the problem: at each iteration, it utilizes the proximal operator for each nonsmooth function , a single evaluation of the gradient for each smooth function , and matrix-vector multiplications involving and . There is no need for any form of matrix inversion, nor to use resolvents of composed functions like , which may in general be much more challenging to evaluate than resolvents of the . Thus, the method achieves the maximum possible decoupling of the elements of (1). There are also no assumptions on the rank, row spaces, or columns spaces of the . Beyond the basic resolvent, gradient, and matrix-vector multiplication operations invoked by our algorithm, the only computations at each iteration are a constant number of inner products, norms, scalar multiplications, and vector additions, all of which can all be carried out within flop counts linear in the dimension of each Hilbert space.
Besides projective splitting approaches, there are a few first-order proximal splitting methods that can achieve full splitting on (1). The most similar to projective splitting are those in the family of primal-dual (PD) splitting methods; see [13, 12, 7, 35] and references therein. In fact, projective splitting is also a kind of primal-dual method, since it produces primal and dual sequences jointly converging to a primal-dual solution. However, the convergence mechanisms are different: PD methods are usually constructed by applying an established operator splitting technique such as FB, FBF, or DR to an appropriately formulated primal-dual inclusion in a primal-dual product space, possibly with a specially chosen metric. Projective splitting methods instead work by projecting onto (or through) explicitly constructed separating hyperplanes in the primal-dual space.
There are several potential advantages of our proposed method over the more established PD schemes. First, unlike the PD methods, the norms do not effect the stepsize constraints of our proposed method, making such constraints easier to satisfy. Furthermore, projective splitting’s stepsizes may vary at each iteration and may differ for each operator. In general,
projective splitting methods allow for asynchronous parallel and incremental implementations in an arguably simpler way than PD methods (although we do not develop this aspect of projective splitting in this paper). Projective splitting methods can incorporate deterministic block-iterative and asynchronous assumptions [10, 17], resulting in deterministic convergence guarantees, with the analysis being similar to the synchronous case. In contrast, existing asynchronous and block-coordinate analyses of PD methods require stochastic assumptions which only lead to probabilistic convergence guarantees [35].
1.4 Notation and a Simplifying Assumption
We use the same general notation as in [21, 23, 22]. Summations of the form will appear throughout this paper. To deal with the case , we use the standard convention that
We will use a boldface for elements of . Let , which we refer to as the “collective primal-dual space”, and note that the assumption on implies that . We use to refer to points in , so .
Throughout, we will simply write as the norm for and let the subscript be inferred from the argument. In the same way, we will write as for the inner product of . For the collective primal-dual space we will use a special norm and inner product with its own subscript defined in (16).
We use the standard “” notation to denote weak convergence, which is of course equivalent to ordinary convergence in finite-dimensional settings.
For the definition of maximal monotone operators and their basic properties, we refer to [4]. For any maximal monotone operator and scalar , we will use the notation to denote the resolvent operator, also known as the backward or implicit step with respect to . Thus,
[TABLE]
the and satisfying this relation being unique. Furthermore, is defined everywhere and [4, Prop. 23.2].
If for a closed, convex, and proper function , the resolvent is often referred to as the proximal operator and written as . Computing the proximal operator requires solving
[TABLE]
Many functions encountered in applications to machine learning and signal processing have proximal operators which can be computed exactly with low computational complexity. In this paper, for a single-valued maximal monotone operator , a forward step (also known as an explicit step) refers to the direct evaluation of (or in convex optimization) as part of an algorithm.
For the rest of the paper, we will impose the simplifying assumption
[TABLE]
As noted in [21], the requirement that is not a very restrictive assumption. For example, one can always enlarge the original problem by one operator, setting .
2 Projective Splitting
The goal of our algorithm will be to find a point in
[TABLE]
It is clear that solves (2)–(3) if and only if there exist such that
[TABLE]
Under reasonable assumptions, the set is closed and convex; see Lemma 2. is often called the Kuhn-Tucker solution set of problem (3).
A separator-projector algorithm for finding a point in (and hence a solution to (3)) will, at each iteration , find a closed and convex set which separates from the current point, meaning is entirely in the set (preferably, the current point is not). One can then attempt to “move closer” to the solution set by projecting the current point onto the set . This general setup guarantees that the sequence generated by the method is Fejér monotone [8] with respect to . This alone is not sufficient to guarantee that the iterates actually converge to a point in the solution set. To establish this, one needs to show that the set “sufficiently separates” the current point from the solution set, or at least does so sufficiently often. Such “sufficient separation” allows one to establish that any weakly convergent subsequence of the iterates must have its limit in the set , from which overall weak convergence follows from [8, Prop. 2].
With as in (5), the separator formulation presented in [10] constructs the halfspace using the function defined as
[TABLE]
for some auxiliary points (. These points () will be specified later and must be chosen at each iteration in a specific manner guaranteeing the validity of the separator and convergence to . Among other properties, they must be chosen so that for . Under this condition, it follows readily that has the promised separator properties:
Lemma 1**.**
The function defined in (6) is affine, and if for all , then for all .
Proof.
That is affine is clear from its expression in (7). Now suppose that for all and . Then
[TABLE]
where From and the definition of , one has that for all , as well as . Since for , it follows from the monotonicity of that every inner product displayed in (8) is nonnegative, and so . ∎
Figure 1 presents a rough depiction of the current algorithm iterate and the separator in the case that . The basic iterative cycle pursued by projective splitting methods is:
For each operator , identify a pair . These pairs define an affine function such that for all , using the construction (6) (or related constructions for variations of the basic problem formulation). 2. 2.
Obtain the next iterate by projecting the current iterate onto the halfspace , with possible over- or under-relaxation.
Figure 2 presents a rough depiction of two iterations of this process in the absence of over- or under-relaxation. The projection operation in part 2 of the cycle is a straightforward application of standard formulas for projecting onto a halfspace. For the particular formulation (3), the necessary calculations are derived in [21] and displayed in Algorithm 3 below. This projection is a low-complexity operation involving only inner products, norms, matrix multiplication by , and sums of scalars. For example, when for and each , then the projection step has computational complexity .
The key question in the design of algorithms in this class therefore concerns step 1 in the cycle: how might one select the points so that convergence to may be established? The usual approach has been to choose to be some function of such that is positive and “sufficiently large” whenever . Then projecting the current point onto this hyperplane makes progress toward the solution and can be shown to lead (with some further analysis) to overall convergence. In the original versions of projective splitting, the calculation of involved (perhaps approximately) evaluating a resolvent; later [21] introduced the alternative of a two-forward-step calculation for Lipschitz continuous operators that achieved essentially the same sufficient separation condition.
Here, we introduce a one-forward-step calculation for the case of cocoercive operators. A principal difference between this analysis and earlier work on projective splitting is that processing all the operators at iteration need not result in being positive. Instead, we establish an “ascent lemma” that relates the values and in such a way that overall convergence may still be proved, even though it is possible that at some iterations . In particular, will be larger than the previous value , up to some error term that vanishes as .
When , projection onto results in . In this case, the algorithm continues to compute new points , until, for some , it constructs a hyperplane such that the and projection results in .
Additional Notation for Projective Splitting
For an arbitrary we use the notation
[TABLE]
as in the proof of Lemma 1. Note that when , . Under the above convention, we may write in the more compact form
[TABLE]
We also use the following notation for :
[TABLE]
Note that .
2.1 The New Procedure
Suppose for some . Since is cocoercive, it is also Lipschitz continuous. In [21] we introduced the following two-forward-step update for Lipschitz continuous :
[TABLE]
Under -Lipschitz continuity and the condition , it is possible to show that updating in this way leads to being sufficiently positive to establish overall convergence. Although we did not discuss it in [21], this two-forward step procedure can be extended to handle nonzero in the following manner:
[TABLE]
Following (4), it is clear that (9) is essentially a resolvent calculation applied to its right-hand side . This type of update, with forward steps and backward steps together, was introduced in [39] for a more limited form of projective splitting.
An obvious drawback of (9)–(10) is that it requires two forward steps per iteration, one to compute and another to compute . The initial motivation for the current paper was the following question: is there a way to reuse so as to avoid computing at each iteration, perhaps under the stronger assumption of cocoercivity? With some effort we arrived at the following update for each block at each iteration :
[TABLE]
where , , and . Condition (11) is readily satisfied by some simple linear algebra calculations and a resolvent calculation involving .
In particular, referring to (4), one may see that (11) is equivalent to computing
[TABLE]
Following this resolvent calculation, (12) requires only an evaluation (forward step) on , and (13) is a simple vector addition. In comparison to (9), we have replaced with the previously computed point . However, in order to establish convergence, it turns out that we also need to replace with a convex combination of and .
The parameter plays the role of the stepsize in the resolvent calculation. It also plays the role of a forward (gradient) stepsize, since it multiplies in (11), and by (12). From the assumptions on and immediately following 13, it follows that may be made arbitrarily close to by setting close to [math]. However, in practice it may be better to use an intermediate value, such as , since doing so causes the update to make significant use of the information in , a point computed more recently than .
Computing as proposed in (11)-(13) does not guarantee that the quantity is positive. In the next section, we give some intuition as to why (11)-(13) nevertheless leads to convergence to .
2.2 A Connection with the Forward-Backward Method
In the projective splitting literature preceeding [21], the pairs are solutions of
[TABLE]
for some , which — again following (4) — is a resolvent calculation. It can be shown that the resulting are such that is positive and sufficiently large to guarantee overall convergence to a solution of (3). Since the stepsize in (14) can be any positive number, let us replace with for some and rewrite (14) as
[TABLE]
The reason for this reparameterization will become apparent below.
In this paper, , with being cocoercive and maximal monotone. For in this form, computing the resolvent as in (14) exactly may be impossible, even when the resolvent of is available. With this structure, in (15) satisfies:
[TABLE]
which can be rearranged to where
[TABLE]
Since is -cocoercive, is -cocoercive [4, Prop. 4.12]. Consider the generic monotone inclusion problem : is maximal and is cocoercive, and thus one may solve the problem with the forward-backward (FB) method [4, Theorem 26.14]. If one applies a single iteration of FB initialized at , with stepsize , to the inclusion , one obtains the calculation:
[TABLE]
which is precisely the update (11). So, our proposed calculation is equivalent to one iteration of FB initialized at the previous point , applied to the subproblem of computing the resolvent in (15). Prior versions of projective splitting require computing this resolvent either exactly or to within a certain relative error criterion, which may be time consuming. Here, we simply make a single FB step toward computing the resolvent, which we will prove is sufficient for the projective splitting method to converge to . However, our stepsize restriction on will be slightly stronger than the natural stepsize limit that would arise when applying FB to .
3 The Algorithm
3.1 Main Problem Assumptions and Preliminary Results
Assumption 1**.**
Problem (2) conforms to the following:
* and are real Hilbert spaces.* 2. 2.
For , the operators and are monotone. Additionally each is maximal. 3. 3.
Each operator is either -cocoercive for some (and thus single-valued) and , or and for all and some (that is, is a constant function). 4. 4.
Each for is linear and bounded. 5. 5.
Problem (2) has a solution, so the set defined in (5) is nonempty.
Problem (1) will be equivalent to an instance of Problem (2) satisfying Assumption 1 if each and is closed, convex, and proper, each has -Lipschitz continuous gradients, and a special case of the constraint qualification in [9, Prop. 5.3] holds.
In order to apply a separator-projector algorithm, the target set must be closed and convex. Establishing this for is very similar to in our previous work [21], which in turn follows many earlier results.
Lemma 2**.**
Suppose Assumption 1 holds. The set defined in (5) is closed and convex.
Proof.
By [4, Cor. 20.28] each is maximal. Furthermore, since , is maximal monotone by [4, Cor. 25.5(i)]. The rest of the proof is identical to [21, Lemma 3]. ∎
Throughout, we will use for a generic point in , the collective primal-dual space. For , we adopt the following (standard) norm and inner product:
[TABLE]
Lemma 3**.**
[21, Lemma 4]* Let be defined as in (6). Then:*
* is affine on .* 2. 2.
With respect to inner product on , the gradient of is
[TABLE]
3.2 Abstract One-Forward-Step Update
We sharpen the notation for the one-forward-step update introduced in (11)–(13) as follows:
Definition 1**.**
Suppose and are real Hilbert spaces, is maximal-monotone with nonempty domain, is -cocoercive, and is bounded and linear. For and , define the mapping , with additional parameters , and , as
[TABLE]
To simplify the presentation, we will also use the notation
[TABLE]
With this notation, (11)–(13) may be written as
3.3 Algorithm Definition
Algorithms 1–3 define the main method proposed in this work. They produce a sequence of primal-dual iterates and, implicitly, . Algorithm 1 gives the basic outline of our method; for each operator, it invokes either our new one-forward-step update with a user-defined stepsize (through line 1) or its backtracking variant given in Algorithm 2 (through line 1). Together, algorithms 1–2 specify how to update the points used to define the separating affine function in (6). Algorithm 3, called from line 1 of Algorithm 1, defines the projectToHplane function that performs the projection step to obtain the next iterate.
Taken together, algorithms 1–3 are essentially the same as Algorithm 2 of [21], except that the update of uses the new procedure given in (11)–(13). For simplicity, the algorithm also lacks the block-iterative and asynchronous features of [10, 17, 21], which we plan to combine with algorithms 1–3 in a follow-up paper.
The computations in projectToHplane are all straightforward and of relatively low complexity. They consist of matrix multiplies by , inner products, norms, and sums of scalars. In particular, there are no potentially difficult minimization problems involved. If and for , then the computational complexity of projectToHplane is .
3.4 Algorithm Parameters
The method allows two ways to select the stepsizes . One may either choose them manually or invoke the backTrack procedure. If one decides to select the stepsizes manually, the upper bound condition is required whenever . However, it may be difficult to ensure that this condition is satisfied when the cocoercivity constant is hard to estimate. The global cocoercivity constant may also be conservative in parts of the domain of , leading to unnecessarily small stepsizes in some cases. We developed the backtracking linesearch technique for these reasons. The set holds the indices of operators for which backtracking is to be used.
For a trial stepsize , Algorithm 2 generates candidate points using the single-forward-step procedure of (22). For these candidates, Algorithm 2 checks two conditions on lines 2–2. If both of these inequalities are satisfied, then backtracking terminates and returns the successful candidate points. If either condition is not satisfied, the stepsize is reduced by the factor and the process is repeated. These two conditions arise in the analysis in Section 5.
The parameter is a global upper bound on the stepsizes (both backtracked and fixed) and must be chosen to satisfy Assumption 2. In backTrack, one must choose an initial trial stepsize within a specified interval (line 2 of Algorithm 2). This interval arises in the analysis (see lemmas 16 and 17). Written in terms of the parameters passed into backTrack in the call on line 1 of Algorithm 1, and assuming the global upper bound is sufficiently large to not be active on line 2, the interval is
[TABLE]
An obvious choice is to set the initial stepsize to be at the upper limit of the interval. In practice we have observed that and tend to be approximately equal, so this allows for an increase in the trial stepsize by up to a factor of approximately over the previous stepsize.
Note that backTrack returns the chosen stepsize as well as the quantity which are needed to compute the available interval in the call to backTrack during the next iteration.
In the analysis it will be convenient to let be the initial trial stepsize chosen during iteration of Algorithm 1, when backTrack has been called through line 1 for some .
We call the stepsize returned by backTrack . Assuming that backTrack always terminates finitely (which we will show to be the case), we may write for
[TABLE]
The only difference between the update for on line 1 and this update for is that in the former, the stepsize is discovered by backtracking, while in the latter it is directly user-supplied.
The backTrack procedure computes several auxiliary quantities used to check the two backtracking termination conditions. The point is calculated to be the same as given in Definition 2. The quantity is the value of corresponding to the candidate points . The quantity computed on line 2 is equal to . Typically, we want to be as large as possible to get a bigger cut with the separating hyperplane, but the condition checked on line 2 will ultimately suffice to prove convergence.
Algorithm 1 has several additional parameters.
** **
these are used in the backtracking procedure for . An obvious choice which we used in our numerical experiments was , i.e. the initial point.
:
allows for the projection to be performed using a slightly more general primal-dual metric than (16). In effect, this parameter changes the relative size of the primal and dual updates in lines 3–3 of Algorithm 3. As increases, a smaller step is taken in the primal and a larger step in the dual. As decreases, a smaller step is taken in the dual update and a larger step is taken in the primal. See [19, Sec. 5.1] and [18, Sec. 4.1] for more details.
In Algorithm 1, the averaging parameters and user-selected stepsizes are fixed across all iterations. In the preprint version of this paper [24], we instead allow these parameters to vary by iteration, subject to certain restrictions. Doing so complicates the notation and the analysis, so for relative simplicity we consider only fixed values of these parameter here. This simplification also accords with the parameter choices in our computational tests below. For the full, more complicated analysis, please refer to [24].
As written, Algorithm 1 is not as efficient as it could be. On the surface, it seems that we need to recompute in order to evaluate on line 1. However, was already computed in the previous iteration and can obviously be reused, so only one evaluation of is needed per iteration. Similarly, within backTrack, each invocation of on line 2 may reuse the quantity which was computed in the previous iteration of Algorithm 1. Thus, each iteration of the loop within backTrack requires one new evaluation of , to compute within .
We now precisely state our stepsize assumption for the manually chosen stepsizes, as well as the stepsize upper bound .
Assumption 2**.**
For : If , then otherwise . The parameter must satisfy
[TABLE]
Note that if , Assumption 2 effectively limits to be strictly less than , otherwise the stepsize would be forced to [math], which is prohibited. In this case must be chosen in . On the other hand, if , there is no constraint on other than that it is positive and nonzero, and in this case may be chosen in .
3.5 Separator-Projector Properties
Lemma 4 details the key results for Algorithm 1 that stem from it being a seperator-projector algorithm. While these properties alone do not guarantee convergence, they are important to all of the arguments that follow.
Lemma 4**.**
Suppose that Assumption 1 holds. Then for Algorithm 1
The sequence is bounded. 2. 2.
If the algorithm never terminates via line 1, . Furthermore and for . 3. 3.
If the algorithm never terminates via line 1 and remains bounded for all , then .
Proof.
Parts 1–2 are proved in lemmas 2 and 6 of [21]. Part 3 can be found in Part 1 of the proof of Theorem 1 in [21]. The analysis in [21] uses a different procedure to construct the pairs , but the result is generic and not dependent on that particular procedure. Note also that [21] establishes the results in a more general setting allowing asynchrony and block-iterativeness, which we do not analyze here. ∎
4 The Special Case
Before starting the analysis, we consider the important special case . In this case, we have by assumption that , , and we are solving the problem where both operators are maximal monotone and is -cocoercive. In this case, Algorithm 1 reduces to a method which is similar to FB. Let , , , and . Assuming for simplicity that , meaning backtracking is not being used, then the updates carried out by the algorithm are
[TABLE]
If , then for all , the iterates computed in (25) reduce simply to
[TABLE]
which is exactly FB. However, is not allowed in our analysis. Thus, FB is a forbidden boundary case which may be approached by setting arbitrarily close to [math]. As approaches [math], the stepsize constraint approaches the classical stepsize constraint for FB: for some arbitrarily small constant . A potential benefit of Algorithm 1 over FB in the case is that it does allow for backtracking when is unknown or only a conservative estimate is available.
5 Main Proof
The core of the proof strategy will be to establish (26) below. If this can be done, then weak convergence to a solution follows from part 3 of Theorem 1 in [21].
Lemma 5**.**
Suppose Assumption 1 holds and Algorithm 1 produces an infinite sequence of iterations without terminating via Line 1. If
[TABLE]
then there exists such that . Furthermore, we also have and for all , , and .
Proof.
Equivalent to part 3 of the proof of Theorem 1 in [21].∎
Lemma 5 can be intuitively understood as follows. If we define, for all ,
[TABLE]
then (26) is equivalent to saying that . For all , we have . If , then and since , it follows that and solves (3). Thus can be thought of as the “residual” of the algorithm which measures how far it is from finding a point in and a solution to (3). In finite dimension, it is straightforward to show that if , must converge to some element of . This can be done using Fejér monotonicity [4, Theorem 5.5] combined with the fact that the graph of a maximal-monotone operator in a finite-dimensional Hilbert space is closed [4, Proposition 20.38]. However in the general Hilbert space setting the proof is more delicate, since the graph of a maximal-monotone operator is not in-general closed in the weak-to-weak topology [4, Example 20.39]. Nevertheless the overall result was established in the general Hilbert space setting in part 3 of Theorem 1 of [21], which is itself an instance of [1, Proposition 2.4] (see also [4, Proposition 26.5]). An arguably more transparent proof can be found in [16] (this proof is only for the case , but it can be extended).
In order to establish (26), we start by establishing certain contractive and “ascent” properties for the mapping , and also show that the backtracking procedure terminates finitely. Then, we prove the boundedness of and , in turn yielding the boundedness of the gradients and hence the result that by Lemma 4. Next we establish a “Lyapunov-like” recursion for , relating to . Eventually this result will allow us to establish that and hence that , which will in turn allow an argument that . The proof that will then follow fairly elementary arguments.
The primary innovations of the upcoming proof are the ascent lemma and the way that it is used in Lemma 18 to establish and . This technique is a significant deviation from previous analyses in the projective splitting family. In previous work, the strategy was to show that for a constant , which may be combined with to imply (26). In contrast, in the algorithm of this paper we cannot establish such a result and in fact may be negative. Instead, we relate to to show that the separation improves at each iteration in a way which still leads to overall convergence.
5.1 Some Basic Results
We begin by stating three elementary results on sequences, which may be found in [36], and a basic, well known nonexpansivity property for forward steps with cocoercive operators.
Lemma 6**.**
[36, Lemma 1, Ch. 2]* Suppose that for all , , , and for all . Then is a bounded sequence.*
Lemma 7**.**
[36, Lemma 3, Ch. 2]* Suppose that for all , , and there is some such that for all . Then .*
Lemma 8**.**
Suppose that and are sequences in with the properties and for all . Then .
Proof.
Negating the assumed inequality yields . Applying [36, Lemma 3, Ch. 2] then yields .∎
Lemma 9**.**
Suppose is -cocoercive and . Then for all
[TABLE]
Proof.
Squaring the left hand side of (27) yields
[TABLE]
∎
5.2 A Contractive Result
We begin the main proof with a result on the one-forward-step mapping: from Definition 1. The following lemma will ultimately be used to show that the iterates remain bounded.
Lemma 10**.**
Suppose , where is given in Definition 1. Recall that is -cocoercive. If or , then
[TABLE]
for any and .
Proof.
Select any and . Let . It follows immediately from (4) that
[TABLE]
Therefore, (22) and (29) yield
[TABLE]
To obtain (a), one uses the nonexpansivity of the resolvent [4, Prop. 23.8(ii)]. To obtain (b), one regroups terms and adds and subtracts . Then (c) follows from the triangle inequality. Finally we consider (d): If , apply Lemma 9 to the first term on the right-hand side of (30) with the stepsize which by assumption satisfies
[TABLE]
by Assumption 2. Alternatively, if , implying that is a constant-valued operator, then and (d) is just an equality. ∎
We now prove the key “ascent lemma”. It shows that, while the one-forward-step update is not guaranteed to find a separating hyperplane at each iteration, it does make a certain kind of progress toward separation.
Lemma 11**.**
Suppose , where is given in Definition 1. Recall is -cocoercive. Let and define . Further, define , as in (22), and . If and whenever , then
[TABLE]
Proof.
Since , there exists such that . Let . Note by (4) that . With this notation, .
We may write the -update in (22) as
[TABLE]
which rearranges to
[TABLE]
Adding to both sides yields
[TABLE]
Substituting this equation into the definition of yields
[TABLE]
We now focus on the second term in (33). Assume for now that (we will deal with the case below). We write
[TABLE]
To derive (34) we substituted and for the following inequality we used the monotonicity of and -cocoercivity of (recall that and ). Substituting the resulting inequality back into (33) yields
[TABLE]
Subtracting from both sides of the above inequality produces
[TABLE]
Using (32) once again, this time to the third term on the right-hand side of (36), we write
[TABLE]
Substituting this equation back into (36) yields
[TABLE]
We next use the identity on both inner products in (38), as follows:
[TABLE]
and
[TABLE]
Here we have used the identities
[TABLE]
Using (39)–(40) in (38) yields
[TABLE]
Consider this last expression: since , the coefficient multiplying is nonnegative. Furthermore, since , the coefficient multiplying is positive. Therefore we may drop these two terms from the above inequality and divide by to obtain (31).
Finally, we deal with the case in which , which implies that for some for all . The main difference is that the terms are no longer present since . The analysis is the same up to (33). In this case so instead of (35) we may deduce from (34) that
[TABLE]
Since is constant we also have that
[TABLE]
Thus, instead of (36) in this case we have the simpler inequality
[TABLE]
The term in (41) is dealt with just as in (36), by substitution of (32). This step now leads via (37) to
[TABLE]
Once again using on the second term on the r.h.s. above yields
[TABLE]
We can lower-bound the term by [math]. Dividing through by and rearranging, we obtain
[TABLE]
Since in the case, this is equivalent to (31).∎
5.3 Finite Termination of Backtracking
In all the following lemmas in sections 5.3 and 5.4 regarding algorithms 1–3, assumptions 1 and 2 are in effect and will not be explicitly stated in each lemma. We start by proving that backTrack terminates in a finite number of iterations, and that the stepsizes it returns are bounded away from [math].
Lemma 12**.**
For , Algorithm 2 terminates in a finite number of iterations for all . There exists such that for all , where is the stepsize returned by Algorithm 2 on line 1. Furthermore for all .
Proof.
Assume we are at iteration in Algorithm 1 and backTrack has been called through line 1 for some . The internal variables within backTrack are defined in terms of the variables passed from Algorithm 1 as follows: , , , , and . Furthermore , , , , , and . The calculation on line 2 of Algorithm 2 yields . In the following argument, we mostly refer to the internal name of the variables within backTrack without explicitly making the above substitutions. With that in mind, let be the cocoercivity constant of .
Recall that is the initial trial stepsize chosen on line 2 of backTrack. We must establish that the interval on line 2 is always nonempty and so a valid initial stepsize can be chosen. Since , this will be true if , which we will prove by induction. Note that by Assumption 2, for all . Therefore for , . We will prove the induction step below.
Observe that backtracking terminates via line 2 if two conditions are met. The first condition,
[TABLE]
is identical to (28) of Lemma 10, with and respectively in place of and . The initialization step of Algorithm 2 provides us with for some . Furthermore, since
[TABLE]
the findings of Lemma 10 may be applied. In particular, if and , then (42) will be met. Alternatively, if , (42) will hold for any value of the stepsize .
Next, consider the second termination condition,
[TABLE]
This relation is identical to (31) of Lemma 11, with in place of . However, to apply the lemma we must show that . We will also prove this by induction.
For , holds by the initialization step of Algorithm 1. Now assume that at iteration it holds that and furthermore that , therefore the interval on line 2 is nonempty. We may then apply the findings of Lemma 11 to conclude that if and , then condition (43) is satisfied. Or, if , condition (43) is satisfied for any .
Combining the above observations, we conclude that if and , backtracking will terminate for that iteration of backTrack via line 2. Or, if , it will terminate in the first iteration of backTrack. The stepsize decrement condition on line 2 of the backtracking procedure implies that will eventually hold for large enough , and hence that the two backtracking termination conditions must eventually hold.
Let be the iteration at which backtracking terminates when called for operator at iteration of Algorithm 1. For the pair returned by backTrack on line 1 of Algorithm 1, we may write
[TABLE]
Thus, by the definition of in (22), . Therefore, induction establishes that holds for all .
Now the returned stepsize must satisfy . In the next iteration, . Thus we have also established by induction that and therefore that the interval on line 2 is nonempty for all iterations . Finally, we now also infer by induction that backTrack terminates in a finite number of iterations for all and .
Now must be chosen in the range
[TABLE]
Since we have established that this interval remains nonempty, it holds trivially that . For all and , the returned stepsize must satisfy
[TABLE]
Therefore for all and all such that , one has
[TABLE]
where the first inequality uses (44) and , the second inequality recurses, and the final inequality is just (44) for . If , the argument is simply
[TABLE]
∎
5.4 Boundedness Results and their Direct Consequences
Lemma 13**.**
For all , the sequences and are bounded.
Proof.
To prove this, we first establish that for and
[TABLE]
For , Lemma 12 establishes that backTrack terminates for finite for all . For fixed and , let be the iteration of backTrack that terminates. At termination, the following condition is satisfied via line 2:
[TABLE]
Into this inequality, now substitute in the following variables from Algorithm 1, as passed to and from backTrack: , , , , , , , , and . Further noting that , the result is (45).
For , we note that line 1 of Algorithm 1 reads as
[TABLE]
and since Assumption 2 holds, we may apply Lemma 10. Further noting that by Assumption 2 we arrive at yield (45).
Since , and are bounded by Lemma 4 and is bounded by Assumption 1, boundedness of now follows by applying Lemma 6 with to (45).
Next, boundedness of follows from the continuity of . Since Lemma 12 established that backTrack terminates in a finite number of iterations we have for any that
[TABLE]
where for . Expanding the -update in the definition of in (22), we may write
[TABLE]
Since , , and are bounded, for , and (using Lemma 12 for ), and for is constant, we conclude that remains bounded.∎
With and bounded for all , the boundedness of follows immediately:
Lemma 14**.**
The sequence is bounded. If Algorithm 1 never terminates via line 1, .
Proof.
By Lemma 3, , which is bounded since each is bounded by assumption and each is bounded by Lemma 13. Furthermore, is bounded using the same two lemmas. That then immediately follows from Lemma 4(3).∎
Using the boundedness of and , we can next derive the following simple bound relating to :
Lemma 15**.**
There exists such that for all and ,
[TABLE]
Proof.
For each , let be respective bounds on \big{\{}\|G_{i}z^{k-1}-x_{i}^{k-1}\|\big{\}} and \big{\{}\|y_{i}^{k-1}-w_{i}^{k}\|\big{\}}, which must exist by Lemma 4, the boundedness of and , and the boundedness of . Let and . Then, for any and , we may write
[TABLE]
where the last step uses the Cauchy-Schwarz inequality and the definitions of and .∎
5.5 A Lyapunov-Like Recursion for the Hyperplane
We now establish a Lyapunov-like recursion for the hyperplane. For this purpose, we need two more definitions.
Definition 2**.**
For all , since Lemma 12 establishes that Algorithm 2 terminates in a finite number of iterations, we may write for :
[TABLE]
where for are actually fixed. Using (4) and the -update in (22), there exists such that
[TABLE]
Define .
Definition 3**.**
For we will use , even though these stepsizes are fixed, so that we can use the same statements as for . Similarly we will use for .
Lemma 16**.**
For all , and
[TABLE]
Proof.
For , recall that is the initial trial stepsize chose on line 2 of backTrack at iteration for some . The condition on line 2 of backTrack guarantees that
[TABLE]
Multiplying through by and noting that proves the lemma.
For the expression holds trivially because . ∎
Lemma 17**.**
For all and ,
[TABLE]
and
[TABLE]
Proof.
Take any . Lemma 12 guarantees the finite termination of backTrack. Now consider the backtracking termination condition
[TABLE]
Fix some , and let be the iteration at which backTrack terminates. In the above inequality, make the following substitutions for the internal variables of backTrack by those passed in/out of the function: , , , , , . Furthermore, where is defined in Definition 2. Together, these substitutions yield (47). We can then apply Lemma 16 to get (48).
Now take any . From line 1 of Algorithm 1, Assumption 2, and Lemma 11, we directly deduce (47). Combining this relation with (46) we obtain (48).∎
5.6 Finishing the Proof
We now work toward establishing the conditions of Lemma 5. Unless otherwise specified, we henceforth assume that Algorithm 1 runs indefinitely and does not terminate at line 1. Termination at line 1 is dealt with in Theorem 1 to come.
Lemma 18**.**
For all , we have and .
Proof.
Fix any . First, note that for all ,
[TABLE]
where and is a bound on , which must exist because both and are bounded by lemmas 4 and 13. Note that as a consequence of Lemma 4.
Second, recall Lemma 15, which states that there exists such that for all ,
[TABLE]
Now let, for all ,
[TABLE]
so that
[TABLE]
Using (49) and (50) in (48) yields
[TABLE]
where
[TABLE]
Note that is bounded, , is finite, and by Lemma 4, and . Thus .
Since , we may apply Lemma 8 to (53) with , which yields . Therefore
[TABLE]
On the other hand, by Lemma 14. Therefore, using (52) and (55),
[TABLE]
Therefore \lim_{k\to\infty}\big{\{}\varphi_{k}(p^{k})\big{\}}=0. Consider any . Combining \lim_{k\to\infty}\big{\{}\varphi_{k}(p^{k})\big{\}}=0 with we have
[TABLE]
Since (using Lemma 12 for ) we conclude that .∎
We have already proved the first requirement of Lemma 5, that for all . We now work to establish the second requirement, that . In the upcoming lemmas we continue to use the quantity which is given in Definition 2.
Lemma 19**.**
Recall from Definition 2. For all , .
Proof.
Fix any . For all , repeating (47) from Lemma 17, we have
[TABLE]
where we have used defined (51) along with (49)–(50) and is defined in (54). This is the same argument used in Lemma 18, but now we apply (49)–(50) to (47), rather than (48), so that we can upper bound the term. Summing over , yields
[TABLE]
Since , , , and for all , the above inequality implies that .∎
Lemma 20**.**
For , .
Proof.
Fix . Using the definition of in Definition 2, we have for that
[TABLE]
Using the definition of , also in Definition 2, this implies that
[TABLE]
Subtracting the second of these equations from the first yields, for all ,
[TABLE]
Taking norms and using the triangle inequality yields, for all , that
[TABLE]
where
[TABLE]
Since is bounded from above, using Lemma 19, the finiteness of , and Lemma 4. Furthermore, , so we may apply Lemma 7 to (57) to conclude that .∎
Lemma 21**.**
For , .
Proof.
Recalling (56), we first write
[TABLE]
Lemma 20 implies that the first term on the right-hand side of (58) converges to zero. Since is bounded, Lemma 19 implies that the second term on the right-hand side also converges to zero. Since , we conclude that . ∎
Finally, we can state our convergence result for Algorithm 1:
Theorem 1**.**
Suppose that assumptions 1-2 hold. If Algorithm 1 terminates by reaching line 1, then its final iterate is a member of the extended solution set . Otherwise, the sequence generated by Algorithm 1 converges weakly to some point in the extended solution set of (2) defined in (5). Furthermore, and for all , , and .
Proof.
For the finite termination result we refer to Lemma 5 of [21]. Otherwise, lemmas 18 and 21 imply that the hypotheses of Lemma 5, hold, and the result follows.∎
6 Numerical Experiments
All our numerical experiments were implemented in Python (using numpy and scipy) on an Intel Xeon workstation running Linux with 16 cores and 64 GB of RAM. The code is available via github at https://github.com/projective-splitting/coco. We restricted our attention to algorithms with comparable features and benefits to our proposed method. Thus we only considered methods that:
Are first-order and “fully split” the problem (that is, separate the linear operators from the resolvent calculations, and use gradient-type steps for smooth functions), 2. 2.
Do not (either approximately or exactly) solve a linear system of equations at each iteration or before the first iteration, 3. 3.
Avoid having to apply “smoothing” to nonsmooth operators, 4. 4.
Incorporate a backtracking linesearch in a manner that avoids the need for bounds on Lipschitz or cocoercivity constants, and 5. 5.
Do not use iterative approximation of resolvents.
The last property we include for reasons of simplicity, while the rest contribute to making algorithms scalable and easy to apply. For a given application, there may of course be effective algorithms which could have been considered but do not satisfy all of the above requirements. However, because of the general desirability of properties 1-4 and the relative simplicity of algorithms with property 5, we only considered methods having all of them.
We compared this paper’s backtracking one-forward-step projective splitting algorithm given in Algorithm 1 (which we call ps1fbt) with the following methods:
- •
The two-forward-step projective splitting algorithm with backtracking we developed in [21] (ps2fbt). This method requires only Lipschitz continuity of single-valued operators, as opposed to cocoercivity.
- •
The adaptive three-operator splitting algorithm of [34] (ada3op) (where “adaptive” is used to mean “backtracking linesearch”); this method is a backtracking adaptation of the fixed-stepsize method proposed in [14]. This method requires in problem (2) and hence can only be readily applied to two of the three test applications described below.
- •
The backtracking linesearch variant of the Chambolle-Pock primal-dual splitting method [30] (cp-bt).
- •
The algorithm of [12]. This is essentially Tseng’s method applied to a product-space “monotone + skew” inclusion in the following way: Assume is Lipschitz monotone, problem (3) is equivalent to finding such that (which is equivalent to ) for , and . In other words, we wish to solve , where and are defined by
[TABLE]
is maximal monotone, while is the sum of two Lipshitz monotone operators (the second being skew linear), and therefore also Lipschitz monotone. The algorithm in [12] is essentially Tseng’s forward-backward-forward method [40] applied to this inclusion, using resolvent steps for and forward steps for . Thus, we call this method tseng-pd. In order to achieve good performance with tseng-pd we had to incorporate a diagonal preconditioner as proposed in [41].
- •
The recently proposed forward-reflected-backward method [31], applied to this same primal-dual inclusion specified by (59)-(72). We call this method frb-pd.
Recently there have been several stochastic extensions of ada3op and cp-bt [42, 43, 33]. The method of [43] requires estimates of the Lipschitz constants and matrix norms, and so does not satisfy our experimental requirements. Since one of our problems is not in “finite-sum” format, and another includes a matrix which is not equal to the identity, the methods of [42, 33] could only be applied to one of our three test problems. Even for this problem, the number of training examples in the two datasets were and , respectively, while the feature dimensions were and , so finite-sum methods are not particularly suitable. For these reasons we did not include these methods in our experiments.
6.1 Portfolio Selection
Consider the optimization problem:
[TABLE]
where , , and . This model arises in Markowitz portfolio theory. We chose this particular problem because it features two constraint sets (a general halfspace and a simplex) onto which it is easy to project individually, but whose intersection poses a more difficult projection problem. This property makes it difficult to apply first-order methods such as ISTA/FISTA [5] as they can only perform one projection per iteration and thus cannot fully split the problem. On the other hand, projective splitting can handle an arbitrary number of constraint sets so long as one can compute projections onto each of them. We consider a fairly large instance of this problem so that standard interior point methods (for example, those in the CVXPY [15] package) are disadvantaged by their high per-iteration complexity and thus not generally competitive with first-order methods. Furthermore, backtracking variants of first-order methods are preferable for large problems as they avoid the need to estimate the largest eigenvalue of .
To convert (73) to a monotone inclusion, we set where is the normal cone of the simplex . We set , which is the gradient of the objective function and is cocoercive (and Lipschitz-continuous). Finally, we set , where , and let be the zero operator. Note that the resolvents of and (that is, the projections onto and ) are easily computed in operations [32]. With this notation, one may write (73) as the the problem of finding such that
[TABLE]
which is an instance of (2) with and .
To terminate each method in our comparisons, we used the following common criterion incorporating both the objective function and the constraints of (73):
[TABLE]
where is the optimal value of the problem. Note that if and only if solves (73). To estimate , we used the best feasible value returned by any method after iterations.
We generated random instances of (73) as follows: we set to obtain a relatively large instance of the problem. We then generated a matrix with each entry drawn from . The matrix is then formed as , which is guaranteed to be positive semidefinite. We then generate the vector of length to have entries uniformly distributed between [math] and . The constant is set to for various values of . We solved the problem for .
All methods were initialized at the same point . For all the backtracking linesearch procedures except cp-bt , the initial stepsize estimate is the previously discovered stepsize; at the first iteration, the initial stepsize is . For cp-bt we allowed the stepsize to increase in accordance with [30, Algorithm 4], as performance was poor otherwise. The backtracking stepsize decrement factor ( in Algorithm 2) was for all algorithms.
For ps1fbt and ps2fbt, was discovered via backtracking. We also set the other stepsize equal to at each iteration. While this is not necessary, this heuristic performed well and eliminated as a separately tunable parameter. For the averaging parameters in ps1fbt, we used and (which is possible because ). For ps1fbt we set and .
For tseng-pd and frb-pd, we used the following preconditioner:
[TABLE]
where is used as in [41, Eq. (3.2)] for tseng-pd ( on [31, p. 7] for frb-pd). In this case, the “monotone + skew” primal-dual inclusion described in (59)-(72) features two -dimensional dual variables in addition to the -dimensional primal variable. The parameter changes the relative size of the steps taken in the primal and dual spaces, and plays a similar role to in our algorithm (see Algorithm 3). The parameter in [30, Algorithm 4] plays a similar role for cp-bt. For all of these methods, we have found that performance is highly sensitive to this parameter: the primal and dual stepsizes need to be balanced. The only method not requiring such tuning is ada3op, which is a purely primal method. With this setup, all the methods have one tuning parameter except ada3op , which has none. For each method, we manually tuned the parameter for each ; Table 1 shows the final choices.
We calculated the criterion in (74) for computed by ps1fbt and ps2fbt, computed on Line 3 of [34, Algorithm 1] for ada3op, computed in [30, Algorithm 4] for cp-bt, and the primal iterate for tseng-pd and frb-pd. Table 2 displays the average number iterations and running time, over random trials, until falls (and stays) below for each method. Examining the table,
- •
For all four problems, ps1fbt outperforms ps2fbt. This behavior is not suprising, as ps1fbt only requires one forward step per iteration, rather than two. Since the matrix is large and dense, reducing the number of forward steps should have a sizable impact.
- •
For , ps1fbt is the best-performing method. However, for , ada3op is the quickest.
6.2 Sparse Group Logistic Regression
Consider the following problem:
[TABLE]
where and for are given data, are regularization parameters, and is a set of subsets of such that no element is in more than one group . This is the non-overlapping group-sparse logistic regression problem, which has applications in bioinformatics, image processing, and statistics [37]. It is well understood that the penalty encourages sparsity in the solution vector. On the other hand the group-sparse penalty encourages group sparsity, meaning that as increases more groups in the solution will be set entirely to [math]. The group-sparse penalty can be used when the features/predictors can be put into correlated groups in a meaningful way. As with the portfolio experiment, this problem features two nonsmooth regularizers and so methods like FISTA cannot easily be applied.
Problem (76) may be treated as a special case of (1) with , , and
[TABLE]
Since the logistic regression loss has a Lipschitz-continuous gradient and the -norm and non-overlapping group-lasso penalties both have computationally simple proximal operators, all our candidate methods may be applied.
We applied (76) to two bioinformatics classification problems with real data. Following [37], we use the breast cancer dataset of [27] and the inflammatory bowel disease (IBD) dataset of [6].222The breast cancer dataset is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1379. The IBD dataset is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3365. The breast cancer dataset contains gene expression levels for 60 patients with estrogen-positive breast cancer. The patients were treated with tamoxifen for 5 years and classified based on whether the cancer recurred (there were 28 recurrences). The goal is to use the gene expression values to predict recurrence. The IBD data set contains gene expression levels for 127 patients, 85 of which have IBD. The IBD data set actually features three classes: ulcerative colitis (UC), Crohn’s disease (CD), and normal, and so the most natural goal would be to perform three-way classification. For simplicity, we considered a two-way classification problem of UC/CD patients versus normal patients.
For both datasets, as in [37], the group structure was extracted from the C1 dataset [38], which groups genes based on cytogenetic position data.333The C1 dataset is available at http://software.broadinstitute.org/gsea/index.jsp. Genes that are in multiple C1 groups were removed from the dataset.444Overlapping group norms can also be handled with our method, but using a different problem formulation than (76). We also removed genes that could not be found in the C1 dataset, although doing so was not strictly necessary. After these steps, the breast cancer data had 7,705 genes in 324 groups, with each group having an average of 23.8 genes. For the IBD data there were 19,836 genes in 325 groups, with an average of 61.0 genes per group. Let be the data matrix with each row is equal to for ; as a final preprocessing step, we normalized the columns of to have unit -norm, which tended to improve the performance of the first-order methods, especially the primal-dual ones.
For simplicity we set the regularization parameters to be equal: . In practice, one would typically solve (76) for various values of and then choose the final model based on cross-validation performance combined with other criteria such as sparsity. Therefore, to give an overall sense of the performance of each algorithm, we solved (76) for three values of : large, medium, and small, corresponding to decreasing the amount of regularization and moving from a relatively sparse solution to a dense solution. For the breast cancer data, we selected and for IBD we chose . The corresponding number of non-zero entries, non-zero groups, and training error of the solution are reported in Table 3. Since the goal of these experiments is to assess the computational performance of the optimization solvers, we did not break up the data into training and test sets, instead treating the entire dataset as training data.
We initialized all the methods to the [math] vector. As in the portfolio problem, all stepsizes were initially set to . Since the logistic regression function does not have uniform curvature, we allowed the initial trial stepsize in the backtracking linesearch to increase by a factor of multiplied by the previously discovered stepsize. The methods ps1fbt, cp-bt, and ada3op have an upper bound on the trial stepsize at each iteration, so the trial stepsize was taken to be the minimum of multiplied by the previous stepsize and this upper bound.
Otherwise, the setup was the same as the portfolio experiment. tseng-pd and frb-pd use the same preconditioner as given in (75). For ps1fbt and ps2fbt we set to be equal to the discovered backtracked stepsize at each iteration. For ps1fbt we again set , , and fixed to . As such, all methods (except ada3op) have one tuning parameter which was hand-picked for each method; the chosen values are given in Table 4.
Figure 3 shows the results of the experiments, plotting against time for each algorithm, where is the objective function in (76) and is the estimated optimal value. To approximate , we ran each algorithm for 4,000 iterations and took the lowest value obtained. Overall, ps1fbt and ada3op were much faster than the other methods. For the highly regularized cases (the right column of the figure), ps1fbt was faster than all other methods. For middle and low regularization, ps1fbt and ada3op are comparable, and for ada3op is slightly faster for the the breast cancer data. The methods ps1fbt and ada3op may be succesful because they exploit the cocoercivity of the gradient, while ps2fbt, tseng-pd,and frb-pd only treat it as Lipschitz continuous. cp-bt also exploits cocoercivity, but its convergence was slow nonetheless. We discuss the performance of ps1fbt versus ps2fbt more in Section 6.3.
6.3 Final Comments: ps1fbt versus ps2fbt
On the portfolio problem, ps1fbt and ps2fbt have fairly comparable performance, with ps1fbt being slightly faster. However, for the group logistic regression problem, ps1fbt is significantly faster. Given that both methods are based on the same projective splitting framework but use different forward-step procedures to update , this difference may be somewhat surprising. Since ps1fbt only requires one forward step per iteration while ps2fbt requires two, one might expect ps1fbt to be about twice as fast as ps2fbt. But for the group logistic regression problem, ps1fbt significantly outpaces this level of performance.
Examining the stepsizes returned by backtracking for both methods reveals that ps1fbt returns much larger stepsizes for the logistic regression problem, typically - orders of magnitude larger; see Figure 4. For the portfolio problem, where the performance of the two methods is more similar, this is not the case: the ps1fbt stepsizes are typically about twice as large as the ps2fbt stepsizes, in keeping with their theoretical upper bounds of and , respectively.
Note that the portfolio problem has a smooth function which is quadratic and hence has the same curvature everywhere, while group logisitic regression does not. We hypothesize that the backtracking scheme in ps1fbt does a better job adapting to nonuniform curvature. A possible reason for this behavior is that the termination criterion for the backtracking search in ps1fbt may be weaker than for ps2fbt. For example, while ps2fbt requires to be positive at each iteration and operator , ps1fbt does not.
Acknowledgments
This research was supported by the National Science Foundation grant CCF-1617617.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Alotaibi, A., Combettes, P.L., Shahzad, N.: Solving coupled composite monotone inclusions by successive Fejér approximations of their Kuhn–Tucker set. SIAM Journal on Optimization 24 (4), 2076–2095 (2014)
- 2[2] Baillon, J.B., Haddad, G.: Quelques propriétés des opérateurs angle-bornés n 𝑛 n -cycliquement monotones. Israel Journal of Mathematics 26 (2), 137–150 (1977)
- 3[3] Bauschke, H.H., Combettes, P.L.: The Baillon-Haddad Theorem Revisited. Journal of Convex Analysis 17 (3-4, SI), 781–787 (2010)
- 4[4] Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces, 2nd edn. Springer (2017)
- 5[5] Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. Image Processing, IEEE Transactions on 18 (11), 2419–2434 (2009)
- 6[6] Burczynski, M.E., Peterson, R.L., Twine, N.C., Zuberek, K.A., Brodeur, B.J., Casciotti, L., Maganti, V., Reddy, P.S., Strahs, A., Immermann, F., et al.: Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. The Journal of Molecular Diagnostics 8 (1), 51–61 (2006)
- 7[7] Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40 (1), 120–145 (2011)
- 8[8] Combettes, P.L.: Fejér monotonicity in convex optimization. In: Encyclopedia of optimization, vol. 2, pp. 106–114. Springer Science & Business Media (2001)
