Testing and non-linear preconditioning of the proximal point method
Tuomo Valkonen

TL;DR
This paper develops a unified theoretical framework for analyzing the convergence of various optimization algorithms using non-linear preconditioning and testing, applicable to classical and stochastic methods.
Contribution
It formalizes a simple iteration-wise inequality approach for convergence proofs, generalizing properties like firm non-expansivity to a broad class of algorithms.
Findings
Effective application to classical algorithms and their stochastic variants
Unified convergence analysis framework for multiple methods
Demonstrates the approach's versatility across different algorithms
Abstract
Employing the ideas of non-linear preconditioning and testing of the classical proximal point method, we formalise common arguments in convergence rate and convergence proofs of optimisation methods to the verification of a simple iteration-wise inequality. When applied to fixed point operators, the latter can be seen as a generalisation of firm non-expansivity or the -averaged property. The main purpose of this work is to provide the abstract background theory for our companion paper "Block-proximal methods with spatially adapted acceleration". In the present account we demonstrate the effectiveness of the general approach on several classical algorithms, as well as their stochastic variants. Besides, of course, the proximal point method, these method include the gradient descent, forward--backward splitting, Douglas--Rachford splitting, Newton's method, as well as several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\nochangebars\cbcolor
blue
Testing and non-linear preconditioning of the proximal point method
Tuomo Valkonen ModeMat, Escuela Politécnica Nacional, Quito, Ecuador; previously Department of Mathematical Sciences, University of Liverpool, United Kingdom. [email protected]
(2017-03-16 (revised 2018-08-23))
Abstract
Employing the ideas of non-linear preconditioning and testing of the classical proximal point method, we formalise common arguments in convergence rate and convergence proofs of optimisation methods to the verification of a simple iteration-wise inequality. When applied to fixed point operators, the latter can be seen as a generalisation of firm non-expansivity or the -averaged property. The main purpose of this work is to provide the abstract background theory for our companion paper “Block-proximal methods with spatially adapted acceleration”. In the present account we demonstrate the effectiveness of the general approach on several classical algorithms, as well as their stochastic variants. Besides, of course, the proximal point method, these method include the gradient descent, forward–backward splitting, Douglas–Rachford splitting, Newton’s method, as well as several methods for saddle-point problems, such as the Alternating Directions Method of Multipliers, and the Chambolle–Pock method.
Get the version from http://tuomov.iki.fi/publications/, citations broken in this one due arXiv being stuck in the 70s and not supporting biblatex (or 80s bibtex for that matter), hence not modern bibliography styles or utf8.
1 Introduction
The proximal point method for monotone operators [21, 27], while infrequently used by itself, can be found as a building block of many popular optimisation algorithms. Indeed, many important application problems can be written in the form \cbstart
[TABLE]
for convex and , and a linear operator , with and non-smooth and smooth. \cbendExamples abound in image processing and data science. The problem (P) can often be solved by methods such as forward–backward splitting, ADMM (alternating directions method of multipliers) and their variants [2, 19, 13, 7]. They all involve a proximal point step.
The equivalent saddle point form of (P) is
[TABLE]
In particular within mathematical image processing and computer vision, a popular algorithm for solving (S) with is the primal–dual method of Chambolle and Pock [7]. As discovered in [14], the method can most concisely be written as a preconditioned proximal point method, solving on each iteration for the variational inclusion
[TABLE]
where the monotone operator
[TABLE]
encodes the optimality condition for (S). In the standard proximal point method [27], one would take the identity. With this choice, (PP0) is generally difficult to solve. In the Chambolle–Pock method the preconditioning operator is given for suitable step length parameters by
[TABLE]
This choice of decouples the primal and dual updates, making the solution of (PP0) feasible in a wide range of problems. If is strongly convex, the step length parameters can be chosen to yield convergence rates of an ergodic duality gap and the quadratic distance .
In our earlier work [31], we have modified as well as the condition (PP0) to still allow a level of mixed-rate acceleration when is strongly convex only on sub-spaces. Our convergence proofs were based on testing the abstract proximal point method by a suitable operator, which encodes the desired and achievable convergence rates on relevant subspaces.
In the present paper, we extend this theoretical approach to non-linear preconditioning, non-invertible step-length operators, and arbitrary monotone operators . Our main purpose is to provide the abstract background theory for our companion paper [30]. Here, within these pages, we demonstrate that several classical optimisation methods—including the second-order Newton’s method—can also be seen as variants of the proximal point method, and that their common convergence rate and convergence proofs reduce to the verification of a simple iteration-wise inequality. Through application of our theory to Browder’s fixed point theorem [4] in section 2.6, we see that our inequality generalises the concepts of firm non-expansivity or the -averaged property. Our theory also covers stochastic variants of the considered algorithms.
In section 2, we start by developing our theory for general monotone operators . This extends, simplifies, and clarifies the more disconnected results from [31] that concentrated on saddle-point problems with preconditioners derived from (1). We demonstrate our results on the basic proximal point method, gradient descent, forward–backward splitting, Douglas–Rachford splitting, and Newton’s method. The proximal step in forward–backward splitting and proximal Newton’s method can be introduced completely “free”, without any additional proof effort, in our approach. \cbstartIn section 3 we demonstrate the further flexibility of our techniques by application to stochastic block coordinate methods. We refer to [33] for a review of this class of methods. In the final sections 4 and 5 we specialise our work to saddle-point problems, and demonstrate the results on variants of the Chambolle–Pock method, and the Generalised Iterative Soft Thresholding (GIST) algorithm of [19]. Some of the derivations in these last two sections are quite abstract and general, as we will need this for our companion paper [30] where we develop stochastic primal-dual methods with coordinate-wise adapted step lengths. \cbend
Besides already cited works, other previous work related to ours includes that on generalised proximal point methods, such as [6, 9], as well inertial methods for variational inclusions [18].
2 An abstract preconditioned proximal point iteration
2.1 Notation and general setup
We use to denote the space of convex, proper, lower semicontinuous functions from to the extended reals , and to denote the space of bounded linear operators between Hilbert spaces and . We denote the identity operator by . For , we write when is positive semidefinite. Also for possibly non-self-adjoint , we introduce the inner product and norm-like notations
[TABLE]
For a set , we write if every element satisfies .
Our overall wish is to find some , on a Hilbert space , solving for a given set-valued map the variational inclusion
[TABLE]
\cbstart
Throughout the manuscript, stands for an arbitrary root of a relevant map . \cbendIn the present section 2, will be arbitrary, but in sections 4 and 5, where we specialise the results, we concentrate on arising from the saddle point problem (S).
Our strategy towards finding a solution is to introduce an arbitrary non-linear iteration-dependent preconditioner and a step length operator . With these, we define the generalised proximal point method, which on each iteration solves from
[TABLE]
We assume that splits into , and as
[TABLE]
More generally, to rigorously extend our approach to cases that would otherwise involve set-valued , we also consider for the iteration
[TABLE]
We say that (PP) or (PP∼) is solvable for the iterates if given any , we can solve the corresponding inclusion to iteratively calculate from for each .
2.2 Basic estimates
We analyse the preconditioned proximal point methods (PP) and (PP∼) by applying a testing operator , following the ideas introduced in [31]. The product with the linear part of the preconditioner, will, as we soon demonstrate, be an indicator of convergence rates. In essence, as seen in the descent inequality (DI) of the next result, the operator forms a local metric (in the differential geometric sense) that measures closeness to a solution.
Theorem 2.1**.**
On a Hilbert space , let , and for . Suppose (PP∼) is solvable for . \cbstartIf for all , is self-adjoint, and for some and the fundamental condition
[TABLE]
holds, then so do the quantitative -Féjer monotonicity
[TABLE]
\cbend
as well as the descent inequality
[TABLE]
\cbstart
The main condition (CI∼) of theorem 2.1 essentially writes in abstract and step-dependent form the three-point formulas that hold for convex smooth functions (see appendix B). The term is able to measure the strong monotonicity of or the approximation . Indeed, if we have the estimate
[TABLE]
then this suggests to update the local metrics as
[TABLE]
where we write to indicate that only the norm induced by the two operators has to be the same: might not be self-adjoint, while has to be self-adjoint. As we will see in section 4.2, these metric update and self-adjointness conditions effectively give popular primal–dual optimisation methods their necessary forms. The term , on the other hand, as we shall see in more detail in section 2.3, gives the necessary leeway for taking a forward step instead of a proximal step with respect to some components of . The term can model function value differences or duality gaps, as will be the case in this work, but in other contexts, such as the stochastic methods of our companion paper [30], it will be a penalty for the dissatisfaction of the metric update; hence the negated sign and the right-hand position in (DI).
Specialised to (PP), we obtain the following result. The condition (CI) is often more practical to verify than (CI∼) thanks to the additional structure introduced by . Indeed, in many of our examples, we can eliminate through monotonicity. To derive gap and function value estimates in section 5, we will however need (CI∼). \cbend
Corollary 2.2**.**
On a Hilbert space , let . Also let , and for . Suppose (PP) is solvable for with as in (4). Let . \cbstartIf for all , is self-adjoint, and for some and the condition
[TABLE]
holds, then (CI∼), (QF), and (DI) hold for . \cbend
Proof 2.3** (Proof of theorem 2.1).**
Inserting (PP∼) into (CI∼), we obtain
[TABLE]
We recall for general self-adjoint the three-point formula
[TABLE]
Using this with , we rewrite (5) as the quantitative -Féjer monotonicity (QF). Summing this over , we obtain the descent inequality (DI).
Remark 2.4** (Bregman divergences and Banach spaces).**
\cbstart
Let be a Banach space and . Then for and one can define the asymmetric Bregman divergence (or distance)
[TABLE]
where denotes the dual product. This is non-negative, but not a true distance, as it can happen that for . However with and , we deduce [9]
[TABLE]
Therefore, the Bergman distance satisfies an analogue of the standard three-point identity (6). It allows generalising our techniques to Banach spaces and the algorithm
[TABLE]
where for each now has been replaced by . The convergence will, however, be with respect to . Indeed, if is, in fact, a Hilbert space and we take , then .
Proximal point methods based on general Bregman divergences in place of the squared norm are studied in, e.g., [6, 9, 15, 16]. \cbend
The next two results demonstrate how the estimate of theorem 2.1 can be used to prove convergence with or without rates.
Proposition 2.5** (Convergence with a rate).**
Suppose the descent inequality (DI) holds with , and that for all . Then at the rate .
Proof 2.6**.**
Immediate from (DI).
\cbstart
We can also obtain superlinear convergence from (QF), a form of quantitative Féjer monotonicity when .
Proposition 2.7** (Superlinear convergence).**
Suppose (QF) holds with , and that for some for all . If , then superlinearly.
Proof 2.8**.**
Immediate from (QF).
The scalar has its index off-by-one intentionally; the reason will become more apparent once we get to primal–dual methods. It is also possible to obtain superlinear convergences of different orders from (DI) or (QF). However, the conventional notions cannot be characterised without involving the iterates. Indeed, assuming , eqrefeq:convergence-result-main-h characterises superlinear convergence of order . It would also be possible to introduce new notions of the order of superlinear convergence, not involving the iterates and more in spirit with the testing approach, such as , if such a notion would turn out to be useful.
To obtain weak convergence, we do not need to grow, but we need some additional technical assumptions. First of all, some of the leeway that the fundamental condition (CI∼) included for the forward steps, is now required to obtain convergence. Secondly, we need some weak-to-strong outer semicontinuity from , which we write more abstractly in terms of . It would be possible to improve this requirement based on the Brezis–Crandall–Pazy property [3]. \cbend
Proposition 2.9** (Weak convergence).**
Suppose for all that is self-adjoint, and that the iterates of the preconditioned proximal point method (PP∼) satisfy the fundamental condition (CI∼) with for all \cbstart and some . Suppose either that has a bounded inverse, or that is bounded on bounded sets. If is strong-to-strong outer semicontinuous and
[TABLE]
then weakly in for some . \cbend
\cbstart
For the proof, we use the next lemma. Its earliest version is contained in the proof of [22, Theorem 1], but can be found more explicitly stated as [5, Lemma 6]. \cbend
Lemma 2.10**.**
On a Hilbert space , let be closed and convex, and . If the following conditions hold, then weakly in for some :
- (i)
* is non-increasing for all \cbstart(Féjer monotonicity) \cbend.* 2. (ii)
All weak limit points of belong to .
Proof 2.11** (Proof of proposition 2.9).**
\cbstart
To use lemma 2.10, we need a closed and convex solution set. However, may generally be non-convex and not closed. Since , using the strong-to-strong outer semicontinuity of , it is easy to see that (CI∼) holds for all . Consequently the descent inequality (DI) holds for all .
We apply theorem 2.1 on any . From the quantitative -Féjer monotonicity (QF), since and , we have
[TABLE]
This implies the condition lemma 2.10(i) for the sequence .
Let then as in (7). From (8), we deduce that as . By (PP∼) and (7), any weak limit point of the sequence then satisfies . Let then be any weak limit point of . We need to show that . If has a bounded inverse, then this is clear as the weak convergence of implies the weak convergence of . Otherwise, when is bounded on bounded sets, since , we see that is bounded. Hence a subsequence converges to some . But this implies that as required.
By lemma 2.10 now . This implies weakly for some . \cbend
2.3 Examples of first-order methods
We now look at several concrete examples.
\cbstart
Example 2.12** (The proximal point method).**
For all , take , , and for some . Then (PP) is the standard proximal point method . If the operator is maximal monotone, converges weakly to some for any starting point .
Proof 2.13** (Proof of convergence).**
We take for some . Then the fundamental condition (CI) reads
[TABLE]
As long as , the monotonicity of clearly proves (9), thus (CI), with . Using the maximal monotonicity, Minty’s theorem guarantees the solvability of (PP). Thus the conditions of corollary 2.2 are satisfied. Maximal monotonicity also guarantees that is weak-to-strong outer semicontinuous; see lemma A.1. This establishes the iteration outer semicontinuity condition (7). Taking for constant , so that , it remains to refer to proposition 2.9.
Suppose is strongly monotone, that is, for some holds
[TABLE]
Then from (9), we immediately also derive convergence rates as follows. Letting will obviously give the fastest convergence, however, the step length rule will be useful later on with splitting methods, combining the simple proximal step with other algorithmic elements.
Example 2.14** (Acceleration and linear convergence of the proximal point method).**
Suppose is strongly monotone for some factor . If we choose , then the proximal point method satisfies at the rate . If we keep constant, we get linear convergence of the iterates. If , we get superlinear convergence.
Proof 2.15** (Proof of convergence).**
Clearly (9) holds with provided we update
[TABLE]
Then theorem 2.1 gives the descent inequality (DI), which now reads
[TABLE]
If we take , this reads . Since is of the order [7, 31], we get the claimed convergence from (2.15). If, on the other hand, we keep fixed, then clearly . Since this is exponential when , we get linear convergence from (2.15). Finally, if , we see from (2.15) that . We now obtain superlinear convergence from corollaries 2.2 and 2.7.
The next lemma starts our analysis of gradient descent and forward–backward splitting. It relies on the three-point smoothness inequalities of appendix B, which the reader may want to study at this point. \cbend
Lemma 2.16**.**
Let for such that is -Lipschitz. For all , take and with as well as for some .\cbstartThen the fundamental condition (CI) holds if
- (i)
* is constant, , and . In this case the iteration outer semicontinuity condition (7) moreover holds provided .*
\cbend
If is strongly convex with factor , alternatively:
- (ii)
, , or , and .
Proof 2.17**.**
\cbstart
We expand the fundamental condition (CI) as
[TABLE]
By the monotonicity of , this holds if
[TABLE]
(i)* The three-point inequality (76) in lemma B.1 states*
[TABLE]
This clearly reduces (10) to
[TABLE]
which holds under the conditions of (i). The satisfaction of (7) is immediate from the weak-to-strong outer semicontinuity of (lemma A.1), the Lipschitz continuity of , and the bounds on .
(ii)* The three-point smoothness inequality (79) in lemma B.3 gives*
[TABLE]
\cbend
Inserting this into (10), we see it to hold with if
[TABLE]
Clearly our two alternative choices of are non-increasing. Therefore, (11) follows from the initialisation condition and the update rule in (ii).
\cbstart
Remark 2.18**.**
It is also possible to exploit the strong convexity of instead of for acceleration.
\cbend
Example 2.19** (Gradient descent).**
\cbstart
Let for with -Lipschitz. \cbendTaking and constant in lemma 2.16, (PP) reads
[TABLE]
This is the gradient descent method. Direct application of lemma 2.16(i) with and together with corollaries 2.2 and 2.9 now verifies the well-known weak convergence of the method to a root of when .
Observe that for
[TABLE]
Each step of (PP) therefore minimises the surrogate objective [11]
[TABLE]
The function on one hand penalises long steps, and on the other hand allows longer steps when the local linearisation error is large. In this example, is, in fact, a Bregman divergence.
\cbstart
Under strong convexity, we again get rates via lemma 2.16(ii). Minding our remarks before example 2.14, we only state the case . Due to the upper bound , we cannot get superlinear convergence as in example 2.14.
Example 2.20** (Acceleration and linear convergence of gradient descent).**
Continuing from example 2.19, if is strongly convex with factor and is -Lipschitz, and we keep fixed, we get linear convergence.
\cbend
Now comes the full power of lemma 2.16: we can easily bolt on a proximal step to gradient descent.
Example 2.21** (Forward–backward splitting).**
Let for with Lipschitz. Taking , , and as in lemma 2.16, the preconditioned proximal point method (PP) becomes
[TABLE]
This is the forward–backward splitting method
[TABLE]
By lemma 2.16, convergence and acceleration work exactly as for gradient descent in examples 2.19 and 2.20.
\cbstart
We can also do fully non-smooth splitting methods by a lifting approach: \cbend
Example 2.22** (Douglas–Rachford splitting).**
Let be maximal monotone operators. Consider the problem of finding with . For , let
[TABLE]
Then if and only if , where . The preconditioned proximal point method (PP∼) becomes the Douglas–Rachford splitting [12]
[TABLE]
We work with (PP∼) since in (PP), would have to be set-valued. If and are maximal monotone, the variables converge weakly to .
Proof 2.23** (Proof of convergence).**
Write and . Observe that
[TABLE]
Using the monotonicity of and , with , we have
[TABLE]
Thus the fundamental condition (CI∼) holds with . Using (13) and the weak-to-strong outer semicontinuity of and (see lemma A.1), we easily verify (7). \cbstartSince is non-invertible, we also have to verify that is bounded on bounded sets. This is to say that (14) bounds in terms of . This is an easy consequence of the Lipschitz-continuity of the resolvent of maximal monotone operators [1, Corollary 23.10]. \cbendWeak convergence now follows from theorems 2.1 and 2.9.
2.4 Examples of second-order methods
\cbstart
We now look at how are techniques are applicable to Newton’s method. Through the three-point inequalities of lemma B.5 for functions, the analysis turn out to be very close to that of gradient descent. Our analysis is not as short as the conventional analysis of Newton’s method, but has its advantages. Indeed, the convergence of proximal Newton’s method will be an automatic corollary of our approach, exactly how the convergence of forward–backward splitting was a corollary of the convergence of gradient descent. \cbend
Example 2.24** (Newton’s method).**
Suppose for . Take
[TABLE]
Then the preconditioned proximal point method (PP) reads
[TABLE]
This is Newton’s method. \cbstartBy lemma 2.26 (below) and proposition 2.5, we obtain local linear convergence if . By lemma 2.28 (below), this convergence is, further, superlinear (quadratic if is locally Lipschitz near ). \cbend
Observe that now is the gradient of
[TABLE]
In the surrogate objective (12), this allows longer steps when the second-order Taylor expansion under-approximates, and forces shorter steps when it over-approximates.
\cbstart
Again, we can easily bolt on a proximal step: \cbend
Example 2.25** (Proximal Newton’s method).**
Let for , and . Taking , , and as in example 2.24, the preconditioned proximal point method (PP) becomes
[TABLE]
This is the proximal Newton’s method [[, see, e.g.,]]lee2014proximal
[TABLE]
where solves Convergence and acceleration work exactly as for Newton’s method in example 2.24, based on the same lemmas that we state next.
Lemma 2.26**.**
Let for and . Take
[TABLE]
\cbstart
For an initial iterate , let be defined through (PP). \cbendIf , there exists such that if , then the fundamental condition (CI) holds with and for all . Moreover, we can take such that \cbstart for some . In particular, at the linear rate .
Proof 2.27**.**
We set and for some . Then imply that is positive and self-adjoint for close to .
By assumption, for some , we have
[TABLE]
For a fixed , let us assume that . Since is monotone, similarly to the proof of lemma 2.16, the fundamental condition (CI) holds if
[TABLE]
where we use (81) in lemma B.5 with to estimate
[TABLE]
for
[TABLE]
Consequently, (15) holds with if we take such that
[TABLE]
This can always be satisfied for some for small enough because then implies .
Now corollary 2.2 shows the quantitative -Féjer monotonicity eq. QF, which with (17) implies
[TABLE]
If , this implies by (16) that . Consequently, if is small enough, that is, if is small enough due to the continuity of , we obtain so that also . In particular, our assumption guarantees . Consequently also for all . We can now take in (16), so that (17) gives
[TABLE]
Since is increasing within , and , we see that . Taking we now get . This implies the convergence rate claim.
We can also show superlinear convergence, however, this is somewhat more elaborate as we need to make use of .
Lemma 2.28**.**
With everything as in lemma 2.26, the convergence rate claim can be improved to superlinear. If is locally Lipschitz near , for example, if , then this convergence is quadratic (superlinear convergence of order ).
Proof 2.29**.**
We continue with the initial setup of the proof of lemma 2.26 until (15). Now, for given by (16), (86) in lemma B.7 gives
[TABLE]
With this, (15), hence the fundamental condition (CI), holds if
[TABLE]
This holds for
[TABLE]
provided
[TABLE]
This can always be satisfied for some if is small enough because then due to .
By corollary 2.2 we now obtain the quantitative -Féjer monotonicity (QF), which with (19) gives
[TABLE]
Due to (16), we have . Hence, also using (20), (21) implies
[TABLE]
If , this and imply , hence our assumption implies . Consequently also for all , If now , which is guaranteed by small enough and the continuity of , then (22) implies . Consequently .
Let . From (22), we get superlinear convergence if , which follows from . Superlinear convergence of order occurs if for some . From (22), we see this to hold if . If is Lipschitz near , then for some constant . Therefore we get superlinear convergence of order .
\cbend
\cbstart
2.5 Convergence of function values
We now study how our framework can be used to derive the convergence, or ergodic convergence, of function values. We concentrate on algorithms that are variants of forward–backward splitting, including gradient descent and the proximal point method, although other algorithms can be handled similarly. We again use the three-point inequalities of appendix B.
Lemma 2.30**.**
Let for with -Lipschitz. For all , take and with as well as for some . Then the fundamental condition (CI∼) holds if
- (i)
* is constant, , and*
[TABLE]
If is strongly convex with factor , alternatively:
- (ii)
, , or , and
[TABLE]
Proof 2.31**.**
We fellow the proof of lemma 2.16, where we start by expanding (CI∼) (instead of (CI)) as
[TABLE]
Note that we have not inserted here. Now, as the next step, we do not eliminate through monotonicity of , but use the definition of the convex subdifferential. Then we use the value three-point inequality (77) in place of the non-value inequality (76) and the value inequality (80) in place of the non-value inequality (79). From here the claims follow as in the proof of lemma 2.16. Note the factor-of-two differences between these formulas, which are reflected in the step length rules: instead of ; instead of ; and instead of .
We now obtain the convergence to zero of a weighted function value difference over the history of iterates, and as a consequence, for an ergodic sequence formed from the iterates:
Corollary 2.32**.**
Suppose the conditions of lemma 2.30 hold. Then
[TABLE]
In consequence, if we define the ergodic sequence
[TABLE]
then
[TABLE]
In particular, if lemma 2.30(i) holds, then at the rate . If, instead, lemma 2.30(ii) holds, then this convergence is linear.
Proof 2.33**.**
The basic inequality (23) is a consequence of the fundamental theorem 2.1. The ergodic estimate (24) follows from there by Jensen’s inequality. The first convergence rate estimate follows from (24) are based on the fact that under lemma 2.30(i) is a constant, so . Under lemma 2.30(i) we recall from example 2.14 that the rule for shows that is grows exponentially with constant. Then also is exponential, so we obtain linear rates.
The following three examples follow from corollary 2.32. For the proximal point method, additionally, since we can still let due to , we can also get superlinear convergence. Also, in the case of the proximal point method, we use the strong convexity of , which is for simplicity not considered in (2.30), but can easily be added.
Example 2.34** (Proximal point method ergodic function value).**
For the proximal point method of examples 2.12 and 2.14, applied to with , we have at the rate when and no strong convexity is present. If is strongly convex, and , the convergence is linear; if , the convergence is superlinear.
Example 2.35** (Gradient descent ergodic function value).**
For the gradient descent method of examples 2.19 and 2.20, applied to with -Lipschitz gradient, if with , we have at the rate the . If is strongly convex, , and we update , then this convergence is .
Example 2.36** (Forward–backward splitting ergodic function value).**
For the forward–backward splitting of example 2.21, at exactly the same rates and conditions are for gradient descent in example 2.35.
For Newton’s method, we can use similar arguments: we can replace (81) by (83) in lemma 2.26, and (86) by (87) in lemma 2.28. This can be done because the preceding non-value lemmas show that . In lemma 2.26 the effect of the change is to replace by everywhere, and in lemma 2.28, to replace by . With these changes, the main arguments go through, although the exact value of and the upper bounds for in the final paragraphs are changed.
Example 2.37** (Newton’s method function value).**
For Newton’s method in example 2.24, we have and for some . We have (super)linearly.
We can also obtain non-ergodic convergence for monotone methods. We demonstrate the idea only for the unaccelerated () proximal point method, but unaccelerated forward–backward splitting and gradient descent can be handled analogously.
Example 2.38** (Proximal point method function value).**
For the proximal point method of examples 2.12 and 2.14, applied to with , we have at the rate when and no strong convexity is present. If is strongly convex, and , the convergence is linear; if , the convergence is superlinear.
Proof 2.39** (Proof of convergence).**
From (PP), that is , we have
[TABLE]
That is, the proximal point method is monotone: Now we use corollary 2.32. Using (25) to unroll the function value sum in (23) gives . The rates follow as in corollaries 2.32 and 2.34.
\cbend
2.6 Connections to fixed point theorems
\cbstart
We demonstrate connections of our approach to established fixed point theorems. The following result in its modern form, stated for firmly non-expansive or more generally -averaged maps, can be first found in [5]. Similar results for what are now known as Krasnoselski–Mann iterations, closely related to -averaged maps, were, however, stated earlier for more limited settings in [20, 28, 23, 17, 22]. \cbend
Example 2.40** (Browder’s fixed point theorem).**
Let be -averaged, that is for some non-expansive and . Suppose there exists a fixed point . Let . Then for some fixed point of .
Proof 2.41** (Proof).**
Let us set , as well as and . We have
[TABLE]
where the last step follows by observing from the previous steps that (PP) says . The expression (26) easily gives the \cbstartiteration outer semicontinuity condition (7), \cbendand reduces the fundamental condition (CI∼) to
[TABLE]
Using and , and taking , (CI∼) therefore holds for
[TABLE]
provided
[TABLE]
Using the -averaged property and , we expand
[TABLE]
We take . Then . Cauchy’s inequality and non-expansivity of thus give
[TABLE]
This verifies (CI∼). From (27), . We now obtain the claimed convergence from corollaries 2.2 and 2.9.
3 Stochastic methods
\cbstart
We now exploit the fact that the step length can be a non-invertible operator. We do this in the context of stochastic block-coordinate methods. Towards this end we introduce the following probabilistic notations: \cbend
Definition 3.1**.**
We write if is an -valued random variable: for some (in the present work fixed) probability space , where is a -algebra on . We denote by the expectation with respect to a probability measure on . As is common, we abuse notation and write for the unknown random realisation . We also write for the conditional expectation with respect to random variable realisations up to and including iteration .
We refer to [29] for more details on measure-theoretic probability.
\cbstart
The following is an immediate corollary of theorem 2.1, obtained by taking the expectation of both (CI∼) and (DI). By only requiring these inequalities to hold in expectation may may produce more lenient step length and other conditions. In the section, we demonstrate the flexibility of our techniques to stochastic methods with a few basic examples. We refer to the review article [33] for an introduction and further references to stochastic coordinate descent, and to our companion paper [30] for primal–dual methods based on the work here.
Corollary 3.2**.**
On a Hilbert space and a probability space , let , and for . Suppose (PP∼) is solvable for . If for all and almost all random events , is self-adjoint, and for some and the expected fundamental condition
[TABLE]
holds, then so does the expected descent inequality
[TABLE]
In block-coordinate descent methods, we write for some mutually orthogonal projections operators, and on each step of the method, only update some of the “blocks” . Functions with respect to which we take a proximal step, we assume separable with respect to these projections or subspaces: . To perform forward steps, we introduce a blockwise version of standard smoothness conditions of convex functions. The idea is that the factor for the subset of blocks can be better than the global smoothness or Lipschitz factor .
\cbend
Definition 3.3**.**
We write if are projection operators in with , and for . For random and an iteration , we then set
[TABLE]
For smooth , we let be the -relative smoothness factor, satisfying
[TABLE]
and consequently (see lemma C.1)
[TABLE]
Example 3.4** (Stochastic block-coordinate descent).**
Let for with Lipschitz gradient. Also let . For each , take random , and set
[TABLE]
Then (PP) says that we take a forward step on the random subspace :
[TABLE]
If the step lengths are deterministic and satisfy and for all for some , we have at the rate \cbstartfor the ergodic sequence
[TABLE]
\cbend
Through the use of the “local” smoothness factors , the method may be able to take larger steps than those allowed by the global factor in example 2.19.
The smoothness of limits the usefulness of example 3.4. However, it forms the basis for popular stochastic forward–backward splitting methods, of which we now provide an example.
Example 3.5** (Stochastic forward–backward splitting).**
Let . Suppose for , where has Lipschitz gradient, and is separable: . Take , , and as in example 3.4. Then (PP) describes the stochastic forward–backward splitting method
[TABLE]
With , this can be written
[TABLE]
The method has exactly the same convergence properties as the stochastic gradient descent of example 3.4.
Remark 3.6**.**
Following example 2.20, if or is strongly convex, it is also possible to construct accelerated versions of both examples 3.4 and 3.5. Then we can obtain from (D) convergence rates for .
Proof 3.7** (Proof of convergence of stochastic gradient descent and forward–backward splitting).**
\cbstart
We take as the testing operator . Then, since , (C) expands as
[TABLE]
From the decomposition and the convexity of , we observe that
[TABLE]
Since is deterministic and , such that for all , by Jensen’s inequality, therefore,
[TABLE]
If we show the ergodic three-point smoothness condition
[TABLE]
then using our assumption and (33), we verify (32), hence (C), for some such that
[TABLE]
Since by our assumption , corollary 3.2 now shows the convergences of function values for the ergodic sequence . \cbend
To prove (34), from (28) with and we have
[TABLE]
By convexity, we also have
[TABLE]
Summing (35) and (36), multiplying by , and taking the expectation,
[TABLE]
Since , Jensen’s inequality shows
[TABLE]
Therefore, summing (37) over verifies (34).
Example 3.8** (Stochastic Newton’s method).**
Suppose and . Take , , and
[TABLE]
Then (PP) reads
[TABLE]
where we abbreviate . We get
[TABLE]
where we define to satisfy and . This is a variant of stochastic Newton’s method and “sketching” [25, 24]. Notice how can be significantly cheaper to compute than .
\cbstart
Let
[TABLE]
as well as
[TABLE]
If , then at a linear rate. \cbend
Remark 3.9**.**
If for some self-adjoint positive definite and , then , so the upper bound on is satisfied for any . If for some , then due to
[TABLE]
An advantage of our techniques is the immediate convergence of:
Example 3.10** (Stochastic proximal Newton’s method).**
Let . Let for and with . Take , , and as in example 3.8. Then we obtain the algorithm
[TABLE]
We have at a linear rate under the same conditions as in example 3.8.
Proof 3.11** (Proof of convergence of stochastic Newton’s and proximal Newton’s methods).**
\cbstart
We set as the preconditioner and as the test for some . Clearly we have the following simpler non-value version of the value estimate (33):
[TABLE]
Therefore, since , the expected fundamental condition eq. C becomes
[TABLE]
for
[TABLE]
Adapting the argumentation of lemmas B.5 and B.7 to the present projected setting, by the mean value theorem, for some between and , and using the definition of in (38) and the three-point identity (6), we rearrange
[TABLE]
By the definition of in (39) and by Cauchy’s inequality, for any , we obtain the expected three-point inequality
[TABLE]
We take . Then (41) holds when
[TABLE]
This is the case for some with provided and is small enough that . Due to (38), we can take for
[TABLE]
In particular, we obtain exponential growth of provided , which holds when , which is the case under our assumption . Consequently, we can take for . By corollary 3.2 we have
[TABLE]
Since , we obtain the claimed linear expected convergence of iterates.
\cbend
Remark 3.12** (Variance estimates).**
From an estimate of the type , as above, Jensen’s inequality gives . From this, with the application of the triangle and Cauchy’s inequalities, it is easy to derive the variance estimate .
4 Saddle point problems
\cbstart
We now momentarily forget the stochastic setting and ergodic estimates to which we will return in section 5, and introduce our overall approach to primal–dual methods for saddle-point problems. \cbendWith ; ; and on Hilbert spaces and , we now wish to solve the following version of (S). The first-order necessary optimality conditions read
[TABLE]
Setting and introducing the variable splitting notation , , etc., this can succinctly be written as in terms of the operator
[TABLE]
In this section, concentrating on this specific , we specialise the theory of section 2.2 to saddle point problems. Throughout, for some primal and dual step length and testing operators , and , we take
[TABLE]
To work with arbitrary step length operators, which will be necessary for stochastic algorithms in section 3, as well as the partially accelerated algorithms of [31], we will need abstract forms of partial strong monotonicity of and . As a first step, we take subspaces of operators
[TABLE]
We suppose that is partially (strongly) -monotone, which we take to mean
[TABLE]
for some linear operator . The operator acts as a testing operator. \cbstartObserve that we have already proven this in (40) for the setting of the stochastic Newton’s method. \cbendSimilarly, we assume that is -monotone in the sense
[TABLE]
Regarding , we assume that exists and is partially -co-coercive in the sense that for some holds
[TABLE]
(We allow for the case .)
We also introduce
[TABLE]
which are operator measures of strong monotonicity and smoothness of . Finally, we introduce the forward–step preconditioner with respect to , familiar from example 2.19 as
[TABLE]
Example 4.1** (Block-separable structure, monotonicity).**
Let be projection operators in with and if . Suppose are (strongly) convex with factors . Then the partial strong monotonicity (G-PM) holds with for
[TABLE]
4.1 Estimates
Using the (strong) -monotonicity of , and the -co-coercivity of , the next lemma simplifies corollary 2.2 for given by (42). We introduce to facilitate later gap estimates that will require the conditions in the lemma to hold for instead of .
Theorem 4.2**.**
Let have the structure (42) and assume . Suppose satisfies the partial strong monotonicity (G-PM) for some , similarly satisfies (F∗-PM), and satisfies the partial co-coercivity (J-PC) for some . For each , let and be such that and . Define and through (43). Also take , and . Suppose (PP) is solvable for . Then the fundamental conditions (CI), (CI∼) and the descent inequality (DI) hold if \cbstartfor all , the operator \cbend is self-adjoint and for and we have the fundamental inequality for saddle-point problems
[TABLE]
We have introduced and for later gap estimates, where the specific choices of these will differ by a factor of two, similarly to the differences in the step length bounds for the function value estimates of section 2.5 compared to the non-value estimates of section 2.3.
Proof 4.3**.**
Note that being self-adjoint implies that so is . Using (J-PC), similarly to lemma B.1 we derive
[TABLE]
Using (45), therefore
[TABLE]
With this, (G-PM), and (F∗-PM), we observe (47) to imply
[TABLE]
Here pay attention to the fact that (48) employs while (47) employs . If we show that (CI) follows from (48), then the descent inequality (DI) follows from corollary 2.2. Indeed, using the expansion
[TABLE]
we expand for any that
[TABLE]
With the help of we then obtain
[TABLE]
Inserting this into (48), we obtain the fundamental inequality (CI). It implies (CI∼) via corollary 2.2. Finally, theorem 2.1 gives (DI).
4.2 Examples of primal–dual methods
\cbstart
We now look at several known methods for the saddle point problem (S). The fundamental idea in all of them is to design such that the primal variable and the dual variable can be updated independently unlike in the standard proximal point method with . To help verifying the condition theorem 4.2 for these methods, we reformulate the result for scalar step length and testing parameters: we will only use the full power of the operator setup in our companion paper [30].
If for each , we pick and , and define , , , and , then (43), (44), and (50c) reduce to
[TABLE]
Then we have the following corollary of theorem 4.2.
Corollary 4.4**.**
Let have the structure (42) and assume . Assume that is (-strongly) convex and is -Lipschitz for some and . For each , assume the structure (50) for . Also take and . Suppose (PP) is solvable for . Suppose for all that is self-adjoint, and that the fundamental condition for saddle-point problems (47) holds for and . Then the fundamental conditions (CI), (CI∼) and the descent inequality (DI) hold.
Proof 4.5**.**
Clearly and . Moreover, satisfies the partial monotonicity condition (F∗-PM) and satisfies the partial partial monotonicity condition (G-PM) with by the corresponding (strong) monotonicity of the subdifferentials. The rest follows from theorem 4.2.
\cbend
Example 4.6** (The primal–dual method of Chambolle and Pock [7]).**
[TABLE]
In the basic version of the algorithm, , , and , assuming the step length parameters to satisfy
[TABLE]
If is compact, the iterates convergence weakly, and the method has rate for the ergodic duality gap, to which we will return in section 5. If is strongly convex with factor , we may accelerate
[TABLE]
This yields convergence of to zero.
Proof 4.7** (Proof of convergence of iterates).**
We formulate the method in our proximal point framework with and following [31, 14] by taking as the preconditioner
[TABLE]
For the rest of the operators, we use the setup of (50). Taking , we now reduce (47) to
[TABLE]
We may expand
[TABLE]
We have (but not , as the former depends on the off-diagonals cancelling out), and is self-adjoint, if for some constant we take
[TABLE]
This gives the acceleration scheme (53). Moreover, for any holds
[TABLE]
Thus if . By (56), . Since this fixes the ratio of to , we need to take as well as . Through the positivity of , we recover the initialisation condition (52).
\cbstart
Recall that subdifferentials are weak-to-strong outer-semicontinuous. By the continuity of , we thus deduce the strong-to-strong outer semicontinuity of . To verify (7), we use the assumed compactness of , which implies for a further unrelabelled subsequence of that satisfy . Corollaries 4.4 and 2.9 now shows weak convergence of the iterates without a rate. \cbend
If is strongly convex with factor , the results in [7, 31] show that is of the order , and consequently is of the order . By proposition 2.5, converges to zero at the rate .
\cbstart
Remark 4.8** (Brezis–Crandall–Pazy property).**
It is possible to show that satisfies the Brezis–Crandall–Pazy property [3] without a compactness assumption on . With a corresponding improvement to proposition 2.5, the assumption could be dropped.
Remark 4.9** (Linear convergence).**
If is strongly convex with factor , the last equation of (56) gets similar form as the first, . From here, if both and are strongly convex, it is possible to show linear convergence.
We can also add an additional forward step to the method. With that the method resembles the method of Vũ–Condat [10, 32], which also incorporates an additional outer over-relaxation step on the whole algorithm. \cbend
Example 4.10** (Chambolle–Pock with a forward step).**
Suppose is (strongly) convex with factor , and Lipschitz with factor . In [8], the Chambolle–Pock method was extended to take forward steps with respect to . With everything else as in example 4.6, take . Then the preconditioned proximal point method (PP) can be rearranged as
[TABLE]
The method inherits the convergences properties of example 4.6 if we use the step length update rules (53), and initialise subject to (52), and
[TABLE]
Proof 4.11** (Proof of convergence).**
With as in (54), the fundamental condition for saddle-point problems (47) becomes
[TABLE]
The rules (56) force . We take for some , and deduce using Cauchy’s inequality that (62) holds if
[TABLE]
Recalling (57), this is true if and . Further recalling (56), and observing that is non-increasing, we only have to satisfy . Otherwise put, we obtain (61).
\cbstart
Finally, we have the following Generalised Iterative Soft Thresholding (GIST) method from [19]. \cbend
Example 4.12** (GIST).**
Suppose , , , and . Take
[TABLE]
With and , we obtain the method
[TABLE]
If is compact, the iterates converge weakly to .
Proof 4.13** (Proof of convergence).**
Observe that the partial co-coercivity (J-PC) holds with . Clearly is positive semi-definite self-adjoint. If we take and , then
[TABLE]
Thus . Eliminating by monotonicity, the fundamental condition for saddle-point problems (47) thus holds if
[TABLE]
Expanding , we see this to hold when and , which are exactly our assumptions. Using corollaries 4.4 and 2.5, and reasoning as in example 4.6 to verify the outer-semicontinuity properties of , we obtain weak convergence.
5 An ergodic duality gap
We now study the extension of the testing approach of section 2.2 to produce the convergence of an ergodic duality gap. Throughout this section, we are in the saddle point setup of section 4. In particular, the operator is as in (42), and the step length and testing operators and as in (43).
5.1 Preliminary gap estimates
Our first lemma demonstrates how to obtain a “preliminary” gap from . If the step lengths and tests are scalar, , and , etc., and satisfy , it is easy to bound this preliminary gap from below by times the \cbstart“relaxed” duality gap
[TABLE]
\cbend
To do the same for more general step length operators, we will in section 5.3 introduce abstract notions of convexity that incorporate ergodicity and stochasticity.
\cbstart
Observe that the “relaxed” gap (63) satisfies
[TABLE]
where the right-hand side is the conventional duality gap guaranteed to be non-zero for a non-solution . \cbend
Lemma 5.1**.**
\cbstart
For a fixed , \cbendsuppose and are self-adjoint. Then for as in (42), we have
[TABLE]
where the “preliminary gap”
[TABLE]
Proof 5.2**.**
Similarly to the proof of theorem 4.2, we have
[TABLE]
A little bit of reorganisation gives (64). Indeed
[TABLE]
The next lemma extends theorem 4.2 to estimate the preliminary gap.
Lemma 5.3**.**
Let have the structure (42) and assume . For each , let and , as well as and . Define and through (43). Suppose (PP) is solvable for . If \cbstartfor all , \cbend is self-adjoint, and
[TABLE]
then
[TABLE]
Proof 5.4**.**
Inserting (64) from lemma 5.1 into (66) shows that
[TABLE]
Hence the fundamental condition (CI∼) holds for . Now we use theorem 2.1.
5.2 General conversion formulas of preliminary gaps to ergodic gaps
The “preliminary gaps” are not as such very useful. To go further, the abstract partial monotonicity assumptions (G-PM) and (F∗-PM) are not enough, and we need analogous convexity formulations. We formulate these conditions directly in the stochastic setting (recall section 3).
For the moment, we assume for all that whenever and for each with , then for some holds
[TABLE]
Analogously, we assume for and for each with that for some holds
[TABLE]
These conditions can of course always be satisfied for some and . After a few general lemmas, we will replace these placeholder values by more meaningful ones.
To state those lemmas, we also assume \cbstartfor some scalars , (), \cbendeither of the primal–dual coupling conditions
[TABLE]
As will see in example 5.15, (C) is satisfied by the accelerated Chambolle–Pock method of example 4.6. In our companion paper [30], we will however see that (C) is required to develop doubly-stochastic methods.
Lemma 5.5**.**
Assume (68), (69), and the first primal–dual coupling condition (C). Given iterates , for all set
[TABLE]
and define the ergodic sequences
[TABLE]
Then
[TABLE]
Proof 5.6**.**
Let be fixed. With over , (68) implies
[TABLE]
Likewise, with , (69) shows that
[TABLE]
From the definition of the preliminary gap in (65), applying (C), we obtain
[TABLE]
Recalling the definition of the gap in (63), and using the estimates (71), (72), as well as the definition (70) of the ergodic sequences, we obtain the claim.
Lemma 5.7**.**
\cbstart
Suppose and satisfy with the corresponding partial monotonicities (G-PM) and (F∗-PM). \cbendAlso assume (68), (69), and the second primal–dual coupling condition (C). Given , for all set
[TABLE]
and define the ergodic sequences
[TABLE]
Then
[TABLE]
Proof 5.8**.**
Shifting indices of by one compared to , we define
[TABLE]
Reorganising terms, therefore
[TABLE]
By virtue of , we have , and . Estimating with (G-PM) and (F∗-PM), and afterwards taking the expectation, we therefore obtain
[TABLE]
From here we may proceed analogously to the proof of lemma 5.5.
5.3 Final gap estimates
As now convert the abstract ergodic conditions (68) and (69) into ergodic strong convexity and smoothness conditions that can be derived from the corresponding standard properties in block-separable cases.
\cbstart
Recall the spaces of operator and from section 4. \cbendWe assume for all that whenever and for each with , then for some we have the \cbstartergodic strong convexity\cbend
[TABLE]
Analogously, we assume for and for each with the \cbstartergodic convexity\cbend
[TABLE]
Finally, we assume is differentiable and satisfies for some parameters the \cbstart3-point ergodic smoothness\cbendcondition
[TABLE]
The shifting refers to uses of , where a typical definition of smoothness would use .
Example 5.9** (Block-separable structure, ergodic convexity).**
Let and have the separable structure of example 4.1. We claim that the ergodic strong convexity (G-EC) holds. Indeed, let us introduce , satisfying for each . Splitting (G-EC) into separate inequalities over all , and using the strong convexity of , we see (G-EC) to be true with if for all holds
[TABLE]
The right hand side can also be written as for the measure on the domain . Using our assumption , we deduce . An application of Jensen’s inequality now shows (73). Therefore (G-EC) is satisfied for .
Example 5.10** (Ergodic smoothness for smooth ).**
\cbstart
If has -Lipschitz gradient, then lemma B.1 shows the three-point inequality
[TABLE]
If for scalar , then proceeding as in (73) in example 5.9, we deduce the 3-point ergodic smoothness (J-ES) with . Similarly, we can treat the block-separable case when each individually has Lipschitz gradient. \cbend
The next theorem is our main result for saddle point problems. \cbstartTo clarify the statement of the theorem, which depends on various different combinations of several conditions in the definition of , we recall here the rough meaning of each:
\tagform@47**, p.47**
Fundamental condition (CI∼) for saddle point problems.
\tagform@G-PM**, p.G-PM**
Partial (testing and step length operator relative) strong monotonicity of .
\tagform@F∗-PM**, p.F∗-PM**
Partial monotonicity of .
\tagform@J-PC**, p.J-PC**
Partial co-coercivity of .
\tagform@G-EC**, p.G-EC**
Partial strong ergodic convexity of .
\tagform@F∗-EC**, p.F∗-EC**
Partial ergodic convexity of .
\tagform@J-ES**, p.J-ES**
Partial 3-point ergodic smoothness of .
\tagform@C****, p.C
First alternative primal–dual coupling condition
\tagform@C****, p.C
Second alternative primal–dual coupling condition
\cbend
Theorem 5.11**.**
Let have the structure (42) and assume . For each , let and be such that and . Define and through (43). Also take and . Suppose (PP) is solvable for . Assuming one of the following cases to hold with and , let
[TABLE]
If \cbstartfor all , \cbend is self-adjoint and (47) holds for given above, then so does the following ergodic gap descent inequality:
[TABLE]
Proof 5.12**.**
The case is simply the result of taking the expectation in the claim of theorem 4.2; \cbstartcompare how corollary 3.2 follows form theorem 2.1. \cbendRegarding the remaining two cases, clearly (47) implies (66) for
[TABLE]
Thus lemma 5.3 shows the descent estimate (67).
The ergodic strong convexity (G-EC) and (J-ES) imply (68) for
[TABLE]
where . Likewise the ergodic convexity (F∗-EC) implies (69) for . When the first primal–dual coupling condition (C) holds, we take above , which we have assumed to belong to . \cbstartIf the alternative second primal–dual coupling condition (C) holds, we take . \cbendTherefore, (67) can be rewritten
[TABLE]
for
[TABLE]
Now we just take the expectation in (74), and apply lemma 5.5 \cbstartor lemma 5.7. \cbend
5.4 Primal–dual examples revisited
We now study gap estimates for several of the examples from section 4. \cbstartWe start by verifying partial monotonicity and ergodic convexity and smoothness conditions for in the case of simple deterministic scalar step length and testing operators: the block-separable and stochastic case we leave to the companion paper [30]. \cbend
\cbstart
Similarly to corollary 4.4 of theorem 4.2, we now have the following non-stochastic scalar corollary of theorem 5.11. From the corollary, if , we clearly get the convergence of or to zero at the respective rate or .
Corollary 5.13**.**
Let have the structure (42) and assume . Assume that is (-strongly) convex and is -Lipschitz for some and . For each , assume the structure (50) for . Also take and . Suppose (PP) is solvable for . Suppose for all that , that is self-adjoint, and that the fundamental condition for saddle-point problems (47) holds for and . Then
[TABLE]
If, instead, , then the gap expression is replaced by .
Proof 5.14**.**
As in the proof of corollary 4.4, clearly and , so that the partial monotonicities (F∗-PM) and (G-PM) (with ) hold by the monotonicity of the subdifferentials of and . Similarly, the ergodic (strong) convexity (G-EC) of with and (F∗-EC) of hold by a Jensen argument similar to example 5.9. Likewise, the ergodic smoothness (J-ES) holds by the three-point inequality eq. 77 and a Jensen argument similar to example 5.10. Note that with everything deterministic, the expectations disappear.
With this, the result follows immediately from theorem 5.11 for the second and third cases of . The primal–dual coupling conditions (C) and (C) reduce to our respective conditions and ,
In examples 4.6 and 4.12, we proved (47) for the Chambolle–Pock method and the GIST with and . Now we have to do the same but with the factor-of-two different and . The different will merely change the acceleration factor of the method. The larger , on the other hand, will change the step length bound (61) of the forward-step Chambolle–Pock, example 4.10, to
[TABLE]
and the the bound of the GIST of example 4.12 to .
Example 5.15** (Gap for Chambolle–Pock with a forward step).**
In the demonstration of examples 4.6 and 4.10, we have seen the Chambolle–Pock method to satisfy and the self-adjointness of . As discussed above, (47) holds with subject to the conditions and (75). We now have . In the unaccelerated case (), we get . Therefore, we get from corollary 5.13 the convergence of to zero. In the accelerated case (), is of the order . Therefore also is of the order , so we get convergence of to zero.
Example 5.16** (Gap for GIST).**
In example 4.12 we have seen the GIST to satisfy , the self-adjointness of . Moreover, as discussed above, (47) with if . It therefore has and . Consequently, corollary 5.13 yields the convergence of both and to zero.
\cbend
Conclusion
We have unified common convergence proofs of optimisation methods, employing the ideas of non-linear preconditioning and testing of the classical proximal point method. We have demonstrated that popular classical and modern algorithms can be presented in this framework, and their convergence, including convergence rates, proved with little effort. The theory was, however, not developed with existing algorithms in mind. It was developed to allow the development of new spatially adapted block-proximal methods in [30]. We will demonstrate there and in other works to follow, the full power of the theory. For one, we did not yet fully exploit the fact that and are operators, to construct step-wise step lengths and acceleration.
Appendix A Outer semicontinuity of maximal monotone operators
We could not find the following result explicitly stated in the literature, although it is hidden in, e.g., the proof of [27, Theorem 1].
Lemma A.1**.**
Let be maximal monotone on a Hilbert space . Then is is weak-to-strong outer semicontinuous: for any sequence , and any such that weakly, and strongly, we have .
Proof A.2**.**
By monotonicity, for any and holds . Since a weakly convergent sequence is bounded, we have for some independent of . Taking the limit, we therefore have . If we had , this would contradict that is maximal, i.e., its graph not contained in the graph of any monotone operator.
\cbstart
Appendix B Three-point inequalities
The following three-point formulas are central to handling forward steps with respect to smooth functions.
Lemma B.1**.**
If has -Lipschitz gradient. Then
[TABLE]
as well as
[TABLE]
Proof B.2**.**
Regarding the “three-point hypomonotonicity” (76), the -Lipschitz gradient implies co-coercivity (see [1] or appendix C)
[TABLE]
Thus using Cauchy’s inequality
[TABLE]
To prove (77), the Lipschitz gradient implies the smoothness or “descent inequality” (again, [1] or appendix C)
[TABLE]
By convexity . Summed, we obtain (77).
Lemma B.3**.**
If has -Lipschitz gradient and is -strongly convex. Then for any holds
[TABLE]
as well as
[TABLE]
Proof B.4**.**
To prove (80), using strong convexity,the Lipschitz gradient, and Cauchy’s inequality, we have
[TABLE]
Regarding (79), using the -strong monotonicity of , we estimate completely analogously
[TABLE]
Since smooth functions with a positive Hessian are locally convex, the above lemmas readily extend to this case, locally. In fact, we have following more precise result:
Lemma B.5**.**
Suppose with at given . Then for any and all , we have
[TABLE]
with
[TABLE]
If , then also
[TABLE]
Proof B.6**.**
By Taylor expansion, for some between and , and any , we have
[TABLE]
Since , by the definition of , we obtain (81).
Similarly, by Taylor expansion, for some between and , we have
[TABLE]
Using (84) we obtain
[TABLE]
Using the assumption , we have . Hence we obtain (83) by the definition of and .
We can also derive the following alternate result:
Lemma B.7**.**
Suppose with at given . Then for all we have
[TABLE]
for given by (82). If , then also
[TABLE]
Proof B.8**.**
By Taylor expansion, for some between and , we have
[TABLE]
In the last step we have used Cauchy’s inequality, and the definition of following . The standard three-point or Pythagoras’ identity states
[TABLE]
Applying this in (88), we obtain (86).
To prove (87), we use (85), the definition of , and (86).
\cbend
Appendix C Projected gradients and smoothness
The next lemma generalises well-known properties [[, see, e.g.,]]bauschke2017convex of smooth convex functions to projected gradients, when we take as projection operator. With a random projection, taking the expectation in (91), we in particular obtain a connection to the Expected Separable Over-approximation property in the stochastic coordinate descent literature [26].
Lemma C.1**.**
Let , and be self-adjoint and positive semi-definite on a Hilbert space . Suppose has a pseudo-inverse satisfying . Consider the properties:
- (i)
-relative Lipschitz continuity of with factor :
[TABLE] 2. (ii)
The -relative property
[TABLE] 3. (iii)
-relative smoothness of with factor :
[TABLE] 4. (iv)
The -relative property
[TABLE] 5. (v)
-relative co-coercivity of with factor :
[TABLE]
*We have (i) (ii) (iii) (iv) (v). If is invertible, all are equivalent. *
Proof C.2**.**
(i)* (ii): Take and multiply (89) by . Then use Cauchy–Schwarz.*
(ii)* (iii): Using the mean value theorem and (90), we compute (91):*
[TABLE]
(iii)* (ii): Add together (91) for and .*
(iii)* (iv): Adding on both sides of (91), we get*
[TABLE]
The left hand side is minimised with respect to by taking . Taking on the right-hand side therefore gives (92).
(iv)* (v): Summing the estimate (92) with the same estimate with and exchanged, we obtain (93).*
(v)* (i) when is invertible: Cauchy–Schwarz.*
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces . CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York, 2 edition, 2017.
- 2[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.
- 3[3] H. Brezis, M. G. Crandall, and A. Pazy. Perturbations of nonlinear maximal monotone sets in banach space. Communications on Pure and Applied Mathematics , 23(1):123–144, 1970.
- 4[4] Felix E Browder. Nonexpansive nonlinear operators in a banach space. Proceedings of the National Academy of Sciences of the United States of America , 54(4):1041, 1965.
- 5[5] Felix E. Browder. Convergence theorems for sequences of nonlinear operators in banach spaces. Mathematische Zeitschrift , 100(3):201–225, Jun 1967.
- 6[6] Y. Censor and S. A. Zenios. Proximal minimization algorithm withd-functions. Journal of Optimization Theory and Applications , 73(3):451–464, 1992.
- 7[7] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision , 40:120–145, 2011.
- 8[8] Antonin Chambolle and Thomas Pock. On the ergodic convergence rates of a first-order primal–dual algorithm. Mathematical Programming , pages 1–35, 2015.
