Testing and non-linear preconditioning of the proximal point method

Tuomo Valkonen

arXiv:1703.05705·math.OC·October 6, 2020

Testing and non-linear preconditioning of the proximal point method

Tuomo Valkonen

PDF

TL;DR

This paper develops a unified theoretical framework for analyzing the convergence of various optimization algorithms using non-linear preconditioning and testing, applicable to classical and stochastic methods.

Contribution

It formalizes a simple iteration-wise inequality approach for convergence proofs, generalizing properties like firm non-expansivity to a broad class of algorithms.

Findings

01

Effective application to classical algorithms and their stochastic variants

02

Unified convergence analysis framework for multiple methods

03

Demonstrates the approach's versatility across different algorithms

Abstract

Employing the ideas of non-linear preconditioning and testing of the classical proximal point method, we formalise common arguments in convergence rate and convergence proofs of optimisation methods to the verification of a simple iteration-wise inequality. When applied to fixed point operators, the latter can be seen as a generalisation of firm non-expansivity or the $α$ -averaged property. The main purpose of this work is to provide the abstract background theory for our companion paper "Block-proximal methods with spatially adapted acceleration". In the present account we demonstrate the effectiveness of the general approach on several classical algorithms, as well as their stochastic variants. Besides, of course, the proximal point method, these method include the gradient descent, forward--backward splitting, Douglas--Rachford splitting, Newton's method, as well as several…

Equations421

x min G (x) + J (x) + F (K x)

x min G (x) + J (x) + F (K x)

x min \leavevmode y max \leavevmode G (x) + \cbstart J (x) \cbend + ⟨ K x, y ⟩ - F^{*} (y) .

x min \leavevmode y max \leavevmode G (x) + \cbstart J (x) \cbend + ⟨ K x, y ⟩ - F^{*} (y) .

0 \in H (u^{i + 1}) + M_{i + 1} (u^{i + 1} - u^{i}),

0 \in H (u^{i + 1}) + M_{i + 1} (u^{i + 1} - u^{i}),

H (u) := (\partial G (x) + K^{*} y \partial F^{*} (y) - K x)

H (u) := (\partial G (x) + K^{*} y \partial F^{*} (y) - K x)

M_{i + 1} := (τ_{i}^{- 1} I - θ_{i} K - K^{*} σ_{i + 1}^{- 1} I) .

M_{i + 1} := (τ_{i}^{- 1} I - θ_{i} K - K^{*} σ_{i + 1}^{- 1} I) .

⟨ x, z ⟩_{T} := ⟨ T x, z ⟩, and ∥ x ∥_{T} := ⟨ x, x ⟩_{T} .

⟨ x, z ⟩_{T} := ⟨ T x, z ⟩, and ∥ x ∥_{T} := ⟨ x, x ⟩_{T} .

0 \in H (u) .

0 \in H (u) .

0 \in W_{i + 1} H (u^{i + 1}) + V_{i + 1} (u^{i + 1}) .

0 \in W_{i + 1} H (u^{i + 1}) + V_{i + 1} (u^{i + 1}) .

V_{i + 1} (u) = V_{i + 1}^{'} (u) + M_{i + 1} (u - u^{i}) .

V_{i + 1} (u) = V_{i + 1}^{'} (u) + M_{i + 1} (u - u^{i}) .

0 \in H_{i + 1} (u^{i + 1}) + M_{i + 1} (u^{i + 1} - u^{i}) .

0 \in H_{i + 1} (u^{i + 1}) + M_{i + 1} (u^{i + 1} - u^{i}) .

⟨ H_{i + 1} (u^{i + 1}), u^{i + 1} - u ⟩_{Z_{i + 1}} \geq \frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 2} M_{i + 2} - Z_{i + 1} M_{i + 1}}^{2} - \frac{1}{2} ∥ u^{i + 1} - u^{i} ∥_{Z_{i + 1} M_{i + 1}}^{2} - Δ_{i + 1} (u),

⟨ H_{i + 1} (u^{i + 1}), u^{i + 1} - u ⟩_{Z_{i + 1}} \geq \frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 2} M_{i + 2} - Z_{i + 1} M_{i + 1}}^{2} - \frac{1}{2} ∥ u^{i + 1} - u^{i} ∥_{Z_{i + 1} M_{i + 1}}^{2} - Δ_{i + 1} (u),

\frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 2} M_{i + 2}}^{2} \leq \frac{1}{2} ∥ u^{i} - u ∥_{Z_{i + 1} M_{i + 1}}^{2} + Δ_{i + 1} (u) (i \in N)

\frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 2} M_{i + 2}}^{2} \leq \frac{1}{2} ∥ u^{i} - u ∥_{Z_{i + 1} M_{i + 1}}^{2} + Δ_{i + 1} (u) (i \in N)

\frac{1}{2} ∥ u^{N} - u ∥_{Z_{N + 1} M_{N + 1}}^{2} \leq \frac{1}{2} ∥ u^{0} - u ∥_{Z_{1} M_{1}}^{2} + i = 0 \sum N - 1 Δ_{i + 1} (u) (N \geq 1) .

\frac{1}{2} ∥ u^{N} - u ∥_{Z_{N + 1} M_{N + 1}}^{2} \leq \frac{1}{2} ∥ u^{0} - u ∥_{Z_{1} M_{1}}^{2} + i = 0 \sum N - 1 Δ_{i + 1} (u) (N \geq 1) .

⟨ H_{i + 1} (u^{i + 1}), u^{i + 1} - u ⟩_{Z_{i + 1}} \geq \frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 1} Γ}^{2},

⟨ H_{i + 1} (u^{i + 1}), u^{i + 1} - u ⟩_{Z_{i + 1}} \geq \frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 1} Γ}^{2},

Z_{i + 2} M_{i + 2} ≃ Z_{i + 1} (M_{i + 1} + Γ),

Z_{i + 2} M_{i + 2} ≃ Z_{i + 1} (M_{i + 1} + Γ),

⟨ W_{i + 1} [H (u^{i + 1}) - H (u)] + V_{i + 1}^{'} (u^{i + 1}), u^{i + 1} - u ⟩_{Z_{i + 1}} \geq \frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 2} M_{i + 2} - Z_{i + 1} M_{i + 1}}^{2} - \frac{1}{2} ∥ u^{i + 1} - u^{i} ∥_{Z_{i + 1} M_{i + 1}}^{2} - Δ_{i + 1} (u),

⟨ W_{i + 1} [H (u^{i + 1}) - H (u)] + V_{i + 1}^{'} (u^{i + 1}), u^{i + 1} - u ⟩_{Z_{i + 1}} \geq \frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 2} M_{i + 2} - Z_{i + 1} M_{i + 1}}^{2} - \frac{1}{2} ∥ u^{i + 1} - u^{i} ∥_{Z_{i + 1} M_{i + 1}}^{2} - Δ_{i + 1} (u),

\frac{1}{2} ∥ u^{i + 1} - u^{i} ∥_{Z_{i + 1} M_{i + 1}}^{2} + \frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 1} M_{i + 1} - Z_{i + 2} M_{i + 2}}^{2} - ⟨ u^{i + 1} - u^{i}, u^{i + 1} - u ⟩_{Z_{i + 1} M_{i + 1}} \geq - Δ_{i + 1} (u) .

\frac{1}{2} ∥ u^{i + 1} - u^{i} ∥_{Z_{i + 1} M_{i + 1}}^{2} + \frac{1}{2} ∥ u^{i + 1} - u ∥_{Z_{i + 1} M_{i + 1} - Z_{i + 2} M_{i + 2}}^{2} - ⟨ u^{i + 1} - u^{i}, u^{i + 1} - u ⟩_{Z_{i + 1} M_{i + 1}} \geq - Δ_{i + 1} (u) .

⟨ u^{i + 1} - u^{i}, u^{i + 1} - u ⟩_{M} = \frac{1}{2} ∥ u^{i + 1} - u^{i} ∥_{M}^{2} - \frac{1}{2} ∥ u^{i} - u ∥_{M}^{2} + \frac{1}{2} ∥ u^{i + 1} - u ∥_{M}^{2} .

⟨ u^{i + 1} - u^{i}, u^{i + 1} - u ⟩_{M} = \frac{1}{2} ∥ u^{i + 1} - u^{i} ∥_{M}^{2} - \frac{1}{2} ∥ u^{i} - u ∥_{M}^{2} + \frac{1}{2} ∥ u^{i + 1} - u ∥_{M}^{2} .

D_{J}^{p} (z, x) := J (z) - J (x) - ⟨ p ∣ z - x ⟩_{X}, (x \in X),

D_{J}^{p} (z, x) := J (z) - J (x) - ⟨ p ∣ z - x ⟩_{X}, (x \in X),

D_{J}^{p} (x, x) - D_{J}^{q} (x, z) + D_{J}^{q} (x, z) = [J (x) - J (x) - ⟨ p ∣ x - x ⟩_{X}] - [J (x) - J (z) - ⟨ q ∣ x - z ⟩_{X}] + [J (x) - J (z) - ⟨ q ∣ x - z ⟩_{X}] = ⟨ p - q ∣ x - x ⟩_{X} .

D_{J}^{p} (x, x) - D_{J}^{q} (x, z) + D_{J}^{q} (x, z) = [J (x) - J (x) - ⟨ p ∣ x - x ⟩_{X}] - [J (x) - J (z) - ⟨ q ∣ x - z ⟩_{X}] + [J (x) - J (z) - ⟨ q ∣ x - z ⟩_{X}] = ⟨ p - q ∣ x - x ⟩_{X} .

0 \in Z_{i + 1} H_{i + 1} (u^{i + 1}) + (p^{i + 1} - q^{i}) with q^{i} \in \partial J_{i + 1} (u^{i}) and p^{i + 1} \in \partial J_{i + 1} (u^{i + 1})

0 \in Z_{i + 1} H_{i + 1} (u^{i + 1}) + (p^{i + 1} - q^{i}) with q^{i} \in \partial J_{i + 1} (u^{i}) and p^{i + 1} \in \partial J_{i + 1} (u^{i + 1})

w^{i + 1} := - Z_{0} M_{0} (u^{i + 1} - u^{i}) \to 0, w^{i_{k}} \in Z_{0} H_{i_{k}} (u^{i_{k}}), u^{i_{k}} ⇀ u ⟹ 0 \in H (u),

w^{i + 1} := - Z_{0} M_{0} (u^{i + 1} - u^{i}) \to 0, w^{i_{k}} \in Z_{0} H_{i_{k}} (u^{i_{k}}), u^{i_{k}} ⇀ u ⟹ 0 \in H (u),

\frac{1}{2} ∥ u^{i + 1} - u ∥_{A}^{2} + \frac{δ}{2} ∥ u^{i + 1} - u^{i} ∥_{A}^{2} \leq \frac{1}{2} ∥ u^{i} - u ∥_{A}^{2}

\frac{1}{2} ∥ u^{i + 1} - u ∥_{A}^{2} + \frac{δ}{2} ∥ u^{i + 1} - u^{i} ∥_{A}^{2} \leq \frac{1}{2} ∥ u^{i} - u ∥_{A}^{2}

ϕ_{i} τ_{i} ⟨ H (u^{i + 1}) - H (u), u^{i + 1} - u ⟩ \geq \frac{ϕ _{i + 1} - ϕ _{i}}{2} ∥ u^{i + 1} - u ∥^{2} - \frac{ϕ _{i}}{2} ∥ u^{i + 1} - u^{i} ∥^{2} - Δ_{i + 1} (u) .

ϕ_{i} τ_{i} ⟨ H (u^{i + 1}) - H (u), u^{i + 1} - u ⟩ \geq \frac{ϕ _{i + 1} - ϕ _{i}}{2} ∥ u^{i + 1} - u ∥^{2} - \frac{ϕ _{i}}{2} ∥ u^{i + 1} - u^{i} ∥^{2} - Δ_{i + 1} (u) .

⟨ H (u) - H (u^{'}), u - u^{'} ⟩ \geq γ ∥ u - u^{'} ∥^{2} (u, u^{'} \in U) .

⟨ H (u) - H (u^{'}), u - u^{'} ⟩ \geq γ ∥ u - u^{'} ∥^{2} (u, u^{'} \in U) .

ϕ_{i + 1} := ϕ_{i} (1 + 2 γ τ_{i}) .

ϕ_{i + 1} := ϕ_{i} (1 + 2 γ τ_{i}) .

\frac{ϕ _{N}}{2} ∥ u^{N} - u ∥^{2} \leq \frac{ϕ _{0}}{2} ∥ u^{0} - u ∥^{2} (N \geq 1) .

\frac{ϕ _{N}}{2} ∥ u^{N} - u ∥^{2} \leq \frac{ϕ _{0}}{2} ∥ u^{0} - u ∥^{2} (N \geq 1) .

\frac{ϕ _{i}}{2} ∥ u - u^{i} ∥^{2} + \frac{ϕ _{i} - ϕ _{i + 1}}{2} ∥ u - u ∥^{2} + ϕ_{i} τ_{i} ⟨ H (u^{i + 1}) - H (u), u - u ⟩ \geq 0.

\frac{ϕ _{i}}{2} ∥ u - u^{i} ∥^{2} + \frac{ϕ _{i} - ϕ _{i + 1}}{2} ∥ u - u ∥^{2} + ϕ_{i} τ_{i} ⟨ H (u^{i + 1}) - H (u), u - u ⟩ \geq 0.

\frac{ϕ _{i}}{2} ∥ u - u^{i} ∥^{2} + \frac{ϕ _{i} - ϕ _{i + 1}}{2} ∥ u - u ∥^{2} + ϕ_{i} τ_{i} ⟨ \nabla J (u^{i}) - \nabla J (u), u - u ⟩ \geq 0.

\frac{ϕ _{i}}{2} ∥ u - u^{i} ∥^{2} + \frac{ϕ _{i} - ϕ _{i + 1}}{2} ∥ u - u ∥^{2} + ϕ_{i} τ_{i} ⟨ \nabla J (u^{i}) - \nabla J (u), u - u ⟩ \geq 0.

⟨ \nabla J (u^{i}) - \nabla J (u), u - u ⟩ \geq - \frac{L}{4} ∥ u - u^{i} ∥^{2} .

⟨ \nabla J (u^{i}) - \nabla J (u), u - u ⟩ \geq - \frac{L}{4} ∥ u - u^{i} ∥^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\nochangebars\cbcolor

blue

Testing and non-linear preconditioning of the proximal point method

Tuomo Valkonen ModeMat, Escuela Politécnica Nacional, Quito, Ecuador; previously Department of Mathematical Sciences, University of Liverpool, United Kingdom. [email protected]

(2017-03-16 (revised 2018-08-23))

Abstract

Employing the ideas of non-linear preconditioning and testing of the classical proximal point method, we formalise common arguments in convergence rate and convergence proofs of optimisation methods to the verification of a simple iteration-wise inequality. When applied to fixed point operators, the latter can be seen as a generalisation of firm non-expansivity or the $\alpha$ -averaged property. The main purpose of this work is to provide the abstract background theory for our companion paper “Block-proximal methods with spatially adapted acceleration”. In the present account we demonstrate the effectiveness of the general approach on several classical algorithms, as well as their stochastic variants. Besides, of course, the proximal point method, these method include the gradient descent, forward–backward splitting, Douglas–Rachford splitting, Newton’s method, as well as several methods for saddle-point problems, such as the Alternating Directions Method of Multipliers, and the Chambolle–Pock method.

Get the version from http://tuomov.iki.fi/publications/, citations broken in this one due arXiv being stuck in the 70s and not supporting biblatex (or 80s bibtex for that matter), hence not modern bibliography styles or utf8.

1 Introduction

The proximal point method for monotone operators [21, 27], while infrequently used by itself, can be found as a building block of many popular optimisation algorithms. Indeed, many important application problems can be written in the form \cbstart

[TABLE]

for convex $G,J$ and $F$ , and a linear operator $K$ , with $G$ and $F$ non-smooth and $J$ smooth. \cbendExamples abound in image processing and data science. The problem (P) can often be solved by methods such as forward–backward splitting, ADMM (alternating directions method of multipliers) and their variants [2, 19, 13, 7]. They all involve a proximal point step.

The equivalent saddle point form of (P) is

[TABLE]

In particular within mathematical image processing and computer vision, a popular algorithm for solving (S) with $J=0$ is the primal–dual method of Chambolle and Pock [7]. As discovered in [14], the method can most concisely be written as a preconditioned proximal point method, solving on each iteration for $u^{i+1}=(x^{i+1},y^{i+1})$ the variational inclusion

[TABLE]

where the monotone operator

[TABLE]

encodes the optimality condition $0\in H({\widehat{u}})$ for (S). In the standard proximal point method [27], one would take $M_{i+1}=I$ the identity. With this choice, (PP0) is generally difficult to solve. In the Chambolle–Pock method the preconditioning operator is given for suitable step length parameters $\tau_{i},\sigma_{i+1},\theta_{i}>0$ by

[TABLE]

This choice of $M_{i+1}$ decouples the primal $x$ and dual $y$ updates, making the solution of (PP0) feasible in a wide range of problems. If $G$ is strongly convex, the step length parameters $\tau_{i},\sigma_{i+1},\theta_{i}$ can be chosen to yield $O(1/N^{2})$ convergence rates of an ergodic duality gap and the quadratic distance $\|x^{i}-{\widehat{x}}\|^{2}$ .

In our earlier work [31], we have modified $M_{i+1}$ as well as the condition (PP0) to still allow a level of mixed-rate acceleration when $G$ is strongly convex only on sub-spaces. Our convergence proofs were based on testing the abstract proximal point method by a suitable operator, which encodes the desired and achievable convergence rates on relevant subspaces.

In the present paper, we extend this theoretical approach to non-linear preconditioning, non-invertible step-length operators, and arbitrary monotone operators $H$ . Our main purpose is to provide the abstract background theory for our companion paper [30]. Here, within these pages, we demonstrate that several classical optimisation methods—including the second-order Newton’s method—can also be seen as variants of the proximal point method, and that their common convergence rate and convergence proofs reduce to the verification of a simple iteration-wise inequality. Through application of our theory to Browder’s fixed point theorem [4] in section 2.6, we see that our inequality generalises the concepts of firm non-expansivity or the $\alpha$ -averaged property. Our theory also covers stochastic variants of the considered algorithms.

In section 2, we start by developing our theory for general monotone operators $H$ . This extends, simplifies, and clarifies the more disconnected results from [31] that concentrated on saddle-point problems with preconditioners derived from (1). We demonstrate our results on the basic proximal point method, gradient descent, forward–backward splitting, Douglas–Rachford splitting, and Newton’s method. The proximal step in forward–backward splitting and proximal Newton’s method can be introduced completely “free”, without any additional proof effort, in our approach. \cbstartIn section 3 we demonstrate the further flexibility of our techniques by application to stochastic block coordinate methods. We refer to [33] for a review of this class of methods. In the final sections 4 and 5 we specialise our work to saddle-point problems, and demonstrate the results on variants of the Chambolle–Pock method, and the Generalised Iterative Soft Thresholding (GIST) algorithm of [19]. Some of the derivations in these last two sections are quite abstract and general, as we will need this for our companion paper [30] where we develop stochastic primal-dual methods with coordinate-wise adapted step lengths. \cbend

Besides already cited works, other previous work related to ours includes that on generalised proximal point methods, such as [6, 9], as well inertial methods for variational inclusions [18].

2 An abstract preconditioned proximal point iteration

2.1 Notation and general setup

We use $\mathrm{cpl}(X)$ to denote the space of convex, proper, lower semicontinuous functions from $X$ to the extended reals $\overline{\mathbb{R}}:=[-\infty,\infty]$ , and $\mathcal{L}(X;Y)$ to denote the space of bounded linear operators between Hilbert spaces $X$ and $Y$ . We denote the identity operator by $I$ . For $T,S\in\mathcal{L}(X;X)$ , we write $T\geq S$ when $T-S$ is positive semidefinite. Also for possibly non-self-adjoint $T$ , we introduce the inner product and norm-like notations

[TABLE]

For a set $A\subset\mathbb{R}$ , we write $A\geq 0$ if every element $t\in A$ satisfies $t\geq 0$ .

Our overall wish is to find some ${\widehat{u}}\in U$ , on a Hilbert space $U$ , solving for a given set-valued map $H:U\rightrightarrows U$ the variational inclusion

[TABLE]

\cbstart

Throughout the manuscript, ${\widehat{u}}$ stands for an arbitrary root of a relevant map $H$ . \cbendIn the present section 2, $H$ will be arbitrary, but in sections 4 and 5, where we specialise the results, we concentrate on $H$ arising from the saddle point problem (S).

Our strategy towards finding a solution ${\widehat{u}}$ is to introduce an arbitrary non-linear iteration-dependent preconditioner $V_{i+1}:U\to U$ and a step length operator $W_{i+1}\in\mathcal{L}(U;U)$ . With these, we define the generalised proximal point method, which on each iteration $i\in\mathbb{N}$ solves $u^{i+1}$ from

[TABLE]

We assume that $V_{i+1}$ splits into $M_{i+1}\in\mathcal{L}(U;U)$ , and $V^{\prime}_{i+1}:U\to U$ as

[TABLE]

More generally, to rigorously extend our approach to cases that would otherwise involve set-valued $V_{i+1}$ , we also consider for $\widetilde{H}_{i+1}:U\rightrightarrows U$ the iteration

[TABLE]

We say that (PP) or (PP∼) is solvable for the iterates $\{u^{i+1}\}_{i\in\mathbb{N}}\subset U$ if given any $u^{0}\in U$ , we can solve the corresponding inclusion to iteratively calculate $u^{i+1}$ from $u^{i}$ for each $i\in\mathbb{N}$ .

2.2 Basic estimates

We analyse the preconditioned proximal point methods (PP) and (PP∼) by applying a testing operator $Z_{i+1}\in\mathcal{L}(U;U)$ , following the ideas introduced in [31]. The product $Z_{i+1}M_{i+1}$ with the linear part of the preconditioner, will, as we soon demonstrate, be an indicator of convergence rates. In essence, as seen in the descent inequality (DI) of the next result, the operator forms a local metric (in the differential geometric sense) that measures closeness to a solution.

Theorem 2.1.

On a Hilbert space $U$ , let $\widetilde{H}_{i+1}:U\rightrightarrows U$ , and $M_{i+1},Z_{i+1}\in\mathcal{L}(U;U)$ for $i\in\mathbb{N}$ . Suppose (PP∼) is solvable for $\{u^{i+1}\}_{i\in\mathbb{N}}\subset U$ . \cbstartIf for all $i \in\mathbb{N}$ , $Z_{i+1}M_{i+1}$ is self-adjoint, and for some $\Delta_{i+1}\in\mathbb{R}$ and ${\widehat{u}}\in U$ the fundamental condition

[TABLE]

holds, then so do the quantitative $\Delta$ -Féjer monotonicity

[TABLE]

\cbend

as well as the descent inequality

[TABLE]

\cbstart

The main condition (CI∼) of theorem 2.1 essentially writes in abstract and step-dependent form the three-point formulas that hold for convex smooth functions (see appendix B). The term $\frac{1}{2}\|u^{i+1}-{\widehat{u}}\|_{Z_{i+2}M_{i+2}-Z_{i+1}M_{i+1}}^{2}$ is able to measure the strong monotonicity of $H$ or the approximation $\widetilde{H}_{i+1}$ . Indeed, if we have the estimate

[TABLE]

then this suggests to update the local metrics as

[TABLE]

where we write $\simeq$ to indicate that only the norm induced by the two operators has to be the same: $Z_{i+1}\Gamma$ might not be self-adjoint, while $Z_{i+2}M_{i+2}$ has to be self-adjoint. As we will see in section 4.2, these metric update and self-adjointness conditions effectively give popular primal–dual optimisation methods their necessary forms. The term $\frac{1}{2}\|u^{i+1}-u^{i}\|_{Z_{i+1}M_{i+1}}^{2}$ , on the other hand, as we shall see in more detail in section 2.3, gives the necessary leeway for taking a forward step instead of a proximal step with respect to some components of $H$ . The term $\Delta_{i+1}$ can model function value differences or duality gaps, as will be the case in this work, but in other contexts, such as the stochastic methods of our companion paper [30], it will be a penalty for the dissatisfaction of the metric update; hence the negated sign and the right-hand position in (DI).

Specialised to (PP), we obtain the following result. The condition (CI) is often more practical to verify than (CI∼) thanks to the additional structure introduced by $H({\widehat{u}})\ni 0$ . Indeed, in many of our examples, we can eliminate $H$ through monotonicity. To derive gap and function value estimates in section 5, we will however need (CI∼). \cbend

Corollary 2.2.

On a Hilbert space $U$ , let $H:U\rightrightarrows U$ . Also let $Z_{i+1},W_{i+1},M_{i+1}\in\mathcal{L}(U;U)$ , and $V^{\prime}_{i+1}:U\to U$ for $i\in\mathbb{N}$ . Suppose (PP) is solvable for $\{u^{i+1}\}_{i\in\mathbb{N}}\subset U$ with $V_{i+1}$ as in (4). Let ${\widehat{u}}\in H^{-1}(0)$ . \cbstartIf for all $i\in\mathbb{N}$ , $Z_{i+1}M_{i+1}$ is self-adjoint, and for some $\Delta_{i+1}\in\mathbb{R}$ and ${\widehat{u}}\in U$ the condition

[TABLE]

holds, then (CI∼), (QF), and (DI) hold for $\widetilde{H}_{i+1}(u):=W_{i+1}H(u)+V^{\prime}_{i+1}(u)$ . \cbend

Proof 2.3 (Proof of theorem 2.1).

Inserting (PP∼) into (CI∼), we obtain

[TABLE]

We recall for general self-adjoint $M$ the three-point formula

[TABLE]

Using this with $M=Z_{i+1}M_{i+1}$ , we rewrite (5) as the quantitative $\Delta$ -Féjer monotonicity (QF). Summing this over $i=0,\ldots,N-1$ , we obtain the descent inequality (DI).

Remark 2.4 (Bregman divergences and Banach spaces).

\cbstart

Let $X$ be a Banach space and $J\in\mathrm{cpl}(X)$ . Then for $x\in\operatorname{dom}J$ and $p\in\partial J(x)$ one can define the asymmetric Bregman divergence (or distance)

[TABLE]

where $\langle\,\boldsymbol{\cdot}\,\,|\,\,\boldsymbol{\cdot}\,\rangle_{X}:X^{*}\times X\to\mathbb{R}$ denotes the dual product. This is non-negative, but not a true distance, as it can happen that $D_{J}^{p}(z,x)=0$ for $z\neq x$ . However with ${\widehat{x}},z\in\operatorname{dom}J$ and $q\in\partial J(z)$ , we deduce [9]

[TABLE]

Therefore, the Bergman distance satisfies an analogue of the standard three-point identity (6). It allows generalising our techniques to Banach spaces and the algorithm

[TABLE]

where for each $i\in\mathbb{N}$ now $Z_{i+1}M_{i+1}$ has been replaced by $J_{i+1}\in\mathrm{cpl}(X)$ . The convergence will, however, be with respect to $D_{J_{i+1}}$ . Indeed, if $X$ is, in fact, a Hilbert space and we take $J_{i+1}(x)=\frac{1}{2}\|x\|_{Z_{i+1}M_{i+1}}^{2}$ , then $D_{J_{i+1}}^{x-z}(z,x)=\frac{1}{2}\|z-x\|_{Z_{i+1}M_{i+1}}^{2}$ .

Proximal point methods based on general Bregman divergences in place of the squared norm are studied in, e.g., [6, 9, 15, 16]. \cbend

The next two results demonstrate how the estimate of theorem 2.1 can be used to prove convergence with or without rates.

Proposition 2.5 (Convergence with a rate).

Suppose the descent inequality (DI) holds with $\Delta_{i+1}({\widehat{u}})\leq 0$ , and that $Z_{N+1}M_{N+1}\geq\mu(N)I$ for all $N\geq 1$ . Then $\|u^{N}-{\widehat{u}}\|^{2}\to 0$ at the rate $O(1/\mu(N))$ .

Proof 2.6.

Immediate from (DI).

\cbstart

We can also obtain superlinear convergence from (QF), a form of quantitative Féjer monotonicity when $\Delta_{i+1}({\widehat{u}})\leq 0$ .

Proposition 2.7 (Superlinear convergence).

Suppose (QF) holds with $\Delta_{i+1}({\widehat{u}})\leq 0$ , and that $Z_{i+1}M_{i+1}=\phi_{i}I$ for some $\phi_{i}$ for all $i\in\mathbb{N}$ . If $\phi_{i}/\phi_{i+1}\to 0$ , then $u^{N}\to{\widehat{u}}$ superlinearly.

Proof 2.8.

Immediate from (QF).

The scalar $\phi_{N}$ has its index off-by-one intentionally; the reason will become more apparent once we get to primal–dual methods. It is also possible to obtain superlinear convergences of different orders $q>1$ from (DI) or (QF). However, the conventional notions $\|u^{i+1}-{\widehat{u}}\|/\|u^{i}-{\widehat{u}}\|^{q}\to c\in\mathbb{R}$ cannot be characterised without involving the iterates. Indeed, assuming $\phi_{i+1}\geq C/\|x^{i}-{\widehat{x}}\|^{2q}$ , eqrefeq:convergence-result-main-h characterises superlinear convergence of order $q$ . It would also be possible to introduce new notions of the order of superlinear convergence, not involving the iterates and more in spirit with the testing approach, such as $\phi_{i}^{q}/\phi_{i+1}\to c$ , if such a notion would turn out to be useful.

To obtain weak convergence, we do not need $Z_{i+1}M_{i+1}$ to grow, but we need some additional technical assumptions. First of all, some of the leeway that the fundamental condition (CI∼) included for the forward steps, is now required to obtain convergence. Secondly, we need some weak-to-strong outer semicontinuity from $H$ , which we write more abstractly in terms of $\widetilde{H}_{i+1}$ . It would be possible to improve this requirement based on the Brezis–Crandall–Pazy property [3]. \cbend

Proposition 2.9 (Weak convergence).

Suppose for all $i\in\mathbb{N}$ that $Z_{i}M_{i}=Z_{0}M_{0}\geq 0$ is self-adjoint, and that the iterates of the preconditioned proximal point method (PP∼) satisfy the fundamental condition (CI∼) with $\Delta_{i+1}({\widehat{u}})\leq-\frac{\delta}{2}\|u^{i+1}-u^{i}\|_{Z_{i+1}M_{i+1}}^{2}$ for all \cbstart ${\widehat{u}}\in H^{-1}(0)$ and some $\delta>0$ . Suppose either that $Z_{0}M_{0}$ has a bounded inverse, or that $(Z_{0}H+Z_{0}M_{0})^{-1}\circ Z_{0}M_{0}$ is bounded on bounded sets. If $H$ is strong-to-strong outer semicontinuous and

[TABLE]

then $Z_{0}M_{0}(u^{i}-{\widehat{u}})\mathrel{\rightharpoonup}0$ weakly in $U$ for some ${\widehat{u}}\in H^{-1}(0)$ . \cbend

\cbstart

For the proof, we use the next lemma. Its earliest version is contained in the proof of [22, Theorem 1], but can be found more explicitly stated as [5, Lemma 6]. \cbend

Lemma 2.10.

On a Hilbert space $X$ , let $\hat{X}\subset X$ be closed and convex, and $\{x^{i}\}_{i\in\mathbb{N}}\subset X$ . If the following conditions hold, then $x^{i}\mathrel{\rightharpoonup}x^{*}$ weakly in $X$ for some $x^{*}\in\hat{X}$ :

(i)

$i\mapsto\|x^{i}-x^{*}\|$ * is non-increasing for all $x^{*}\in\hat{X}$ \cbstart(Féjer monotonicity) \cbend.* 2. (ii)

All weak limit points of $\{x^{i}\}_{i\in\mathbb{N}}$ belong to $\hat{X}$ .

Proof 2.11 (Proof of proposition 2.9).

\cbstart

To use lemma 2.10, we need a closed and convex solution set. However, $H^{-1}(0)$ may generally be non-convex and not closed. Since $Z_{i+1}M_{i+1}=Z_{i+2}M_{i+2}$ , using the strong-to-strong outer semicontinuity of $H$ , it is easy to see that (CI∼) holds for all ${\widehat{u}}\in\hat{U}:=\operatorname{cl}\operatorname{conv}H^{-1}(0)$ . Consequently the descent inequality (DI) holds for all ${\widehat{u}}\in\hat{U}$ .

We apply theorem 2.1 on any ${\widehat{u}}\in\hat{U}$ . From the quantitative $\Delta$ -Féjer monotonicity (QF), since $\Delta_{i+1}({\widehat{u}})\leq-\frac{\delta}{2}\|u^{i+1}-u^{i}\|_{Z_{i+1}M_{i+1}}^{2}$ and $Z_{i+1}M_{i+1} \equiv Z_{0}M_{0}=:A$ , we have

[TABLE]

This implies the condition lemma 2.10 (i) for the sequence $\{x^{i}:=A^{1/2}u^{i}\}_{i\in\mathbb{N}}$ .

Let then $w^{i+1}:=-A(u^{i+1}-u^{i})$ as in (7). From (8), we deduce that $w^{i+1}\to 0$ as $i\to\infty$ . By (PP∼) and (7), any weak limit point $u^{*}$ of the sequence $\{u^{i}\}_{i\in\mathbb{N}}$ then satisfies $u^{*}\in H^{-1}(0)\subset\hat{U}$ . Let then $x^{*}$ be any weak limit point of $\{x^{i}\}_{i\in\mathbb{N}}$ . We need to show that $x^{*}\in\hat{X}:=A^{1/2}\hat{U}$ . If $Z_{0}M_{0}=A$ has a bounded inverse, then this is clear as the weak convergence of $\{x^{i_{k}}\}$ implies the weak convergence of $\{u^{i_{k}}=A^{-1/2}x^{i_{k}}\}$ . Otherwise, when $(Z_{0}H+Z_{0}M)_{0}^{-1}\circ Z_{0}M_{0}$ is bounded on bounded sets, since $u^{i+1}\in(Z_{0}H+Z_{0}M_{0})^{-1}(Z_{0}M_{0}u^{i})=(H+A)^{-1}(A^{1/2}x^{i})$ , we see that $\{u^{i+1}\}_{i\in\mathbb{N}}$ is bounded. Hence a subsequence converges to some $u^{*}\in H^{-1}(0)$ . But this implies that $x^{*}=A^{1/2}u^{*}$ as required.

By lemma 2.10 now $x^{i}\mathrel{\rightharpoonup}x^{*}\in A^{1/2}\hat{U}$ . This implies $Z_{0}M_{0}(u^{i}-u^{*})\mathrel{\rightharpoonup}0$ weakly for some $u^{*}\in H^{-1}(0)$ . \cbend

2.3 Examples of first-order methods

We now look at several concrete examples.

\cbstart

Example 2.12 (The proximal point method).

For all $i\in\mathbb{N}$ , take $M_{i}=I$ , $V^{\prime}_{i}=0$ , and $W_{i+1}=\tau_{i}I$ for some $\tau_{i}>0$ . Then (PP) is the standard proximal point method $u^{i+1}\in(I+\tau_{i}H)^{-1}(u^{i})$ . If the operator $H:U\rightrightarrows U$ is maximal monotone, $\{u^{i}\}_{i\in\mathbb{N}}$ converges weakly to some ${\widehat{u}}\in H^{-1}(0)$ for any starting point $u^{0}\in U$ .

Proof 2.13 (Proof of convergence).

We take $Z_{i+1}=\phi_{i}I$ for some $\phi_{i}>0$ . Then the fundamental condition (CI) reads

[TABLE]

As long as $\phi_{i}\geq\phi_{i+1}$ , the monotonicity of $H$ clearly proves (9), thus (CI), with $\Delta_{i+1}({\widehat{u}})=-\frac{\phi_{i}}{2}\|u^{i+1}-u^{i}\|^{2}$ . Using the maximal monotonicity, Minty’s theorem guarantees the solvability of (PP). Thus the conditions of corollary 2.2 are satisfied. Maximal monotonicity also guarantees that $H$ is weak-to-strong outer semicontinuous; see lemma A.1. This establishes the iteration outer semicontinuity condition (7). Taking $\phi_{i}\equiv\phi_{0}$ for constant $\phi_{0}>0$ , so that $Z_{i+1}M_{i+1}=Z_{0}M_{0}=\phi_{0}I$ , it remains to refer to proposition 2.9.

Suppose $H$ is strongly monotone, that is, for some $\gamma>0$ holds

[TABLE]

Then from (9), we immediately also derive convergence rates as follows. Letting $\phi_{i}\nearrow\infty$ will obviously give the fastest convergence, however, the $O(1/N^{2})$ step length rule will be useful later on with splitting methods, combining the simple proximal step with other algorithmic elements.

Example 2.14 (Acceleration and linear convergence of the proximal point method).

Suppose $H$ is strongly monotone for some factor $\gamma>0$ . If we choose $\tau_{i+1}:=\tau_{i}/\sqrt{1+2\gamma\tau_{i}}$ , then the proximal point method satisfies $\|u^{N}-{\widehat{u}}\|^{2}\to 0$ at the rate $O(1/N^{2})$ . If we keep $\tau_{i}=\tau_{0}>0$ constant, we get linear convergence of the iterates. If $\tau_{i}\nearrow\infty$ , we get superlinear convergence.

Proof 2.15 (Proof of convergence).

Clearly (9) holds with $\Delta_{i+1}({\widehat{u}})=0$ provided we update

[TABLE]

Then theorem 2.1 gives the descent inequality (DI), which now reads

[TABLE]

If we take $\phi_{i}=\tau_{i}^{-1/2}$ , this reads $\phi_{i+1} :=\phi_{i}+2\gamma\phi_{i}^{-1/2}$ . Since $\phi_{N}$ is of the order $\Theta(N^{2})$ [7, 31], we get the claimed $O(1/N^{2})$ convergence from (2.15). If, on the other hand, we keep $\tau_{i}\equiv\tau_{0}$ fixed, then clearly $\phi_{N}=(1+2\gamma\tau_{0})^{N}\phi_{0}$ . Since this is exponential when $\gamma>0$ , we get linear convergence from (2.15). Finally, if $\tau_{i}\nearrow\infty$ , we see from (2.15) that $\phi_{i}/\phi_{i+1}\searrow 0$ . We now obtain superlinear convergence from corollaries 2.2 and 2.7.

The next lemma starts our analysis of gradient descent and forward–backward splitting. It relies on the three-point smoothness inequalities of appendix B, which the reader may want to study at this point. \cbend

Lemma 2.16.

Let $H=\partial G+\nabla J$ for $G,J\in\mathrm{cpl}(X)$ such that $\nabla J$ is $L$ -Lipschitz. For all $i\in\mathbb{N}$ , take $M_{i+1}\equiv I$ and $V^{\prime}_{i+1}(u):=\tau_{i}(\nabla J(u^{i})-\nabla J(u))$ with $W_{i+1}=\tau_{i}I$ as well as $Z_{i+1}\equiv\phi_{i}I$ for some $\tau_{i},\phi_{i}>0$ .\cbstartThen the fundamental condition (CI) holds if

(i)

$\phi_{i}=\phi_{0}$ * is constant, $\tau_{i}L<2$ , and $\Delta_{i+1}({\widehat{u}}):=-\phi_{i}(1-\tau_{i}L/2)\|u-u^{i}\|^{2}/2$ . In this case the iteration outer semicontinuity condition (7) moreover holds provided $\inf_{i}\tau_{i}>0$ .*

\cbend

If $J$ is strongly convex with factor $\gamma>0$ , alternatively:

(ii)

$\tau_{0}L^{2}<\gamma$ , $\phi_{i+1}:=\phi_{i}(1+\tau_{i}(2\gamma-\tau_{i}L^{2}))$ , $\tau_{i}:=\phi_{i}^{-1/2}$ or $\tau_{i}:=\tau_{0}$ , and $\Delta_{i+1}({\widehat{u}})=0$ .

Proof 2.17.

\cbstart

We expand the fundamental condition (CI) as

[TABLE]

By the monotonicity of $\partial G$ , this holds if

[TABLE]

(i)* The three-point inequality (76) in lemma B.1 states*

[TABLE]

This clearly reduces (10) to

[TABLE]

which holds under the conditions of (i). The satisfaction of (7) is immediate from the weak-to-strong outer semicontinuity of $\partial F$ (lemma A.1), the Lipschitz continuity of $\nabla G$ , and the bounds on $\tau_{i}$ .

(ii)* The three-point smoothness inequality (79) in lemma B.3 gives*

[TABLE]

\cbend

Inserting this into (10), we see it to hold with $\Delta_{i+1}({\widehat{u}})=0$ if

[TABLE]

Clearly our two alternative choices of $\{\tau_{i}\}_{i\in\mathbb{N}}$ are non-increasing. Therefore, (11) follows from the initialisation condition $\tau_{0}L^{2}<\gamma$ and the update rule $\phi_{i+1}:=\phi_{i}+\phi_{i}\tau_{i}(\gamma-\tau_{i}L^{2})$ in (ii).

\cbstart

Remark 2.18.

It is also possible to exploit the strong convexity of $G$ instead of $J$ for acceleration.

\cbend

Example 2.19 (Gradient descent).

\cbstart

Let $H=\nabla J$ for $J\in\mathrm{cpl}(U)$ with $\nabla J$ $L$ -Lipschitz. \cbendTaking $\tau_{i}=\tau$ and $G=0$ constant in lemma 2.16, (PP) reads

[TABLE]

This is the gradient descent method. Direct application of lemma 2.16 (i) with $u=u^{i+1}$ and $u^{*}={\widehat{u}}$ together with corollaries 2.2 and 2.9 now verifies the well-known weak convergence of the method to a root ${\widehat{u}}$ of $H$ when $\tau L<2$ .

Observe that $V_{i+1}=\nabla Q_{i+1}$ for

[TABLE]

Each step of (PP) therefore minimises the surrogate objective [11]

[TABLE]

The function $Q_{i+1}$ on one hand penalises long steps, and on the other hand allows longer steps when the local linearisation error is large. In this example, $Q_{i+1}$ is, in fact, a Bregman divergence.

\cbstart

Under strong convexity, we again get rates via lemma 2.16 (ii). Minding our remarks before example 2.14, we only state the case $\tau_{i}=\tau_{0}$ . Due to the upper bound $\tau_{0}<\gamma/L^{2}$ , we cannot get superlinear convergence as in example 2.14.

Example 2.20 (Acceleration and linear convergence of gradient descent).

Continuing from example 2.19, if $J$ is strongly convex with factor $\gamma>0$ and $\nabla J$ is $L$ -Lipschitz, and we keep $\tau_{i}=\tau_{0}<\gamma/L^{2}$ fixed, we get linear convergence.

\cbend

Now comes the full power of lemma 2.16: we can easily bolt on a proximal step to gradient descent.

Example 2.21 (Forward–backward splitting).

Let $H=\partial G+\nabla J$ for $G,J\in\mathrm{cpl}(X)$ with $\nabla J$ Lipschitz. Taking $M_{i+1}$ , $W_{i+1}$ , and $V^{\prime}_{i+1}$ as in lemma 2.16, the preconditioned proximal point method (PP) becomes

[TABLE]

This is the forward–backward splitting method

[TABLE]

By lemma 2.16, convergence and acceleration work exactly as for gradient descent in examples 2.19 and 2.20.

\cbstart

We can also do fully non-smooth splitting methods by a lifting approach: \cbend

Example 2.22 (Douglas–Rachford splitting).

Let $A,B:U\rightrightarrows U$ be maximal monotone operators. Consider the problem of finding ${\widehat{u}}$ with $0\in A({\widehat{u}})+B({\widehat{u}})$ . For $\lambda>0$ , let

[TABLE]

Then $0\in A({\widehat{u}})+B({\widehat{u}})$ if and only if $0\in H({\widehat{u}},{\widehat{v}})$ , where ${\widehat{v}}\in({\widehat{u}}-\lambda A({\widehat{u}}))\cap({\widehat{u}}+\lambda B({\widehat{u}}))$ . The preconditioned proximal point method (PP∼) becomes the Douglas–Rachford splitting [12]

[TABLE]

We work with (PP∼) since in (PP), $V^{\prime}_{i+1}$ would have to be set-valued. If $A$ and $B$ are maximal monotone, the variables $\{v^{i}\}_{i\in\mathbb{N}}$ converge weakly to ${\widehat{v}}$ .

Proof 2.23 (Proof of convergence).

Write $\bar{u}^{i}:=(u^{i},v^{i})$ and $\widehat{\bar{u}}:=({\widehat{u}},{\widehat{v}})$ . Observe that

[TABLE]

Using the monotonicity of $A$ and $B$ , with $Z_{i+1}:=I$ , we have

[TABLE]

Thus the fundamental condition (CI∼) holds with $\Delta_{i+1}(\widehat{\bar{u}}):=-\frac{1}{2}\|\bar{u}^{i+1}-\bar{u}^{i}\|_{Z_{i+1}M_{i+1}}^{2}$ . Using (13) and the weak-to-strong outer semicontinuity of $A$ and $B$ (see lemma A.1), we easily verify (7). \cbstartSince $Z_{i+1}M_{i+1}\equiv Z_{0}M_{0}$ is non-invertible, we also have to verify that $(Z_{0}H+Z_{0}M_{0})^{-1}\circ Z_{0}M_{0}$ is bounded on bounded sets. This is to say that (14) bounds $\bar{u}^{i+1}=(u^{i+1},v^{i})$ in terms of $v^{i}$ . This is an easy consequence of the Lipschitz-continuity of the resolvent of maximal monotone operators [1, Corollary 23.10]. \cbendWeak convergence now follows from theorems 2.1 and 2.9.

2.4 Examples of second-order methods

\cbstart

We now look at how are techniques are applicable to Newton’s method. Through the three-point inequalities of lemma B.5 for $C^{2}$ functions, the analysis turn out to be very close to that of gradient descent. Our analysis is not as short as the conventional analysis of Newton’s method, but has its advantages. Indeed, the convergence of proximal Newton’s method will be an automatic corollary of our approach, exactly how the convergence of forward–backward splitting was a corollary of the convergence of gradient descent. \cbend

Example 2.24 (Newton’s method).

Suppose $H=\nabla J$ for $J\in C^{2}(U)$ . Take

[TABLE]

Then the preconditioned proximal point method (PP) reads

[TABLE]

This is Newton’s method. \cbstartBy lemma 2.26 (below) and proposition 2.5, we obtain local linear convergence if $\nabla^{2}J({\widehat{u}})>0$ . By lemma 2.28 (below), this convergence is, further, superlinear (quadratic if $\nabla^{2}J$ is locally Lipschitz near ${\widehat{x}}$ ). \cbend

Observe that now $V_{i+1}(u)$ is the gradient of

[TABLE]

In the surrogate objective (12), this allows longer steps when the second-order Taylor expansion under-approximates, and forces shorter steps when it over-approximates.

\cbstart

Again, we can easily bolt on a proximal step: \cbend

Example 2.25 (Proximal Newton’s method).

Let $H=\partial G+\nabla J$ for $J\in C^{2}(X)$ , and $G\in\mathrm{cpl}(X)$ . Taking $M_{i+1}$ , $W_{i+1}$ , and $V^{\prime}_{i+1}$ as in example 2.24, the preconditioned proximal point method (PP) becomes

[TABLE]

This is the proximal Newton’s method [[, see, e.g.,]]lee2014proximal

[TABLE]

where $(I+A^{-1}\partial G)^{-1}(v)$ solves $\min_{u}\frac{1}{2}\|u-v\|_{A}^{2}+G(u).$ Convergence and acceleration work exactly as for Newton’s method in example 2.24, based on the same lemmas that we state next.

Lemma 2.26.

Let $H=\partial G+\nabla J$ for $G\in\mathrm{cpl}(U)$ and $J\in C^{2}(U)$ . Take

[TABLE]

\cbstart

For an initial iterate $u^{0}\in U$ , let $\{u^{i+1}\}_{i\in\mathbb{N}}$ be defined through (PP). \cbendIf $\nabla^{2}J({\widehat{u}})>0$ , there exists $\epsilon>0$ such that if $\|u^{0}-{\widehat{u}}\|_{\nabla^{2}J({\widehat{u}})}\leq\epsilon$ , then the fundamental condition (CI) holds with $\Delta_{i+1}({\widehat{u}})=0$ and $M_{i+1}=\nabla^{2}J(u^{i})$ for all $i\in\mathbb{N}$ . Moreover, we can take $Z_{i+1}=\phi_{i}I$ such that \cbstart $Z_{N}M_{N}\geq\kappa^{N}\nabla^{2}J({\widehat{u}})$ for some $\kappa>1$ . In particular, $\|u^{i}-{\widehat{u}}\|^{2}\to 0$ at the linear rate $O(1/\kappa^{N})$ .

Proof 2.27.

We set $M_{i+1}:=\nabla^{2}J(u^{i})$ and $Z_{i+1}:=\phi_{i}I$ for some $\phi_{i}>0$ . Then $\nabla^{2}J({\widehat{u}})>0$ imply that $Z_{i+1}M_{i+1}=\phi_{i}\nabla^{2}J(u^{i})$ is positive and self-adjoint for $u^{i}$ close to ${\widehat{u}}$ .

By assumption, for some $\epsilon>0$ , we have

[TABLE]

For a fixed $i\in\mathbb{N}$ , let us assume that $u^{i}\in\widehat{B}(\epsilon)$ . Since $\partial F$ is monotone, similarly to the proof of lemma 2.16, the fundamental condition (CI) holds if

[TABLE]

where we use (81) in lemma B.5 with $\tau=1+\delta_{i}$ to estimate

[TABLE]

for

[TABLE]

Consequently, (15) holds with $\Delta_{i+1}({\widehat{u}})=0$ if we take $\phi_{i+1}>0$ such that

[TABLE]

This can always be satisfied for some $\phi_{i+1}>0$ for $\epsilon>0$ small enough because $\nabla^{2}J({\widehat{u}})>0$ then implies $\nabla^{2}J(u^{i})>0$ .

Now corollary 2.2 shows the quantitative $\Delta$ -Féjer monotonicity eq. QF, which with (17) implies

[TABLE]

If $\delta_{i}\in(0,1)$ , this implies by (16) that $\|u^{i+1}-{\widehat{u}}\|_{[1+(1-\delta_{i}^{2})]/(1+\delta_{i})\nabla^{2}J({\widehat{u}})}^{2}\leq\|u^{i}-{\widehat{u}}\|_{\nabla^{2}J({\widehat{u}})/(1-\delta_{i})}^{2}$ . Consequently, if $\delta_{i}\in(0,1)$ is small enough, that is, if $\epsilon>0$ is small enough due to the continuity of $\nabla^{2}J$ , we obtain $\|u^{i+1}-{\widehat{u}}\|_{\nabla^{2}J({\widehat{u}})}\leq\|u^{i}-{\widehat{u}}\|_{\nabla^{2}J({\widehat{u}})}$ so that also $u^{i+1}\in\widehat{B}(\epsilon)$ . In particular, our assumption $u^{0}\in\widehat{B}(\epsilon)$ guarantees $\{u^{i}\}_{i\in\mathbb{N}}\subset\widehat{B}(\epsilon)$ . Consequently also $\delta_{i+1}\leq\delta_{i}\leq\delta_{0}$ for all $i\in\mathbb{N}$ . We can now take $\zeta=u^{i+1}$ in (16), so that (17) gives

[TABLE]

Since $\kappa(\delta):=(1+(1-\delta)^{2})/(1-\delta)$ is increasing within $(0,1)$ , and $\kappa:=\kappa(0)=2$ , we see that $\phi_{i+1}\geq\kappa\phi_{i}$ . Taking $\phi_{0}:=1+\delta_{0}$ we now get $Z_{N}M_{N}\geq\kappa^{N}(1+\delta_{0})\nabla^{2}J(u^{N})\geq\kappa^{N}\nabla^{2}J({\widehat{u}})$ . This implies the convergence rate claim.

We can also show superlinear convergence, however, this is somewhat more elaborate as we need to make use of $\Delta_{i+1}({\widehat{u}})$ .

Lemma 2.28.

With everything as in lemma 2.26, the convergence rate claim can be improved to superlinear. If $\nabla^{2}J$ is locally Lipschitz near ${\widehat{u}}$ , for example, if $J\in C^{3}(U)$ , then this convergence is quadratic (superlinear convergence of order $q=2$ ).

Proof 2.29.

We continue with the initial setup of the proof of lemma 2.26 until (15). Now, for $\delta_{i}$ given by (16), (86) in lemma B.7 gives

[TABLE]

With this, (15), hence the fundamental condition (CI), holds if

[TABLE]

This holds for

[TABLE]

provided

[TABLE]

This can always be satisfied for some $\phi_{i+1}>0$ if $\epsilon>0$ is small enough because then $\nabla^{2}J(u^{i})>0$ due to $\nabla^{2}J({\widehat{u}})>0$ .

By corollary 2.2 we now obtain the quantitative $\Delta$ -Féjer monotonicity (QF), which with (19) gives

[TABLE]

Due to (16), we have $(1-\delta_{i})\nabla^{2}J({\widehat{u}})\leq\nabla^{2}J(u^{i})\leq(1+\delta_{i})\nabla^{2}J({\widehat{u}})$ . Hence, also using (20), (21) implies

[TABLE]

If $\delta_{i}\in(0,1/2]$ , this and $u^{i}\in\widehat{B}(\epsilon)$ imply $u^{i+1}\in\widehat{B}(\epsilon)$ , hence our assumption $u^{0}\in\widehat{B}(\epsilon)$ implies $\{u^{i}\}_{i\in\mathbb{N}}\subset\widehat{B}(\epsilon)$ . Consequently also $\delta_{i+1}\leq\delta_{i}\leq\delta_{0}$ for all $i\in\mathbb{N}$ , If now $\delta_{0}<1/2$ , which is guaranteed by $\epsilon>0$ small enough and the continuity of $\nabla^{2}J$ , then (22) implies $\|u^{i}-{\widehat{u}}\|_{\nabla^{2}J({\widehat{u}})}\to 0$ . Consequently $\delta_{i}\to 0$ .

Let $\widetilde{\delta}_{i}:=\delta_{i}(1+\delta_{i})/[(2-\delta_{i})(1-\delta_{i})]$ . From (22), we get superlinear convergence if $\widetilde{\delta}_{i}\to 0$ , which follows from $\delta_{i}\to 0$ . Superlinear convergence of order $q>1$ occurs if $\|u^{i+1}-{\widehat{u}}\|_{\nabla^{2}J({\widehat{u}})}/\|u^{i}-{\widehat{u}}\|_{\nabla^{2}J({\widehat{u}})}^{q}\to c$ for some $c\geq 0$ . From (22), we see this to hold if $\widetilde{\delta}_{i}/\|u^{i}-{\widehat{u}}\|^{2(q-1)}\to c\in\mathbb{R}$ . If $\nabla^{2}J$ is Lipschitz near ${\widehat{u}}$ , then $\delta_{i}\leq C\|u^{i}-{\widehat{u}}\|$ for some constant $C>0$ . Therefore we get superlinear convergence of order $q=2$ .

\cbend

\cbstart

2.5 Convergence of function values

We now study how our framework can be used to derive the convergence, or ergodic convergence, of function values. We concentrate on algorithms that are variants of forward–backward splitting, including gradient descent and the proximal point method, although other algorithms can be handled similarly. We again use the three-point inequalities of appendix B.

Lemma 2.30.

Let $H=\partial G+\nabla J$ for $G,J\in\mathrm{cpl}(X)$ with $\nabla J$ $L$ -Lipschitz. For all $i\in\mathbb{N}$ , take $M_{i+1}\equiv I$ and $V^{\prime}_{i+1}(u):=\tau_{i}(\nabla J(u^{i})-\nabla J(u))$ with $W_{i+1}=\tau_{i}I$ as well as $Z_{i+1}\equiv\phi_{i}I$ for some $\tau_{i},\phi_{i}>0$ . Then the fundamental condition (CI∼) holds if

(i)

$\phi_{i}\equiv\phi_{0}$ * is constant, $\tau_{i}L<1$ , and*

[TABLE]

If $J$ is strongly convex with factor $\gamma>0$ , alternatively:

(ii)

$\tau_{0}L^{2}<\gamma$ , $\phi_{i+1}:=\phi_{i}(1+\tau_{i}(\gamma-\tau_{i}L^{2}))$ , $\tau_{i}:=\phi_{i}^{-1/2}$ or $\tau_{i}:=\tau_{0}$ , and

[TABLE]

Proof 2.31.

We fellow the proof of lemma 2.16, where we start by expanding (CI∼) (instead of (CI)) as

[TABLE]

Note that we have not inserted $H({\widehat{u}})\ni 0$ here. Now, as the next step, we do not eliminate $G$ through monotonicity of $\partial G$ , but use the definition of the convex subdifferential. Then we use the value three-point inequality (77) in place of the non-value inequality (76) and the value inequality (80) in place of the non-value inequality (79). From here the claims follow as in the proof of lemma 2.16. Note the factor-of-two differences between these formulas, which are reflected in the step length rules: $\tau_{i}L<1$ instead of $\tau_{i}L<2$ ; $\tau_{0}L^{2}<\gamma$ instead of $\tau_{0}L^{2}<2\gamma$ ; and $\phi_{i+1}:=\phi_{i}(1+\tau_{i}(\gamma-\tau_{i}L^{2}))$ instead of $\phi_{i+1}:=\phi_{i}(1+\tau_{i}(2\gamma-\tau_{i}L^{2}))$ .

We now obtain the convergence to zero of a weighted function value difference over the history of iterates, and as a consequence, for an ergodic sequence formed from the iterates:

Corollary 2.32.

Suppose the conditions of lemma 2.30 hold. Then

[TABLE]

In consequence, if we define the ergodic sequence

[TABLE]

then

[TABLE]

In particular, if lemma 2.30 (i) holds, then $[G+J](\widetilde{u}_{N})\to[G+J]({\widehat{u}})$ at the rate $O(1/N)$ . If, instead, lemma 2.30 (ii) holds, then this convergence is linear.

Proof 2.33.

The basic inequality (23) is a consequence of the fundamental theorem 2.1. The ergodic estimate (24) follows from there by Jensen’s inequality. The first convergence rate estimate follows from (24) are based on the fact that under lemma 2.30 (i) $\phi_{i}\tau_{i}=\phi_{0}\tau_{0}$ is a constant, so $\zeta_{N}=N\phi_{0}\tau_{0}$ . Under lemma 2.30 (i) we recall from example 2.14 that the rule for $\phi_{i+1}$ shows that $\phi_{i+1}$ is grows exponentially with $\tau_{i}=\tau_{0}$ constant. Then also $\zeta_{N}$ is exponential, so we obtain linear rates.

The following three examples follow from corollary 2.32. For the proximal point method, additionally, since we can still let $\tau_{i}\nearrow\infty$ due to $L=0$ , we can also get superlinear convergence. Also, in the case of the proximal point method, we use the strong convexity of $F$ , which is for simplicity not considered in (2.30), but can easily be added.

Example 2.34 (Proximal point method ergodic function value).

For the proximal point method of examples 2.12 and 2.14, applied to $H=\partial G$ with $G\in \mathrm{cpl}(U)$ , we have $G(\widetilde{u}_{N})\to G({\widehat{u}})$ at the rate $O(1/N)$ when $\tau_{i}\equiv\tau_{0}$ and no strong convexity is present. If $G$ is strongly convex, and $\tau_{i}\equiv\tau_{0}$ , the convergence is linear; if $\tau_{i}\nearrow\infty$ , the convergence is superlinear.

Example 2.35 (Gradient descent ergodic function value).

For the gradient descent method of examples 2.19 and 2.20, applied to $J\in\mathrm{cpl}(U)$ with $L$ -Lipschitz gradient, if $\tau_{i}\equiv\tau_{0}$ with $\tau_{0}L\leq 1$ , we have $J(\widetilde{u}_{N})\to J({\widehat{u}})$ at the rate the $O(1/N)$ . If $J$ is strongly convex, $\tau_{0}L^{2}<\gamma$ , and we update $\tau_{i+1}:=\tau_{i}/\sqrt{1+(2\gamma-\tau_{i}L^{2})}$ , then this convergence is $O(1/N^{2})$ .

Example 2.36 (Forward–backward splitting ergodic function value).

For the forward–backward splitting of example 2.21, $[G+J](\widetilde{u}_{N})\to[G+J]({\widehat{u}})$ at exactly the same rates and conditions are for gradient descent in example 2.35.

For Newton’s method, we can use similar arguments: we can replace (81) by (83) in lemma 2.26, and (86) by (87) in lemma 2.28. This can be done because the preceding non-value lemmas show that $\{u^{i}\}_{i\in\mathbb{N}}\in\widehat{B}(\epsilon)$ . In lemma 2.26 the effect of the change is to replace $(1-\delta_{i})^{2}$ by $\delta_{i}^{2}-3\delta_{i}$ everywhere, and in lemma 2.28, to replace $2-\delta_{i}$ by $1-2\delta_{i}$ . With these changes, the main arguments go through, although the exact value of $\kappa$ and the upper bounds for $\delta_{i}$ in the final paragraphs are changed.

Example 2.37 (Newton’s method function value).

For Newton’s method in example 2.24, we have $\tau_{i}=1$ and $\phi_{N}:=\kappa^{N}\phi_{0}$ for some $\kappa>1$ . We have $J(\widetilde{u}^{N})\to J({\widehat{u}})$ (super)linearly.

We can also obtain non-ergodic convergence for monotone methods. We demonstrate the idea only for the unaccelerated ( $\phi_{i}\tau_{i}=\phi_{0}\tau_{0}$ ) proximal point method, but unaccelerated forward–backward splitting and gradient descent can be handled analogously.

Example 2.38 (Proximal point method function value).

For the proximal point method of examples 2.12 and 2.14, applied to $H=\partial G$ with $G\in\mathrm{cpl}(U)$ , we have $G(u^{N})\to G({\widehat{u}})$ at the rate $O(1/N)$ when $\tau_{i}\equiv\tau_{0}$ and no strong convexity is present. If $G$ is strongly convex, and $\tau_{i}\equiv\tau_{0}$ , the convergence is linear; if $\tau_{i}\nearrow\infty$ , the convergence is superlinear.

Proof 2.39 (Proof of convergence).

From (PP), that is $0\in\partial F(u^{i+1})+\tau_{i}(u^{i+1}-u^{i})$ , we have

[TABLE]

That is, the proximal point method is monotone: Now we use corollary 2.32. Using (25) to unroll the function value sum in (23) gives $\zeta_{N}[G(u^{N})-G({\widehat{u}})]\leq C_{0}$ . The rates follow as in corollaries 2.32 and 2.34.

\cbend

2.6 Connections to fixed point theorems

\cbstart

We demonstrate connections of our approach to established fixed point theorems. The following result in its modern form, stated for firmly non-expansive or more generally $\alpha$ -averaged maps, can be first found in [5]. Similar results for what are now known as Krasnoselski–Mann iterations, closely related to $\alpha$ -averaged maps, were, however, stated earlier for more limited settings in [20, 28, 23, 17, 22]. \cbend

Example 2.40 (Browder’s fixed point theorem).

Let $T:U\to U$ be $\alpha$ -averaged, that is $T=(1-\alpha)J+\alpha I$ for some non-expansive $J$ and $\alpha\in(0,1)$ . Suppose there exists a fixed point ${\widehat{u}}=T({\widehat{u}})$ . Let $u^{i+1}:=T(u^{i})$ . Then $u^{i}\mathrel{\rightharpoonup}u^{*}$ for some fixed point $u^{*}$ of $T$ .

Proof 2.41 (Proof).

Let us set $H(u):=T(u)-u$ , as well as $Z_{i+1}:=W_{i+1}:=M_{i+1}:=I$ and $V^{\prime}_{i+1}(u):=T(u^{i})+u^{i}-T(u)-u$ . We have

[TABLE]

where the last step follows by observing from the previous steps that (PP) says $u^{i+1}=T(u^{i})$ . The expression (26) easily gives the \cbstartiteration outer semicontinuity condition (7), \cbendand reduces the fundamental condition (CI∼) to

[TABLE]

Using $u^{i+1}=T(u^{i})$ and ${\widehat{u}}=T({\widehat{u}})$ , and taking $\beta>0$ , (CI∼) therefore holds for

[TABLE]

provided

[TABLE]

Using the $\alpha$ -averaged property and ${\widehat{u}}=J({\widehat{u}})$ , we expand

[TABLE]

We take $\beta:=\max\{0,1/2-\alpha\}$ . Then $2\alpha+2\beta\geq 1$ . Cauchy’s inequality and non-expansivity of $J$ thus give

[TABLE]

This verifies (CI∼). From (27), $\Delta_{i+1}({\widehat{u}})\leq-\frac{1}{2}\min\{1,\alpha/(1-\alpha)\}\|u^{i+1}-u^{i}\|^{2}$ . We now obtain the claimed convergence from corollaries 2.2 and 2.9.

3 Stochastic methods

\cbstart

We now exploit the fact that the step length $W_{i+1}$ can be a non-invertible operator. We do this in the context of stochastic block-coordinate methods. Towards this end we introduce the following probabilistic notations: \cbend

Definition 3.1.

We write $x\in\mathcal{R}(X)$ if $x$ is an $X$ -valued random variable: $x:\Omega\to X$ for some (in the present work fixed) probability space $(\Omega,\mathcal{O})$ , where $\mathcal{O}$ is a $\sigma$ -algebra on $\Omega$ . We denote by $\mathbb{E}$ the expectation with respect to a probability measure $\mathbb{P}$ on $\Omega$ . As is common, we abuse notation and write $x=x(\omega)$ for the unknown random realisation $\omega\in\Omega$ . We also write $\mathbb{E}[\cdot|i]$ for the conditional expectation with respect to random variable realisations up to and including iteration $i$ .

We refer to [29] for more details on measure-theoretic probability.

\cbstart

The following is an immediate corollary of theorem 2.1, obtained by taking the expectation of both (CI∼) and (DI). By only requiring these inequalities to hold in expectation may may produce more lenient step length and other conditions. In the section, we demonstrate the flexibility of our techniques to stochastic methods with a few basic examples. We refer to the review article [33] for an introduction and further references to stochastic coordinate descent, and to our companion paper [30] for primal–dual methods based on the work here.

Corollary 3.2.

On a Hilbert space $U$ and a probability space $(\Omega,\mathcal{O})$ , let $\widetilde{H}_{i+1}:\mathcal{R}(U\rightrightarrows U)$ , and $M_{i+1},Z_{i+1}\in\mathcal{R}(\mathcal{L}(U;U))$ for $i\in\mathbb{N}$ . Suppose (PP∼) is solvable for $\{u^{i+1}\}_{i\in\mathbb{N}}\subset\mathcal{R}(U)$ . If for all $i \in\mathbb{N}$ and almost all random events $\omega\in\Omega$ , $(Z_{i+1}M_{i+1})(\omega)$ is self-adjoint, and for some $\Delta_{i+1}\in\mathcal{R}(\mathbb{R})$ and ${\widehat{u}}\in U$ the expected fundamental condition

[TABLE]

holds, then so does the expected descent inequality

[TABLE]

In block-coordinate descent methods, we write $u=\sum_{j=1}^{m}P_{j}u$ for some mutually orthogonal projections operators, and on each step of the method, only update some of the “blocks” $P_{j}u$ . Functions with respect to which we take a proximal step, we assume separable with respect to these projections or subspaces: $G=\sum_{j=1}^{m}G_{j}\circ P_{j}$ . To perform forward steps, we introduce a blockwise version of standard smoothness conditions of convex functions. The idea is that the factor $L_{S(i)}$ for the subset of blocks $S(i)$ can be better than the global smoothness or Lipschitz factor $L$ .

\cbend

Definition 3.3.

We write $(P_{1},\ldots,P_{m})\in\mathcal{P}(U)$ if $P_{1},\ldots,P_{m}$ are projection operators in $U$ with $\sum_{j=1}^{m}P_{j}=I$ , and $P_{j}P_{i}=0$ for $i\neq j$ . For random $S(i)\subset\{1,\ldots,m\}$ and an iteration $i\in\mathbb{N}$ , we then set

[TABLE]

For smooth $J\in\mathrm{cpl}(U)$ , we let $L_{S(i)}>0$ be the $\Pi_{S(i)}$ -relative smoothness factor, satisfying

[TABLE]

and consequently (see lemma C.1)

[TABLE]

Example 3.4 (Stochastic block-coordinate descent).

Let $H=\nabla J$ for $J\in\mathrm{cpl}(U)$ with Lipschitz gradient. Also let $(P_{1},\ldots,P_{m})\in\mathcal{P}(U)$ . For each $i\in\mathbb{N}$ , take random $S(i)\subset\{1,\ldots,n\}$ , and set

[TABLE]

Then (PP) says that we take a forward step on the random subspace $\mathop{\mathrm{range}}(\Pi_{S(i)})$ :

[TABLE]

If the step lengths are deterministic and satisfy $\epsilon\leq\tau_{i}$ and $\tau_{i}L_{S(i)}\leq\pi_{j,i}$ for all $j\in S(i)$ for some $\epsilon>0$ , we have $\mathbb{E}[J(\widetilde{u}_{N})]\to J({\widehat{u}})$ at the rate $O(1/N)$ \cbstartfor the ergodic sequence

[TABLE]

\cbend

Through the use of the “local” smoothness factors $L_{S(i)}$ , the method may be able to take larger steps $\tau_{i}$ than those allowed by the global factor $L$ in example 2.19.

The smoothness of $G$ limits the usefulness of example 3.4. However, it forms the basis for popular stochastic forward–backward splitting methods, of which we now provide an example.

Example 3.5 (Stochastic forward–backward splitting).

Let $(P_{1},\ldots,P_{m})\in\mathcal{P}(U)$ . Suppose $H=\partial G+\nabla J$ for $J,G\in\mathrm{cpl}(U)$ , where $J$ has Lipschitz gradient, and $G$ is separable: $G=\sum_{j=1}^{m}G_{j}\circ P_{j}$ . Take $M_{i+1}$ , $W_{i+1}$ , and $V^{\prime}_{i+1}$ as in example 3.4. Then (PP) describes the stochastic forward–backward splitting method

[TABLE]

With $u_{j}:=P_{j}u$ , this can be written

[TABLE]

The method has exactly the same convergence properties as the stochastic gradient descent of example 3.4.

Remark 3.6.

Following example 2.20, if $G$ or $J$ is strongly convex, it is also possible to construct accelerated versions of both examples 3.4 and 3.5. Then we can obtain from (D $\mathbb{E}$ ) convergence rates for $\mathbb{E}[\|u^{i+1}-{\widehat{u}}\|^{2}]$ .

Proof 3.7 (Proof of convergence of stochastic gradient descent and forward–backward splitting).

\cbstart

We take as the testing operator $Z_{i+1}:=I$ . Then, since $Z_{i+1}M_{i+1}\equiv I$ , (C $\mathbb{E}{\sim}$ ) expands as

[TABLE]

From the decomposition $G=\sum_{j=1}^{m}G_{j}\circ P_{j}$ and the convexity of $G_{j}$ , we observe that

[TABLE]

Since $\tau_{i}$ is deterministic and $\mathbb{E}[\pi^{-1}_{j,i}\chi_{S(i)}(j)P_{j}]=\mathbb{E}[\Pi_{S(i)}]=I$ , such that $\sum_{i=0}^{N-1}\mathbb{E}[\tau_{i}\pi^{-1}_{j,i}\chi_{S(i)}(j)P_{j}]=\zeta_{N}$ for all $j=1,\ldots,m$ , by Jensen’s inequality, therefore,

[TABLE]

If we show the ergodic three-point smoothness condition

[TABLE]

then using our assumption $\tau_{i}L_{S(i)}\leq\pi_{j,i}$ and (33), we verify (32), hence (C $\mathbb{E}{\sim}$ ), for some $\Delta_{i+1}({\widehat{u}})$ such that

[TABLE]

Since $\zeta_{N}\geq\epsilon N$ by our assumption $\tau_{i}\geq\epsilon$ , corollary 3.2 now shows the $O(1/N)$ convergences of function values for the ergodic sequence $\{\widetilde{u}_{N}\}_{N\geq 1}$ . \cbend

To prove (34), from (28) with $h:=u^{i+1}-u^{i}$ and $\bar{u}^{i+1}:=(I-\Pi_{S(i)})u^{i}+\Pi_{S(i)}u^{i+1}$ we have

[TABLE]

By convexity, we also have

[TABLE]

Summing (35) and (36), multiplying by $\widetilde{\tau}_{i}$ , and taking the expectation,

[TABLE]

Since $\sum_{i=0}^{N-1}\tau_{i}=\zeta_{N}$ , Jensen’s inequality shows

[TABLE]

Therefore, summing (37) over $i=0,\ldots,N-1$ verifies (34).

Example 3.8 (Stochastic Newton’s method).

Suppose $(P_{1},\ldots,P_{m})\in\mathcal{P}(U)$ and $J\in C^{2}(U)$ . Take $H=\nabla J$ , $W_{i+1}:=P_{S(i)}$ , and

[TABLE]

Then (PP) reads

[TABLE]

where we abbreviate $A_{S(i)}:=P_{S(i)}AP_{S(i)}$ . We get

[TABLE]

where we define $A^{\dagger}_{S(i)}$ to satisfy $A^{\dagger}_{S(i)}=P_{S(i)}A^{\dagger}_{S(i)}P_{S(i)}$ and $A_{S(i)}A^{\dagger}_{S(i)}=A^{\dagger}_{S(i)}A_{S(i)}=P_{S(i)}$ . This is a variant of stochastic Newton’s method and “sketching” [25, 24]. Notice how $[\nabla^{2}J(u)]_{S(i)}^{\dagger}$ can be significantly cheaper to compute than $[\nabla^{2}J(u)]^{-1}$ .

\cbstart

Let

[TABLE]

as well as

[TABLE]

If $0\leq\delta_{J}<\frac{3-\sqrt{9-8\bar{p}}}{4}$ , then $\mathbb{E}[\|u^{N}-{\widehat{u}}\|^{2}]\to 0$ at a linear rate. \cbend

Remark 3.9.

If $J(u)=\langle u,Au-c\rangle$ for some self-adjoint positive definite $A\in\mathcal{L}(U;U)$ and $c\in U$ , then $\delta_{J}=0$ , so the upper bound on $\delta_{J}$ is satisfied for any $\bar{p}\in(0,1]$ . If $\mathbb{E}[P_{S(i)}|i]\equiv pI$ for some $p>1/2$ , then $\bar{p}>0$ due to

[TABLE]

An advantage of our techniques is the immediate convergence of:

Example 3.10 (Stochastic proximal Newton’s method).

Let $(P_{1},\ldots,P_{m})\in\mathcal{P}(U)$ . Let $H=\partial G+\nabla J$ for $G\in\mathrm{cpl}(U)$ and $J\in C^{2}(X)$ with $G=\sum_{j=1}^{m}G_{j}\circ P_{j}$ . Take $M_{i+1}$ , $W_{i+1}$ , and $V^{\prime}_{i+1}$ as in example 3.8. Then we obtain the algorithm

[TABLE]

We have $\mathbb{E}[\|u^{N}-{\widehat{u}}\|^{2}]\to 0$ at a linear rate under the same conditions as in example 3.8.

Proof 3.11 (Proof of convergence of stochastic Newton’s and proximal Newton’s methods).

\cbstart

We set as the preconditioner $M_{i+1}:=\nabla^{2}J(u^{i})$ and as the test $Z_{i}:=\phi_{i}I$ for some $\phi_{i}>0$ . Clearly we have the following simpler non-value version of the value estimate (33):

[TABLE]

Therefore, since $0\in \partial G({\widehat{u}})+\nabla J({\widehat{u}})$ , the expected fundamental condition eq. C $\mathbb{E}{\sim}$ becomes

[TABLE]

for

[TABLE]

Adapting the argumentation of lemmas B.5 and B.7 to the present projected setting, by the mean value theorem, for some $\zeta$ between $u^{i}$ and ${\widehat{u}}$ , and using the definition of $\delta_{J}$ in (38) and the three-point identity (6), we rearrange

[TABLE]

By the definition of $\bar{p}$ in (39) and by Cauchy’s inequality, for any $\alpha>0$ , we obtain the expected three-point inequality

[TABLE]

We take $\alpha=(1-\delta_{J})/(1-\bar{p})$ . Then (41) holds when

[TABLE]

This is the case for some $\Delta_{i+1}({\widehat{u}})\in\mathcal{R}(\mathbb{R})$ with $\mathbb{E}[\Delta_{i+1}({\widehat{u}})]=0$ provided $2>\delta_{J}+\alpha^{-1}$ and $\phi_{i+1}>0$ is small enough that $\phi_{i+1}\nabla^{2}J(u^{i+1})\leq\phi_{i}(2-\delta_{J}-\alpha^{-1})\nabla^{2}J(u^{i})$ . Due to (38), we can take $\phi_{i+1}\geq\phi_{i}\kappa$ for

[TABLE]

In particular, we obtain exponential growth of $\{\phi_{i}\}_{k\in\mathbb{N}}$ provided $\kappa>1$ , which holds when $-3\delta_{J}+2\delta_{J}^{2}+\bar{p}>0$ , which is the case under our assumption $0\leq\delta_{J}<\frac{3-\sqrt{9-8\bar{p}}}{4}$ . Consequently, we can take $\phi_{i}:=\kappa^{i}/(1-\delta_{J})$ for $\kappa>1$ . By corollary 3.2 we have

[TABLE]

Since $Z_{N+1}M_{N+1}=\phi_{N}\nabla^{2}J(u^{i})\geq\kappa^{N}\nabla^{2}J({\widehat{u}})$ , we obtain the claimed linear expected convergence of iterates.

\cbend

Remark 3.12 (Variance estimates).

From an estimate of the type $\mathbb{E}[\|u^{N}-{\widehat{u}}\|^{2}]\leq C_{N}$ , as above, Jensen’s inequality gives $\|\mathbb{E}[u^{N}]-{\widehat{u}}\|^{2}\leq C_{N}$ . From this, with the application of the triangle and Cauchy’s inequalities, it is easy to derive the variance estimate $\mathbb{E}[\|\mathbb{E}[u^{N}]-u^{N}\|^{2}]\leq 4C_{N}$ .

4 Saddle point problems

\cbstart

We now momentarily forget the stochastic setting and ergodic estimates to which we will return in section 5, and introduce our overall approach to primal–dual methods for saddle-point problems. \cbendWith $K\in\mathcal{L}(X;Y)$ ; $G,J\in\mathrm{cpl}(X)$ ; and $F^{*}\in\mathrm{cpl}(Y)$ on Hilbert spaces $X$ and $Y$ , we now wish to solve the following version of (S). The first-order necessary optimality conditions read

[TABLE]

Setting $U:=X\times Y$ and introducing the variable splitting notation $u=(x,y)$ , ${\widehat{u}}=({\widehat{x}},{\widehat{y}})$ , etc., this can succinctly be written as $0\in H({\widehat{u}})$ in terms of the operator

[TABLE]

In this section, concentrating on this specific $H$ , we specialise the theory of section 2.2 to saddle point problems. Throughout, for some primal and dual step length and testing operators $T_{i},\Phi_{i}\in\mathcal{L}(X;X)$ , and $\Sigma_{i+1},\Psi_{i+1}\in\mathcal{L}(Y;Y)$ , we take

[TABLE]

To work with arbitrary step length operators, which will be necessary for stochastic algorithms in section 3, as well as the partially accelerated algorithms of [31], we will need abstract forms of partial strong monotonicity of $G$ and $F^{*}$ . As a first step, we take subspaces of operators

[TABLE]

We suppose that $\partial G$ is partially (strongly) $\mathcal{T}$ -monotone, which we take to mean

[TABLE]

for some linear operator $0\leq\Gamma\in\mathcal{L}(X;X)$ . The operator ${\widetilde{T}}\in\mathcal{T}$ acts as a testing operator. \cbstartObserve that we have already proven this in (40) for the setting of the stochastic Newton’s method. \cbendSimilarly, we assume that $\partial F^{*}$ is $\mathcal{S}$ -monotone in the sense

[TABLE]

Regarding $J$ , we assume that $\nabla J$ exists and is partially $\mathcal{T}$ -co-coercive in the sense that for some $L\geq 0$ holds

[TABLE]

(We allow $L=0$ for the case $J=0$ .)

We also introduce

[TABLE]

which are operator measures of strong monotonicity and smoothness of $H$ . Finally, we introduce the forward–step preconditioner with respect to $J$ , familiar from example 2.19 as

[TABLE]

Example 4.1 (Block-separable structure, monotonicity).

Let $P_{1},\ldots,P_{m}$ be projection operators in $X$ with $\sum_{j=1}^{m}P_{j}=I$ and $P_{j}P_{i}=0$ if $i\neq j$ . Suppose $G_{1},\ldots,G_{m}\in\mathrm{cpl}(X)$ are (strongly) convex with factors $\gamma_{1},\ldots,\gamma_{m}\geq 0$ . Then the partial strong monotonicity (G-PM) holds with $\Gamma=\sum_{j=1}^{m}\gamma_{j}P_{j}$ for

[TABLE]

4.1 Estimates

Using the (strong) $\mathcal{T}$ -monotonicity of $\partial G$ , and the $\mathcal{T}$ -co-coercivity of $\nabla J$ , the next lemma simplifies corollary 2.2 for $H$ given by (42). We introduce $\widetilde{\Gamma}=\Gamma$ to facilitate later gap estimates that will require the conditions in the lemma to hold for $\widetilde{\Gamma}=\Gamma/2$ instead of $\widetilde{\Gamma}=\Gamma$ .

Theorem 4.2.

Let $H$ have the structure (42) and assume ${\widehat{u}}\in H^{-1}(0)$ . Suppose $G$ satisfies the partial strong monotonicity (G-PM) for some $0\leq\Gamma\in\mathcal{L}(X;X)$ , $F^{*}$ similarly satisfies (F∗-PM), and $J$ satisfies the partial co-coercivity (J-PC) for some $L\geq 0$ . For each $i\in\mathbb{N}$ , let $T_{i},\Phi_{i}\in\mathcal{L}(X;X)$ and $\Sigma_{i+1},\Psi_{i+1}\in\mathcal{L}(Y;Y)$ be such that $\Phi_{i}T_{i}\in\mathcal{T}$ and $\Psi_{i+1}\Sigma_{i+1}\in\mathcal{S}$ . Define $Z_{i+1}$ and $W_{i+1}$ through (43). Also take $V^{\prime}_{i+1}:X\times Y\to X\times Y$ , and $M_{i+1}\in\mathcal{L}(X\times Y;X\times Y)$ . Suppose (PP) is solvable for $\{u^{i+1}\}_{i\in\mathbb{N}}\subset X\times Y$ . Then the fundamental conditions (CI), (CI∼) and the descent inequality (DI) hold if \cbstartfor all $i\in\mathbb{N}$ , the operator \cbend $Z_{i+1}M_{i+1}$ is self-adjoint and for $\widetilde{\Gamma}=\Gamma$ and $L_{i}\equiv L/2$ we have the fundamental inequality for saddle-point problems

[TABLE]

We have introduced $\widetilde{\Gamma}$ and $L_{i}$ for later gap estimates, where the specific choices of these will differ by a factor of two, similarly to the differences in the step length bounds for the function value estimates of section 2.5 compared to the non-value estimates of section 2.3.

Proof 4.3.

Note that $Z_{i+1}M_{i+1}$ being self-adjoint implies that so is $\Phi_{i}T_{i}$ . Using (J-PC), similarly to lemma B.1 we derive

[TABLE]

Using (45), therefore

[TABLE]

With this, (G-PM), and (F∗-PM), we observe (47) to imply

[TABLE]

Here pay attention to the fact that (48) employs $\Xi_{i+1}(0)$ while (47) employs $\Xi_{i+1}(\widetilde{\Gamma})$ . If we show that (CI) follows from (48), then the descent inequality (DI) follows from corollary 2.2. Indeed, using the expansion

[TABLE]

we expand for any ${\widetilde{u}}=({\widetilde{x}},{\widetilde{y}})$ that

[TABLE]

With the help of $\Xi_{i+1}(0)$ we then obtain

[TABLE]

Inserting this into (48), we obtain the fundamental inequality (CI). It implies (CI∼) via corollary 2.2. Finally, theorem 2.1 gives (DI).

4.2 Examples of primal–dual methods

\cbstart

We now look at several known methods for the saddle point problem (S). The fundamental idea in all of them is to design $M_{i+1}$ such that the primal variable $y^{i+1}$ and the dual variable $y^{i+1}$ can be updated independently unlike in the standard proximal point method with $M_{i+1}=I$ . To help verifying the condition theorem 4.2 for these methods, we reformulate the result for scalar step length and testing parameters: we will only use the full power of the operator setup in our companion paper [30].

If for each $i\in\mathbb{N}$ , we pick $\tau_{i},\phi_{i},\sigma_{i+1},\psi_{i+1}>0$ and $\gamma\geq 0$ , and define $T_{i}=\tau_{i}I$ , $\Phi_{i}=\phi_{i}I$ , $\Sigma_{i+1}=\sigma_{i+1}I,\Psi_{i+1}=\psi_{i+1}I$ , and $\Gamma:=\gamma I$ , then (43), (44), and (50c) reduce to

[TABLE]

Then we have the following corollary of theorem 4.2.

Corollary 4.4.

Let $H$ have the structure (42) and assume ${\widehat{u}}\in H^{-1}(0)$ . Assume that $G$ is ( $\gamma$ -strongly) convex and $\nabla J$ is $L$ -Lipschitz for some $\gamma\geq 0$ and $L>0$ . For each $i\in\mathbb{N}$ , assume the structure (50) for $\tau_{i},\phi_{i},\sigma_{i+1},\psi_{i+1}>0$ . Also take $V^{\prime}_{i+1}\in X\times Y\to X\times Y$ and $M_{i+1}\in\mathcal{L}(X\times Y;X\times Y)$ . Suppose (PP) is solvable for $\{u^{i+1}\}_{i\in\mathbb{N}}\subset X\times Y$ . Suppose for all $i\in\mathbb{N}$ that $Z_{i+1}M_{i+1}$ is self-adjoint, and that the fundamental condition for saddle-point problems (47) holds for $\widetilde{\Gamma}=\gamma I$ and $L_{i}\equiv L/2$ . Then the fundamental conditions (CI), (CI∼) and the descent inequality (DI) hold.

Proof 4.5.

Clearly $\Phi_{i}T_{i}\in\mathcal{T}:=[0,\infty)I$ and $\Psi_{i+1}\Sigma_{i+1}\in\mathcal{S}:=[0,\infty)I$ . Moreover, $F^{*}$ satisfies the partial monotonicity condition (F∗-PM) and $G$ satisfies the partial partial monotonicity condition (G-PM) with $\Gamma=\gamma I$ by the corresponding (strong) monotonicity of the subdifferentials. The rest follows from theorem 4.2.

\cbend

Example 4.6 (The primal–dual method of Chambolle and Pock [7]).

[TABLE]

In the basic version of the algorithm, $\omega_{i}=1$ , $\tau_{i}\equiv\tau_{0}>0$ , and $\sigma_{i}\equiv\sigma_{0}>0$ , assuming the step length parameters to satisfy

[TABLE]

If $K$ is compact, the iterates convergence weakly, and the method has $O(1/N)$ rate for the ergodic duality gap, to which we will return in section 5. If $G$ is strongly convex with factor $\gamma>0$ , we may accelerate

[TABLE]

This yields $O(1/N^{2})$ convergence of $\|x^{N}-{\widehat{x}}\|^{2}$ to zero.

Proof 4.7 (Proof of convergence of iterates).

We formulate the method in our proximal point framework with $J=0$ and $G=G$ following [31, 14] by taking as the preconditioner

[TABLE]

For the rest of the operators, we use the setup of (50). Taking $\Delta_{i+1}({\widehat{u}}):=-\frac{1}{2}\|u^{i+1}-u^{i}\|_{Z_{i+1}M_{i+1}}^{2}$ , we now reduce (47) to

[TABLE]

We may expand

[TABLE]

We have $\|\,\boldsymbol{\cdot}\,\|_{D_{i+2}}=0$ (but not $D_{i+2}=0$ , as the former depends on the off-diagonals cancelling out), and $Z_{i+1}M_{i+1}$ is self-adjoint, if for some constant $\psi$ we take

[TABLE]

This gives the acceleration scheme (53). Moreover, for any $\delta\in(0,1)$ holds

[TABLE]

Thus $Z_{i+1}M_{i+1}\geq 0$ if $\psi\geq(1-\delta)^{-1}\phi_{i}\tau_{i}^{2}\|K\|^{2}$ . By (56), $\sigma_{i}\tau_{i}=1/\psi$ . Since this fixes the ratio of $\sigma_{i}$ to $\tau_{i}$ , we need to take $\psi:=1/(\sigma_{0}\tau_{0})$ as well as $\delta:=1-\sigma_{0}\tau_{0}\|K\|^{2}$ . Through the positivity of $\delta$ , we recover the initialisation condition (52).

\cbstart

Recall that subdifferentials are weak-to-strong outer-semicontinuous. By the continuity of $K$ , we thus deduce the strong-to-strong outer semicontinuity of $H$ . To verify (7), we use the assumed compactness of $K$ , which implies for a further unrelabelled subsequence of $\{u^{i_{k}}\}_{k\in\mathbb{N}}$ that $w^{i_{k}}\in H(u^{i_{k}})$ satisfy $0=\lim_{k\to\infty}w^{i_{k}}\in H({\widetilde{u}})$ . Corollaries 4.4 and 2.9 now shows weak convergence of the iterates without a rate. \cbend

If $G$ is strongly convex with factor $\gamma\geq 0$ , the results in [7, 31] show that $\tau_{N}$ is of the order $O(1/N)$ , and consequently $\phi_{N}$ is of the order $\Theta(N^{2})$ . By proposition 2.5, $\|x^{N}-{\widehat{x}}\|^{2}$ converges to zero at the rate $O(1/N^{2})$ .

\cbstart

Remark 4.8 (Brezis–Crandall–Pazy property).

It is possible to show that $H$ satisfies the Brezis–Crandall–Pazy property [3] without a compactness assumption on $K$ . With a corresponding improvement to proposition 2.5, the assumption could be dropped.

Remark 4.9 (Linear convergence).

If $F^{*}$ is strongly convex with factor $\rho>0$ , the last equation of (56) gets similar form as the first, $\psi_{i+1}:=\psi_{i}(1+2\rho\sigma_{i})$ . From here, if both $G$ and $F^{*}$ are strongly convex, it is possible to show linear convergence.

We can also add an additional forward step to the method. With that the method resembles the method of Vũ–Condat [10, 32], which also incorporates an additional outer over-relaxation step on the whole algorithm. \cbend

Example 4.10 (Chambolle–Pock with a forward step).

Suppose $G$ is (strongly) convex with factor $\gamma\geq 0$ , and $\nabla J$ Lipschitz with factor $L$ . In [8], the Chambolle–Pock method was extended to take forward steps with respect to $J$ . With everything else as in example 4.6, take $V^{\prime}_{i+1}(u):=(\tau_{i}(\nabla J(x^{i})-\nabla J(x)),0)$ . Then the preconditioned proximal point method (PP) can be rearranged as

[TABLE]

The method inherits the convergences properties of example 4.6 if we use the step length update rules (53), and initialise $\tau_{0},\sigma_{0}>0$ subject to (52), and

[TABLE]

Proof 4.11 (Proof of convergence).

With $D_{i+2}$ as in (54), the fundamental condition for saddle-point problems (47) becomes

[TABLE]

The rules (56) force $\|\,\boldsymbol{\cdot}\,\|_{D_{i+2}}=0$ . We take $\Delta_{i+1}({\widehat{u}})=-\frac{\theta}{2}\|u^{i+1}-u^{i}\|_{Z_{i+1}M_{i+1}}^{2}$ for some $\theta>0$ , and deduce using Cauchy’s inequality that (62) holds if

[TABLE]

Recalling (57), this is true if $(1-\theta)\delta\phi_{i}\geq\tau_{i}\phi_{i}L$ and $\psi\geq(1-\delta)^{-1}\phi_{i}\tau_{i}^{2}\|K\|^{2}$ . Further recalling (56), and observing that $\{\tau_{i}\}$ is non-increasing, we only have to satisfy $(1-\theta)(1-\tau_{0}\sigma_{0}\|K\|^{2})\geq L\tau_{0}$ . Otherwise put, we obtain (61).

\cbstart

Finally, we have the following Generalised Iterative Soft Thresholding (GIST) method from [19]. \cbend

Example 4.12 (GIST).

Suppose $G=0$ , $J(x)=\frac{1}{2}\|f-Ax\|^{2}$ , $\|A\|<\sqrt{2}$ , and $\|K\|\leq 1$ . Take

[TABLE]

With $T_{i}:=I$ and $\Sigma_{i+1}:=I$ , we obtain the method

[TABLE]

If $K$ is compact, the iterates $\{x^{i}\}_{i\in\mathbb{N}}$ converge weakly to ${\widehat{x}}$ .

Proof 4.13 (Proof of convergence).

Observe that the partial co-coercivity (J-PC) holds with $L=\|A\|^{2}$ . Clearly $Z_{i+1}M_{i+1}$ is positive semi-definite self-adjoint. If we take $\Phi_{i}=I$ and $\Psi_{i+1}=I$ , then

[TABLE]

Thus $\frac{1}{2}\|u\|_{D_{i+2}}^{2}=0$ . Eliminating $\partial F^{*}$ by monotonicity, the fundamental condition for saddle-point problems (47) thus holds if

[TABLE]

Expanding $Z_{i+1}M_{i+1}$ , we see this to hold when $\|K\|<1$ and $L<2$ , which are exactly our assumptions. Using corollaries 4.4 and 2.5, and reasoning as in example 4.6 to verify the outer-semicontinuity properties of $H$ , we obtain weak convergence.

5 An ergodic duality gap

We now study the extension of the testing approach of section 2.2 to produce the convergence of an ergodic duality gap. Throughout this section, we are in the saddle point setup of section 4. In particular, the operator $H$ is as in (42), and the step length and testing operators $W_{i+1}$ and $Z_{i+1}$ as in (43).

5.1 Preliminary gap estimates

Our first lemma demonstrates how to obtain a “preliminary” gap $\mathcal{G}^{\prime}_{i+1}(u)$ from $H$ . If the step lengths and tests are scalar, $T_{i}=\tau_{i}I$ , and $\Phi_{i}=\phi_{i}I$ , etc., and satisfy $\tau_{i}\phi_{i}=\sigma_{i}\psi_{i+1}$ , it is easy to bound this preliminary gap from below by $\tau_{i}\phi_{i}$ times the \cbstart“relaxed” duality gap

[TABLE]

\cbend

To do the same for more general step length operators, we will in section 5.3 introduce abstract notions of convexity that incorporate ergodicity and stochasticity.

\cbstart

Observe that the “relaxed” gap (63) satisfies

[TABLE]

where the right-hand side is the conventional duality gap guaranteed to be non-zero for a non-solution $x$ . \cbend

Lemma 5.1.

\cbstart

For a fixed $i\in\mathbb{N}$ , \cbendsuppose $\Phi_{i}T_{i}$ and $\Psi_{i+1}\Sigma_{i+1}$ are self-adjoint. Then for $H$ as in (42), we have

[TABLE]

where the “preliminary gap”

[TABLE]

Proof 5.2.

Similarly to the proof of theorem 4.2, we have

[TABLE]

A little bit of reorganisation gives (64). Indeed

[TABLE]

The next lemma extends theorem 4.2 to estimate the preliminary gap.

Lemma 5.3.

Let $H$ have the structure (42) and assume ${\widehat{u}}\in H^{-1}(0)$ . For each $i\in\mathbb{N}$ , let $T_{i},\Phi_{i}\in\mathcal{L}(X;X)$ and $\Sigma_{i+1},\Psi_{i+1}\in\mathcal{L}(Y;Y)$ , as well as $V^{\prime}_{i+1}\in X\times Y\to X\times Y$ and $M_{i+1}\in\mathcal{L}(X\times Y;X\times Y)$ . Define $Z_{i+1}$ and $W_{i+1}$ through (43). Suppose (PP) is solvable for $\{u^{i+1}\}_{i\in\mathbb{N}}\subset X\times Y$ . If \cbstartfor all $i\in\mathbb{N}$ , \cbend $Z_{i+1}M_{i+1}$ is self-adjoint, and

[TABLE]

then

[TABLE]

Proof 5.4.

Inserting (64) from lemma 5.1 into (66) shows that

[TABLE]

Hence the fundamental condition (CI∼) holds for $\Delta_{i+1}({\widehat{u}}):=\widetilde{\Delta}_{i+1}({\widehat{u}})-\mathcal{G}^{\prime}_{i+1}(u^{i+1})$ . Now we use theorem 2.1.

5.2 General conversion formulas of preliminary gaps to ergodic gaps

The “preliminary gaps” are not as such very useful. To go further, the abstract partial monotonicity assumptions (G-PM) and (F∗-PM) are not enough, and we need analogous convexity formulations. We formulate these conditions directly in the stochastic setting (recall section 3).

For the moment, we assume for all $N\geq 1$ that whenever ${\widetilde{T}}_{i}\,(:=\Phi_{i}T_{i})\in\mathcal{R}(\mathcal{T})$ and $x^{i+1}\in\mathcal{R}(X)$ for each $i=0,\ldots,N-1$ with $\sum_{i=0}^{N-1}\mathbb{E}[{\widetilde{T}}_{i}]=I$ , then for some $\delta_{G}^{i+1}\in\mathcal{R}(\mathbb{R})$ holds

[TABLE]

Analogously, we assume for ${\widetilde{\Sigma}}_{i+1}\,(:=\Psi_{i+1}\Sigma_{i+1})\in\mathcal{R}(\mathcal{S})$ and $y^{i+1}\in\mathcal{R}(Y)$ for each $i=0,\ldots,N-1$ with $\sum_{i=0}^{N-1}\mathbb{E}[{\widetilde{\Sigma}}_{i+1}]=I$ that for some $\delta_{F^{*}}^{i+1}\in\mathcal{R}(\mathbb{R})$ holds

[TABLE]

These conditions can of course always be satisfied for some $\delta_{G}^{i+1}$ and $\delta_{F^{*}}^{i+1}$ . After a few general lemmas, we will replace these placeholder values by more meaningful ones.

To state those lemmas, we also assume \cbstartfor some scalars $\bar{\eta}_{i}\in\mathbb{R}$ , ( $i\in\mathbb{N}$ ), \cbendeither of the primal–dual coupling conditions

[TABLE]

As will see in example 5.15, (C $\mathcal{G}_{*}$ ) is satisfied by the accelerated Chambolle–Pock method of example 4.6. In our companion paper [30], we will however see that (C $\mathcal{G}$ ) is required to develop doubly-stochastic methods.

Lemma 5.5.

Assume (68), (69), and the first primal–dual coupling condition (C $\mathcal{G}$ ). Given iterates $\{(x^{i},y^{i})\}_{i=1}^{\infty}\subset X\times Y$ , for all $N\geq 1$ set

[TABLE]

and define the ergodic sequences

[TABLE]

Then

[TABLE]

Proof 5.6.

Let $N$ be fixed. With ${\widetilde{T}}_{i}:=\zeta^{-1}_{N}\Phi_{i}T_{i}$ over $i=0,\ldots,N-1$ , (68) implies

[TABLE]

Likewise, with ${\widetilde{\Sigma}}_{i+1}:=\zeta^{-1}_{N}\Psi_{i+1}\Sigma_{i+1}$ , (69) shows that

[TABLE]

From the definition of the preliminary gap in (65), applying (C $\mathcal{G}$ ), we obtain

[TABLE]

Recalling the definition of the gap $\mathcal{G}$ in (63), and using the estimates (71), (72), as well as the definition (70) of the ergodic sequences, we obtain the claim.

Lemma 5.7.

\cbstart

Suppose $G$ and $F^{*}$ satisfy with $\Gamma=0$ the corresponding partial monotonicities (G-PM) and (F∗-PM). \cbendAlso assume (68), (69), and the second primal–dual coupling condition (C $\mathcal{G}_{*}$ ). Given $\{(x^{i},y^{i})\}_{i=1}^{\infty}\subset X\times Y$ , for all $N\geq 1$ set

[TABLE]

and define the ergodic sequences

[TABLE]

Then

[TABLE]

Proof 5.8.

Shifting indices of $y^{i}$ by one compared to $\mathcal{G}^{\prime}_{i+1}$ , we define

[TABLE]

Reorganising terms, therefore

[TABLE]

By virtue of $0\in H({\widehat{u}})$ , we have $K^{*}{\widehat{y}}\in\partial G({\widehat{x}})$ , and $-K{\widehat{x}}\in\partial F^{*}({\widehat{y}})$ . Estimating with (G-PM) and (F∗-PM), and afterwards taking the expectation, we therefore obtain

[TABLE]

From here we may proceed analogously to the proof of lemma 5.5.

5.3 Final gap estimates

As now convert the abstract ergodic conditions (68) and (69) into ergodic strong convexity and smoothness conditions that can be derived from the corresponding standard properties in block-separable cases.

\cbstart

Recall the spaces of operator $\mathcal{T}$ and $\mathcal{S}$ from section 4. \cbendWe assume for all $N\geq 1$ that whenever ${\widetilde{T}}_{i}\,(:=\Phi_{i}T_{i})\in\mathcal{R}(\mathcal{T})$ and $x^{i+1}\in\mathcal{R}(X)$ for each $i=0,\ldots,N-1$ with $\sum_{i=0}^{N-1}\mathbb{E}[{\widetilde{T}}_{i}]=I$ , then for some $0\leq\Gamma\in\mathcal{L}(X;X)$ we have the \cbstartergodic strong convexity\cbend

[TABLE]

Analogously, we assume for ${\widetilde{\Sigma}}_{i+1}\,(:=\Psi_{i+1}\Sigma_{i+1})\in\mathcal{R}(\mathcal{S})$ and $y^{i+1}\in\mathcal{R}(Y)$ for each $i=0,\ldots,N-1$ with $\sum_{i=0}^{N-1}\mathbb{E}[{\widetilde{\Sigma}}_{i+1}]=I$ the \cbstartergodic convexity\cbend

[TABLE]

Finally, we assume $J$ is differentiable and satisfies for some parameters $L_{i}\geq 0$ the \cbstart3-point ergodic smoothness\cbendcondition

[TABLE]

The shifting refers to uses of $x^{i}$ , where a typical definition of smoothness would use ${\widehat{x}}$ .

Example 5.9 (Block-separable structure, ergodic convexity).

Let $G$ and $\mathcal{T}$ have the separable structure of example 4.1. We claim that the ergodic strong convexity (G-EC) holds. Indeed, let us introduce ${\widetilde{T}}_{i}:=\sum_{j=1}^{m}\widetilde{\tau}_{j,i}P_{j}\geq 0$ , satisfying $\sum_{i=0}^{N-1}\mathbb{E}[\widetilde{\tau}_{j,i}]=1$ for each $j=1,\ldots,m$ . Splitting (G-EC) into separate inequalities over all $j=1,\ldots,m$ , and using the strong convexity of $G_{j}$ , we see (G-EC) to be true with $\Gamma=\sum_{j=1}^{m}\gamma_{j}P_{j}$ if for all $j=1,\ldots,m$ holds

[TABLE]

The right hand side can also be written as $\int_{\Omega^{N}}G_{j}(P_{j}{\widehat{x}})-G_{j}(P_{j}x^{i}(\omega))\,d\mu^{N}(i,\omega)$ for the measure $\mu^{N}:=\widetilde{\tau}_{j}\sum_{i=0}^{N-1}\delta_{i}\times\mathbb{P}$ on the domain $\Omega^{N}:=\{0,\ldots,N-1\}\times\Omega$ . Using our assumption $\sum_{i=0}^{N-1}\mathbb{E}[\widetilde{\tau}_{j,i}]=1$ , we deduce $\mu^{N}(\Omega^{N})=1$ . An application of Jensen’s inequality now shows (73). Therefore (G-EC) is satisfied for $G=G$ .

Example 5.10 (Ergodic smoothness for smooth $J$ ).

\cbstart

If $J\in C(x)$ has $L$ -Lipschitz gradient, then lemma B.1 shows the three-point inequality

[TABLE]

If $\widetilde{T}_{i}=\widetilde{\tau}_{i}I$ for scalar $\widetilde{\tau}_{i}I$ , then proceeding as in (73) in example 5.9, we deduce the 3-point ergodic smoothness (J-ES) with $L_{i}=L$ . Similarly, we can treat the block-separable case $J=\sum_{i=0}^{m}J_{j}(P_{j}x)$ when each $J_{j}$ individually has Lipschitz gradient. \cbend

The next theorem is our main result for saddle point problems. \cbstartTo clarify the statement of the theorem, which depends on various different combinations of several conditions in the definition of $\widetilde{g}_{N}$ , we recall here the rough meaning of each:

\tagform@47**, p.47**

Fundamental condition (CI∼) for saddle point problems.

\tagform@G-PM**, p.G-PM**

Partial (testing and step length operator relative) strong monotonicity of $G$ .

\tagform@F∗-PM**, p.F∗-PM**

Partial monotonicity of $F^{*}$ .

\tagform@J-PC**, p.J-PC**

Partial co-coercivity of $J$ .

\tagform@G-EC**, p.G-EC**

Partial strong ergodic convexity of $G$ .

\tagform@F∗-EC**, p.F∗-EC**

Partial ergodic convexity of $F^{*}$ .

\tagform@J-ES**, p.J-ES**

Partial 3-point ergodic smoothness of $J$ .

\tagform@C $\mathcal{G}$ ****, p.C $\mathcal{G}$

First alternative primal–dual coupling condition

\tagform@C $\mathcal{G}_{*}$ ****, p.C $\mathcal{G}_{*}$

Second alternative primal–dual coupling condition

\cbend

Theorem 5.11.

Let $H$ have the structure (42) and assume ${\widehat{u}}\in H^{-1}(0)$ . For each $i\in\mathbb{N}$ , let $T_{i},\Phi_{i}\in\mathcal{R}(\mathcal{L}(X;X))$ and $\Sigma_{i+1},\Psi_{i+1}\in\mathcal{R}(\mathcal{L}(Y;Y))$ be such that $\Phi_{i}T_{i}\in\mathcal{R}(\mathcal{T})$ and $\Psi_{i+1}\Sigma_{i+1}\in\mathcal{R}(\mathcal{S})$ . Define $Z_{i+1}$ and $W_{i+1}$ through (43). Also take $V^{\prime}_{i+1}\in\mathcal{R}(X\times Y\to X\times Y)$ and $M_{i+1}\in\mathcal{R}(\mathcal{L}(X\times Y;X\times Y))$ . Suppose (PP) is solvable for $\{u^{i+1}\}_{i\in\mathbb{N}}\subset X\times Y$ . Assuming one of the following cases to hold with $0\leq\Gamma\in\mathcal{L}(X;X)$ and $L_{i}\geq 0$ , let

[TABLE]

If \cbstartfor all $i\in\mathbb{N}$ , \cbend $Z_{i+1}M_{i+1}$ is self-adjoint and (47) holds for $\widetilde{\Gamma}$ given above, then so does the following ergodic gap descent inequality:

[TABLE]

Proof 5.12.

The case $\widetilde{g}_{N}=0$ is simply the result of taking the expectation in the claim of theorem 4.2; \cbstartcompare how corollary 3.2 follows form theorem 2.1. \cbendRegarding the remaining two cases, clearly (47) implies (66) for

[TABLE]

Thus lemma 5.3 shows the descent estimate (67).

The ergodic strong convexity (G-EC) and (J-ES) imply (68) for

[TABLE]

where ${\widetilde{T}}_{i}\in\mathcal{R}(\mathcal{T})$ . Likewise the ergodic convexity (F∗-EC) implies (69) for $\delta_{F^{*},N}^{i+1}:=0$ . When the first primal–dual coupling condition (C $\mathcal{G}$ ) holds, we take above ${\widetilde{T}}_{i}=\zeta^{-1}_{N}\Phi_{i}T_{i}$ , which we have assumed to belong to $\mathcal{R}(\mathcal{T})$ . \cbstartIf the alternative second primal–dual coupling condition (C $\mathcal{G}_{*}$ ) holds, we take ${\widetilde{T}}_{i}=\zeta^{-1}_{*,N}\Phi_{i}T_{i}$ . \cbendTherefore, (67) can be rewritten

[TABLE]

for

[TABLE]

Now we just take the expectation in (74), and apply lemma 5.5 \cbstartor lemma 5.7. \cbend

5.4 Primal–dual examples revisited

We now study gap estimates for several of the examples from section 4. \cbstartWe start by verifying partial monotonicity and ergodic convexity and smoothness conditions for in the case of simple deterministic scalar step length and testing operators: the block-separable and stochastic case we leave to the companion paper [30]. \cbend

\cbstart

Similarly to corollary 4.4 of theorem 4.2, we now have the following non-stochastic scalar corollary of theorem 5.11. From the corollary, if $\Delta_{i+1}\leq 0$ , we clearly get the convergence of $\mathcal{G}(\widetilde{x}_{*,N},\widetilde{y}_{*,N})$ or $\mathcal{G}(\widetilde{x}_{N},\widetilde{y}_{N})$ to zero at the respective rate $O(1/\zeta_{*,N})$ or $(1/\zeta_{N})$ .

Corollary 5.13.

Let $H$ have the structure (42) and assume ${\widehat{u}}\in H^{-1}(0)$ . Assume that $G$ is ( $\gamma$ -strongly) convex and $\nabla J$ is $L$ -Lipschitz for some $\gamma\geq 0$ and $L>0$ . For each $i\in\mathbb{N}$ , assume the structure (50) for $\tau_{i},\phi_{i},\sigma_{i+1},\psi_{i+1}>0$ . Also take $V^{\prime}_{i+1}\in X\times Y\to X\times Y$ and $M_{i+1}\in\mathcal{L}(X\times Y;X\times Y)$ . Suppose (PP) is solvable for $\{u^{i+1}\}_{i\in\mathbb{N}}\subset X\times Y$ . Suppose for all $i\in \mathbb{N}$ that $\phi_{i}\tau_{i}=\psi_{i}\sigma_{i}$ , that $Z_{i+1}M_{i+1}$ is self-adjoint, and that the fundamental condition for saddle-point problems (47) holds for $\widetilde{\Gamma}=(\gamma/2)I$ and $L_{i}\equiv L$ . Then

[TABLE]

If, instead, $\phi_{i}\tau_{i}=\psi_{i+1}\sigma_{i+1}$ , then the gap expression is replaced by $\zeta_{N}\mathcal{G}(\widetilde{x}_{N},\widetilde{y}_{N})$ .

Proof 5.14.

As in the proof of corollary 4.4, clearly $\Phi_{i}T_{i}\in\mathcal{T}:=[0,\infty)I$ and $\Psi_{i+1}\Sigma_{i+1}\in\mathcal{S}:=[0,\infty)I$ , so that the partial monotonicities (F∗-PM) and (G-PM) (with $\Gamma=0$ ) hold by the monotonicity of the subdifferentials of $G$ and $F^{*}$ . Similarly, the ergodic (strong) convexity (G-EC) of $G$ with $\Gamma=\gamma I$ and (F∗-EC) of $F^{*}$ hold by a Jensen argument similar to example 5.9. Likewise, the ergodic smoothness (J-ES) holds by the three-point inequality eq. 77 and a Jensen argument similar to example 5.10. Note that with everything deterministic, the expectations disappear.

With this, the result follows immediately from theorem 5.11 for the second and third cases of $\widetilde{g}_{N}$ . The primal–dual coupling conditions (C $\mathcal{G}_{*}$ ) and (C $\mathcal{G}$ ) reduce to our respective conditions $\phi_{i}\tau_{i}=\psi_{i}\sigma_{i}$ and $\phi_{i}\tau_{i}=\psi_{i+1}\sigma_{i+1}$ ,

In examples 4.6 and 4.12, we proved (47) for the Chambolle–Pock method and the GIST with $\widetilde{\Gamma}=\gamma I$ and $L_{i}\equiv L/2$ . Now we have to do the same but with the factor-of-two different $\widetilde{\Gamma}=(\gamma/2)I$ and $L_{i}\equiv L$ . The different $\widetilde{\Gamma}$ will merely change the acceleration factor of the method. The larger $L_{i}$ , on the other hand, will change the step length bound (61) of the forward-step Chambolle–Pock, example 4.10, to

[TABLE]

and the the bound $\|A\|\leq\sqrt{2}$ of the GIST of example 4.12 to $\|A\|\leq 1$ .

Example 5.15 (Gap for Chambolle–Pock with a forward step).

In the demonstration of examples 4.6 and 4.10, we have seen the Chambolle–Pock method to satisfy $\phi_{i}\tau_{i}=\psi_{i}\sigma_{i}$ and the self-adjointness of $Z_{i+1}M_{i+1}$ . As discussed above, (47) holds with $\Delta_{i+1}\leq 0$ subject to the conditions $\widetilde{\gamma}\in[0,\gamma/2]$ and (75). We now have $\zeta_{*,N}=\sum_{i=1}^{N-1}\phi_{i}^{1/2}$ . In the unaccelerated case ( $\gamma=0$ ), we get $\zeta_{*,N}=N\phi_{0}^{1/2}$ . Therefore, we get from corollary 5.13 the $O(1/N)$ convergence of $\mathcal{G}(\widetilde{x}_{*,N},\widetilde{y}_{*,N})$ to zero. In the accelerated case ( $\gamma>0$ ), $\phi_{i}$ is of the order $\Theta(i^{2})$ . Therefore also $\zeta_{*,N}$ is of the order $\Theta(N^{2})$ , so we get $O(1/N^{2})$ convergence of $\mathcal{G}(\widetilde{x}_{*,N},\widetilde{y}_{*,N})$ to zero.

Example 5.16 (Gap for GIST).

In example 4.12 we have seen the GIST to satisfy $\tau_{i}=\phi_{i}=\sigma_{i+1}=\psi_{i+1}=1$ , the self-adjointness of $Z_{i+1}M_{i+1}$ . Moreover, as discussed above, (47) with $\Delta_{i+1}\leq 0$ if $\|A\|\leq 1$ . It therefore has $\zeta_{N}=N-1$ and $\zeta_{*,N}=N$ . Consequently, corollary 5.13 yields the $O(1/N)$ convergence of both $\mathcal{G}(\widetilde{x}_{*,N},\widetilde{y}_{*,N})$ and $\mathcal{G}(\widetilde{x}_{N},\widetilde{y}_{N})$ to zero.

\cbend

Conclusion

We have unified common convergence proofs of optimisation methods, employing the ideas of non-linear preconditioning and testing of the classical proximal point method. We have demonstrated that popular classical and modern algorithms can be presented in this framework, and their convergence, including convergence rates, proved with little effort. The theory was, however, not developed with existing algorithms in mind. It was developed to allow the development of new spatially adapted block-proximal methods in [30]. We will demonstrate there and in other works to follow, the full power of the theory. For one, we did not yet fully exploit the fact that $W_{i+1}$ and $Z_{i+1}$ are operators, to construct step-wise step lengths and acceleration.

Appendix A Outer semicontinuity of maximal monotone operators

We could not find the following result explicitly stated in the literature, although it is hidden in, e.g., the proof of [27, Theorem 1].

Lemma A.1.

Let $H:U\rightrightarrows U$ be maximal monotone on a Hilbert space $U$ . Then $H$ is is weak-to-strong outer semicontinuous: for any sequence $\{u^{i}\}_{i\in\mathbb{N}}$ , and any $z^{i}\in H(u^{i})$ such that $u^{i}\mathrel{\rightharpoonup}u$ weakly, and $z^{i}\to z$ strongly, we have $z\in H(u)$ .

Proof A.2.

By monotonicity, for any $u^{\prime}\in U$ and $z^{\prime}\in U$ holds $D_{i}:=\langle u^{\prime}-u^{i},z^{\prime}-z^{i}\rangle\geq 0$ . Since a weakly convergent sequence is bounded, we have $D_{i}\geq\langle u^{\prime}-u^{i},z^{\prime}-z\rangle-C\|z-z^{i}\|$ for some $C>0$ independent of $i$ . Taking the limit, we therefore have $\langle u^{\prime}-u,z^{\prime}-z\rangle\geq 0$ . If we had $z\not\in H(u)$ , this would contradict that $H$ is maximal, i.e., its graph not contained in the graph of any monotone operator.

\cbstart

Appendix B Three-point inequalities

The following three-point formulas are central to handling forward steps with respect to smooth functions.

Lemma B.1.

If $J\in\mathrm{cpl}(X)$ has $L$ -Lipschitz gradient. Then

[TABLE]

as well as

[TABLE]

Proof B.2.

Regarding the “three-point hypomonotonicity” (76), the $L$ -Lipschitz gradient implies co-coercivity (see [1] or appendix C)

[TABLE]

Thus using Cauchy’s inequality

[TABLE]

To prove (77), the Lipschitz gradient implies the smoothness or “descent inequality” (again, [1] or appendix C)

[TABLE]

By convexity $J({\widehat{x}})-J(z)\geq\langle\nabla J(z),{\widehat{x}}-z\rangle$ . Summed, we obtain (77).

Lemma B.3.

If $J\in\mathrm{cpl}(X)$ has $L$ -Lipschitz gradient and is $\gamma$ -strongly convex. Then for any $\tau>0$ holds

[TABLE]

as well as

[TABLE]

Proof B.4.

To prove (80), using strong convexity,the Lipschitz gradient, and Cauchy’s inequality, we have

[TABLE]

Regarding (79), using the $\gamma$ -strong monotonicity of $\nabla J$ , we estimate completely analogously

[TABLE]

Since smooth functions with a positive Hessian are locally convex, the above lemmas readily extend to this case, locally. In fact, we have following more precise result:

Lemma B.5.

Suppose $J\in C^{2}(X)$ with $\nabla^{2}J({\widehat{x}})>0$ at given ${\widehat{x}}\in X$ . Then for any $\tau\in(0,2]$ and all $z,x,\eta\in X$ , we have

[TABLE]

with

[TABLE]

If $x\in\operatorname{cl}B(\|z-{\widehat{x}}\|,{\widehat{x}})$ , then also

[TABLE]

Proof B.6.

By Taylor expansion, for some $\zeta$ between $z$ and ${\widehat{x}}$ , and any $\tau>0$ , we have

[TABLE]

Since $\zeta\in\operatorname{cl}B(\|z-{\widehat{x}}\|,{\widehat{x}})$ , by the definition of $\delta_{z,\eta}$ , we obtain (81).

Similarly, by Taylor expansion, for some $\zeta_{0}$ between $x$ and ${\widehat{x}}$ , we have

[TABLE]

Using (84) we obtain

[TABLE]

Using the assumption $x\in\operatorname{cl}B(\|z-{\widehat{x}}\|,{\widehat{x}})$ , we have $\zeta_{0}\in\operatorname{cl}B(\|z-{\widehat{x}}\|,{\widehat{x}})$ . Hence we obtain (83) by the definition of $\delta_{z,\eta}$ and $(1-\delta_{z,\eta})(2-\tau)-(1+\delta_{z,\eta})=(1-\delta_{z,\eta})(1-\tau)-2\delta_{z,\eta}$ .

We can also derive the following alternate result:

Lemma B.7.

Suppose $J\in C^{2}(X)$ with $\nabla^{2}J({\widehat{x}})>0$ at given ${\widehat{x}}\in X$ . Then for all $z,x,\eta\in X$ we have

[TABLE]

for $\delta_{z,\eta}$ given by (82). If $x\in\operatorname{cl}B(\|z-{\widehat{x}}\|,{\widehat{x}})$ , then also

[TABLE]

Proof B.8.

By Taylor expansion, for some $\zeta$ between $z$ and ${\widehat{x}}$ , we have

[TABLE]

In the last step we have used Cauchy’s inequality, and the definition of $\delta_{z,\eta}$ following $\zeta\in\operatorname{cl}B(\|z-{\widehat{x}}\|,{\widehat{x}})$ . The standard three-point or Pythagoras’ identity states

[TABLE]

Applying this in (88), we obtain (86).

To prove (87), we use (85), the definition of $\delta_{z,\eta}$ , and (86).

\cbend

Appendix C Projected gradients and smoothness

The next lemma generalises well-known properties [[, see, e.g.,]]bauschke2017convex of smooth convex functions to projected gradients, when we take $P$ as projection operator. With $P$ a random projection, taking the expectation in (91), we in particular obtain a connection to the Expected Separable Over-approximation property in the stochastic coordinate descent literature [26].

Lemma C.1.

Let $J\in\mathrm{cpl}(X)$ , and $P\in\mathcal{L}(X;X)$ be self-adjoint and positive semi-definite on a Hilbert space $X$ . Suppose $P$ has a pseudo-inverse $P^{\dagger}$ satisfying $PP^{\dagger}P=P$ . Consider the properties:

(i)

$P$ -relative Lipschitz continuity of $\nabla J$ with factor $L$ :

[TABLE] 2. (ii)

The $P$ -relative property

[TABLE] 3. (iii)

$P$ -relative smoothness of $J$ with factor $L$ :

[TABLE] 4. (iv)

The $P$ -relative property

[TABLE] 5. (v)

$P$ -relative co-coercivity of $\nabla J$ with factor $L^{-1}$ :

[TABLE]

*We have (i) $\implies$ (ii) $\iff$ (iii) $\implies$ (iv) $\implies$ (v). If $P$ is invertible, all are equivalent. *

Proof C.2.

(i)* $\implies$ (ii): Take $y=x+Ph$ and multiply (89) by $\|h\|_{P}$ . Then use Cauchy–Schwarz.*

(ii)* $\implies$ (iii): Using the mean value theorem and (90), we compute (91):*

[TABLE]

(iii)* $\implies$ (ii): Add together (91) for $x=x^{\prime}$ and $x=x^{\prime}+Ph$ .*

(iii)* $\implies$ (iv): Adding $-\langle\nabla J(y),x+Ph\rangle$ on both sides of (91), we get*

[TABLE]

The left hand side is minimised with respect to $x$ by taking $x=y-Ph$ . Taking on the right-hand side $h=L^{-1}(\nabla J(y)-\nabla J(x))$ therefore gives (92).

(iv)* $\implies$ (v): Summing the estimate (92) with the same estimate with $x$ and $y$ exchanged, we obtain (93).*

(v)* $\implies$ (i) when $P$ is invertible: Cauchy–Schwarz.*

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces . CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York, 2 edition, 2017.
2[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.
3[3] H. Brezis, M. G. Crandall, and A. Pazy. Perturbations of nonlinear maximal monotone sets in banach space. Communications on Pure and Applied Mathematics , 23(1):123–144, 1970.
4[4] Felix E Browder. Nonexpansive nonlinear operators in a banach space. Proceedings of the National Academy of Sciences of the United States of America , 54(4):1041, 1965.
5[5] Felix E. Browder. Convergence theorems for sequences of nonlinear operators in banach spaces. Mathematische Zeitschrift , 100(3):201–225, Jun 1967.
6[6] Y. Censor and S. A. Zenios. Proximal minimization algorithm withd-functions. Journal of Optimization Theory and Applications , 73(3):451–464, 1992.
7[7] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision , 40:120–145, 2011.
8[8] Antonin Chambolle and Thomas Pock. On the ergodic convergence rates of a first-order primal–dual algorithm. Mathematical Programming , pages 1–35, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Testing and non-linear preconditioning of the proximal point method

Abstract

1 Introduction

2 An abstract preconditioned proximal point iteration

2.1 Notation and general setup

2.2 Basic estimates

Theorem 2.1**.**

Corollary 2.2**.**

Proof 2.3** (Proof of theorem 2.1).**

Remark 2.4** (Bregman divergences and Banach spaces).**

Proposition 2.5** (Convergence with a rate).**

Proof 2.6**.**

Proposition 2.7** (Superlinear convergence).**

Proof 2.8**.**

Proposition 2.9** (Weak convergence).**

Lemma 2.10**.**

Proof 2.11** (Proof of proposition 2.9).**

2.3 Examples of first-order methods

Example 2.12** (The proximal point method).**

Proof 2.13** (Proof of convergence).**

Example 2.14** (Acceleration and linear convergence of the proximal point method).**

Proof 2.15** (Proof of convergence).**

Lemma 2.16**.**

Proof 2.17**.**

Remark 2.18**.**

Example 2.19** (Gradient descent).**

Example 2.20** (Acceleration and linear convergence of gradient descent).**

Example 2.21** (Forward–backward splitting).**

Example 2.22** (Douglas–Rachford splitting).**

Proof 2.23** (Proof of convergence).**

2.4 Examples of second-order methods

Example 2.24** (Newton’s method).**

Example 2.25** (Proximal Newton’s method).**

Lemma 2.26**.**

Proof 2.27**.**

Lemma 2.28**.**

Proof 2.29**.**

2.5 Convergence of function values

Lemma 2.30**.**

Proof 2.31**.**

Corollary 2.32**.**

Proof 2.33**.**

Example 2.34** (Proximal point method ergodic function value).**

Example 2.35** (Gradient descent ergodic function value).**

Example 2.36** (Forward–backward splitting ergodic function value).**

Example 2.37** (Newton’s method function value).**

Example 2.38** (Proximal point method function value).**

Proof 2.39** (Proof of convergence).**

2.6 Connections to fixed point theorems

Example 2.40** (Browder’s fixed point theorem).**

Proof 2.41** (Proof).**

3 Stochastic methods

Definition 3.1**.**

Corollary 3.2**.**

Definition 3.3**.**

Example 3.4** (Stochastic block-coordinate descent).**

Example 3.5** (Stochastic forward–backward splitting).**

Remark 3.6**.**

Proof 3.7** (Proof of convergence of stochastic gradient descent and forward–backward splitting).**

Example 3.8** (Stochastic Newton’s method).**

Remark 3.9**.**

Example 3.10** (Stochastic proximal Newton’s method).**

Proof 3.11** (Proof of convergence of stochastic Newton’s and proximal Newton’s methods).**

Remark 3.12** (Variance estimates).**

4 Saddle point problems

Example 4.1** (Block-separable structure, monotonicity).**

4.1 Estimates

Theorem 4.2**.**

Proof 4.3**.**

4.2 Examples of primal–dual methods

Corollary 4.4**.**

Proof 4.5**.**

Example 4.6** (The primal–dual method of Chambolle and Pock [7]).**

Proof 4.7** (Proof of convergence of iterates).**

Theorem 2.1.

Corollary 2.2.

Proof 2.3 (Proof of theorem 2.1).

Remark 2.4 (Bregman divergences and Banach spaces).

Proposition 2.5 (Convergence with a rate).

Proof 2.6.

Proposition 2.7 (Superlinear convergence).

Proof 2.8.

Proposition 2.9 (Weak convergence).

Lemma 2.10.

Proof 2.11 (Proof of proposition 2.9).

Example 2.12 (The proximal point method).

Proof 2.13 (Proof of convergence).

Example 2.14 (Acceleration and linear convergence of the proximal point method).

Proof 2.15 (Proof of convergence).

Lemma 2.16.

Proof 2.17.

Remark 2.18.

Example 2.19 (Gradient descent).

Example 2.20 (Acceleration and linear convergence of gradient descent).

Example 2.21 (Forward–backward splitting).

Example 2.22 (Douglas–Rachford splitting).

Proof 2.23 (Proof of convergence).

Example 2.24 (Newton’s method).

Example 2.25 (Proximal Newton’s method).

Lemma 2.26.

Proof 2.27.

Lemma 2.28.

Proof 2.29.

Lemma 2.30.

Proof 2.31.

Corollary 2.32.

Proof 2.33.

Example 2.34 (Proximal point method ergodic function value).

Example 2.35 (Gradient descent ergodic function value).

Example 2.36 (Forward–backward splitting ergodic function value).

Example 2.37 (Newton’s method function value).

Example 2.38 (Proximal point method function value).

Proof 2.39 (Proof of convergence).

Example 2.40 (Browder’s fixed point theorem).

Proof 2.41 (Proof).

Definition 3.1.

Corollary 3.2.

Definition 3.3.

Example 3.4 (Stochastic block-coordinate descent).

Example 3.5 (Stochastic forward–backward splitting).

Remark 3.6.

Proof 3.7 (Proof of convergence of stochastic gradient descent and forward–backward splitting).

Example 3.8 (Stochastic Newton’s method).

Remark 3.9.

Example 3.10 (Stochastic proximal Newton’s method).

Proof 3.11 (Proof of convergence of stochastic Newton’s and proximal Newton’s methods).

Remark 3.12 (Variance estimates).

Example 4.1 (Block-separable structure, monotonicity).

Theorem 4.2.

Proof 4.3.

Corollary 4.4.

Proof 4.5.

Example 4.6 (The primal–dual method of Chambolle and Pock [7]).

Proof 4.7 (Proof of convergence of iterates).

Remark 4.8 (Brezis–Crandall–Pazy property).

Remark 4.9 (Linear convergence).

Example 4.10 (Chambolle–Pock with a forward step).

Proof 4.11 (Proof of convergence).

Example 4.12 (GIST).

Proof 4.13 (Proof of convergence).

Lemma 5.1.

Proof 5.2.

Lemma 5.3.

Proof 5.4.

Lemma 5.5.

Proof 5.6.

Lemma 5.7.

Proof 5.8.

Example 5.9 (Block-separable structure, ergodic convexity).

Example 5.10 (Ergodic smoothness for smooth $J$ ).

Theorem 5.11.

Proof 5.12.

Corollary 5.13.

Proof 5.14.