Primal-dual proximal splitting and generalized conjugation in non-smooth   non-convex optimization

Christian Clason; Stanislav Mazurenko; Tuomo Valkonen

arXiv:1901.02746·math.OC·August 27, 2021

Primal-dual proximal splitting and generalized conjugation in non-smooth non-convex optimization

Christian Clason, Stanislav Mazurenko, Tuomo Valkonen

PDF

1 Repo

TL;DR

This paper extends primal-dual proximal splitting methods to solve challenging non-convex, non-smooth problems by leveraging generalized conjugates and saddle-point reformulations, with proven local linear convergence under certain conditions.

Contribution

It introduces a novel framework applying primal-dual proximal splitting to non-convex, non-smooth problems using generalized conjugates and saddle-point formulations, with convergence analysis.

Findings

01

Method successfully applied to Nash equilibrium and Potts segmentation problems.

02

Proven local linear convergence under strong convexity assumptions.

03

Numerical experiments confirm theoretical convergence results.

Abstract

We demonstrate that difficult non-convex non-smooth optimization problems, such as Nash equilibrium problems and anisotropic as well as isotropic Potts segmentation model, can be written in terms of generalized conjugates of convex functionals. These, in turn, can be formulated as saddle-point problems involving convex non-smooth functionals and a general smooth but non-bilinear coupling term. We then show through detailed convergence analysis that a conceptually straightforward extension of the primal--dual proximal splitting method of Chambolle and Pock is applicable to the solution of such problems. Under sufficient local strong convexity assumptions of the functionals -- but still with a non-bilinear coupling term -- we even demonstrate local linear convergence of the method. We illustrate these theoretical results numerically on the aforementioned example problems.

Tables1

Table 1. Table 1 : Results for elliptic NEP example for different N 𝑁 N

$i$	$N = 64$	$N = 128$	$N = 256$	$N = 512$	$N = 1024$
$1$	$1.298 ⋅ 10^{- 01}$	$1.319 ⋅ 10^{- 01}$	$1.330 ⋅ 10^{- 01}$	$1.335 ⋅ 10^{- 01}$	$1.338 ⋅ 10^{- 01}$
$2$	$3.889 ⋅ 10^{- 06}$	$4.048 ⋅ 10^{- 06}$	$4.074 ⋅ 10^{- 06}$	$4.088 ⋅ 10^{- 06}$	$4.097 ⋅ 10^{- 06}$
$3$	$3.835 ⋅ 10^{- 10}$	$3.977 ⋅ 10^{- 10}$	$4.010 ⋅ 10^{- 10}$	$4.026 ⋅ 10^{- 10}$	$4.032 ⋅ 10^{- 10}$
$4$	$3.811 ⋅ 10^{- 14}$	$3.952 ⋅ 10^{- 14}$	$3.986 ⋅ 10^{- 14}$	$4.001 ⋅ 10^{- 14}$	$4.008 ⋅ 10^{- 14}$
$5$	$3.787 ⋅ 10^{- 18}$	$3.928 ⋅ 10^{- 18}$	$3.963 ⋅ 10^{- 18}$	$3.977 ⋅ 10^{- 18}$	$3.985 ⋅ 10^{- 18}$

Equations459

x \in X min y \in Y max G (x) + K (x, y) - F^{*} (y),

x \in X min y \in Y max G (x) + K (x, y) - F^{*} (y),

x^{i + 1}

x^{i + 1}

\widebar x^{i + 1}

y^{i + 1}

x \in X min y \in Y max G (x) + ⟨ A x, y ⟩ - F^{*} (y)

x \in X min y \in Y max G (x) + ⟨ A x, y ⟩ - F^{*} (y)

F (x) = y \in Y sup K (x, y) - F^{*} (y)

F (x) = y \in Y sup K (x, y) - F^{*} (y)

(x_{- k} ∣ z) := (x_{1}, \dots, x_{k - 1}, z, x_{k + 1}, \dots x_{n}) (1 \leq k \leq n, z \in R)

(x_{- k} ∣ z) := (x_{1}, \dots, x_{k - 1}, z, x_{k + 1}, \dots x_{n}) (1 \leq k \leq n, z \in R)

ϕ_{k} (x^{*}) = ϕ_{k} (x_{- k}^{*} ∣ x_{k}^{*}) = z \in R min ϕ_{k} (x_{- k}^{*} ∣ z) (1 \leq k \leq n) .

ϕ_{k} (x^{*}) = ϕ_{k} (x_{- k}^{*} ∣ x_{k}^{*}) = z \in R min ϕ_{k} (x_{- k}^{*} ∣ z) (1 \leq k \leq n) .

Ψ (x, y) = k = 1 \sum n (ϕ_{k} (x_{- k} ∣ x_{k}) - ϕ_{k} (x_{- k} ∣ y_{k})) (x, y \in X)

Ψ (x, y) = k = 1 \sum n (ϕ_{k} (x_{- k} ∣ x_{k}) - ϕ_{k} (x_{- k} ∣ y_{k})) (x, y \in X)

V (x) = y \in X max Ψ (x, y) (x \in X) .

V (x) = y \in X max Ψ (x, y) (x \in X) .

δ_{X} (x) = {0 \infty if x \in X, if x \in / X,

δ_{X} (x) = {0 \infty if x \in X, if x \in / X,

x \in R^{n} min y \in R^{n} max δ_{X} (x) + Ψ (x, y) - δ_{X} (y) .

x \in R^{n} min y \in R^{n} max δ_{X} (x) + Ψ (x, y) - δ_{X} (y) .

K (x, y) = Ψ (x, y), F^{*} = G = δ_{X} .

K (x, y) = Ψ (x, y), F^{*} = G = δ_{X} .

X_{k} (x_{- k}) = {x_{k} \in R^{n} : (x_{- k} ∣ x_{k}) \in Z} (1 \leq k \leq n)

X_{k} (x_{- k}) = {x_{k} \in R^{n} : (x_{- k} ∣ x_{k}) \in Z} (1 \leq k \leq n)

x \in R^{N_{1} \times N_{2}} min \frac{1}{2 α} ∥ x - f ∥^{2} + ∥ D_{h} x ∥_{p, 0},

x \in R^{N_{1} \times N_{2}} min \frac{1}{2 α} ∥ x - f ∥^{2} + ∥ D_{h} x ∥_{p, 0},

∥ z ∥_{p, 0} := i = 1 \sum N_{1} j = 1 \sum N_{2} ∣ (∣ z_{ij 1} ∣_{0}, ∣ z_{ij 2} ∣_{0}) ∣_{p}, where ∣ t ∣_{0} = {01 if t = 0, if t \neq = 0,

∥ z ∥_{p, 0} := i = 1 \sum N_{1} j = 1 \sum N_{2} ∣ (∣ z_{ij 1} ∣_{0}, ∣ z_{ij 2} ∣_{0}) ∣_{p}, where ∣ t ∣_{0} = {01 if t = 0, if t \neq = 0,

χ_{(0, \infty)} (t) = {01 if t \leq 0, if t > 0.

χ_{(0, \infty)} (t) = {01 if t \leq 0, if t > 0.

χ_{(0, \infty)} (t) = s \geq 0 sup ρ (s t) = s \in R sup ρ (s t) - δ_{[0, \infty)} (s) .

χ_{(0, \infty)} (t) = s \geq 0 sup ρ (s t) = s \in R sup ρ (s t) - δ_{[0, \infty)} (s) .

ρ (t) = 2 t - t^{2}, (t \in R),

ρ (t) = 2 t - t^{2}, (t \in R),

∣ t ∣_{0} = s \in R sup ρ (s t) = s \in R sup ρ (s t) - 0,

∣ t ∣_{0} = s \in R sup ρ (s t) = s \in R sup ρ (s t) - 0,

∣ t ∣_{γ} := s \in R sup ρ (s t) - \frac{γ}{2} ∣ s ∣^{2} = \frac{2 t ^{2}}{2 t ^{2} + γ},

∣ t ∣_{γ} := s \in R sup ρ (s t) - \frac{γ}{2} ∣ s ∣^{2} = \frac{2 t ^{2}}{2 t ^{2} + γ},

∥ z ∥_{1, 0} = i = 1 \sum N_{1} j = 1 \sum N_{2} k = 1 \sum 2 ∣ z_{ij k} ∣_{0},

∥ z ∥_{1, 0} = i = 1 \sum N_{1} j = 1 \sum N_{2} k = 1 \sum 2 ∣ z_{ij k} ∣_{0},

κ_{1} (z, y) = i = 1 \sum N_{1} j = 1 \sum N_{2} k = 1 \sum 2 ρ (z_{ij k} y_{ij k})

κ_{1} (z, y) = i = 1 \sum N_{1} j = 1 \sum N_{2} k = 1 \sum 2 ρ (z_{ij k} y_{ij k})

F_{γ} (z) = i = 1 \sum N_{1} j = 1 \sum N_{2} k = 1 \sum 2 ∣ z_{ij k} ∣_{γ} .

F_{γ} (z) = i = 1 \sum N_{1} j = 1 \sum N_{2} k = 1 \sum 2 ∣ z_{ij k} ∣_{γ} .

∥ z ∥_{\infty, 0} = i = 1 \sum N_{1} j = 1 \sum N_{2} max {∣ z_{ij 1} ∣_{0}, ∣ z_{ij 2} ∣_{0}} .

∥ z ∥_{\infty, 0} = i = 1 \sum N_{1} j = 1 \sum N_{2} max {∣ z_{ij 1} ∣_{0}, ∣ z_{ij 2} ∣_{0}} .

∣ ∥ z ∣ ∥_{0, p} := i = 1 \sum N_{1} j = 1 \sum N_{2} ∣ ∣ (z_{ij 1}, z_{ij 2}) ∣_{p} ∣_{0}

∣ ∥ z ∣ ∥_{0, p} := i = 1 \sum N_{1} j = 1 \sum N_{2} ∣ ∣ (z_{ij 1}, z_{ij 2}) ∣_{p} ∣_{0}

∣∣ t ∣_{2} ∣_{0} = s \in R sup ρ (⟨ s, t ⟩) = s \in R sup ρ (s_{1} t_{1} + s_{2} t_{2})

∣∣ t ∣_{2} ∣_{0} = s \in R sup ρ (⟨ s, t ⟩) = s \in R sup ρ (s_{1} t_{1} + s_{2} t_{2})

κ_{\infty} (z, y) = i = 1 \sum N_{1} j = 1 \sum N_{2} ρ (z_{ij 1} y_{ij 1} + z_{ij 2} y_{ij 2})

κ_{\infty} (z, y) = i = 1 \sum N_{1} j = 1 \sum N_{2} ρ (z_{ij 1} y_{ij 1} + z_{ij 2} y_{ij 2})

F_{γ} (z) = i = 1 \sum N_{1} j = 1 \sum N_{2} ∣ ∣ (z_{ij 1}, z_{ij 2}) ∣_{2} ∣_{γ} .

F_{γ} (z) = i = 1 \sum N_{1} j = 1 \sum N_{2} ∣ ∣ (z_{ij 1}, z_{ij 2}) ∣_{2} ∣_{γ} .

s \in R^{2} sup ρ_{p} (s, t) = ⎩ ⎨ ⎧ 01 2^{1/ p} if t = 0, if t \neq = 0, t_{1} t_{2} = 0, if t \neq = 0, t_{1} t_{2} \neq = 0,

s \in R^{2} sup ρ_{p} (s, t) = ⎩ ⎨ ⎧ 01 2^{1/ p} if t = 0, if t \neq = 0, t_{1} t_{2} = 0, if t \neq = 0, t_{1} t_{2} \neq = 0,

K (x, y) = κ_{p} (D_{h} x, y), G (x) = \frac{1}{2 α} ∥ x - f ∥^{2}, F_{γ}^{*} (y) = \frac{γ}{2} ∥ y ∥^{2}

K (x, y) = κ_{p} (D_{h} x, y), G (x) = \frac{1}{2 α} ∥ x - f ∥^{2}, F_{γ}^{*} (y) = \frac{γ}{2} ∥ y ∥^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://zenodo.org/record/3647615
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\manuscripteprinttype

arxiv \manuscripteprint1901.02746v4

Primal–dual proximal splitting and generalized conjugation in non-smooth non-convex optimization

Christian Clason Faculty of Mathematics, University Duisburg-Essen, 45117 Essen, Germany (, \orcid0000-0002-9948-8426) [email protected]

Stanislav Mazurenko Loschmidt Laboratories, Masaryk University, Brno, Czechia; previously Department of Mathematical Sciences, University of Liverpool, United Kingdom (, \orcid0000-0003-3659-4819) [email protected]

Tuomo Valkonen ModeMat, Escuela Politécnica Nacional, Quito, Ecuador and Department of Mathematics and Statistics, University of Helsinki, Finland; previously Department of Mathematical Sciences, University of Liverpool, United Kingdom (, \orcid0000-0001-6683-3572) [email protected]

(2020-03-19)

Abstract

We demonstrate that difficult non-convex non-smooth optimization problems, such as Nash equilibrium problems and anisotropic as well as isotropic Potts segmentation model, can be written in terms of generalized conjugates of convex functionals. These, in turn, can be formulated as saddle-point problems involving convex non-smooth functionals and a general smooth but non-bilinear coupling term. We then show through detailed convergence analysis that a conceptually straightforward extension of the primal–dual proximal splitting method of Chambolle and Pock is applicable to the solution of such problems. Under sufficient local strong convexity assumptions of the functionals – but still with a non-bilinear coupling term – we even demonstrate local linear convergence of the method. We illustrate these theoretical results numerically on the aforementioned example problems.

1 Introduction

This work is concerned with the numerical solution of non-smooth non-convex saddle-point problems of the form

[TABLE]

where $G:X\to\overline{\mathbb{R}}$ and $F^{*}:Y\to\overline{\mathbb{R}}$ are (possibly non-smooth) proper, convex and lower semicontinuous functionals on Hilbert spaces $X$ and $Y$ , and $K:X\times Y\to\mathbb{R}$ is smooth but may be non-convex-concave. Such problems arise in many areas of optimal control, inverse problems, and imaging; we will treat two specific examples below. To find a critical point for (1), we propose the generalized primal–dual proximal splitting (GPDPS) method:

Algorithm 1.1 (GPDPS).

Given a starting point $(x^{0},y^{0})$ and step lengths $\tau_{i},\omega_{i},\sigma_{i}>0$ , iterate:

[TABLE]

where $\mathrm{prox}_{\tau_{i}G}(v)=(I+\tau_{i}\partial G)^{-1}(v)$ is the proximal mapping for $G$ ; and $K_{x},K_{y}$ are the partial Fréchet derivatives of $K$ with respect to $x$ and $y$ . A main result of this work is that under suitable conditions on the step length parameters $\tau_{i}$ , $\sigma_{i}$ , and $\omega_{i}$ , this algorithm converges weakly to a critical point of (1); see Theorem 6.1. Furthermore, if $\partial G$ and/or $\partial F^{*}$ is strongly metrically subregular at the saddle point (in particular, if $G$ and/or $F^{*}$ are strongly convex), we show optimal convergence rates for the standard acceleration strategies; see Theorems 6.4 and 6.6.

In addition, we demonstrate in this work how through a suitable reformulation this method can be applied to the following two non-trivial applications:

(i)

elliptic Nash equilibrium problems, where $K(x,y)$ is the so-called Nikaido–Isoda function encoding the Nash equilibrium [29, 25, 38]; see Section 2.1 for details. 2. (ii)

(Huber-regularized) $\ell^{0}$ - $TV$ denoising (also referred to as the Potts model) [18, 33, 34], where $K(x,y)$ is used to express the non-convex Potts functional as the generalized $K$ -conjugate of a convex indicator function; see Section 2.2 for details.

In particular, the second example demonstrates how the proposed method can be used to solve (some) non-convex non-smooth problems by reformulating in them in terms of a convex but non-smooth functional and a smooth but non-convex coupling term. (We stress, however, that we do not claim that this approach is superior to state-of-the-art problem-specific approaches such as the ones mentioned in the cited works for the specific problems; such an investigation is left for the future.)

Related literature.

Our approach is obviously motivated by the well-known primal–dual proximal splitting (PDPS) method of Chambolle and Pock [8] for convex optimization problems of the form $\min_{x\in X}F(Ax)+G(x)$ for $F:Y\to\overline{\mathbb{R}}$ proper, convex, and lower semicontinuous and $A:X\to Y$ linear. The method is based on the equivalent reformulation as the saddle-point problem

[TABLE]

where $F^{*}$ is the Fenchel conjugate of $F$ . Several other alternative techniques for such optimization problems have also been developed, e.g., using smoothing schemes [28] or a proximal alternating predictor corrector [13]. This approach was extended to allow for nonlinear but Fréchet differentiable $A$ in [35]. Later work [12, 10] applied this to non-convex PDE-constrained optimization problems and derived accelerated variants.

In a broader context, generalized convex conjugation has been studied for many decades with applications in economics, see, e.g., [26, 32, 15] and the references therein. Algorithms for the solution of general saddle-point problems $\min_{x}\max_{y}f(x,y)$ have been considered in several seminal papers. In particular, a prox-type method was suggested in [27] for $C^{1,1}$ convex–concave functions yielding a $O(1/N)$ rate of convergence for an ergodic version of the gap $\max_{y^{\prime}\in Y}f(x,y^{\prime})-\min_{x^{\prime}\in X}f(x^{\prime},y)$ . These results were further extended to allow non-smooth functions in the Mirror Descent method [22], demonstrating a $O(1/\sqrt{N})$ rate of convergence for the ergodic gap although with a vanishing step size for large $N$ . The authors also considered an acceleration of the Mirror Proximal method for the case when the gradient map of $f$ can be split into a Lipschitz-continuous part and a monotone operator [23]. The latter was assumed “simple” in the sense that a solution to a specific variational inequality could be found relatively efficiently. As a result, the authors obtained an $O(1/N)$ rate of convergence with a possibility for improvement to $O(1/N^{2})$ for a strongly concave $f$ . Finally, the reformulation of (1) with a bilinear $K$ as a monotone inclusion problem was considered in [21]. Algorithms applicable to (1) with a genuinely nonlinear $K$ have only started to appear in literature relatively recently. An abstract convergence result was obtained for an inexact regularized Gauss–Seidel method in [3]. In [20], the authors considered saddle-point representable functions and arrived at a very similar structure to (1); specifically, they reformulated this problem as a smooth linearly-constrained saddle point problem by moving the non-smooth terms into the problem domain and applied the Mirror Proximal algorithm mentioned earlier, with a smooth cost function and the $O(1/N)$ convergence rate [27]. Following [21], Kolossoski and Monteiro [24] developed a non-Euclidean hybrid proximal extragradient for $G$ and $F^{*}$ Bregman distances, and $K$ general convex–concave. The case of a general convex–concave $K$ in (1) (which therefore becomes an overall convex–concave problem) has been recently studied in [19]. Besides being restricted to convex–concave problems, their algorithm differs from Algorithm 1.1 in applying the overrelaxation to $K_{y}(x^{i+1},y^{i})$ instead of to $x^{i+1}$ in the third step. Finally, problems for general sufficiently smooth $K(x,y)$ were considered in [5] in conjunction with a variant of ADMM; however, no proofs of convergence were given in the general case.

Organization.

To motivate our approach, we start with a more detailed description of the above-mentioned example problems and their reformulation as a saddle-point problem of the form (1) in the next Section 2. (This section can be skipped by readers only interested in the convergence analysis for the general Algorithm 1.1.) The following Section 3 then collects basic notation and definitions as well as the fundamental assumptions that will be used throughout the following. We then study the convergence and convergence rates of Algorithm 1.1 in Sections 4, 5 and 6. More precisely, in Section 4 we derive a basic convergence estimate using the “testing” framework introduced in [36, 37] for the study of preconditioned proximal point methods. The results and assumptions depend on the iterates staying in a local neighborhood of a solution. In Section 5 we therefore derive conditions on the step length parameters and initial iterate that ensure that the iterates do not escape from a local neighborhood. Afterwards, we provide in Section 6 exact step length rules for Algorithm 1.1 together with respective weak convergence or convergence rate results: linear under sufficient strong convexity of $G$ and $F^{*}$ , and “accelerated” $O(1/N)$ or $O(1/N^{2})$ rates with somewhat lesser assumptions. Finally, we illustrate the applicability and performance of the proposed approach applied to our two example problems in Section 7. Appendices A to C contain further technical results on the assumptions required for convergence, in particular verifying them for the Huber-regularized $\ell^{0}$ -TV denoising example.

2 Applications

Before we begin our analysis of the convergence of Algorithm 1.1, we motivate its generality by discussing two examples of practically relevant problems that can be cast in the form (1) and which will be used to numerically illustrate the behavior of the algorithm in Section 7. The idea in each case is to write a non-convex functional $F$ as the generalized $K$ -conjugate of a convex functional $F^{*}$ , i.e.,

[TABLE]

for a suitable $K$ (depending on $F$ ).

2.1 Elliptic Nash equilibrium problems

Our first example is the reformulation of Nash equilibrium problems using the Nikaido–Isoda function following [38]. Consider a non-cooperative game of $n\in\mathbb{N}$ players, each of which has a strategy $x_{k}\in X_{k}\subset\mathbb{R}$ and a payout function $\phi_{k}:\mathbb{R}^{n}\to\mathbb{R}$ . For convenience, we introduce the vector $x\in\mathbb{R}^{n}$ of strategies and the notation

[TABLE]

for the vector where player $k$ changes their strategy $x_{k}$ to $z$ . We also set $X:=X_{1}\times\dots\times X_{n}$ . A vector $x^{*}\in X$ of strategies is then a Nash equilibrium if

[TABLE]

We now introduce the Nikaido–Isoda function [29] (also called the Ky Fan function [17])

[TABLE]

as well as the optimum response function

[TABLE]

It follows from [38, Thm. 2.2] that $x^{*}\in X$ is a Nash equilibrium if and only if it is a minimizer of $V$ . Using the indicator function of the set $X\subset\mathbb{R}^{n}$ defined by

[TABLE]

we see that the generally non-convex response function $V$ is the $\Psi$ -preconjugate of the convex functional $\delta_{X}$ and can characterize a Nash equilibrium $x^{*}\in X$ as the solution to the saddle-point problem

[TABLE]

We can therefore solve the Nash equilibrium problem (3) by applying Algorithm 1.1 to

[TABLE]

In Section 7.1, we illustrate this exemplarily for the two-player elliptic Nash equilibrium problem from [6].

Remark 2.1.

If the set $X_{k}$ of feasible strategies for each player depends on the strategies of the other players (i.e., $X_{k}=X_{k}(x_{-k})$ ), (3) becomes a generalized Nash equilibrium problem (GNEP); see the survey [16] and the literature cited therein. If for all $k$

[TABLE]

for some closed and convex set $Z\subset\mathbb{R}^{n}$ , the GNEP is called jointly convex. In this case, minimization of (4) is no longer an equivalent characterization but defines a variational equilibria [31]; every variational equilibrium is a generalized Nash equilibrium but not vice versa, see, e.g., [16, Thm. 3.9]. Hence Algorithm 1.1 can also be applied to compute (some if not all) solutions to jointly convex GNEPs.

2.2 Huber–Potts denoising

Our next example is concerned with (Huber-regularized) $\ell^{0}$ -TV denoising or segmentation, also referred to as Potts model. Let $f\in\mathbb{R}^{N_{1}\times N_{2}}$ , $N_{1},N_{2}\in\mathbb{N}$ , be a given noisy or to be segmented image. We then search for the denoised or segmented image as the solution to

[TABLE]

for a regularization parameter $\alpha\geq 0$ (which we write in front of the discrepancy term to simplify the computations), the discrete gradient $D_{h}:\mathbb{R}^{N_{1}\times N_{2}}\to\mathbb{R}^{N_{1}\times N_{2}\times 2}$ , and the vectorial $\ell^{0}$ -seminorm

[TABLE]

and $|\cdot|_{p}$ for $p\in[1,\infty]$ is the usual $p$ -norm on $\mathbb{R}^{2}$ ; we will discuss the choice of $p$ in detail below. Clearly, $\|\,\boldsymbol{\cdot}\,\|_{p,0}$ is a non-convex functional for any $p\in[1,\infty]$ . Let us briefly comment on the use of $\ell^{0}$ -TV as a regularizer in imaging. Intuitively, the functional in (6) applied to the discrete gradient counts the number of jumps of the image value between neighboring pixels; it can therefore be expected that minimizers are piecewise constant, and that jumps are penalized even more strongly than by the (convex) total variation model.

To motivate our approach, we first consider a simple scalar (lower semicontinuous) step function, i.e., we consider for $(0,\infty)\subset\mathbb{R}$ the corresponding characteristic function

[TABLE]

To write this non-convex function as the generalized preconjugate of a convex function, let $\rho:\mathbb{R}\to\mathbb{R}$ satisfy $\rho(0)=0$ , $\sup_{t\leq 0}\rho(t)=0$ , and $\sup_{t>0}\rho(t)=1$ . Then a simple case distinction shows that

[TABLE]

Setting $\kappa(s,t):=\rho(st)$ , we thus obtain that $\chi_{(0,\infty)}$ is the $\kappa$ -preconjugate of the convex indicator function $\delta_{[0,\infty)}$ . One possible choice for $\rho$ is $\rho=\chi_{(0,\infty)}$ ; however, we require $\rho$ to be smooth in order to apply Algorithm 1.1. A better choice is therefore

[TABLE]

see Fig. 2, which has the advantage that the supremum in (8) is always attained at a finite $s\geq 0$ . We will use this choice from now on.

Noting that $|t|_{0}=\chi_{\{0\}}(t)$ , we can proceed similarly by case distinction to write

[TABLE]

i.e., for $\kappa(s,t)=\rho(st)$ as above, $|\cdot|_{0}$ is the $\kappa$ -preconjugate of the zero function $f^{*}\equiv 0$ . In practice, it may be useful to add Huber regularization, i.e., replace $f^{*}$ by $f_{\gamma}^{*}:=f^{*}+\frac{\gamma}{2}|\cdot|^{2}=\frac{\gamma}{2}|\cdot|^{2}$ for some $\gamma>0$ . Using the fact that $f_{\gamma}^{*}$ and our choice (9) are differentiable, an elementary calculus argument shows that the corresponding preconjugate is

[TABLE]

which is a still non-convex approximation of $|t|_{0}$ , see Fig. 2.

We now turn to the vectorial $\ell^{0}$ seminorm, where we distinguish between $p\in[1,\infty]$ .

The case $p=1$ .

With this choice, (6) reduces to

[TABLE]

which is the most common choice for the Potts model found in the literature. Here, the Potts functional $\|D_{h}x\|_{1,0}$ counts for each pixel $(i,j)$ the jumps across each edge of the pixel separately, i.e., the contribution of each pixel is either [math] (no jump), $1$ (jump in either horizontal or vertical direction), or $2$ (jump in both directions). We thus refer (in a slight abuse of terminology) to this case as the anisotropic Potts model.

Since this functional is completely separable, we can apply the above scalar approach componentwise by taking

[TABLE]

such that $F=\|\,\boldsymbol{\cdot}\,\|_{1,0}$ is the $\kappa_{1}$ -preconjugate of the zero function $F^{*}\equiv 0$ . Correspondingly, the Huber regularization of $F$ is given by

[TABLE]

The case $p=\infty$ .

Now (6) reduces to

[TABLE]

Here, each pixel contributes to the Potts functional only once, even if there is a jump across both edges. Since a simple case distinction shows that $\max\{|a|_{0},|b|_{0}\}=||(a,b)|_{p}|_{0}$ for any $a,b\in\mathbb{R}$ and $p\in[1,\infty]$ , this case is equivalent to

[TABLE]

for any $p\in[1,\infty]$ , which leads to an alternate definition of the Potts functional sometimes found in the literature. We refer to this case as the isotropic Potts model.

This functional is only separable with respect to the pixel coordinates $(i,j)$ but not with respect to $k$ . We thus extend our preconjugation approach to $\mathbb{R}^{2}$ by observing for $t\in\mathbb{R}^{2}$ that

[TABLE]

since for $t=0$ , $\rho(\langle s,t\rangle)=0$ for all $s\in\mathbb{R}^{2}$ , while for $t_{1}\neq 0$ or $t_{2}\neq 0$ , the supremum will be attained at $1$ by the choice of $\rho$ . Setting

[TABLE]

makes $F=\|\,\boldsymbol{\cdot}\,\|_{\infty,0}$ again the $\kappa_{\infty}$ -preconjugate of the zero function $F^{*}\equiv 0$ . The corresponding Huber regularization can be once more computed by elementary calculus as

[TABLE]

The case $p\in(1,\infty)$ .

In principle, one could proceed as for $p=\infty$ by constructing a function $\rho_{p}:\mathbb{R}^{2}\times\mathbb{R}^{2}\to\mathbb{R}$ with

[TABLE]

and setting $\kappa_{p}(s,t)=\rho_{p}(s,t)$ . However, since the corresponding Potts functional only differs from the case $p=1$ by the relative contribution of pixels with jumps in both directions and $2^{1/p}\to 1$ for $p\to\infty$ , we will only consider the extremal cases $p=1$ and $p=\infty$ .

In all cases, we can apply Algorithm 1.1 to

[TABLE]

for $p\in[1,\infty]$ and $\gamma\geq 0$ . We illustrate the application of Algorithm 1.1 for $p\in\{1,\infty\}$ and $\gamma>0$ in Section 7.2.

Remark 2.2.

We can also apply this approach for $|t|^{q}$ with $q\in(0,1)$ using the same $\rho$ as above, writing

[TABLE]

as $\rho(st)=0$ if $t=0$ and attains the maximal value $1$ otherwise. However, $\kappa(t,s)$ is not $C^{2}$ ; we can achieve that by instead writing

[TABLE]

3 Notation and assumptions

We start the development of our proposed method by introducing the necessary notation and overall assumptions. Throughout the rest of this paper, we write $\mathcal{L}(X;Y)$ for the space of bounded linear operators between Hilbert spaces $X$ and $Y$ . In what follows, we let $x$ and $y$ denote elements of $X$ and $Y$ , respectively, and denote by $u$ a pair $(x,y)\in X\times Y$ . For brevity, we will also use this notation for similar tuples, e.g., $u^{i}:=(x^{i},y^{i})$ , without explicit introduction in each case.

For any Hilbert space, $I$ is the identity operator, $\langle x,x^{\prime}\rangle$ is the inner product in the corresponding space, and $\vmathbb{B}(x,r)$ is the closed unit ball of the radius $r$ at $x$ . If $H:X\rightrightarrows X$ is a set-valued map, we will frequently use the concise notation

[TABLE]

as well as, e.g.,

[TABLE]

if the corresponding relation holds for all $w\in H(x)$ .

For self-adjoint $T,S\in\mathcal{L}(X;Y)$ , the inequality $T\geq S$ means $T-S$ is positive semidefinite. If $T\in\mathcal{L}(X;X)$ is self-adjoint, we further set $\langle x,x^{\prime}\rangle_{T}:=\langle Tx,x^{\prime}\rangle$ , and $\|x\|_{T}:=\sqrt{\langle x,x\rangle_{T}}$ (which define an inner product and a norm in $X$ , respectively, if $T$ is in addition positive definite). In this case, $T\geq S$ implies that $\|x\|_{T}\geq\|x\|_{S}$ for all $x\in X$ .

We also recall that $K_{x}$ and $K_{y}$ denote the partial Fréchet derivatives of a continuosly differentiable operator $K$ with respect to the given variable.

Throughout this paper, we make the following fundamental assumptions on (1).

Assumption 3.0.

The functionals $G:X\to\overline{\mathbb{R}}$ and $F^{*}:Y\to\overline{\mathbb{R}}$ are convex, proper, and lower semicontinuous. Furthermore,

(i)

there exist a constant $\gamma_{G}\in\mathbb{R}$ and a neighborhood $\mathcal{X}_{G}$ of ${\widehat{x}}$ such that

[TABLE] 2. (ii)

there exist a constant $\gamma_{F^{*}}\in\mathbb{R}$ and a neighborhood $\mathcal{Y}_{F^{*}}$ of ${\widehat{y}}$ such that

[TABLE]

Let us comment on this assumption. First, since the subgradients $\partial G$ and $\partial F^{*}$ of convex, proper, and lower semicontinuous functionals are maximally monotone operators [4, Theorem 20.25], Section 3 always holds with $\gamma_{G}=\gamma_{F^{*}}=0$ . This is already sufficient for showing weak convergence of Algorithm 1.1; see Theorem 6.1. For strong convergence with rates, however, we (as usual in nonlinear optimization) need a local superlinear growth condition near the solution that requires taking $\gamma_{G}$ and/or $\gamma_{F^{*}}$ strictly positive (unless we can compensate by better properties of $K$ through Section 3 below); see Theorems 6.4 and 6.6. In this case, Section 3 (i), for example, coincides with strong metric subregularity of $\partial G$ ; see [1, 2]. This property holds (at any $\hat{x}$ and $\hat{w}\in\partial G(\hat{x})$ ) whenever $G$ is strongly convex; however, it is a strictly weaker property since we only require it to hold at a specific $\hat{x}$ and $\hat{w}=-K_{x}({\widehat{x}},{\widehat{y}})$ arising from the first-order necessary optimality conditions (17) below. (For example, $\partial g$ for $g(x)=|x|$ is strongly metrically subregular at $x=0$ for $w\in(-1,1)$ – but not at $w\in\{-1,1\}$ – although $g$ is not strongly convex.)

Assumption 3.0.

The functional $K(x,y)\in C^{1}(X\times Y)$ and there exist $\rho_{x},\rho_{y}>0$ such that for all

[TABLE]

the following properties hold:

(i)

(second partial derivatives) The second partial derivatives $K_{xy}(u)$ and $K_{yx}(u)$ exist and satisfy $K_{xy}(u)=[K_{yx}(u)]^{*}$ . 2. (ii)

(locally Lipschitz gradients) For some functions $L_{x}(y),L_{y}(x)\geq 0$ and a constant $L_{yx}\geq 0$ ,

[TABLE] 3. (iii)

(locally bounded gradient) There exists $R_{K}>0$ with $\sup_{u\in\mathcal{U}(\rho_{x},\rho_{y})}\|K_{xy}(x,y)\|\leq R_{K}$ . 4. (iv)

(three-point condition) There exist $\theta_{x},\theta_{y}>0$ , $\lambda_{x},\lambda_{y}\geq 0$ , $\xi_{x},\xi_{y}\in\mathbb{R}$ such that

[TABLE]

We again elaborate on this assumption. Section 3 (i)–(iii) are standard in nonlinear optimization of smooth functions. Apart from the estimates in Section 3 (ii), we make use of the following inequality that is an immediate consequence:

[TABLE]

The constants $\xi_{x}$ and $\xi_{y}$ in Section 3 (iv) can typically be taken positive by exploiting the strong monotonicity factors $\gamma_{G}$ and $\gamma_{F^{*}}$ of $\partial G$ and $\partial F^{*}$ . Indeed, further on in Theorem 4.1, we will require that $\gamma_{G}-\widetilde{\gamma}_{G}\geq\xi_{x}$ and $\gamma_{F^{*}}-\widetilde{\gamma}_{F^{*}}\geq\xi_{y}$ , where $\widetilde{\gamma}_{G}$ and $\widetilde{\gamma}_{F^{*}}$ will be acceleration factors employed to update the step length parameters $\tau_{i}$ , $\omega_{i}$ , and $\sigma_{i}$ in the algorithm.

In Appendix A we demonstrate that Section 3 (iv) is closely related to standard second-order optimality conditions, i.e., a positive definite Hessian at the solution ${\widehat{u}}$ . In particular, if the primal problem for the saddle-point functional is strongly convex and the dual problem is strongly concave, the constants that ensure Section 3 (iv) can be found explicitly. Nonetheless, Section 3 (iv) is more general than the simple strong convex-concavity. Indeed, in Appendix C we verify Section 3 for $K$ arising from combinations of a linear operator with a generalized conjugate representations of the step function and the $\ell^{0}$ function from Section 2.2.

Since (LABEL:eq:k-nonlinear-ky) holds for any $\xi_{y},\lambda_{y}\geq 0$ when $K(x,y)=\langle A(x),y\rangle$ for some $A\in C^{1}(X)$ , the conditions (15) reduce to the three-point condition for $A$ from [10] with the exponent $p=1$ . In the present work, such an exponent would correspond to exponents $p_{x},p_{y}\in[1,2]$ over the norms with the factors $\theta_{y}$ and $\theta_{y}$ that we consider in Appendix B (ivenumi). These can sometimes be useful: The exponent $p=2$ was needed in [35, Appendix B] to show the three-point condition for $A$ for a phase and amplitude reconstruction problem. For the sake of readability, in the main part of the present work we focus on the case $p_{x}=p_{y}=1$ , i.e., Section 3 (iv), and discuss the changes needed for $p_{x},p_{y}\in(1,2]$ in Appendix B.

4 An abstract convergence result

We want to find a critical point ${\widehat{u}}=({\widehat{x}},{\widehat{y}})\in X\times Y$ of the saddle point functional $(x,y)\mapsto G(x)+K(x,y)-F^{*}(y)$ , i.e., satisfying

[TABLE]

Since $G$ and $F^{*}$ are proper, convex, and lower semicontinuous, and $K$ is continuously differentiable, using the definition of the saddle-point, the Fréchet derivative, and the convex subdifferential, an elementary limiting argument as in, e.g., [9, Prop. 2.2] shows that the inclusion (17) is a first-order necessary optimality condition for a saddle point. If $K(x,y)=\langle Ax,y\rangle$ for $A\in\mathcal{L}(X;Y)$ , (17) reduces to $-A^{*}{\widehat{y}}\in\partial G({\widehat{x}})$ and $A{\widehat{x}}\in\partial F^{*}({\widehat{y}})$ , which coincides with the well-known Fenchel–Rockafellar extremality conditions for (2); see [14, Remark 4.2].

To study Algorithm 1.1, we reformulate it in the preconditioned proximal point and testing framework of [36]. Specifically, we write Algorithm 1.1 in implicit proximal point form as solving in each iteration for $u^{i+1}=(x^{i+1},y^{i+1})\in X\times Y$ in

[TABLE]

where the linearization $\widetilde{H}_{i+1}$ of $H$ , the linear preconditioner $M_{i+1}$ , and the step length operator $W_{i+1}$ are defined as

[TABLE]

Inserting these definitions into (IPP) and rearranging, we can rewrite inclusion (IPP) as

[TABLE]

Therefore, based on the definitions of the proximal point mapping $\mathrm{prox}_{\tau G}(v)=(I+\tau\partial G)^{-1}(v)$ and of $\widebar{x}^{i+1}=(1+\omega_{i})x^{i+1}-\omega_{i}x^{i}$ , solving (IPP) for $u^{i+1}$ is equivalent to performing one step of Algorithm 1.1. Since proximal mappings of proper, convex and lower semicontinuous functionals are well-defined, single-valued, and Lipschitz continuous [4, Proposition 12.15], and $K$ is twice Fréchet differentiable on $X\times Y$ , this also shows that (IPP) always admits a unique solution $u^{i+1}$ .

The next step is to “test” the inclusion (IPP) by application of $\langle\,\boldsymbol{\cdot}\,,u^{i+1}-{\widehat{u}}\rangle_{Z_{i+1}}$ for the testing operator

[TABLE]

This testing operator and the respective primal and dual testing variables $\phi_{i}$ and $\psi_{i+1}$ will be seen to encode convergence rates after some rearrangements of the tested inclusions for $i=0,\ldots,N-1$ .

We will base our convergence analysis on the following abstract estimate, where $\|\,\boldsymbol{\cdot}\,\|_{Z_{N+1}M_{N+1}}^{2}$ forms a local metric that measures the convergence of the iterates while $\Delta_{i+1}$ can potentially be used to measure function value or gap converge. In particular, we therefore want $\|u\|_{Z_{N+1}M_{N+1}}\to\infty$ as $N\to\infty$ with a certain rate such that boundedness of $\|u^{N}-{\widehat{u}}\|_{Z_{N+1}M_{N+1}}$ implies the convergence of $u^{N}\to{\widehat{u}}$ at the reciprocal rate (see Theorems 6.4 and 6.6).

Theorem 4.1 ([36, Theorem 2.1]).

Suppose (IPP) is solvable, and denote the iterates by $\{u^{i}\}_{i\in\mathbb{N}}$ . If $Z_{i+1}M_{i+1}$ is self-adjoint and for some ${\widehat{u}}\in U$ and $\Delta_{i+1}=\Delta_{i+1}({\widehat{u}})\in\mathbb{R}$ , for all $i\leq N-1$ ,

[TABLE]

The next theorem specializes Theorem 4.1 to our specific setup, converting the abstract condition (22) into several step length and testing parameter update rules and bounds. Specifically, (24a) below couples the primal and dual step lengths $\tau_{i}$ and $\sigma_{i}$ and the over-relaxation parameter $\omega_{i}$ with the testing parameters. Condition (24b) determines convergence rates by limiting how fast the testing parameters can grow. This rate is limited through the available strong monotonicity or second-order behavior ( $\gamma_{G}-\xi_{x}$ and $\gamma_{F^{*}}-\xi_{y}$ ) through (24d) and (24e) as well as additional step length bounds from (24c). We point out that only the latter are specific to our non-convex setting; the remaining conditions are present in the convex setting as well, see [36]. We will further develop these rules and conditions in the next section to obtain specific convergence results; an explicit example for a set of parameters satisfying these rules and conditions will be provided for the $\ell^{0}$ -TV denoising in Sections 7.2 and C. Here and in the following, we use the notation $\widebar{x}^{i+1}:=x^{i+1}+\omega_{i}(x^{i+1}-x^{i})$ from Algorithm 1.1 for brevity.

Theorem 4.2.

Suppose Sections 3 and 3 hold with the constants $\theta_{x},\theta_{y}>0$ ; $\xi_{x},\xi_{y}\in\mathbb{R}$ ; $\lambda_{x},\lambda_{y}\geq 0$ ; $L_{yx}\geq 0$ and $R_{K}>0$ . For all $i\in\mathbb{N}$ , let $\widebar{u}^{i+1}:=(\widebar{x}^{i+1},y^{i})$ , and suppose $u^{i},u^{i+1},{\widehat{u}},\widebar{u}^{i+1}\in\mathcal{U}(\rho_{x},\rho_{y})$ for some $\rho_{x},\rho_{y}\geq 0$ . Assume for all $i\in\mathbb{N}$ that $\overline{\omega}\geq\omega_{i}\geq\underline{\omega}>0$ and that for some $0<\delta\leq\mu<1$ ; $\eta_{i}>0$ ; and $\widetilde{\gamma}_{G},\widetilde{\gamma}_{F^{*}}\geq 0$ ,

[TABLE]

Then (22) is satisfied for any $\Delta_{i+1}\leq 0$ .

Proof 4.3.

We split the proof into several steps.

Step 1 (estimation of $Z_{i+1}M_{i+1}$ )

By (24a), $\phi_{i}\tau_{i}=\eta_{i}$ and $\psi_{i+1}\sigma_{i+1}\omega_{i}=\eta_{i}$ , so (19) yields

[TABLE]

which is clearly self-adjoint. Applying Cauchy’s and Young’s inequalities, we further obtain for any $\delta>0$ , $x\in X$ , and $y\in Y$ that

[TABLE]

implying that

[TABLE]

Step 2 (estimation of $Z_{i+1}M_{i+1}-Z_{i+2}M_{i+2}$ )

Expanding $Z_{i+1}M_{i+1}-Z_{i+2}M_{i+2}$ according to (25) and then applying (24b), we obtain

[TABLE]

Step 3 (estimation of $\widetilde{H}_{i+1}(u^{i+1})$ )

By (18) we have

[TABLE]

Since $0\in H({\widehat{u}})$ , we have $-K_{x}({\widehat{x}},{\widehat{y}})\in\partial G({\widehat{x}})$ and $K_{y}({\widehat{x}},{\widehat{y}})\in\partial F^{*}({\widehat{y}})$ . Using (21) multliplied by $Z_{i+1}$ , Section 3, and (24a), we can thus estimate

[TABLE]

Combining (28), (27), and (26), we arrive at

[TABLE]

for

[TABLE]

The claim of the theorem is established if we prove that $S_{i+1}\geq 0$ .

Step 4 (estimation of $D$ )

With

[TABLE]

we can rewrite

[TABLE]

We rearrange

[TABLE]

Since $\eta_{i+1}=\eta_{i}\omega_{i}^{-1}$ , setting

[TABLE]

we can write

[TABLE]

As for the estimate for $D_{\omega}$ , using Section 3 (ii) and (16) we obtain

[TABLE]

using in the last inequality the expansion $\widebar{x}^{i+1}:=x^{i+1}+\omega_{i}(x^{i+1}-x^{i})$ and the bound $\|y^{i+1}-{\widehat{y}}\|\leq\rho_{y}$ that follows from the assumed inclusion $u^{i+1}\in\mathcal{U}(\rho_{x},\rho_{y})$ .

We now use Section 3 (iv) to further bound $D_{x}$ and $D_{y}$ . From (15a), we obtain

[TABLE]

using in the last two inequalities that $u^{i+1}\in\mathcal{U}(\rho_{x},\rho_{y})$ for some $\rho_{x},\rho_{y}\geq 0$ , $\omega_{i}^{-1}\leq\underline{\omega}^{-1}$ and $\theta_{x}\geq\rho_{y}\underline{\omega}^{-1}$ from (24e). Analogously, from (LABEL:eq:k-nonlinear-ky) and Cauchy’s inequality,

[TABLE]

where in the last two inequalities we again used $u^{i+1}\in\mathcal{U}(\rho_{x},\rho_{y})$ , $\omega_{i}\leq\overline{\omega}$ , and $\theta_{y}\geq\overline{\omega}\rho_{x}$ from (24d). Therefore, combining (30), (31), and (32), we obtain

[TABLE]

where we have also used the first bounds of (24d) and (24e) in the final step. Further using (24c) and $\eta_{i+1}=\eta_{i}\omega_{i}^{-1}$ , we deduce that $D\geq-\frac{1}{2}\|u^{i+1}-u^{i}\|_{\hat{Q}_{i+1}}^{2}$ . Recalling (29), we obtain $S_{i+1}\geq 0$ , i.e., (22) holds with $\Delta_{i+1}\leq 0$ as claimed.

In the subsequent sections, we will also need the following corollary.

Corollary 4.4.

Suppose that Section 3 (iii) and the conditions (24) hold. Then

[TABLE]

and

[TABLE]

Proof 4.5.

Observe that due to (24),

[TABLE]

This is our first claim. As for the second term, from Section 3 (iii) we have

[TABLE]

Inserting this bound into (26) in the proof of Theorem 4.2 establishes (34).

5 Local step length bounds

In the previous section, we derived step length conditions that we will further develop in Section 6 to prove convergence and convergence rates. However, we implicitly required that all the iterations $\{u^{i}\}_{i\in\mathbb{N}}$ belong to $\mathcal{U}(\rho_{x},\rho_{y})$ . In this section, we derive additional step lengths restrictions to ensure that this holds.

We start with a lemma that bounds the next iterate $u^{i+1}$ given bounds on the current iterate $u^{i}$ and the step lengths for the current iteration. Afterwards, we chain these estimates to only require bounds on the initial iterates and the step lengths.

Lemma 5.1.

Fix $i\in\mathbb{N}$ . Suppose Section 3, Section 3 (ii), and (iii) hold in $\mathcal{U}(\rho_{x},\rho_{y})$ , and that $u^{i+1}$ solves (IPP). For simplicity, assume $\omega_{i}\leq 1$ . Suppose $r_{x,i},r_{y,i},\delta_{x},\delta_{y}>0$ and ${\widehat{u}}\in H^{-1}(0)$ are such that $\vmathbb{B}({\widehat{x}},r_{x,i}+\delta_{x})\times\vmathbb{B}({\widehat{y}},r_{y,i}+\delta_{y})\subseteq\mathcal{U}(\rho_{x},\rho_{y})$ and $u^{i}\in\vmathbb{B}({\widehat{x}},r_{x,i})\times\vmathbb{B}({\widehat{y}},r_{y,i})$ . If

[TABLE]

then $u^{i+1}\in\vmathbb{B}({\widehat{x}},r_{x,i}+\delta_{x})\times\vmathbb{B}({\widehat{y}},r_{y,i}+\delta_{y})$ and $\|\widebar{x}^{i+1}-{\widehat{x}}\|\leq r_{x,i}+\delta_{x}$ .

Proof 5.2.

We want to show that the step length conditions (35) are sufficient for

[TABLE]

We do this by applying the testing argument on the primal and dual variables separately. Multiplying (IPP) by $Z_{i+1}^{*}(u^{i+1}-{\widehat{u}})$ with $\phi_{i}=1$ and $\psi_{i+1}=0$ , we obtain

[TABLE]

Using the three-point identity

[TABLE]

we obtain

[TABLE]

Using further $0\in\partial G({\widehat{x}})+K_{x}({\widehat{x}},{\widehat{y}})$ and the monotonicity of $\partial G$ , we arrive at

[TABLE]

With $C_{x}:=\tau_{i}\|K_{x}(x^{i},y^{i})-K_{x}({\widehat{x}},{\widehat{y}})\|$ , this implies that

[TABLE]

After rearranging the terms and using $\|x^{i+1}-{\widehat{x}}\|\leq\|x^{i+1}-x^{i}\|+\|x^{i}-{\widehat{x}}\|$ , we thus have

[TABLE]

which leads to

[TABLE]

To estimate the dual variable, we multiply (IPP) by $Z_{i+1}^{*}(u^{i+1}-{\widehat{u}})$ with $\phi_{i}=0$ and $\psi_{i+1}=1$ . This gives

[TABLE]

Using $0\in\partial F^{*}({\widehat{y}})-K_{y}({\widehat{x}},{\widehat{y}})$ and following the steps leading to (38), we deduce

[TABLE]

with $C_{y}:=\sigma_{i+1}\|K_{y}({\widehat{x}},{\widehat{y}})-K_{y}(\widebar{x}^{i+1},y^{i})\|$ .

We now proceed to derive bounds on $C_{x}$ and $C_{y}$ with the goal of bounding both (38) and (39) from above. Using Section 3 (ii), (iii), and the mean value theorem applied to $K_{x}(x^{i},\cdot)$ and $K_{y}(\cdot,y^{i})$ ,

[TABLE]

the latter under the assumption that $\|\widebar{x}^{i+1}-{\widehat{x}}\|\leq r_{x,i}+\delta_{x}$ , which we now verify. First, by definition,

[TABLE]

Applying (37) and (38), we obtain

[TABLE]

The bound (35) on $\tau_{i}$ implies that $C_{x}\leq R_{x}\leq\delta_{x}/2$ and hence that $\|\widebar{x}^{i+1}-{\widehat{x}}\|\leq r_{x,i}+\delta_{x}$ . From (38) we thus obtain $\|x^{i+1}-{\widehat{x}}\|\leq r_{x,i}+\delta_{x}$ . The bound (35) on $\sigma_{i}$ then implies that $C_{y}\leq R_{y}\leq\delta_{y}$ , which together with (39) completes the proof.

To chain the applications of Lemma 5.1 on each iteration $i\in\mathbb{N}$ , we introduce the following assumption, for which we recall the notations in Section 3 as well as the definition of $\mathcal{U}(\rho_{x},\rho_{y})$ from (14).

Assumption 5.2.

Suppose Section 3 holds near a solution ${\widehat{u}}\in H^{-1}(0)$ . Given an initial iterate $u^{0}\in X\times Y$ , and initial step length parameters $\tau_{0},\sigma_{1},\omega_{0}>0$ as well as $0<\delta\leq\mu<1$ (to satisfy (24)), define the weighted distance

[TABLE]

We then assume that there exist $\delta_{x},\delta_{y}>0$ and $r_{y}\geq r_{\max}\sqrt{\nu(1-\delta)\delta(\mu-\delta)^{-1}}$ such that

[TABLE]

and that for all $i\in\mathbb{N}$ the step lengths $\tau_{i},\sigma_{i}>0$ satisfy

[TABLE]

Lemma 5.3.

For all $i\in\mathbb{N}$ , suppose $u^{i+1}$ solves (IPP) and that all the conditions of Theorem 4.2 are satisfied for some $\rho_{x},\rho_{y}>0$ and $\widetilde{\gamma}_{G},\widetilde{\gamma}_{F^{*}}\geq 0$ except for the requirement $u^{i},u^{i+1},\widebar{u}^{i+1}\in\mathcal{U}(\rho_{x},\rho_{y})$ . Then if Section 5 holds, $\{u^{i}\}_{i\in\mathbb{N}},\{\widebar{u}^{i+1}\}_{i\in\mathbb{N}}\subset\mathcal{U}(\rho_{x},\rho_{y})$ .

Proof 5.4.

We define $r_{x,i}:=\frac{1}{\sqrt{\delta\phi_{i}}}\|u^{0}-{\widehat{u}}\|_{Z_{1}M_{1}}$ and

[TABLE]

Since the conditions (24) hold, we can apply Corollary 4.4 and the estimate (34) on $Z_{i+1}M_{i+1}$ to deduce that

[TABLE]

From (24b), we also deduce that $\phi_{i+1}\geq\phi_{i}$ and hence that $r_{x,i+1}\leq r_{x,i}$ . Consequently, if $r_{x,0}\leq r_{\max}$ , then

[TABLE]

so it will suffice to show that $u^{i}\in\vmathbb{B}({\widehat{x}},r_{x,i}+\delta_{x})\times\vmathbb{B}({\widehat{y}},r_{y}+\delta_{y})$ for each $i\in\mathbb{N}$ to prove the claim. We do this in two steps. In the first step, we show that $r_{x,i}\leq r_{\max}$ and

[TABLE]

In the second step, we show by induction that $u^{i}\in\mathcal{U}_{i}$ as well as $\widebar{u}^{i+1}\in\mathcal{U}(\rho_{x},\rho_{y})$ for $i\in\mathbb{N}$ .

Step 1

We first prove (44). Since $\mathcal{U}_{i}\subseteq\vmathbb{B}({\widehat{x}},r_{x,i})\times Y$ , we only have to show that $\mathcal{U}_{i}\subseteq X\times\vmathbb{B}({\widehat{y}},r_{y})$ . First, note that (24) and $\widetilde{\gamma}_{G},\widetilde{\gamma}_{F^{*}}\geq 0$ imply $\psi_{i+1}\geq\psi_{i}\geq\psi_{1}$ as well as $\phi_{i+1}\geq\phi_{i}\geq\phi_{0}=\eta_{1}\omega_{0}\tau_{0}^{-1}=\nu\psi_{1}$ for $\nu$ defined in (40). We then obtain from the definition of $r_{x,i}$ substituting $Z_{1}M_{1}$ from (25) that

[TABLE]

Using Cauchy’s and Young’s inequalities, the fact that $\phi_{i}\geq\nu\psi_{1}$ , and the assumption that $\|K_{xy}(x^{0},y^{0})\|\leq R_{K}$ , we arrive at

[TABLE]

We obtain from Corollary 4.4 that $\eta_{0}^{2}\phi_{0}^{-1}R_{K}^{2}\leq(1-\mu)\psi_{1}\leq\psi_{1}$ and hence that $r_{x,i}^{2}\leq r_{\max}^{2}$ . The assumption on $r_{y}$ then yields for all $i\in\mathbb{N}$ that

[TABLE]

Thus (44) follows from the definition of $\mathcal{U}_{i}$ .

Step 2

We next show by induction that $u^{i}\in\mathcal{U}_{i}$ and $\widebar{u}^{i+1}\in\mathcal{U}(\rho_{x},\rho_{y})$ for all $i\in\mathbb{N}$ . Since (42) holds for $i=0$ , we have that $u^{0}\in\mathcal{U}_{0}$ . Moreover, since in Step 1 we have $r_{x,0}\leq r_{\max}$ , the bound (35) for $i=0$ follows from (41). This gives the induction basis.

Suppose now that $u^{N}\in\mathcal{U}_{N}$ . By (44), we have that $u^{N}\in\vmathbb{B}({\widehat{x}},r_{x,N})\times\vmathbb{B}({\widehat{y}},r_{y})$ . Since again the bound (35) for $i=N$ follows from (41) and the bound $r_{x,N}\leq r_{\max}$ follows from Step 1, we can apply Lemma 5.1 to obtain

[TABLE]

By (43), we have $\vmathbb{B}({\widehat{x}},r_{x,N}+\delta_{x})\times\vmathbb{B}({\widehat{y}},r_{y}+\delta_{y})\subseteq\mathcal{U}(\rho_{x},\rho_{y})$ and thus $u^{N+1},\widebar{u}^{N+1}\in\mathcal{U}(\rho_{x},\rho_{y})$ . Theorem 4.2 now implies that (23) is satisfied for $i\leq N$ with $\Delta_{N+1}\leq 0$ , which together with (23) and (42) yields that $u^{N+1}\in\mathcal{U}_{N+1}$ . This completes the induction step and hence the proof.

6 Convergence estimates

We are now ready to formulate the main convergence results of this paper based on the estimates derived above. First, based on (24d) and (24e), strong convexity may be required if $\xi_{x}$ and $\xi_{y}$ have to be positive for Section 3 to be satisfied. Moreover, the neighborhood $\mathcal{U}(\rho_{x},\rho_{y})$ has to be small enough, as determined by the assumptions $\theta_{x}\geq\rho_{y}{\underline{\omega}}^{-1}$ and $\theta_{y}\geq\overline{\omega}\rho_{x}$ in the next results. This affects the admissible step lengths and how close we have to initialize $u^{0}$ via Section 5. After the next three main convergence results, we show that Section 5 is satisfied if we initialize close enough to a root ${\widehat{u}}\in H^{-1}(0)$ . Hence, to apply the theorems in practice, we have to find constants for which Sections 3 and 3 are satisfied, use these constants to bound and compute the step lengths as described in the theorems, and initialize close enough to ${\widehat{u}}$ . In Appendix B we consider some relaxation of Section 3 (iv), which in turn requires larger $\gamma_{G}$ and $\gamma_{F^{*}}$ instead of $\theta_{x}\geq\rho_{y}\underline{\omega}^{-1}$ and $\theta_{y}\geq\overline{\omega}\rho_{x}$ .

The following theorem provides conditions sufficient for weak convergence of the sequence $\{u^{i}\}_{i\in\mathbb{N}}$ generated by Algorithm 1.1. Apart from technical requirements of Theorem 4.2, we require additional weak-to-strong continuity of the mapping $u\mapsto K_{yx}(u)x$ . While its verification depends on the particular choice of $K$ , it is trivially satisfied in two cases: (i) $X$ and $Y$ are finite-dimensional and $K_{yx}$ is continuous; or (ii) the mapping $u\mapsto K_{yx}(u)x$ is linear and compact.

Theorem 6.1 (weak convergence: $\omega_{i}=1$ ).

Suppose Sections 3, 3 and 5 hold for some $R_{K}>0$ ; $L_{yx}\geq 0$ ; $\lambda_{x},\lambda_{y},\theta_{x},\theta_{y}\geq 0$ ; and $\xi_{x},\xi_{y}\in\mathbb{R}$ such that

[TABLE]

For some $0<\delta<\mu<1$ , choose

[TABLE]

Furthermore, suppose that

(i)

$u^{i}\mathrel{\rightharpoonup}\widebar{u}$ implies that $K_{yx}(u^{i})x\rightarrow K_{yx}(\widebar{u})x$ for all $x\in X$ ,

and either

(iia)

the mapping $u\mapsto(K_{x}(u),K_{y}(u))$ is weak-to-strong continuous in $\mathcal{U}(\rho_{x},\rho_{y})$ ; or 2. (iib)

the mapping $u\mapsto(K_{x}(u),K_{y}(u))$ is weak-to-weak continuous, but Section 3 (monotone $\partial G$ and $\partial F^{*}$ ) and Section 3 (iv) (three-point condition on $K$ ) hold at any weak limit $\widebar{u}=(\widebar{x},\widebar{y})$ of $\{u^{i}\}_{i\in\mathbb{N}}$ for the same choices of $\theta_{x}$ and $\theta_{y}$ .

Then the sequence $\{u^{i}\}_{i\in\mathbb{N}}$ generated by Algorithm 1.1 converges weakly to some $\widebar{u}\in H^{-1}(0)$ (possibly different from ${\widehat{u}}$ ).

Since it is assumed that $\theta_{x}\geq 2\rho_{y}$ , we can replace $\rho_{y}$ by $\theta_{x}/2$ in the bound on $\tau$ in (47) if the latter is more readily available.

For constant $\tau$ , $\sigma$ , and $\omega=1$ , we have to set $\psi_{i}\equiv\psi$ and $\phi_{i}\equiv\phi$ to satisfy (24a). Consequently, applying Corollary 4.4 to bound $Z_{i+1}M_{i+1}$ from below will not help to prove Theorem 6.1. We instead will make use of the following enhanced version of Opial’s lemma.

Lemma 6.2 ([10, Lemma A.2]).

Let $U$ be a Hilbert space, $\hat{U}\subset U$ (not necessarily closed or convex), and $\{u^{i}\}_{i\in\mathbb{N}}\subset U$ . Also let $A_{i}\in\mathcal{L}(U;U)$ be self-adjoint and $A_{i}\geq\hat{\epsilon}^{2}I$ for some $\hat{\epsilon}\neq 0$ for all $i\in\mathbb{N}$ . If the following conditions hold, then $u^{i}\mathrel{\rightharpoonup}\widebar{u}$ in $U$ for some $\widebar{u}\in\hat{U}$ :

(i)

The sequence $\{\|u^{i}-\hat{u}\|_{A_{i}}\}_{i\in\mathbb{N}}$ is nonincreasing for some $\hat{u}\in\hat{U}$ . 2. (ii)

All weak limit points of $\{u^{i}\}_{i\in\mathbb{N}}$ belong to $\hat{U}$ . 3. (iii)

There exists $C>0$ such that $\|A_{i}\|\leq C^{2}$ for all $i$ , and for any weakly convergent subsequence $\{u_{i_{k}}\}_{k\in\mathbb{N}}$ there exists $A_{\infty}\in\mathcal{L}(U;U)$ such that $A_{i_{k}}u\to A_{\infty}u$ strongly in $U$ for all $u\in U$ .

Proof 6.3 (Proof of Theorem 6.1).

We first verify (24) so that we can apply Theorem 4.2 and Lemma 5.3. We set $\psi_{N}\equiv 1$ , $\phi_{N}\equiv\sigma\tau^{-1}$ , $\widetilde{\gamma}_{G}=\widetilde{\gamma}_{F^{*}}=0$ to satisfy (24a), (24b), (24d) and (24e) for $\omega=\underline{\omega}=\overline{\omega}=1$ and $\xi_{x}$ , $\xi_{y}$ , $\theta_{x}$ , $\theta_{y}$ satisfying (46). With the choice $\omega=1$ , the bounds (47) thus ensure (24c).

Hence (24) holds, which together with Section 5 and $\psi_{1}=1$ enables us to use Lemma 5.3 to obtain $\{u^{i}\}_{i\in\mathbb{N}}\in\mathcal{U}(\rho_{x},\rho_{y})$ and $\{\widebar{x}^{i+1}\}_{i\in\mathbb{N}}\in\vmathbb{B}({\widehat{x}},\rho_{x})$ . Therefore there exists at least one weak limit point of $\{u^{i}\}_{i\in\mathbb{N}}$ . Moreover, (25) yields self-adjointness of $Z_{i+1}M_{i+1}$ and since the bounds (47) are strict, Theorem 4.2 holds with $\Delta_{i+1}\leq-\hat{\delta}\sum_{i=0}^{N}\|u^{i+1}-u^{i}\|^{2}$ for some $\hat{\delta}>0$ .

We now verify the conditions of Lemma 6.2 with $\hat{U}=H^{-1}(0)$ and $A_{i}=Z_{i+1}M_{i+1}$ . Estimate (23) is valid for any starting iterate; thus setting $N=1$ and taking $u^{i}$ instead of $u^{0}$ , we obtain $\|u^{i+1}-{\widehat{u}}\|^{2}_{Z_{i+2}M_{i+2}}\leq\|u^{i}-{\widehat{u}}\|^{2}_{Z_{i+1}M_{i+1}}+\Delta_{i+1}$ for any $\Delta_{i+1}\leq 0$ due to Theorem 4.2. This verifies (i). Moreover, (iii) follows from the assumed constant step lengths, Section 3 (iii), and the assumption that $K_{yx}(u^{i})x\rightarrow K_{yx}(\widebar{u})x$ for all $x\in X$ if $u^{i}\mathrel{\rightharpoonup}\widebar{u}$ .

Hence we only need to verify (ii), i.e., if a subsequence of $\{u^{i}\}_{i\in\mathbb{N}}$ converges weakly to some $\widebar{u}$ , then $\widebar{u}\in H^{-1}(0)$ . We note that $W_{i+1}\equiv W$ , and (IPP) implies that $v_{i+1}\in WA(u^{i+1})$ for

[TABLE]

Therefore it suffices to show that if $u^{i_{k}}\mathrel{\rightharpoonup}\widebar{u}=(\widebar{x},\widebar{y})$ for a subsequence, then

[TABLE]

which by construction is equivalent to $\widebar{u}\in H^{-1}(0)$ . Note that $A$ is maximally monotone since it only involves subgradient mappings of proper convex lower semicontinuous functions due to Section 3. Moreover, further use of (23) shows that $\sum_{i=0}^{\infty}\frac{\hat{\delta}}{2}\|u^{i+1}-u^{i}\|^{2}<\infty$ and hence that $\|u^{i+1}-u^{i}\|\to 0$ . The last two terms in (51) thus converge strongly to zero. We therefore only have to consider the first term, for which we make a case distinction.

(a)

If assumption (iia) holds, we obtain that $v_{i_{k}}\to\widebar{v}$ , and the required inclusion $\widebar{v}\in A(\widebar{u})$ follows from the fact that the graph of the maximally monotone operator $A$ is sequentially weakly–strongly closed; see [4, Proposition 16.36]. 2. (b)

If assumption (iib) holds, then only $v_{i_{k}}\mathrel{\rightharpoonup}\widebar{v}$ . In this case, we can apply the Brezis–Crandall–Pazy Lemma [4, Corollary 20.59 (iii)] to obtain the required inclusion under the additional condition that $\limsup_{k\to\infty}~{}\langle u^{i_{k}}-\widebar{u},v_{i_{k}}-\widebar{v}\rangle\leq 0$ . In our case, recalling that the last two terms of (51) converge strongly to zero, we have that

[TABLE]

for

[TABLE]

Defining

[TABLE]

we rearrange and estimate

[TABLE]

Using $\xi_{x}=\gamma_{G}$ , $\xi_{y}=\gamma_{F^{*}}$ , (16), and both Section 3 and Section 3 (iv) at $\widebar{u}$ , we estimate $q_{i}\leq O(\|u^{i+1}-u^{i}\|)$ as

[TABLE]

In the last bounds we used $\theta_{x}\geq 2\rho_{y}$ , $\theta_{y}\geq 2\rho_{x}$ , and $\|y^{i+1}-\widebar{y}\|\leq 2\rho_{y}$ because both $\|y^{i+1}-{\widehat{y}}\|\leq\rho_{y}$ and $\|{\widehat{y}}-\widebar{y}\|\leq\rho_{y}$ ; likewise, $\|x^{i+1}-\widebar{x}\|\leq 2\rho_{x}$ . Since $\|u^{i+1}-u^{i}\|\to 0$ , we obtain that $\limsup_{i\rightarrow\infty}~{}q_{i}\leq 0$ . The Brezis–Crandall–Pazy Lemma thus yields the desired inclusion $\widebar{v}\in A(\widebar{u})$ .

Hence in both cases, $\widebar{u}\in H^{-1}(0)$ and the condition (ii) of Lemma 6.2 is satisfied. Applying Lemma 6.2, we obtain the claim.

We now provide convergence rates under additional assumptions of strong convexity of $G$ and/or $F^{*}$ , although we still allow non-convexity of the overall problem through $K$ . To be specific, we require that we can take the acceleration or step length update factors $\widetilde{\gamma}_{G}>0$ and/or $\widetilde{\gamma}_{F^{*}}>0$ in (24d) and (24e), respectively. Let us start with $\widetilde{\gamma}_{G}>0$ , which is the case, for instance, when $G$ is strongly convex and (15a) holds with $\xi_{x}=0$ . Since we obtain a fortiori strong convergence from the rates, we do not require the additional assumptions on $K$ introduced in Theorem 6.1; on the other hand, we only obtain convergence of the primal iterates. Similar to the linear case of [10], the step length choice follows directly from having to satisfy (24b) and the desire to keep the right-hand side of the $\sigma$ -rule (24c) constant.

Theorem 6.4 (convergence rates under acceleration: $\omega_{i}=1$ ).

Suppose Sections 3, 3 and 5 hold for some $R_{K}>0$ ; $L_{yx}\geq 0$ ; $\lambda_{x},\lambda_{y},\theta_{x},\theta_{y}\geq 0$ ; and $\xi_{x},\xi_{y}\in\mathbb{R}$ such that for some $\widetilde{\gamma}_{G}>0$ ,

[TABLE]

Choose

[TABLE]

satisfying for some $0<\delta\leq\mu<1$ the bounds

[TABLE]

Then $\|x^{N}-{\widehat{x}}\|^{2}$ converges to zero at the rate $O(1/N)$ .

Proof 6.5.

We again first verify (24) so that we can apply Theorem 4.2 and Lemma 5.3. Setting $\psi_{i}\equiv 1$ , $\eta_{i}\equiv\sigma$ , $\phi_{i}:=\sigma\tau_{i}^{-1}$ , and $\widetilde{\gamma}_{F^{*}}=0$ , (24a) follows from the $\sigma$ -rule of (54) and the choice of $\psi_{i}$ , $\eta_{i}$ , and $\phi_{i}$ . Using (54) and $\tau_{i}:=\sigma\phi_{i}^{-1}$ , we obtain $\phi_{i+1}=(1+2\widetilde{\gamma}_{G}\tau_{i})\phi_{i}$ , and hence (24b) follows. Since $\tau_{i}\leq\tau_{0}$ and $\lambda_{y}\geq 0$ , (24c) follows from (55) and $\omega_{i}=1$ . Furthermore, (24d) and (24e) are satisfied due to the assumed bounds (53) on $\xi_{x}$ , $\xi_{y}$ , $\theta_{x}$ , and $\theta_{y}$ taking $\overline{\omega}=\underline{\omega}=1$ .

We can thus apply Theorem 4.2 and Lemma 5.3 to arrive at (23) for $\Delta_{i+1}=0$ . We now estimate the convergence rate from (23) by bounding $Z_{N+1}M_{N+1}$ from below. Using Corollary 4.4, we obtain $\delta\phi_{N}\|x^{N}-{\widehat{x}}\|^{2}\leq\|u^{0}-{\widehat{u}}\|^{2}_{Z_{1}M_{1}}$ . Moreover,

[TABLE]

which yields the claim.

Theorem 6.6 (linear convergence: $\omega_{i}<1$ ).

Suppose Sections 3, 3 and 5 hold for some $R_{K}>0$ ; $L_{yx}\geq 0$ ; $\lambda_{x},\lambda_{y}\geq 0$ ; and $\widetilde{\gamma}_{G},\widetilde{\gamma}_{F^{*}}>0$ as well as

[TABLE]

with

[TABLE]

Assume for some $0<\delta\leq\mu<1$ the bound

[TABLE]

Then $\|u^{N}-{\widehat{u}}\|^{2}$ converges to zero with the linear rate $O(\omega^{N})$ .

Proof 6.7.

We will use Theorem 4.2 and Lemma 5.3, for both of which we need to verify (24) first. We set $\overline{\omega}:=\underline{\omega}:=\omega$ ,

[TABLE]

Then $\psi_{1}=1$ and $\psi_{N}\sigma=\phi_{N}\tau$ , verifying (24a) and (24b). We next observe that substituting $\sigma_{i}=\tau\widetilde{\gamma}_{G}\widetilde{\gamma}_{F^{*}}^{-1}$ , the first bound of (24c) is tantamount to requiring

[TABLE]

Substituting $\omega=(1+2\widetilde{\gamma}_{G}\tau)^{-1}$ , this in turn is equivalent to

[TABLE]

which after solving a quadratic inequality for $\tau$ yields the second bound of (58). Since $\omega\leq 1$ , the first bound of (58) gives the second bound of (24c). Finally, (24d) and (24e) follow directly from (56) with $\underline{\omega}=\overline{\omega}=\omega$ .

Since Section 5 and (24) hold, we can apply Lemma 5.3 to obtain $\{u^{i}\}_{i\in\mathbb{N}}\in\mathcal{U}(\rho_{x},\rho_{y})$ and $\{\widebar{x}^{i+1}\}_{i\in\mathbb{N}}\in\vmathbb{B}({\widehat{x}},\rho_{x})$ . Moreover, (25) yields self-adjointness of $Z_{i+1}M_{i+1}$ . Consequently, we can apply Theorem 4.2 and Lemma 5.3 to arrive at (23) for any $\Delta_{i+1}\leq 0$ .

We now estimate the convergence rate from (23) by bounding $Z_{N+1}M_{N+1}$ from below. Using Corollary 4.4, we obtain that

[TABLE]

Since $\omega\in(0,1)$ , this gives the claimed linear convergence rate through the exponential growth of $1/\omega^{N}$ .

Remark 6.8.

If $K(x,y)=\langle A(x),y\rangle$ for some $A\in C^{1}(X)$ , then $K_{x}(x,y)=[\nabla A(x)]^{*}y$ and $K_{y}(x,y)=A(x)$ with $L_{y}(x)=0$ and $L_{yx}=L$ for $L$ a local Lipschitz factor of $\nabla A$ . Furthermore, Section 3, the step length bounds, and the update rules required in Theorem 6.1 or 6.6 reduce to the corresponding ones introduced in [10] for this case. As for acceleration, Theorem 6.4 now gives a weaker convergence rate of $O(1/N)$ compared to $O(1/N^{2})$ in [10, Theorem 4.3]. This is due to (24c) requiring $\sigma_{i}$ to be bounded whenever $\lambda_{y}>0$ , even when $\tau_{i}$ goes to zero.

Before we conclude this section, we refine Section 5 by showing that its implicit requirements do not add any additional step length bounds provided the starting point is sufficiently close to ${\widehat{u}}$ .

Proposition 6.9.

Under the assumptions of Theorem 6.1, 6.4, or 6.6, suppose that $\rho_{x},\rho_{y}>0$ . Then there exists $\varepsilon>0$ such that Section 5 holds whenever the initial iterate $u^{0}=(x^{0},y^{0})$ satisfies

[TABLE]

Proof 6.10.

We take $\mu$ , $\delta$ , $\sigma_{i}$ , $\tau_{i}$ , and $\omega_{i}$ as they are defined in the corresponding Theorem 6.1, 6.4, or 6.6, and $L_{x}(\widehat{y})$ , $L_{y}(\widehat{x}),R_{K}$ from Section 3. We need to show that there exist $\delta_{x},\delta_{y}>0$ and $r_{y}\geq r_{\max}\sqrt{\nu(1-\delta)\delta(\mu-\delta)^{-1}}$ such that (41) holds and

[TABLE]

Let $\varepsilon>0$ and set $r_{y}:=\varepsilon\sqrt{\nu(1-\delta)\delta(\mu-\delta)^{-1}}$ as well as $\delta_{x}:=\sqrt{\varepsilon}$ and $\delta_{y}:=\rho_{y}-r_{y}$ . Observing (60), we then see both that $\delta_{y}>0$ and that (61) holds for $\varepsilon>0$ sufficiently small. Furthermore, (60) yields that $r_{\max}\leq\varepsilon$ in Lemma 5.3. Let

[TABLE]

Since $r_{y},r_{\max}=O(\varepsilon)$ , $\delta_{x}=\sqrt{\varepsilon}$ , and $\delta_{y}>\rho_{y}/2>0$ for $\varepsilon>0$ small enough, we see that $c_{\varepsilon}\to\infty$ as $\varepsilon\to 0$ . Comparing the definition of $c_{\varepsilon}$ to (41), we therefore see that the latter holds for any given $\tau_{0}>0$ and $\sigma_{i}\equiv\sigma>0$ by taking $\varepsilon>0$ sufficiently small. Since in Theorems 6.1, 6.4 and 6.6 we have $\tau_{i}\leq\tau_{0}$ , the inequalities (41) hold.

7 Numerical examples

Finally, we illustrate the applicability of the proposed approach for the example applications described in Section 2. The Julia implementation used to generate the following results is on Zenodo [11].

7.1 An elliptic Nash equilibrium problem

Our first example illustrates the reformulation from Section 2.1 for the two-player elliptic Nash equilibrium problem from [6]. Here the action space of each player is $L^{2}(\Omega)$ for a bounded domain $\Omega\subset\mathbb{R}^{d}$ with boundary $\partial\Omega$ . To avoid confusion with the spatial variable, we will in this subsection denote the primal variable with $u$ and the dual variable with $v$ . The set of admissible strategies is

[TABLE]

For a set of strategies $u:=(u_{1},u_{2})\in X=X_{1}\times X_{2}$ , the payout function for each player is

[TABLE]

where $\alpha_{k}>0$ , $z_{k}\in L^{2}(\Omega)$ are given target states, $S:L^{2}(\Omega)^{2}\to L^{2}(\Omega)$ maps $u=(u_{1},u_{2})$ to the solution $y$ to the elliptic boundary value problem

[TABLE]

$B_{k}:L^{2}(\Omega)\to L^{2}(\Omega)$ are control operators which are here chosen as

[TABLE]

for some control domains $\omega_{k}\subset\Omega$ , and $f$ is a common source term. Following Section 2.1, the corresponding Nash equilibrium problem (3) can then be solved by applying Algorithm 1.1 to

[TABLE]

To implement the algorithm, we need explicit forms of the proximal mappings for $G$ and $F^{*}$ and of the partial derivatives of $K$ . Since $G=F^{*}=\delta_{X}$ for $X=X_{1}\times X_{2}$ , we have

[TABLE]

for the metric projections onto the convex sets $X_{k}$ given pointwise almost everywhere by

[TABLE]

It remains to address the computation of $K_{u}(u,v)$ and $K_{v}(u,v)$ . Using adjoint calculus and the linearity of the adjoint equation, we have that

[TABLE]

where $p_{1}(u,v)=:p_{1}$ and $p_{2}(u,v)=:p_{2}$ are the solutions to the equations

[TABLE]

all with homogeneous Dirichlet conditions. Hence, every iteration of Algorithm 1.1 requires nine solutions of a partial differential equation (recall that $K_{v}$ is evaluated at $(\widebar{u}^{i+1},v^{i})$ , while $K_{u}$ is evaluated at $(u^{i},v^{i})$ ). Since $S$ and hence $K_{u}$ and $K_{v}$ are affine in $u$ and $v$ , the assumptions of Theorem 6.1 are satisfied for sufficiently small step lengths. Since neither $F^{*}$ nor $G$ are strongly convex, no acceleration is possible.

For our numerical tests we follow [6] and consider a finite-difference discretization of (62) on $\Omega=(0,1)^{2}$ with $N$ nodes in each direction,

[TABLE]

as well as $a=-0.5$ , $b=0.5$ , and $\alpha_{i}=1$ . Using the method of manufactured solutions, $z_{1}$ , $z_{2}$ , and $f$ are chosen such that the solution $u^{*}=(u_{1}^{*},u_{2}^{*})$ of the Nash equilibrium problem is known a priori; see Fig. 3. By construction, the saddle point then satisfies $v^{*}=u^{*}$ and hence $\Psi(u^{*},v^{*})=0$ .

Since the Lipschitz constants for $K$ and its derivatives are not available, we simply take the parameters in Algorithm 1.1 as $\sigma_{i+1}\equiv\sigma=1.0$ , $\tau_{i}\equiv\tau=0.99$ , and $\omega=1.0$ . The results of the algorithm for different values of $N\in\{64,128,256,512,1024\}$ are shown in Table 1, which reports the distance of the primal-dual iterates $(u^{i},v^{i})$ to the exact solution. As can be seen, the iteration converges in each case to machine precision within $5$ iterations, and the convergence behavior is virtually identical. This demonstrates the mesh independence expected from an algorithm for which convergence can be shown in function spaces.

7.2 $\ell^{0}$ -TV denoising

Our next example concerns the $\ell^{0}$ -TV denoising or segmentation problem from Section 2.2. Recall that we can solve the (Huber-regularized) $\ell^{0}$ -TV problem (5) by applying Algorithm 1.1 to

[TABLE]

for $p\in\{1,\infty\}$ and $\gamma\geq 0$ , where $D_{h}:\mathbb{R}^{N_{1}\times N_{2}}\to\mathbb{R}^{N_{1}\times N_{2}\times 2}$ is the discrete gradient. We write $H_{\gamma}$ for $H$ defined in (17) corresponding to $F^{*}=F_{\gamma}^{*}$ . Since $G$ and $F_{\gamma}^{*}$ are quadratic, a simple computation shows that

[TABLE]

where all operations are to be understood componentwise. For the derivatives of $K_{p}$ , we have by the chain rule

[TABLE]

where $D_{h}^{T}$ is the discrete (negative) divergence. For the partial derivatives of $\kappa_{p,z}(z,y)$ and $\kappa_{p,y}(z,y)$ , we again distinguish the cases $p=1$ and $p=\infty$ :

[TABLE]

It remains to choose valid step sizes for Algorithm 1.1, for which the next result gives useful estimates. We recall from [7] that a forward differences discretization of the gradient operator satisfies $\|D_{h}\|_{2}\leq\sqrt{8}/h$ . Recalling (63) and the definitions of $G$ and $F_{\gamma}^{*}$ , a critical point $({\widehat{x}},{\widehat{y}})\in H^{-1}_{\gamma}(0)$ satisfies

[TABLE]

For brevity, we set

[TABLE]

Using the results of Appendix C we verify the fundamental Section 3.

Corollary 7.1.

Let $K=K_{p}$ for either $p=1$ or $p=\infty$ . Choose $L\geq\|D_{h}\|_{2}$ and $R_{K}>2L$ . Then Section 3 holds for some $\theta_{x},\theta_{y}>0$ and $\rho_{x},\rho_{y}>0$ with

[TABLE]

and the constants $\xi_{x},\xi_{y}>0$ , $\lambda_{x},\lambda_{y}\geq 0$ satisfying

[TABLE]

Proof 7.2.

We consider only $p=\infty$ as the proof for $p=1$ is similar. Taking $\widetilde{R}_{K}>2$ , Lemma C.1 applied componentwise shows that the operator $\kappa_{p}$ satisfies Section 3 for some $\widetilde{\theta}_{z},\widetilde{\theta}_{y}>0$ and $\widetilde{\rho}_{x},\widetilde{\rho}_{y}>0$ (depending on $\widetilde{R}_{K}$ ) when we take

[TABLE]

Moreover, the constants $\widetilde{\xi}_{z},\widetilde{\xi}_{y}\in\mathbb{R}$ and $\widetilde{\lambda}_{z},\widetilde{\lambda}_{y}\geq 0$ need to satisfy $\widetilde{\xi}_{z}\widetilde{\lambda}_{z}>\max_{ij}2(\lambda_{z}+\|{\widehat{y}}_{\,\boldsymbol{\cdot}\,ij}\|^{2})\|{\widehat{y}}_{\,\boldsymbol{\cdot}\,ij}\|^{2}$ as well as $\widetilde{\xi}_{y}>0$ and $\widetilde{\lambda}_{y}>\max_{ij}\|{\widehat{z}}_{\,\boldsymbol{\cdot}\,ij}\|^{2}$ for ${\widehat{z}}=D_{h}{\widehat{x}}$ .

By Lemma C.3 on compositions with a linear operator, we can now take

[TABLE]

These give the claim.

We now obtain from Theorem 6.6 the following estimate.

Corollary 7.3.

Suppose Section 3 holds. Choose $L\geq\|D_{h}\|_{2}$ . For some $\widetilde{\gamma}_{G}\in(0,\alpha^{-1})$ and $\widetilde{\gamma}_{F^{*}}\in(0,\gamma)$ , take $\xi_{x}=\alpha^{-1}-\widetilde{\gamma}_{G}$ and $\xi_{y}=\gamma-\widetilde{\gamma}_{F^{*}}$ as well as $\lambda_{x},\lambda_{y}\geq 0$ such that (65) holds. For some $0<\delta\leq\mu<1$ , take $\sigma=\tau\widetilde{\gamma}_{G}\widetilde{\gamma}_{F^{*}}^{-1}$ and $\omega:=(1+2\widetilde{\gamma}_{G}\tau)^{-1}$ as well as

[TABLE]

Then $\|u^{N}-{\widehat{u}}\|^{2}$ converges to zero with the linear rate $O(\omega^{N})$ provided $u^{0}$ is close enough to ${\widehat{u}}$ .

Proof 7.4.

The assumptions $\widetilde{\gamma}_{G}\in(0,\alpha^{-1})$ and $\widetilde{\gamma}_{F^{*}}\in(0,\gamma)$ ensure $\xi_{x},\xi_{y}>0$ . Since we have assumed (65), Corollary 7.1 yields Section 3 for any $R_{K}>2L$ and some $\theta_{x},\theta_{y}>0$ . We next use Theorem 6.6, whose conditions we need to verify. First, taking $\rho_{x},\rho_{y}>0$ ensures that $\theta_{x}\geq\rho_{y}\omega^{-1}$ and $\theta_{y}\geq\omega\rho_{x}$ . Furthermore, the strict inequality in (66) implies (58) for sufficiently small $\rho_{y}>0$ . Finally, Proposition 6.9 ensures that we can satisfy Section 5 by taking $u^{0}$ sufficiently close to ${\widehat{u}}$ . The rest of the conditions we have assumed explicitly, so we can apply Theorem 6.6 to finish the proof.

Recall that Section 3 is a second-order growth condition at the critical point $({\widehat{x}},{\widehat{y}})$ , which is a common assumption needed to show convergence of algorithms for non-convex optimization problems. To calculate the upper bounds on $\tau$ in (66), we need to find $\lambda_{x},\lambda_{y}\geq 0$ satisfying (65). For this, in turn, we need to estimate $\widehat{m}_{x}$ and $\widehat{m}_{y}$ . To do this, note that the critical point conditions (64) imply

[TABLE]

Since $t\mapsto t/(t+\gamma)$ is increasing, we can estimate $\widehat{m}_{y}$ based on $\widehat{m}_{x}$ . Since any solution of the Potts problem should be piecewise constant with very few intensity quantization levels, we can estimate $\widehat{m}_{x}$ as the expected maximal jump between neighboring pixels. We take this as 100% of the dynamic range for safety. In practice, as a practical choice of $\gamma>0$ will likely not satisfy $\xi_{x}>2L\widehat{m}_{y}^{2}$ , we use an over-approximation $\widebar{\gamma}:=10\geq\gamma$ in (67). We remark that we thus cannot guarantee convergence of Algorithm 1.1 for small $\gamma>0$ ; however, we demonstrate below that these estimates can still lead to useful step sizes for such cases. Similarly, we do not have an estimate for the unknown local neighborhood of convergence; we compensate for this by taking small $\delta=0.1$ in (66). As the results below demonstrate, with these parameters we nevertheless observe convergence for the reasonable starting point $u^{0}=(x^{0},y^{0})$ with $x^{0}=f$ and $y^{0}\equiv 0$ .

We illustrate the performance of the algorithm and the effects of the choice of $p$ . As a test image, we choose “blobs” from the ImageJ framework [30] with size $N_{1}\times N_{2}=256\times 254$ , see Fig. 4(a). We set $\alpha=1$ and $\gamma=10^{-3}$ (cf. Fig. 2) and use the accelerated step size rule from Theorem 6.6. To do this, we need to satisfy (66) for the primal step length $\tau$ . We discretize the problem such that $h=1$ and hence $L=\sqrt{8}$ . Furthermore, we set $\widetilde{\gamma}_{F^{*}}=\gamma/100$ and $\widetilde{\gamma}_{G}=\widetilde{\alpha}^{-1}$ for $\widetilde{\alpha}=10\alpha$ . The above estimates then lead to the step length parameters

$p=1$ :

$\tau=1.04085\cdot 10^{-3}$ , $\sigma=1.04085$ , $\omega=0.99480$ ;

$p=\infty$ :

$\tau=5.51922\cdot 10^{-4}$ , $\sigma=0.551922$ , $\omega=0.99724$ .

Since the exact solution $({\widehat{x}},{\widehat{y}})$ is not available here, we instead use $x^{\max}:=x^{N_{\max}}$ for $N_{\max}=10^{6}$ and similarly $y^{\max}$ as references for computing errors. The corresponding reference images $x^{\max}$ obtained from Algorithm 1.1 after $N_{\max}=10^{6}$ iterations are shown in Figs. 4(b) and 4(c) for $p=1$ and $p=\infty$ , respectively. While the evaluation of the formulation and the algorithm in the context of image processing is outside of the scope of this work, we briefly comment on the difference between $p=1$ and $p=\infty$ . As can be seen by comparing the two images, the results are very similar. However, since diagonal jumps are penalized less for $p=\infty$ , the isotropic Huber–Potts model is better able to preserve small light blobs such as the one indicated by the red circles. The edges of the blobs are also noticeably smoother.

The convergence behavior of the method for both choices of $p$ over $N_{\max}/2=5\cdot 10^{5}$ iterations is given in Fig. 5. For the function values, we observe in Fig. 5(a) the usual fast decrease in the beginning of the iteration, after which the values stagnate. Nevertheless, the errors continue to decrease down to machine precision at the predicted linear rate. The convergence behavior for $p=1$ and $p=\infty$ is similar, although the linear convergence for $p=\infty$ is with a significantly smaller constant. We remark that visually, the iterates in both cases are indistinguishable from the reference images already after $N=10^{4}$ iterations. This is consistent with Fig. 5(b) since the total error is dominated by the dual component, which acts as an edge indicator; small changes of the boundaries of the blobs during the iteration will, even for small gray value changes, lead to large differences in the dual variable.

8 Conclusion

Using generalized conjugation, some non-smooth non-convex optimization problems can be transformed into saddle-point problems involving non-smooth convex functionals and a smooth non-convex-concave coupling term. For such problems, a generalized primal–dual proximal splitting method can be applied that converges weakly under step length conditions if a local quadratic growth condition is satisfied near a saddle-point. Under additional strong convexity assumptions on the functionals (but not the coupling term and hence the problem), convergence rates for accelerated algorithms can be shown. This approach can be applied to elliptic Nash equilibrium problems and for the anisotropic and isotropic Huber-regularized Potts models, as the numerical examples illustrate. Future work is concerned with further evaluating and comparing the performance of the proposed algorithm for these examples.

Acknowledgments

In the first stages of the research T. Valkonen and S. Mazurenko were supported by the EPSRC First Grant EP/P021298/1, “PARTIAL Analysis of Relations in Tasks of Inversion for Algorithmic Leverage”. Later T. Valkonen was supported by the Academy of Finland grants 314701 and 320022. C. Clason was supported by the German Science Foundation (DFG) under grant Cl 487/2-1. We thank the anonymous reviewers for insightful comments.

A data statement for the EPSRC

The source codes for the numerical experiments are on Zenodo at [11].

Appendix A Reductions of the three-point condition

The following two propositions demonstrate that Section 3 (iv) is closely related to standard second-order optimality conditions, i.e., that the Hessian is positive definite at the solution ${\widehat{u}}$ .

Proposition A.1.

Suppose Section 3 (ii) (locally Lipschitz gradients of $K$ ) holds in some neighborhood $\mathcal{U}$ of ${\widehat{u}}$ , and for some $\xi_{x}\in\mathbb{R}$ , $\gamma_{x}>0$ ,

[TABLE]

Then (15a) holds in $\mathcal{U}$ with $\theta_{x}=2(\gamma_{x}-\alpha)L_{yx}^{-1}$ , and $\lambda_{x}=L_{x}({\widehat{y}})^{2}(2\alpha)^{-1}$ for any $\alpha\in(0,\gamma_{x}]$ .

Proof A.2.

An application of Cauchy’s and Young’s inequalities with any factor $\alpha>0$ , Section 3 (ii), and (68) yields the estimate

[TABLE]

At the same time, using (16),

[TABLE]

Therefore (15a) holds if we take $\theta_{x}\leq 2(\gamma_{x}-\alpha)L_{yx}^{-1}$ and $\lambda_{x}=L_{x}({\widehat{y}})^{2}(2\alpha)^{-1}$ .

Proposition A.3.

Suppose Section 3 (ii) (locally Lipschitz gradients of $K$ ) holds in some neighborhood $\mathcal{U}$ of ${\widehat{u}}$ with $L_{y}(x)\leq\widebar{L}_{y}$ , and that

[TABLE]

for some constant $L_{xy}\geq 0$ . Assume, moreover, for some $\xi_{y}\in\mathbb{R}$ , $\gamma_{y}>0$ that

[TABLE]

Then (LABEL:eq:k-nonlinear-ky) holds in $\mathcal{U}$ with $\theta_{y}=2(\gamma_{y}-\alpha_{1})(1+\alpha_{2})^{-1}L_{xy}^{-1}$ , and $\lambda_{y}=(\widebar{L}_{y}^{2}(2\alpha_{1})^{-1}+(1+\alpha_{2}^{-1})L_{xy}\theta_{y})$ for any $\alpha_{1}\in(0,\gamma_{y}]$ , $\alpha_{2}>0$ .

Proof A.4.

An application of Cauchy’s and Young’s inequalities with any factor $\alpha>0$ , Section 3 (ii), and (69) yields the estimate

[TABLE]

At the same time, using (16) and Young’s inequality for any $\alpha_{2}>0$ ,

[TABLE]

Therefore (LABEL:eq:k-nonlinear-ky) holds if we take $\theta_{y}\leq 2\frac{\gamma_{y}-\alpha_{1}}{(1+\alpha_{2})L_{xy}}$ and $\lambda_{y}=\frac{\widebar{L}_{y}^{2}}{2\alpha_{1}}+(1+\alpha_{2}^{-1})L_{xy}\theta_{y}$ .

Appendix B Relaxations of the three-point condition

In all the results of this paper, Section 3 (iv) can be generalized to the following three-point condition similar to the one used in [10].

Assumption B.0.

The functional $K(x,y)\in C^{1}(X\times Y)$ and there exists a neighborhood

[TABLE]

for some $\rho_{x},\rho_{y}>0$ such that for all $u^{\prime},u\in\mathcal{U}(\rho_{x},\rho_{y})$ , the following property holds:

(ivenumi)

(three-point condition) There exist $\theta_{x},\theta_{y}>0$ , $\lambda_{x},\lambda_{y}\geq 0$ , $\xi_{x},\xi_{y}\in\mathbb{R}$ , and $p_{x},p_{y}\in[1,2]$ such that

[TABLE]

This assumption introduces $p_{x}$ and $p_{y}$ in $[1,2]$ , while in Section 3 (iv) we had $p_{x}=p_{y}=1$ . For instance, in [10, Appendix B] we verified Appendix B with $p_{x}=2$ for the case $K(x,y)=\langle A(x),y\rangle$ for the reconstruction of the phase and amplitude of a complex number. This relaxation mainly affects the proof of Step 4 in Theorem 4.2, which now requires a few intermediate derivations.

Corollary B.1.

The results of Theorem 4.2 continue to hold if Section 3 (iv) is replaced with Appendix B (ivenumi) for some $p_{x},p_{y}\in[1,2]$ , where in case $p_{y}\in(1,2]$ , (24d) is replaced by

[TABLE]

Proof B.2.

The beginning of the proof follows the exact same steps as in the proof of Theorem 4.2 up until (30). We now use Appendix B (ivenumi) to further bound $D_{x}$ and $D_{y}$ similarly to (31) and (32). From (71a),

[TABLE]

The following generalized Young’s inequality for any positive $a,b,p$ and $q$ such that $q^{-1}+p^{-1}=1$ allows for our choice of varying $p_{x}\in[1,2]$ :

[TABLE]

Applying this inequality with $p=p_{x}$ ,

[TABLE]

for any $\zeta_{x}>0$ to the last term of (73), we arrive at the estimate

[TABLE]

We now use $u^{i+1}\in\mathcal{U}(\rho_{x},\rho_{y})$ for some $\rho_{x},\rho_{y}\geq 0$ , and $\omega_{i}^{-1}\leq\underline{\omega}^{-1}$ to obtain

[TABLE]

If $p_{x}=1$ , we use the assumed inequality $\theta_{x}\geq\rho_{y}\underline{\omega}^{-1}$ from (24e) to show that the right-hand side of (75) is non-negative for any $\zeta_{x}>0$ . Otherwise we take $\zeta_{x}:=(\underline{\omega}\theta_{x}p_{x}^{p_{x}}\rho_{y}^{p_{x}-2})^{1/(1-p_{x})}$ to ensure the right-hand side of (75) is zero. In either case, $\theta_{x}-\rho_{y}^{2-p_{x}}(p_{x}^{p_{x}}\underline{\omega}\zeta_{x}^{p_{x}-1})^{-1}\geq 0$ and hence

[TABLE]

Analogously, from (LABEL:eq:k-nonlinear-ky-p2) and Cauchy’s inequality,

[TABLE]

This has a structure similar to (73) with $\omega_{i}$ now as a multiplier. Hence, we apply a similar generalized Young’s inequality to the last term with any $\zeta_{y}>0$ . Noting that $\omega_{i}\leq\overline{\omega}$ , we use the following bound similar to (75):

[TABLE]

The last inequality holds for any $\zeta_{y}>0$ if $p_{y}=1$ due to the assumed $\theta_{y}\geq\overline{\omega}\rho_{x}$ from (24d); otherwise, we set $\zeta_{y}:=(\theta_{y}p_{y}^{p_{y}}\rho_{x}^{p_{y}-2}\overline{\omega}^{-1})^{1/(1-p_{y})}$ . We then obtain that

[TABLE]

Combining (30), (76), and (77), we can thus bound

[TABLE]

where in the final step, we have also used (72) and the selected $\zeta_{x}$ and $\zeta_{y}$ if $p_{x}>1$ or $p_{y}>1$ or both. Thus, we obtained exactly the same lower bound as in (33). We then continue along the rest of the proof of Theorem 4.2 to obtain the claim.

It is worth observing that when $p_{x}\in(1,2]$ or $p_{y}\in(1,2]$ , the inequalities (72) do not directly bound the respective $\rho_{y}$ or $\rho_{x}$ . Hence, we do not need to initalize the corresponding variable locally, unlike when $p_{x}=1$ or $p_{y}=1$ . On the other hand, sufficient strong convexity is required from the corresponding $G$ and $F^{*}$ .

We start with the lemma ensuring that the iterates stay in the initial neighborhood of the saddle point.

Corollary B.3.

The results of Lemma 5.3 continue to hold if the corresponding conditions of Theorem 4.2 are replaced with those in Corollary B.1.

Proof B.4.

The proof repeats that of Lemma 5.3, applying Corollary B.1 instead of Theorem 4.2 in Step 2.

We next extend the results of Section 6 to arbitrary choices of both $p_{x}\in[1,2]$ and $p_{y}\in[1,2]$ . This mainly consists of verifying (72a) when $p_{y}\neq 1$ and (72b) when $p_{x}\neq 1$ . Note that it is possible to take $p_{x}=1$ and $p_{y}\neq 1$ , or vice versa, as long as the corresponding conditions are satisfied.

Corollary B.5.

The results of Theorem 6.1 continue to hold if Section 3 (iv) is replaced with Appendix B (ivenumi) for some $p_{x},p_{y}\in[1,2]$ , where in case $p_{y}\in(1,2]$ , (46a) is replaced with

[TABLE]

Proof B.6.

Since conditions (79) are sufficient for (72) with $\overline{\omega}=\underline{\omega}=1$ to hold, we can repeat the proof of Theorem 6.1 replacing the references to Theorem 4.2 by references to Corollary B.1 up until (52). If $p_{x}>1$ , we now obtain a lower bound on $d_{i}^{x}$ by arguing as in (73)–(75) with ${\widehat{u}}$ replaced by $\widebar{u}$ . Specifically, using (16), Appendix B (ivenumi) at $\widebar{u}$ , and the generalized Young’s inequality (74), we obtain for any $\zeta_{x}>0$ that

[TABLE]

Inserting $\zeta_{x}=(\theta_{x}p_{x}^{p_{x}}(2\rho_{y})^{p_{x}-2})^{1/(1-p_{x})}$ and $\|y^{i+1}-\widebar{y}\|\leq 2\rho_{y}$ , we eliminate the first term on the right-hand side. Likewise, if $p_{y}>1$ , similar steps applied to $d_{i}^{y}$ result in

[TABLE]

for $\zeta_{y}=(\theta_{y}p_{y}^{p_{y}}(2\rho_{x})^{p_{y}-2})^{1/(p_{y}-1)}$ . Using $\|u^{i+1}-u^{i}\|\to 0$ and the selection of $\zeta_{x}$ and $\zeta_{y}$ , we then obtain the desired estimate $\limsup_{i\rightarrow\infty}~{}q_{i}:=\limsup_{i\rightarrow\infty}~{}(d_{i}^{x}+d_{i}^{y}+O(\|u^{i+1}-u^{i}\|))\leq 0$ .

Corollary B.7.

The results of Theorem 6.4 continue to hold if Section 3 (iv) is replaced with Appendix B (ivenumi) for some $p_{x},p_{y}\in[1,2]$ , where in case $p_{y}\in(1,2]$ , (53a) is replaced for some $\widetilde{\gamma}_{G}>0$ with

[TABLE]

Proof B.8.

Conditions (80) are sufficient for (72) with $\overline{\omega}=\underline{\omega}=1$ to hold; therefore, we can repeat the proof of Theorem 6.4 replacing the references to Theorem 4.2 by references to Corollary B.1.

Corollary B.9.

The results of Theorem 6.6 continue to hold if Section 3 (iv) is replaced with Appendix B (ivenumi) for some $p_{x},p_{y}\in[1,2]$ , where in case $p_{y}\in(1,2]$ , (56a) is replaced for some $\widetilde{\gamma}_{G}>0$ with

[TABLE]

Proof B.10.

Conditions (81) are sufficient for (72) with $\overline{\omega}=\underline{\omega}=\omega$ to hold; therefore, we can repeat the proof of Theorem 6.6 replacing the references to Theorem 4.2 by references to Corollary B.1.

Corollary B.11.

The results of Proposition 6.9 continue to hold if the corresponding conditions of Theorem 6.1, 6.4, or 6.6 are replaced with those in Corollary B.5, B.7, or B.9.

Proof B.12.

The proof repeats that of Proposition 6.9.

Appendix C Verification of conditions for step function presentation and Potts model

Throughout this section, we set $\rho(t):=2t-t^{2}$ and $\kappa(x,y):=\rho(\langle x,y\rangle)$ for $x,y\in\mathbb{R}^{m}$ . Then $\rho^{\prime}(t)=2(1-t)$ so that

[TABLE]

where $a\otimes b\in\mathbb{R}^{n\times n}$ is the tensor product between two vectors $a$ and $b$ , producing a matrix of all the combinations of products between the entries.

The following lemma verifies Section 3 for $K=\kappa$ .

Lemma C.1.

Let $R_{K}>2$ , and suppose ${\widehat{x}},{\widehat{y}}\in\mathbb{R}^{m}$ for $m\geq 1$ with

[TABLE]

Then the function $K=\kappa$ defined above satisfies Section 3 for some $\theta_{x},\theta_{y}>0$ and some $\rho_{x},\rho_{y}>0$ dependent on $R_{K}$ with

[TABLE]

as well as the constants $\xi_{x},\xi_{y}\in\mathbb{R}$ , $\lambda_{x},\lambda_{y}\geq 0$ satisfying $\lambda_{x}\xi_{x}>2(\lambda_{x}+|{\widehat{y}}|_{2}^{2})|{\widehat{y}}|_{2}^{2}$ , $\xi_{y}>0$ , and $\lambda_{y}>|{\widehat{x}}|_{2}^{2}$ .

Proof C.2.

First, Section 3 (i) holds everywhere since $K\in C^{\infty}(\mathbb{R}^{m})$ . To verify Section 3 (ii), we observe using (82) that

[TABLE]

Hence $L_{x}$ , $L_{y}$ , and $L_{yx}$ are as claimed.

To verify Section 3 (iii), we first of all observe using (83) that

[TABLE]

Therefore $\sup_{(x,y)\in\vmathbb{B}({\widehat{x}},\rho_{x})\times\vmathbb{B}({\widehat{y}},\rho_{y})}|\kappa_{xy}(x,y)|_{2}\leq R_{K}$ for some $\rho_{x},\rho_{y}>0$ dependent on $R_{K}>2$ .

Finally, to verify Section 3 (iv), we start with (15a), i.e.,

[TABLE]

Expanding the equation using (82), (84), and

[TABLE]

we require that

[TABLE]

Taking any $\alpha>0$ , this will hold by Cauchy’s and Young’s inequalities if $\xi_{x}\geq(2+\alpha)|{\widehat{y}}|_{2}^{2}+2\theta_{x}|y|_{2}$ and $\lambda_{x}/2\geq\alpha^{-1}|{\widehat{y}}|_{2}^{2}$ . If $|{\widehat{y}}|_{2}=0$ , clearly these hold for some $\alpha,\theta_{x}>0$ . Otherwise, solving $\alpha$ from the latter as an equality, i.e., taking $\alpha=2\lambda^{-1}_{x}|{\widehat{y}}|_{2}^{2}$ , the former holds if $\xi_{x}\geq 2(1+\lambda^{-1}_{x}|{\widehat{y}}|_{2}^{2})|{\widehat{y}}|_{2}^{2}+2\theta_{x}|y|_{2}$ . If $\lambda_{x}\xi_{x}>2(\lambda_{x}+|{\widehat{y}}|_{2}^{2})|{\widehat{y}}|_{2}^{2}$ , this holds for some $\theta_{x},\rho_{x},\rho_{y}>0$ in a neighborhood $\vmathbb{B}({\widehat{x}},\rho_{x})\times\vmathbb{B}({\widehat{y}},\rho_{y})$ of ( ${\widehat{x}},{\widehat{y}})$ .

It remains to verify (LABEL:eq:k-nonlinear-ky), i.e.,

[TABLE]

Again, using (82) and (84) we expand this as

[TABLE]

Rearranging the $\theta_{y}$ -term, we see that this holds if

[TABLE]

Rearranging and estimating the first term as

[TABLE]

and then using Young’s inequality on both parts, we obtain the condition

[TABLE]

If $\xi_{y}>0$ and $\lambda_{y}>|{\widehat{x}}|_{2}^{2}$ , this holds for some $\theta_{y},\rho_{y},\rho_{x}>0$ in $\vmathbb{B}({\widehat{x}},\rho_{x})\times\vmathbb{B}({\widehat{y}},\rho_{y})$ .

We comment on the condition (83) on the primal–dual solutions pair ${\widehat{x}},{\widehat{y}}\in\mathbb{R}$ . First, for $m=1$ , this condition reduces to ${\widehat{x}}{\widehat{y}}\in[0,1]$ . This is necessarily satisfied in the case of the step function (where $f^{*}=\delta_{[0,\infty)}$ ) and in the case of the $\ell^{0}$ function (where $f^{*}=0$ ) as in both cases, ${\widehat{x}}{\widehat{y}}\in\{0,1\}$ by the dual optimality condition $\kappa_{y}({\widehat{x}},{\widehat{y}})\in\partial f^{*}({\widehat{y}})$ . Furthermore, if we take $f^{*}_{\gamma}=\frac{\gamma}{2}|\,\boldsymbol{\cdot}\,|_{2}^{2}$ for some $\gamma\geq 0$ , then for any $m\geq 1$ the dual optimality condition reads $2{\widehat{x}}(1-\langle{\widehat{x}},{\widehat{y}}\rangle)=\gamma{\widehat{y}}$ , i.e, ${\widehat{y}}=2{\widehat{x}}(\gamma+2|{\widehat{x}}|_{2}^{2})^{-1}$ , for which (83) is easily verified.

The following lemma shows that Section 3 remains valid if we include a linear operator in the primal component.

Lemma C.3.

Let $K(x,y)=\widetilde{K}(Ax,y)$ for some $A\in\mathcal{L}(X;Z)$ and $\widetilde{K}\in C^{1}(Z\times Y)$ on Hilbert spaces $X,Y,Z$ . Suppose $\widetilde{K}$ satisfies Section 3 at $({\widehat{z}},{\widehat{y}}):=(A{\widehat{x}},{\widehat{y}})$ . Mark the corresponding constants with a tilde: $\widetilde{L}_{z}$ , $\widetilde{R}_{K}$ , and so on. Then $K$ satisfies Section 3 with $R_{K}:=\widetilde{R}_{K}\|A\|$ ; $\xi_{x}=\|A\|\widetilde{\xi}_{z}$ , $\xi_{y}=\widetilde{\xi}_{y}$ ; $\lambda_{x}=\|A\|\widetilde{\lambda}_{z}$ , $\lambda_{y}=\widetilde{\lambda}_{y}$ ; $\theta_{x}=\widetilde{\theta}_{z}$ , $\theta_{y}=\widetilde{\theta}_{y}\|A\|^{-1}$ ; $\rho_{x}=\|A\|^{-1}\widetilde{\rho}_{x}$ , and $\rho_{y}=\widetilde{\rho}_{y}$ as well as

[TABLE]

Proof C.4.

Observe first of all that by the chain rule,

[TABLE]

and hence Section 3 Item (i) holds for $K$ if it holds for $\widetilde{K}$ .

Let now Section 3 (ii) hold for $\widetilde{K}$ with $\widetilde{L}_{x}$ , $\widetilde{L}_{y}$ , and $\widetilde{L}_{yx}$ . Observing that

[TABLE]

Section 3 (ii) thus also holds with the function of (86). Similarly in Section 3 (iii), we can take $R_{K}:=\widetilde{R}_{K}\|A\|$ .

Finally, we expand Section 3 (iv) for $K$ as

[TABLE]

where $z=Ax$ , $z^{\prime}=Ax^{\prime}$ , and ${\widehat{z}}=A{\widehat{x}}$ . Since $\|z-z^{\prime}\|\leq\|A\|\|x-x^{\prime}\|$ , etc., this follows from Section 3 (iv) for $\widetilde{K}$ with the constants as claimed.

Applying this lemma to $\widetilde{K}(z,y)=\sum_{k=1}^{n}\kappa(z_{k},y_{k})$ , we can thus lift the scalar estimates for $K=\kappa$ as in (82) to the corresponding estimates on $K(x,y):=\sum_{k=1}^{n}\kappa([D_{h}x]_{k},y_{k})$ as used in the Potts model example.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F. J.Aragón Artacho and M. H.Geoffroy, Characterization of metric regularity of subdifferentials, Journal of Convex Analysis 15 (2008), 365–380.
2[2] F. J.Aragón Artacho and M. H.Geoffroy, Metric subregularity of the convex subdifferential in Banach spaces, J. Nonlinear Convex Anal. 15 (2014), 35–47.
3[3] H.Attouch, J.Bolte, and B.Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods, Mathematical Programming 137 (2013), 91–129, doi:10.1007/s 10107-011-0484-9 . · doi ↗
4[4] H. H.Bauschke and P. L.Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces , CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer, 2 edition, 2017, doi:10.1007/978-3-319-48311-5 . · doi ↗
5[5] M.Benning, F.Knoll, C. B.Schönlieb, and T.Valkonen, Preconditioned ADMM with nonlinear operator constraint, in System Modeling and Optimization: 27th IFIP TC 7 Conference, CSMO 2015, Sophia Antipolis, France, June 29–July 3, 2015, Revised Selected Papers , L.Bociu, J. A.Désidéri, and A.Habbal (eds.), Springer International Publishing, 2016, 117–126, doi:10.1007/978-3-319-55795-3_10 , ar Xiv:1511.00425 , https://tuomov.iki.fi/m/nonlinear ADMM.pdf . · doi ↗
6[6] A.Borzì and C.Kanzow, Formulation and numerical solution of Nash equilibrium multiobjective elliptic control problems, SIAM Journal on Control and Optimization 51 (2013), 718–744, doi:10.1137/120864921 . · doi ↗
7[7] A.Chambolle, An algorithm for total variation minimization and applications, Journal of Mathematical Imaging and Vision 20 (2004), 89–97, doi:10.1023/b:jmiv.0000011325.36760.1e . · doi ↗
8[8] A.Chambolle and T.Pock, A first-order primal-dual algorithm for convex problems with applications to imaging, Journal of Mathematical Imaging and Vision 40 (2011), 120–145, doi:10.1007/s 10851-010-0251-1 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Primal–dual proximal splitting and generalized conjugation in non-smooth non-convex optimization

Abstract

1 Introduction

Algorithm 1.1** (GPDPS).**

Related literature.

Organization.

2 Applications

2.1 Elliptic Nash equilibrium problems

Remark 2.1**.**

2.2 Huber–Potts denoising

The case p=1p=1p=1.

The case p=∞p=\inftyp=∞.

The case p∈(1,∞)p\in(1,\infty)p∈(1,∞).

Remark 2.2**.**

3 Notation and assumptions

Assumption 3.0**.**

Assumption 3.0**.**

4 An abstract convergence result

Theorem 4.1** ([36, Theorem 2.1]).**

Theorem 4.2**.**

Proof 4.3**.**

Step 1 (estimation of Zi+1Mi+1Z_{i+1}M_{i+1}Zi+1​Mi+1​)

Step 2 (estimation of Zi+1Mi+1−Zi+2Mi+2Z_{i+1}M_{i+1}-Z_{i+2}M_{i+2}Zi+1​Mi+1​−Zi+2​Mi+2​)

Step 3 (estimation of H~i+1(ui+1)\widetilde{H}_{i+1}(u^{i+1})Hi+1​(ui+1))

Step 4 (estimation of DDD)

Corollary 4.4**.**

Proof 4.5**.**

5 Local step length bounds

Lemma 5.1**.**

Proof 5.2**.**

Assumption 5.2**.**

Lemma 5.3**.**

Proof 5.4**.**

Step 1

Step 2

6 Convergence estimates

Theorem 6.1** (weak convergence: ωi=1\omega_{i}=1ωi​=1).**

Lemma 6.2** ([10, Lemma A.2]).**

Proof 6.3** (Proof of Theorem 6.1).**

Theorem 6.4** (convergence rates under acceleration: ωi=1\omega_{i}=1ωi​=1).**

Proof 6.5**.**

Theorem 6.6** (linear convergence: ωi<1\omega_{i}<1ωi​<1).**

Proof 6.7**.**

Remark 6.8**.**

Proposition 6.9**.**

Proof 6.10**.**

7 Numerical examples

7.1 An elliptic Nash equilibrium problem

7.2 ℓ0\ell^{0}ℓ0-TV denoising

Corollary 7.1**.**

Proof 7.2**.**

Corollary 7.3**.**

Proof 7.4**.**

8 Conclusion

Acknowledgments

A data statement for the EPSRC

Appendix A Reductions of the three-point condition

Proposition A.1**.**

Proof A.2**.**

Proposition A.3**.**

Proof A.4**.**

Appendix B Relaxations of the three-point condition

Assumption B.0**.**

Corollary B.1**.**

Proof B.2**.**

Corollary B.3**.**

Proof B.4**.**

Corollary B.5**.**

Proof B.6**.**

Corollary B.7**.**

Proof B.8**.**

Corollary B.9**.**

Proof B.10**.**

Algorithm 1.1 (GPDPS).

Remark 2.1.

The case $p=1$ .

The case $p=\infty$ .

The case $p\in(1,\infty)$ .

Remark 2.2.

Assumption 3.0.

Assumption 3.0.

Theorem 4.1 ([36, Theorem 2.1]).

Theorem 4.2.

Proof 4.3.

Step 1 (estimation of $Z_{i+1}M_{i+1}$ )

Step 2 (estimation of $Z_{i+1}M_{i+1}-Z_{i+2}M_{i+2}$ )

Step 3 (estimation of $\widetilde{H}_{i+1}(u^{i+1})$ )

Step 4 (estimation of $D$ )

Corollary 4.4.

Proof 4.5.

Lemma 5.1.

Proof 5.2.

Assumption 5.2.

Lemma 5.3.

Proof 5.4.

Theorem 6.1 (weak convergence: $\omega_{i}=1$ ).

Lemma 6.2 ([10, Lemma A.2]).

Proof 6.3 (Proof of Theorem 6.1).

Theorem 6.4 (convergence rates under acceleration: $\omega_{i}=1$ ).

Proof 6.5.

Theorem 6.6 (linear convergence: $\omega_{i}<1$ ).

Proof 6.7.

Remark 6.8.

Proposition 6.9.

Proof 6.10.

7.2 $\ell^{0}$ -TV denoising

Corollary 7.1.

Proof 7.2.

Corollary 7.3.

Proof 7.4.

Proposition A.1.

Proof A.2.

Proposition A.3.

Proof A.4.

Assumption B.0.

Corollary B.1.

Proof B.2.

Corollary B.3.

Proof B.4.

Corollary B.5.

Proof B.6.

Corollary B.7.

Proof B.8.

Corollary B.9.

Proof B.10.

Corollary B.11.

Proof B.12.

Lemma C.1.

Proof C.2.

Lemma C.3.

Proof C.4.