Convergence of the Forward-Backward Algorithm: Beyond the Worst Case   with the Help of Geometry

Guillaume Garrigos; Lorenzo Rosasco; Silvia Villa

arXiv:1703.09477·math.OC·December 25, 2023

Convergence of the Forward-Backward Algorithm: Beyond the Worst Case with the Help of Geometry

Guillaume Garrigos, Lorenzo Rosasco, Silvia Villa

PDF

TL;DR

This paper investigates the convergence of the forward-backward algorithm using geometric conditions, extending classical notions to more general sets and infinite-dimensional spaces, with applications in inverse problems and signal processing.

Contribution

It extends geometric convergence analysis of the forward-backward algorithm to arbitrary sets and infinite dimensions, introducing new inequalities and connections to inverse problem conditions.

Findings

01

First Lojasiewicz inequality for a quadratic function with a compact operator

02

New linear convergence rates for inverse problems with low-complexity priors

03

Unified framework connecting geometry and inverse problem conditions

Abstract

We provide a comprehensive study of the convergence of the forward-backward algorithm under suitable geometric conditions, such as conditioning or {\L}ojasiewicz properties. These geometrical notions are usually local by nature, and may fail to describe the fine geometry of objective functions relevant in inverse problems and signal processing, that have a nice behaviour on manifolds, or sets open with respect to a weak topology. Motivated by this observation, we revisit those geometric notions over arbitrary sets. In turn, this allows us to present several new results as well as collect in a unified view a variety of results scattered in the literature. Our contributions include the analysis of infinite dimensional convex minimization problems, showing the first {\L}ojasiewicz inequality for a quadratic function associated to a compact operator, and the derivation of new linear rates…

Equations305

x \mapsto \mbox dist^{p} (x, \mbox argmin f),

x \mapsto \mbox dist^{p} (x, \mbox argmin f),

(\forall λ > 0) (\forall x \in X) \mbox prox_{λ g} (x) = u \in X \mbox argmin {g (u) + \frac{1}{2 λ} ∥ u - x ∥^{2}} .

(\forall λ > 0) (\forall x \in X) \mbox prox_{λ g} (x) = u \in X \mbox argmin {g (u) + \frac{1}{2 λ} ∥ u - x ∥^{2}} .

T_{λ} : x \in X ⟼ T_{λ} x := \mbox prox_{λ g} (x - λ \nabla h (x)) \in X,

T_{λ} : x \in X ⟼ T_{λ} x := \mbox prox_{λ g} (x - λ \nabla h (x)) \in X,

f (x_{n}) - in f f \leq C \frac{\mbox dist ( x _{0} , \mbox argmin f ) ^{2}}{2 λn}, with C = {1 1 + 2 (λ L - 1) (2 - λ L)^{- 1} if λ \leq L^{- 1}, otherwise .

f (x_{n}) - in f f \leq C \frac{\mbox dist ( x _{0} , \mbox argmin f ) ^{2}}{2 λn}, with C = {1 1 + 2 (λ L - 1) (2 - λ L)^{- 1} if λ \leq L^{- 1}, otherwise .

(\forall n \geq 1) ∣ x_{n} ∣ \geq C_{p} n^{- 1/ (p - 2)}, where p \to + \infty lim \frac{1}{p - 2} = 0.

(\forall n \geq 1) ∣ x_{n} ∣ \geq C_{p} n^{- 1/ (p - 2)}, where p \to + \infty lim \frac{1}{p - 2} = 0.

f_{p} (x_{n}) - in f f_{p} \geq C_{p}^{p} n^{- p / (p - 2)} .

f_{p} (x_{n}) - in f f_{p} \geq C_{p}^{p} n^{- p / (p - 2)} .

f_{α} : R \to] - \infty, + \infty] : f_{α} (x) = ∣ x ∣^{- α} if x < 0, + \infty otherwise.

f_{α} : R \to] - \infty, + \infty] : f_{α} (x) = ∣ x ∣^{- α} if x < 0, + \infty otherwise.

f_{α} (x_{n}) - in f f_{α} \geq C_{α}^{- α} n^{- α / (2 + α)}, where α \to 0 lim \frac{α}{2 + α} = 0 and α \to + \infty lim \frac{α}{2 + α} = 1.

f_{α} (x_{n}) - in f f_{α} \geq C_{α}^{- α} n^{- α / (2 + α)}, where α \to 0 lim \frac{α}{2 + α} = 0 and α \to + \infty lim \frac{α}{2 + α} = 1.

\forall x \in Ω \cap missing dom f, \frac{γ _{f, Ω}}{p} \mbox dist (x, \mbox argmin f)^{p} \leq f (x) - in f f .

\forall x \in Ω \cap missing dom f, \frac{γ _{f, Ω}}{p} \mbox dist (x, \mbox argmin f)^{p} \leq f (x) - in f f .

\forall x \in Ω \cap \mbox dom^{*} f, γ_{\partial f, Ω} \mbox dist (x, \mbox argmin f)^{p - 1} \leq ∥ \partial f (x) ∥_{_} .

\forall x \in Ω \cap \mbox dom^{*} f, γ_{\partial f, Ω} \mbox dist (x, \mbox argmin f)^{p - 1} \leq ∥ \partial f (x) ∥_{_} .

\forall x \in Ω \cap \mbox dom^{*} f, (f (x) - in f f)^{1 - \frac{1}{p}} \leq c_{f, Ω} ∥ \partial f (x) ∥_{_} .

\forall x \in Ω \cap \mbox dom^{*} f, (f (x) - in f f)^{1 - \frac{1}{p}} \leq c_{f, Ω} ∥ \partial f (x) ∥_{_} .

(\forall (x_{1}, x_{2}) \in missing dom \partial f^{2}) (\forall x_{1}^{*} \in \partial f (x_{1})) f (x_{2}) - f (x_{1}) - ⟨ x_{1}^{*}, x_{2} - x_{1} ⟩ \geq \frac{γ}{p} ∥ x_{2} - x_{1} ∥^{p} .

(\forall (x_{1}, x_{2}) \in missing dom \partial f^{2}) (\forall x_{1}^{*} \in \partial f (x_{1})) f (x_{2}) - f (x_{1}) - ⟨ x_{1}^{*}, x_{2} - x_{1} ⟩ \geq \frac{γ}{p} ∥ x_{2} - x_{1} ∥^{p} .

γ_{f, X} ∥ x ∥^{2} \leq ⟨ A^{*} A x, x ⟩, γ_{\partial f, X} ∥ x ∥ \leq ∥ A^{*} A x ∥, and ⟨ A^{*} A x, x ⟩ \leq 2 c_{f, X}^{2} ∥ A^{*} A x ∥^{2} .

γ_{f, X} ∥ x ∥^{2} \leq ⟨ A^{*} A x, x ⟩, γ_{\partial f, X} ∥ x ∥ \leq ∥ A^{*} A x ∥, and ⟨ A^{*} A x, x ⟩ \leq 2 c_{f, X}^{2} ∥ A^{*} A x ∥^{2} .

γ_{f, X} = γ_{\partial f, X} = 1/ (2 c_{f, X}^{2}) = σ_{i n f} (A^{*} A),

γ_{f, X} = γ_{\partial f, X} = 1/ (2 c_{f, X}^{2}) = σ_{i n f} (A^{*} A),

u \in argmin \tilde{g} \Leftrightarrow 0 \in \partial (g - ⟨ A^{*} \overset{v}{ˉ}, \cdot ⟩) (u) = \partial g (u) - A^{*} \overset{v}{ˉ} \Leftrightarrow A^{*} \overset{v}{ˉ} \in \partial g (u) \Leftrightarrow u \in \partial g^{*} (A^{*} \overset{v}{ˉ}),

u \in argmin \tilde{g} \Leftrightarrow 0 \in \partial (g - ⟨ A^{*} \overset{v}{ˉ}, \cdot ⟩) (u) = \partial g (u) - A^{*} \overset{v}{ˉ} \Leftrightarrow A^{*} \overset{v}{ˉ} \in \partial g (u) \Leftrightarrow u \in \partial g^{*} (A^{*} \overset{v}{ˉ}),

g (x)

g (x)

h (A x)

f (x) - in f f \geq \frac{γ _{f, Ω}}{p} \mbox dist^{p} (x, argmin f),

f (x) - in f f \geq \frac{γ _{f, Ω}}{p} \mbox dist^{p} (x, argmin f),

0

0

0

\overset{x}{ˉ} \in \mbox argmin f = \partial g^{*} (A^{*} \overset{v}{ˉ}) \cap A^{- 1} \partial h^{*} (- \overset{v}{ˉ}) .

\overset{x}{ˉ} \in \mbox argmin f = \partial g^{*} (A^{*} \overset{v}{ˉ}) \cap A^{- 1} \partial h^{*} (- \overset{v}{ˉ}) .

\overset{v}{ˉ} \in argmin ψ = - \partial h (A \overset{x}{ˉ}) \cap A^{*}^{- 1} \partial g (\overset{x}{ˉ}) .

\overset{v}{ˉ} \in argmin ψ = - \partial h (A \overset{x}{ˉ}) \cap A^{*}^{- 1} \partial g (\overset{x}{ˉ}) .

(\forall x \in Ω \cap δ B_{X} \cap missing dom f) f (x) - in f f \geq γ \mbox dist^{p} (x, \partial g^{*} (A^{*} \overset{v}{ˉ}) \cap A^{- 1} \partial h^{*} (- \overset{v}{ˉ})) .

(\forall x \in Ω \cap δ B_{X} \cap missing dom f) f (x) - in f f \geq γ \mbox dist^{p} (x, \partial g^{*} (A^{*} \overset{v}{ˉ}) \cap A^{- 1} \partial h^{*} (- \overset{v}{ˉ})) .

g (x)

g (x)

h (A x)

f (x) - in f f \geq C_{1} (\mbox dist^{p} (x, \partial g^{*} (A^{*} \overset{v}{ˉ})) + \mbox dist^{p} (A x, \partial h^{*} (- \overset{v}{ˉ}))),

f (x) - in f f \geq C_{1} (\mbox dist^{p} (x, \partial g^{*} (A^{*} \overset{v}{ˉ})) + \mbox dist^{p} (A x, \partial h^{*} (- \overset{v}{ˉ}))),

f (x) - in f f \geq C_{1} max {\mbox dist (x, \partial g^{*} (A^{*} \overset{v}{ˉ})), \mbox dist (A x, \partial h^{*} (- \overset{v}{ˉ}))}^{p},

f (x) - in f f \geq C_{1} max {\mbox dist (x, \partial g^{*} (A^{*} \overset{v}{ˉ})), \mbox dist (A x, \partial h^{*} (- \overset{v}{ˉ}))}^{p},

\mbox dist (x, \mbox argmin f) \leq C_{2} max {\mbox dist (x, \partial g^{*} (A^{*} \overset{v}{ˉ})), \mbox dist (x, A^{- 1} \partial h^{*} (- \overset{v}{ˉ}))} .

\mbox dist (x, \mbox argmin f) \leq C_{2} max {\mbox dist (x, \partial g^{*} (A^{*} \overset{v}{ˉ})), \mbox dist (x, A^{- 1} \partial h^{*} (- \overset{v}{ˉ}))} .

(\forall u \in X) ϕ_{y} (u) \geq (σ_{i n f} (A^{*} A) /2) \mbox dist^{2} (u, \mbox argmin ϕ_{y}) .

(\forall u \in X) ϕ_{y} (u) \geq (σ_{i n f} (A^{*} A) /2) \mbox dist^{2} (u, \mbox argmin ϕ_{y}) .

\mbox dist (A x, R (A) \cap \partial h^{*} (- \overset{v}{ˉ})) \geq σ_{i n f} (A) \mbox dist (x, A^{- 1} \partial h^{*} (- \overset{v}{ˉ}))) .

\mbox dist (A x, R (A) \cap \partial h^{*} (- \overset{v}{ˉ})) \geq σ_{i n f} (A) \mbox dist (x, A^{- 1} \partial h^{*} (- \overset{v}{ˉ}))) .

\mbox dist (x, \mbox argmin f) \leq C_{3} max {\mbox dist (x, \partial g^{*} (A^{*} \overset{v}{ˉ})), \mbox dist (A x, R (A) \cap \partial h^{*} (- \overset{v}{ˉ}))},

\mbox dist (x, \mbox argmin f) \leq C_{3} max {\mbox dist (x, \partial g^{*} (A^{*} \overset{v}{ˉ})), \mbox dist (A x, R (A) \cap \partial h^{*} (- \overset{v}{ˉ}))},

\mbox dist (A x, R (A) \cap \partial h^{*} (- \overset{v}{ˉ}))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

fourierlargesymbols147

Convergence of the Forward-Backward algorithm:

Beyond the worst-case with the help of geometry

Guillaume Garrigos1, Lorenzo Rosasco2,3, and Silvia Villa4

( $\!{}^{1}$ LPSM, Université de Paris. 75205 Paris CEDEX 13, France.

$\!{}^{2}$ DIBRIS, Università degli Studi di Genova. Via Dodecaneso 35, 16146, Genova, Italy.

$\!{}^{3}$ LCSL, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology.

Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA.

$\!{}^{4}$ Dipartimento di Matematica, Università degli Studi di Genova. Via Dodecaneso 35, 16146, Genova, Italy.

)

Abstract

We provide a comprehensive study of the convergence of the forward-backward algorithm under suitable geometric conditions, such as conditioning or Łojasiewicz properties. These geometrical notions are usually local by nature, and may fail to describe the fine geometry of objective functions relevant in inverse problems and signal processing, that have a nice behaviour on manifolds, or sets open with respect to a weak topology. Motivated by this observation, we revisit those geometric notions over arbitrary sets. In turn, this allows us to present several new results as well as collect in a unified view a variety of results scattered in the literature. Our contributions include the analysis of infinite dimensional convex minimization problems, showing the first Łojasiewicz inequality for a quadratic function associated to a compact operator, and the derivation of new linear rates for problems arising from inverse problems with low-complexity priors. Our approach allows to establish unexpected connections between geometry and a priori conditions in inverse problems, such as source conditions, or restricted isometry properties.

††Contact: G. Garrigos [email protected] L. Rosasco [email protected] S. Villa [email protected]

††Acknowledgements: This material is supported by the Center for Brains, Minds and Machines, funded by NSF STC award CCF-1231216, and the Air Force project FA9550-17-1-0390. L. Rosasco acknowledges the financial support of the Italian Ministry of Education, University and Research FIRB project RBFR12M3AC. S. Villa is supported by the INDAM GNAMPA research project 2017 Algoritmi di ottimizzazione ed equazioni di evoluzione ereditarie.

1 Introduction

Splitting algorithms based on first order descent methods are widely used to solve high dimensional convex optimization problems in signal and image processing [28], compressed sensing [31], and machine learning [84]. Their main advantage is their simplicity and complexity independent of the dimension of the problem. The worst case convergence rates of these methods have been intensively investigated in the last twenty years. The simplest example is the gradient method applied to a smooth convex function, which is known to converge in values as $o(n^{-1})$ [32, 94]. Analogous results are known for the forward-backward splitting algorithm. We refer to these results as worst case since no particular assumption is made on the objective function aside from convexity and existence of a solution. Note that these rates are sharp, meaning that there are functions for which these rates are arbitrarily accurate. Clearly such a large class of convex functions allows for functions with wild behaviors around the minimizers [16], behaviors that might hardly appear in practice. It is then natural to ask whether improved rates can be proved under further regularity assumptions.

Previous work on optimization rates with geometry. One classical geometrical assumption is strong convexity, which indeed guarantees linear convergence rates [50, 95]. In practice, strong convexity is often too restrictive, and one would wish to relax it, while retaining fast rates. A relaxation of this condition is given by geometric conditions that, roughly speaking, describe convex functions $f\in\Gamma_{0}(X)$ that behave like

[TABLE]

for some $p\geq 1$ and on some subset $\Omega\subset X$ , which is typically a neighborhood of the minimizers and/or a sub-level set. The intuition behind this kind of assumption required on a neghborhood of the solution is clear: the bigger is $p$ , the more the function is “flat” around its minimizers, which in turns means that a gradient descent algorithm will converge slowly. The idea of exploiting geometric conditions to derive convergence rates has a long history dating back to [89, 91], and plenty of similar convergence rates results have been derived under different yet related geometrical properties.

The optimization community focused on several different but related geometric assumptions, namely the $p$ -conditioning, the $p$ -metric subregularity and the $p$ -Łojasiewicz properties (see Section 3 for their definitions). The first111If we discard the “classic” strong convexity assumption. result exploting geometry to derive fast convergence rates dates back to Polyak [89, Theorem 4], showing that the gradient method converges linearly (in terms of the values and iterates) when the objective function verifies the $2$ -Łojasiewicz inequality. Improved convergence rates for first-order descent methods were then obtained in [91], considering notions slightly stronger than $p$ -metric subregularity, and proving finite convergence of the proximal algorithm for $p=1$ , and linear convergence for $p=2$ . These results are improved and extended in [82], analyzing for the first time convergence rates for the iterates of the proximal algorithm using metric subregularity for general $p\in[1,+\infty[$ . The results in [82] recover those in [91] (see also [96, 97]), but also derive superlinear rates for $p\in\left]1,2\right[$ , and sublinear rates for $p>2$ . Roughly speaking, the results in [82] show that the bigger is $p$ the slower is the algorithm. A related notion, nowadays called the Luo-Tseng error bound condition, has been considered in the seminal paper [81], and implies the linear convergence of several first order methods. Recently, this condition has been shown to be equivalent to 2-conditioning [40, 74]. In the early 90’s, some attention was devoted to the study of $p$ -conditioned functions, in particular for $p=1$ (some authors call this property superlinear conditioning, sharp growth or sharp minima property). In this context, [45, 64, 23] showed that the proximal algorithm terminates after a finite number of iterations. For $p=1$ , Polyak [90, Theorem 7.2.1] obtained the finite termination for the projected gradient method. The $2$ -conditioning was also used to obtain linear rates for the proximal algorithm in [70]. In [3], it was observed that the $p$ -Łojasiewicz property could be used to derive precise rates for the iterates of the proximal algorithm. The authors obtain finite convergence when $p=1$ , linear rates when $p\in\left]1,2\right]$ , and sublinear rates when $p\in\left]2,+\infty\right[$ . Similar results can be found in [4, 83]. Such convergence rates for the iterates have been extended to the forward-backward algorithm (and its alternating versions) in [18], and similar rates also hold for the convergence of the values in [27, 46]. More recently, various papers focused on conditions equivalent (or stronger) to the $2$ -conditioning to derive linear rates [67, 75, 41, 78, 40, 61]. Some effort has also been made to show that the Łojasiewicz property and conditioning are equivalent [16, 17], and to relate it to other error bounds appearing in the literature [61]. See also [85] for a refined analysis of linear rates for the projected gradient algorithm under conditions that interpolate between strong convexity and $2$ -conditioning (see also Subsection 4.3).

A key observation. Our study starts from a basic observation which allows a number of developments. Indeed, motivated by several relevant examples described in Section 5, we require condition (1) to hold on an arbitrary set $\Omega$ , which in general is neither a neighborhood of the solution, nor a sublevel set. This extension allows to establish a connection with modeling assumptions considered in different contexts and unveil their role in optimization. As we explain below, modeling assumptions, such as source conditions in inverse problems [42] or the restricted injectivity property in sparse recovery [25], correspond to conditioning assumptions on specific subsets. This ensures global convergence rates for the forward-backward algorithm that are faster compared to those given by a worst case analysis and indeed often observed in practice.

Geometry and inverse problems. As a first example of the importance of considering arbitrary sets $\Omega$ to define geometrical properties, consider linear inverse problems $Ax=y$ for which the operator $A$ is an infinite dimensional compact operator, making the problem severely ill-posed. A common modeling assumption is to suppose that the minimal norm solution of the problem satisfies a source condition, which can be seen as a measure of its regularity (see Section 5.1 for a definition). Under this condition, it is shown that the sublinear rate of the gradient algorithm is faster than the worst case one [42]. However, such a behavior cannot be apparently explained in terms of classical geometrical conditions satisfied by the least squares function: indeed, it was shown in [53] that such a least squares function cannot verify any Łojasiewicz inequality (1) in a neighborhood of its minimizers. On the contrary,thanks to the extension of the definition considered in this paper, we show that geometric assumptions are indeed satisfied, but only on specific subsets. More precisely, we show in Theorem 5.9 that the source condition guarantees that the least squares $\|Ax-y\|^{2}$ is $p$ -Łojasiewicz ( $p>2$ ) on a dense affine subspace having empty interior. This allows therefore to explain the faster global rates of the gradient algorithm which are typically observed in this context.

As a second example, consider linear inverse problems with a low-complexity prior, such as sparse inverse problems. For these problems, the restricted injectivity condition [25] is a key modeling assumption to guarantee stable recovery: it means that, even if a linear measurement is corrupted by noise, we can hope to reconstruct an approximated solution by solving a regularized optimization problem. In Section 5.2, we show that this assumption implies a $2$ -conditioning of the problem over a (nonconvex) cone of sparse vectors. Since this set is active, in the sense that it is reached by the algorithm after a finite time, it immediately gives us asymptotic linear rate of the algorithm. For problems with more general low-complexity priors the situation is similar: an active set will be identified by the iterates of the algorithm, and we show that restricted injectivity condition on the tangent cone to this active set induces a $2$ -conditioning of the problem on this set. Depending on the applications or on the hypothesis made on the problem, this set can be a low-dimensional manifold, or a set with less structure, and can be computed within the partial smoothness framework [54] or the mirror stratification one [43].

Paper contents. Motivated by the estimation problems presented in Section 5, the goal of this paper is to provide a comprehensive study of the convergence rates of the forward-backward algorithm for convex minimization problems satisfying geometric conditions on arbitrary sets. We collect in a unified view a variety of results scattered in the literature, and we extend them to this more general setting. In addition, we derive several novel results along the way. The paper is organized as follows.

After reviewing and discussing worst-case convergence results for the forward-backward algorithm in Section 2, we give in Section 3 the definition of different geometric conditions for a proper convex lower semicontinuous function $f$ : $p$ -conditioning, $p$ -metric subregularity, and $p$ -Łojasiewicz property on general subsets $\Omega\subset X$ , rather than sublevel sets or open sets, as typically done in the literature. We show that those geometrical notion are equivalent, provided that the set $\Omega$ is stable by the semigroup generated by $\partial f$ (see Proposition 3.3). Since establishing $p$ -conditioning of a function may be hard in general, we provide two sum rules for conditioned functions in Theorem 3.15 and Theorem 3.17. The first one establishes that if a strictly convex function remains $p$ -conditioned under linear perturbations, then it is also $p$ -conditioned under convex perturbation. The second one gives conditions under which the sum of two conditioned functions are conditioned. It allows us to show in particular that the ROF model (minimization of the total variation and the Kullback-Leibler divergence) is $2$ -conditioned on every bounded set.

Section 4 exploits the $p$ -Łojasiewicz property on general sets to study the convergence of the forward-backward algorithm. In Theorem 4.1, we recover and extend results from the literature, getting finite / superlinear / linear / sublinear convergence rates, depending on the value of $p\in[1,+\infty[$ to our more general setting. Along the way, we extend the sharp superlinear rate known for the proximal method to the Forward-Backward algorithm. In addition, our approach allows to derive in a unified setting both nonasymptotic/global and asymptotic/local convergence results, see Corollaries 4.11 and 4.12. We go beyond the classical analysis by introducing a $p$ -Łojasiewicz property with $p$ taking nonpositive values. This allows to study convex functions being bounded from below but with no minimizers, a case which has drawn little attention so far, but which can arise for instance in function approximation [35] or in statistical learning theory [34, Theorem 9] (see also Section 5.1). For such ill-posed problems, we derive new and sharp sublinear rates for the values in Theorem 4.6, interpolating between $o(n^{-1})$ and $o(1)$ . We further show in Section 4.3 that the $2$ -conditioning is essentially equivalent to the linear convergence of the forward-backward algorithm, illustrating the importance of this notion for convergence rate analysis.

In Section 5, we apply the aforementioned results to optimization problems arising from inverse problems, and discuss the interaction between geometry and modeling assumptions. The key results of this section are Theorem 5.9 and Theorem 5.20. Theorem 5.9 establishes that classical source conditions in inverse problems guarantee the Łojasiewicz property on special sets, and therefore give better convergence rates of the gradient method with respect to worst case ones. Theorem 5.20 says that if we have an a priori assumption about the minimizer, which is assumed to belong to a set $C$ , then a restricted injectivity property of the Hessian of the smooth component of the objective function implies that $f$ is $2$ -conditioned on this set $C$ around the minimizer. This guarantees asymptotic linear rates for forward-backward when combined with Corollary 4.15.

2 The forward-backward algorithm: notation and background

2.1 Notation and basic definitions

We recall a few classic notions and introduce some notation. Throughout the paper $X$ is a Hilbert space. Given $\Omega\subset X$ , we note $\mbox{int}~{}\Omega$ and $\mbox{cl}~{}\Omega$ its interior and closure. We say that $\Omega$ is a cone, if $\Omega=]0,+\infty[\Omega$ . We note $\mbox{cone}(\Omega)$ (resp. $\mbox{span}(\Omega)$ ) the smallest cone (resp. linear subspace) in $X$ containing $\Omega$ . Let $x\in X$ , $\delta\in\left]0,+\infty\right[$ , and let $\mathbb{B}_{X}(x,\delta)$ and $\overline{\mathbb{B}}_{X}(x,\delta)$ denote respectively the open and closed balls of radius $\delta$ centered at $x$ . We also use $\mathbb{B}_{X}$ and $\overline{\mathbb{B}}_{X}$ to denote $\mathbb{B}_{X}(0,1)$ and $\overline{\mathbb{B}}_{X}(0,1)$ , and $\mathbb{S}_{X}$ to denote the unit sphere $\overline{\mathbb{B}}_{X}\setminus\mathbb{B}_{X}$ . The distance of $x\in X$ from a set $\Omega\subset X$ is $\mbox{\rm dist\,}(x,\Omega)=\inf\{\|x-y\|\colon y\in\Omega\}$ , and $\|\Omega\|_{\_}$ stands for $\mbox{\rm dist\,}(0,\Omega)$ , so, in particular $\|\emptyset\|_{\_}=+\infty$ . If $\Omega$ is closed and convex, $\mbox{\rm proj}(x,\Omega)$ is the projection of $x$ onto $\Omega$ , and the relative interior and the strong relative interior of $\Omega$ are respectively defined as [11, Definition 6.9]: $\mbox{\rm ri\,}\Omega=\{x\in\Omega\ |\ \mbox{cone}(C-x)=\mbox{span}(C-x)\}$ , $\mbox{\rm sri\,}=\{x\in\Omega\ |\ \mbox{cone}(C-x)=\mbox{cl}~{}\mbox{span}(C-x)\}$ . Given a bounded linear operator $A$ between two Hilbert spaces, its spectrum, noted $\mbox{\rm spec}(A)$ , is the set of spectral values $\lambda\in\mathbb{R}$ such that $A-\lambda I$ is not boundedly invertible. We also note $\mbox{\rm spec}^{*}(A):=\mbox{\rm spec}(A)\setminus\{0\}$ . The set of singular values of $A$ , noted $\sigma(A)$ , is defined as $\sigma(A):=\sqrt{\mbox{\rm spec}^{*}(AA^{*})}$ , and we note $\sigma_{inf}(A):=\inf\sigma(A)$ . Let $\Gamma_{0}(X)$ be the class of convex, lower semi-continuous, and proper functions from $X$ to $\left]-\infty,+\infty\right]$ . For $f\in\Gamma_{0}(X)$ and $x\in X$ , $\partial f(x)\subset X$ denotes the (Fenchel) subdifferential of $f$ at $x$ [11, Definition 16.1], and $\mathop{\mathrm{missing}}{\rm dom}f$ (resp. $\mathop{\mathrm{missing}}{\rm dom}\partial f$ ) denotes the effective domain of $f$ (resp. of $\partial f$ ). Moreover, $f^{*}$ is the Fenchel conjugate of $f$ , namely $f^{*}(v)=\sup_{x\in X}\langle x,v\rangle-f(x)$ for all $v\in X$ . We introduce the shorthand notation $\mbox{\rm dom}^{*}f:=\mathop{\mathrm{missing}}{\rm dom}f\setminus\mbox{\rm argmin\,}f$ . We also introduce the following notation for the (strict) sublevel sets of $f\in\Gamma_{0}(X)$ : for every $r\in]-\infty,+\infty]$ , $[f<r]:=\{x\in X\ |\ f(x)<r\}$ .

The following assumption will be made throughout this paper.

Assumption 2.1.

Let $X$ be a Hilbert space, $g\in\Gamma_{0}(X)$ , and $h\colon X\to\mathbb{R}$ be differentiable and convex, with $L$ -Lipschitz continuous gradient for some $L\in\left]0,+\infty\right[$ and set $f=g+h$ .

Splitting methods, such as the forward-backward algorithm, are extremely popular for minimizing an objective function as in Assumption 2.1. To have an implementable procedure, we implicitly assume that the proximal operator of $g$ can be easily computed (see e.g. [28]):

[TABLE]

Remembering Assumption 2.1 is in force, we introduce the Forward-Backward (FB) map for $\lambda\in]0,2L^{-1}[$ :

[TABLE]

so that the FB algorithm can be simply written as $x_{n+1}=T_{\lambda}x_{n}$ .

2.2 The Forward-Backward algorithm: worst-case analysis

The following theorem collects known results about the convergence of the FB algorithm. This is a “worst-case” analysis, in the sense that it holds for every $f\in\Gamma_{0}(X)$ satisfying Assumption 2.1. The main goal of Section 4 is to show how these results can be improved taking into account the geometry of $f$ at its infimum.

Theorem 2.2 (Forward-Backward - convex case).

Suppose that Assumption 2.1 is in force, and let $(x_{n})_{n\in\mathbb{N}}$ be generated by the FB algorithm with $\lambda\in]0,2L^{-1}[$ . Then:

i)

(Descent property) The sequence $(f(x_{n}))_{n\in\mathbb{N}}$ is decreasing, and converges to $\inf f$ . 2. ii)

(Féjer property) For all $\bar{x}\in\mbox{\rm argmin\,}f$ , the sequence $\left(\|x_{n}-\bar{x}\|\right)_{n\in\mathbb{N}}$ is decreasing. 3. iii)

(Boundedness) The sequence $(x_{n})_{n\in\mathbb{N}}$ is bounded if and only if $\mbox{\rm argmin\,}f$ is nonempty.

Suppose in addition that $f$ is bounded from below. Then

(Subgradients convergence) The sequence $\left(\|\partial f(x_{n})\|_{\_}\right)_{n\in\mathbb{N}}$ converges decreasingly to zero, with $\|\partial f(x_{n+1})\|_{\_}^{2}=O\left(f(x_{n})-\inf f\right).$

Moreover, if $\mbox{\rm argmin\,}f\neq\emptyset$ , we have:

(Weak convergence) The sequence $(x_{n})_{n\in\mathbb{N}}$ converges weakly to a minimizer of $f$ . 2. 6.

(Global rates for function values) For all ${n\in\mathbb{N}}$ ,

[TABLE] 3. 7.

(Asymptotic rates for function values) When $n\to+\infty$ , $f(x_{n})-\inf f=o\left(n^{-1}\right).$

Theorem 2.2 collects various convergence results on the FB algorithm. Item i appears in [94, Theorem 3.22] (see also [52]). Item ii is a consequence of the nonexpansiveness of the FB map (see (3)) [65, Lemma 3.2]. Item iii, which is a consequence of Opial’s Lemma [87, Lem. 5.2], can be found in [94, Theorem 3.12]. Item 4 follows from Lemma A.9.ii in the Annex. Item 5 is also a consequence of Opial’s Lemma, see [65, Proposition 3.1]. Items 6 and 7 are proved in [32, Theorem 3] (see also [20, Proposition 2] and [12, Theorem 3.1]).

Remark 2.3 (Sharpness of the results in the worst-case).

The convergence results in Theorem 2.2 are sharp, in the following sense. First, the iterates may not converge strongly: see [8, 52] for a counterexample in $\Gamma_{0}(\ell^{2}(\mathbb{N}))$ . Even in finite dimension, no sublinear rates should be expected for the iterates. To see this, apply the proximal algorithm to the function $x\in\mathbb{R}\mapsto f_{p}(x)=|x|^{p}$ , whose unique minimizer is zero. When $p\in\left]2,+\infty\right[$ , there exists a constant $C_{p}>0$ depending on $(\|x_{0}\|,\lambda,p)$ such that (see e.g. the discussion following [83, Proposition 2.5], or Lemma A.1):

[TABLE]

The estimate (4) also provides a lower bound for the rates on the objective values:

[TABLE]

The above lower bounds imply that the rate in Theorem 7 cannot be improved into a rate $O(n^{-\delta})$ , for some $\delta>1$ , because we can always find a $p$ large enough verifying $p/(p-2)>\delta$ . It also means that no polynomial rates can be expected for $\|x^{n}-\bar{x}\|$ . This fact was also observed in [32, Theorem 12] on an infinite dimensional counterexample. When $f$ is bounded from below, but has no minimizers, the values $f(x_{n})-\inf f$ go to zero but no rates can be obtained in general. To see this, consider for any $\alpha>0$ the function $f_{\alpha}\in\Gamma_{0}(\mathbb{R})$ defined by

[TABLE]

If $(x_{n})_{n\in\mathbb{N}}$ is obtained by applying the proximal algorithm to this function, then (see Lemma A.1) there exists $C_{\alpha}>0$ such that:

[TABLE]

Observe that this lower bound on the objective function values implies that the convergence for those functions is slower than the usual $O(n^{-1})$ rate obtained in Theorem 2.2.6. It also shows that no polynomial rates can be proven for the values when ${\rm{argmin}}~{}f=\emptyset$ .

3 Identifying the geometry of a function

3.1 Definitions

In this section we introduce the main geometrical concepts that will be used throughout the paper to derive precise rates for the FB method. Roughly speaking, these notions characterize functions which behave like (1) on an arbitrary set $\Omega\subset X$ .

Definition 3.1.

Let $p\in[1,+\infty[$ , let $f\in\Gamma_{0}(X)$ with $\mbox{\rm argmin\,}f\neq\emptyset$ , and $\Omega\subset X$ . We say that:

i)

$f$ is $p$ -conditioned on $\Omega$ if there exists a constant $\gamma_{f,\Omega}>0$ such that:

[TABLE] 2. ii)

$\partial f$ is $p$ -metrically subregular on $\Omega$ if there exists a constant $\gamma_{\partial f,\Omega}>0$ such that:

[TABLE] 3. iii)

$f$ is $p$ -Łojasiewicz on $\Omega$ if there exists a constant $c_{f,\Omega}>0$ such that:

[TABLE]

We will refer to these notions as global if $\Omega=X$ , and as local if $\Omega=\mathbb{B}_{X}(\bar{x};\delta)\cap[f<r]$ for some $\bar{x}\in\mbox{\rm argmin\,}f,$ and $\delta\in]0,+\infty]$ , $r\in]\inf f,+\infty]$ .

The notion of conditioning, introduced in [98, 105], is a common tool in the optimization and regularization literature [6, 86, 66, 101, 17]. It is also called the growth condition [86], and it is strongly related to the notion of Tikhonov wellposedness [38]. The $p$ -metric subregularity coincides with metric subregularity of the subdifferential at the origin, and it is less used, generally defined for $p=1$ or $2$ with $\Omega$ equal to a neighborhood of a specific minimizer [36, 67]. It is also called upper Lipschitz continuity at zero of $\partial f^{-1}$ in [29], or inverse calmness [37]. The Łojasiewicz property goes back to [79], and was initially designed as a tool to guarantee the convergence of trajectories for the gradient flow of analytic functions, before its recent use in convex and nonconvex optimization. It is generally presented with a constant $\theta\in[0,1]$ which is equal, in our notation, to $1-1/p$ [79, 1, 14, 17], or $1/p$ [83, 53, 46]. In the remark below we explain the main difference between our definition and the one usually considered in the literature.

Remark 3.2.

There is a subtle but crucial difference in the terminology used in Definition 3.1 with respect to the one commonly used for the Łojasiewicz property. It is usually said that a function has the Łojasiewicz property at $\bar{x}$ if there exist $\delta>0$ , $c>0$ , and $r>\inf f$ such that $f(x)-f(\bar{x})\leq c\|\partial f(x)\|_{-}$ holds on $\Omega=\mathbb{B}_{X}(\bar{x};\delta)\cap[f<r]$ . If the latter property holds for every $\bar{x}\in S\subset X$ , the function is said to have the Łojasiewicz property on $S$ . This is a different requirement with respect to the one in Definition 3.1. Indeed, we require the inequality to hold uniformly on $\Omega$ , while the above definition must hold locally around every point of interest in a given set, and typically only allows for asymptotic convergence rates (see Corollary 4.12). This change of viewpoint is motivated by the fact that for many convex functions, we have more than just a local information about the geometry (see Sections 3.3 and 4). More importantly, it is actually necessary for the analysis of the problems discussed in Section 5, which motivated this paper. Beyond that, it also allows to understand in a unified framework both global (Corollary 4.11) and local (Corollary 4.12) convergence rates.

The notions introduced in Definition 3.1 are closely related to each other. Indeed, for convex functions, $p$ -conditioning implies metric subregularity, which implies the Łojasiewicz property. Under some additional assumptions, it is possible to show that the reverse implications hold. For instance, metric subregularity implies conditioning when $\Omega=\mbox{\rm argmin\,}f+\delta\mathbb{B}_{X}$ , $\delta>0$ [102, Theorem 4.3]. Similar results can also be found in [2, 7, 41, 39], and [29, Theorem 5.2] (for $\Omega=X$ ). Also, it is shown in [17, Theorem 5] that the local Łojasiewicz property implies local conditioning. The next result, proved in Annex A.2, extends the mentioned ones, and states the equivalence between conditioning, metric subregularity, and Łojasiewicz property on $\partial f$ -invariant sets (see Definition A.2 in Annex A.2).

Proposition 3.3.

Let $p\in[1,+\infty[$ , let $\Omega\subset X$ , and let $f\in\Gamma_{0}(X)$ be such that $\mbox{\rm argmin\,}f\neq\emptyset$ . Consider the following properties:

i)

$f$ is $p$ -conditioned on $\Omega$ , 2. ii)

$\partial f$ is $p$ -metrically subregular on $\Omega$ , 3. iii)

$f$ is $p$ -Łojasiewicz on $\Omega$ .

Then i $\implies$ ii $\implies$ iii. One can respectively take $\gamma_{\partial f,\Omega}=\gamma_{f,\Omega}/p$ and $c_{f,\Omega}=\gamma_{\partial f,\Omega}^{-1/p}$ . Assuming in addition that $\Omega$ is $\partial f$ -invariant, we also have iii $\implies$ i with $\gamma_{f,\Omega}=c_{f,\Omega}^{-p}p^{1-p}$ .

The two next propositions show that these geometric notions are stronger when $p$ is smaller, and are meaningful only on sets containing minimizers (their proof follow directly from Definition 3.1 and are left to the reader).

Proposition 3.4.

Let $f\in\Gamma_{0}(X)$ be such that $\mbox{\rm argmin\,}f\neq\emptyset$ , $\Omega\subset X$ , and $p^{\prime}\geq p\geq 1$ .

i)

If $f$ is $p$ -conditioned (resp. $\partial f$ is $p$ -metrically subregular) on $\Omega$ , then $f$ is $p^{\prime}$ -conditioned (resp. $\partial f$ is $p^{\prime}$ -metrically subregular) on $\Omega\cap\delta\mathbb{B}_{X}$ for any $\delta\in]0,+\infty[$ . 2. ii)

If $f$ is $p$ -Łojasiewicz on $\Omega$ , then $f$ is $p^{\prime}$ -Łojasiewicz on $\Omega\cap[f<r]$ for any $r>\inf f$ .

Proposition 3.5.

Let $f\in\Gamma_{0}(X)$ be such that $\mbox{\rm argmin\,}f\neq\emptyset$ . If $\Omega\subset X$ is a weakly compact set for which $\Omega\cap\mbox{\rm argmin\,}f=\emptyset$ , then $f$ is $p$ -conditioned on $\Omega$ for any $p\in[1,+\infty[$ .

3.2 Examples

In this section, we collect some relevant examples.

Example 3.6 (Uniformly convex functions).

Suppose that $f\in\Gamma_{0}(X)$ is uniformly convex of order $p\in[2,+\infty[$ [11, Definition 10.7]. Then, there exists $\gamma>0$ such that [101, Corollary 3.5.11.iv]:

[TABLE]

Such function is globally $p$ -conditioned, with $\gamma_{f,X}=\gamma$ , and globally $p$ -Łojasiewicz, with $c_{f,X}=(1-1/p)^{1-1/p}\gamma^{-1/p}$ (see Lemma A.4). In the strongly convex case, when $p=2$ , the $2$ -Łojasiewicz inequality holds with the constant $c_{f,X}=1/\sqrt{2\gamma}$ , which is sharp. Examples of uniformly convex functions of order $p$ are $x\mapsto\|x\|^{p}$ [11, Example 10.16].

Example 3.7 (Least squares).

Let $A:X\rightarrow Y$ be a nonzero bounded linear operator between Hilbert spaces, and $f(x)=(1/2)\|Ax-y\|^{2}$ , for some $y\in Y$ . Then, the conditioning, metric subregularity, and Łojasiewicz properties, with $p=2$ and $\Omega=X$ , are equivalent to verify on $\operatorname{Ker}A^{\perp}$ , respectively:

[TABLE]

If $\sigma_{\inf}(A^{*}A)>0$ holds, one can see that the above inequalities hold with

[TABLE]

meaning in particular that $f$ is globally $2$ -conditioned. Since $\sigma_{\inf}(A^{*}A)>0$ is equivalent for $R(A^{*}A)$ to be closed (see Proposition 5.2), it is in particular always true when $Y$ has finite dimension. If instead $\sigma_{\inf}(A^{*}A)=0$ holds, [53, Theorem 2.1] shows that $f$ cannot satisfy any local $p$ -Łojasiewicz property, for any $p\geq 1$ . This is for instance the case for infinite dimensional compact operators. Nevertheless, we will show in Section 5, that the least squares always satisfies a $p$ -Łojasiewicz property on the so-called regularity sets, for any $p>2$ .

Example 3.8 (Convex piecewise polynomials).

A convex continuous function $f:\mathbb{R}^{N}\rightarrow\mathbb{R}$ is said to be convex piecewise polynomial if $\mathbb{R}^{N}$ can be partitioned in a finite number of polyhedra $P_{1},...,P_{s}$ such that for all $i\in\{1,...,s\}$ , the restriction of $f$ to $P_{i}$ is a convex polynomial, of degree $d_{i}\in\mathbb{N}$ . The degree of $f$ is defined as $\deg(f):=\max\{d_{i}\ |\ i\in\{1,...,s\}\}$ . Assume $\deg(f)>0$ . Convex piecewise polynomial functions are conditioned [71, Corollary 3.6]. More precisely, for all $r>\inf f$ , $f$ is $p$ -conditioned on its sublevel set $\Omega=[f<r]$ , with $p=1+(\deg(f)-1)^{N}.$ In general, the constant $\gamma_{f,\Omega}$ (which depends on $r$ ) cannot be explicitly computed. This result implies that polyhedral functions ( $\deg(f)=1$ ) are $1$ -conditioned (in agreement with [23, Corollary 3.6]), and that convex piecewise quadratic functions ( $\deg(f)=2$ ) are $2$ -conditioned (in agreement with [70, Theorem 2.7]). More generally, convex semi-algebraic functions are locally $p$ -conditioned [15].

Example 3.9 (L1 regularized least squares).

Let $f(x)=\alpha\|x\|_{1}+(1/2)\|Ax-y\|^{2}$ , for some linear operator $A:\mathbb{R}^{N}\rightarrow\mathbb{R}^{M}$ , $y\in\mathbb{R}^{M}$ and $\alpha>0$ . As observed in [17, Section 3.2.1], $f$ is convex piecewise polynomial of degree $2$ , thus it is $2$ -conditioned on every nonempty level set $\Omega=[f<r]$ . The computation of the conditioning constant $\gamma_{f,\Omega}$ is rather difficult. In [17, Lemma 10] an estimate of $\gamma_{f,\Omega}$ is provided, by means of Hoffman’s bound [58]. Extensions of this result to the infinite dimensional setting can be found in [49].

Example 3.10 (Regularized problems).

Let $X$ be an Euclidean space, $f(x):=g(x)+h(Ax)$ , where $A:X\rightarrow\mathbb{R}^{M}$ is a linear operator, $g\in\Gamma_{0}(X)$ , and $h\in\Gamma_{0}(\mathbb{R}^{M})$ is a strongly convex $C^{1,1}$ function, and $\mbox{\rm argmin\,}f\neq\emptyset$ . Then $f$ is $2$ -conditioned on any level set $\Omega=[f<r]$ , for $r>\inf f$ , if

i)

$g(x)=\|x\|_{p}$ with $p\in\left]1,2\right]$ , (see [104, Corollary 2]), 2. ii)

$g(x)=\|x\|_{p}^{p}$ with $p\in\left]1,2\right]$ , (use [40, Theorem 4.2]; the details are left to the reader as an exercise, and can be checked in the Appendix), 3. iii)

$g(x)=\|x\|_{*}$ is the nuclear norm of the matrix $x\in X$ , provided the following qualification condition holds222We mention that this result was originally announced in [60, Theorem 3.1] without the qualification condition, but then corrected in [103, Proposition 12 & following remarks], in which the authors show that such condition is necessary. (see [103]): $\exists\bar{x}\in\mbox{\rm argmin\,}f$ such that $-A^{*}\nabla h(A\bar{x})\in\mbox{\rm ri\,}\partial\|\cdot\|_{*}(\bar{x})$ . 4. iv)

$g$ is polyhedral (see [103, Proposition 6]).

Note that in [103, 104], the authors do not prove directly that the functions are $2$ -conditioned, but that they verify the so-called Luo-Tseng error bound, that is known to be equivalent to $2$ -conditioning on sublevel sets [40, Corollary 3.6]. Note also that in items ii-iv), the strong convexity and $C^{1,1}$ assumptions on $h$ can be weakened (see [103] and [40, Theorem 4.2]).

Example 3.11 (Distance to an intersection).

Let $C,D$ be two closed convex sets in $X$ such that $C\cap D\neq\varnothing$ , and for which the intersection is sufficiently regular, i.e. $0\in\mbox{\rm sri\,}(C-D)$ . Let $f(\cdot)=\max\{\mbox{\rm dist\,}(\cdot,C),\mbox{\rm dist\,}(\cdot,D)\}$ . Clearly, $f\in\Gamma_{0}(X)$ , and $\mbox{\rm argmin\,}f=C\cap D$ . Then $f$ is $1$ -conditioned on bounded sets [10, Theorem 4.3]. Let $p\in\left[1,+\infty\right[$ . From $\|\cdot\|_{\infty}\leq\|\cdot\|_{p}$ , it follows that the function $x\mapsto\mbox{\rm dist\,}(x,C)^{p}+\mbox{\rm dist\,}(x,D)^{p}$ is $p$ -conditioned on bounded sets. The regularity condition $0\in\mbox{\rm sri\,}(C-D)$ is not necessary if the two sets are polyhedral, as proved by Hoffman [58].

Example 3.12 (Minimum of Łojasiewicz functions).

If $f=\min_{i=1,\dots,m}f_{i}$ , with $f_{i}\in\Gamma_{0}(\mathbb{R}^{N})$ being continuous on its domain, and locally $p$ -Łojasiewicz at $\bar{x}\in{\rm{argmin}}~{}f$ , then $f$ is locally $p$ -Łojasiewicz at $\bar{x}$ [74, Theorem 3.1]. It is important to notice that this result do not need the $f_{i}$ ’s to be convex.

The next section presents new sum rules for conditioned functions.

3.3 A sum rule for $p$ -conditioned functions

Since verifying conditioning directly with the definition can be difficult, it is very useful to establish which basic operations preserve conditioning. In this section we present two new sum rules for conditioned functions in a setting where $f=g+h\circ A$ , where $g$ and $h$ are convex and $A$ is a bounded linear operator. Theorem 3.15 states that if $g$ strictly convex and $p$ -conditioned up to linear perturbations then also $f$ is $p$ -conditioned. Theorem 3.17 provides an alternative where the assumption of strict convexity of $g$ is replaced by a stable conditioning assumption on $h$ , which we formalise in the next definition, inspired by the terminology used in [88, 41, 40].

Definition 3.13.

Let $f\in\Gamma_{0}(X)$ , $\Omega\subset X$ , and $p\in[1,+\infty[$ . We say that $f$ is $p$ -tilt-conditioned if, for every $u\in X$ , the tilted function $f+\langle u,\cdot\rangle$ has no minimizers, or is $p$ -conditioned on $\Omega$ .

Note that a similar notion is already present in the literature: if $f$ is $p$ -tilt-conditioned (in our sense) on every compact set, then it is firmly convex in the sense of [40, Definition 4.1].

Example 3.14 (Tilt-conditioned functions).

Many conditioned functions relevant for inverse problems are also tilt-conditioned:

•

The $1$ -norm $\|\cdot\|_{1}$ , and more generally every polyhedral function, are $1$ -tilt-conditioned on Euclidean spaces [23, Cor. 3.6].

•

Convex piecewise polynomials of degree 2 are $2$ -tilt-conditioned on their sublevel sets. This is due to Example 3.8 and the fact that this class of functions is stable up to linear perturbations.

•

For the same reasons as above, $p$ -uniformly convex functions are $p$ -tilt-conditioned on $X$ , for $p\geq 2$ .

•

If $KL(x_{1};x_{2})$ denotes the Kullback-Leibler divergence between two vectors in $]0,+\infty[^{N}$ , then the divergence $KL(x_{1};\cdot)$ is $2$ -tilt-conditioned on bounded sets. This result is new, and its proof can be found in Lemma A.6.

•

The nuclear norm is $2$ -tilt-conditioned on bounded sets [103, Proposition 11].

•

See [40, Section 4] for more examples and properties of $2$ -tilt-conditioned functions on compact sets.

In this first theorem, we show that if a strictly convex function remains conditioned up to linear perturbations, then it is also stable up to convex perturbations:

Theorem 3.15 (Sum rule involving a strictly convex tilt-conditioned function).

Let $f=g+h\circ A$ , where $g\in\Gamma_{0}(X)$ , let $Y$ be a Hilbert space, $h\in\Gamma_{0}(Y)$ and $A:X\rightarrow Y$ a bounded linear operator. Suppose that ${\rm{argmin}}~{}f\neq\emptyset$ . Let $\Omega\subset X$ , and assume that:

a)

the nondegeneracy condition $0\in\mbox{\rm sri\,}\left(\mathop{\mathrm{missing}}{\rm dom}h-A(\mathop{\mathrm{missing}}{\rm dom}g)\right)$ holds, 2. b)

$g$ * is strictly convex on its domain,* 3. c)

$g$ * is $p$ -tilt conditioned on $\Omega$ for some $p\in\left[1,+\infty\right[$ .*

Then, $f$ is $p$ -conditioned on $\Omega$ . We have $\gamma_{f,\Omega}=\gamma_{\tilde{g},\Omega}$ , where $\tilde{g}=g+\langle\cdot,u\rangle$ , for some $u\in X$ .

Proof.

Let $\bar{x}\in{\rm{argmin}}~{}f$ ; Fermat’s rule implies that $0\in\partial f(\bar{x})$ . Using assumption a) with [11, Thm. 16.47], we can write $0\in\partial g(\bar{x})+A^{*}\partial h(A\bar{x})$ . Let $\bar{v}\in-\partial h(A\bar{x})$ be such that $0\in\partial g(\bar{x})-A^{*}\bar{v}$ , i.e., $\bar{x}\in\partial g^{*}(A^{*}\bar{v})$ . Let $x\in\Omega\cap\mathop{\mathrm{missing}}{\rm dom}f$ , and set $\tilde{g}=g-\langle A^{*}\bar{v},\cdot\rangle$ . Using the fact that linear forms are continuous, we can use again Fermat’s rule together with a sum rule [87, Thm. 3.30] to write

[TABLE]

meaning that $\mbox{\rm argmin\,}\tilde{g}=\partial g^{*}(A^{*}\bar{v})\neq\emptyset$ . It follows then from assumption c) that $\tilde{g}$ is $p$ -conditioned on $\Omega$ . Moreover, because $g$ is strictly convex, we have $\partial g^{*}(A^{*}\bar{v})=\{\bar{x}\}$ [11, Prop. 16.37.i], and ${\rm{argmin}}~{}f=\{\bar{x}\}$ [11, Cor 11.9]. These facts mean that ${\rm{argmin}}~{}\tilde{g}={\rm{argmin}}~{}f$ . We can now write the conditioning of $\tilde{g}$ evaluated at $x$ , together with the convexity of $h$ (remember that $-\bar{v}\in\partial h(A\bar{x})$ ):

[TABLE]

Observe that we are allowed to use the conditioning of $\tilde{g}$ at $x$ , because $x\in\mathop{\mathrm{missing}}{\rm dom}f\subset\mathop{\mathrm{missing}}{\rm dom}g=\mathop{\mathrm{missing}}{\rm dom}\tilde{g}$ . Summing these two last inequalities gives

[TABLE]

with $\gamma_{f,\Omega}:=\gamma_{\tilde{g},\Omega}$ , which concludes the proof. ∎

Remark 3.16 (On the nondegeneracy condition a) of Theorem 3.15).

This condition is very mild, and is satisfied under any of the following sufficient conditions (we note $\bar{x}$ a minimizer of $f$ ):

•

$h$ is continuous at $A\bar{x}$ (see [11, Prop. 16.27 & Prop. 6.19.vii]).

•

$h$ has a full domain.

•

$\dim Y<+\infty$ , $\bar{x}\in\mbox{\rm qri\,}\mathop{\mathrm{missing}}{\rm dom}g$ and $A\bar{x}\in\mbox{\rm ri\,}\mathop{\mathrm{missing}}{\rm dom}h$ (see [11, Def. 6.9 & Prop. 6.19.ix]). These inclusions hold for instance if $g$ and $h$ have open domains.

Theorem 3.15 is useful, but proves to be impractical when $g$ is not strictly convex, which typically happens when $g$ corresponds to some low-complexity-inducing regularizer used in inverse problems ( $\ell^{1}$ norm, group lasso, nuclear norm, total variation, etc). The next theorem provides a setting for those functions; in exchange for the strict convexity of $g$ , we will require $h$ to also be tilt-conditioned, and to some strong qualification condition to hold.

Theorem 3.17 (Sum rule for tilt-conditioned functions).

Let $f=g+h\circ A$ , where $g\in\Gamma_{0}(X)$ , $h\in\Gamma_{0}(Y)$ and $A:X\rightarrow Y$ is a bounded linear operator with closed range. Suppose that ${\rm{argmin}}~{}f\neq\emptyset$ , and let $\Omega\subset X$ . If $\psi\in\Gamma_{0}(Y)$ denotes the corresponding Fenchel-Rockafellar dual problem $\psi(v)=g^{*}(A^{*}v)+h^{*}(-v)$ , and

a)

the nondegeneracy condition $0\in\mbox{\rm sri\,}\left(\mathop{\mathrm{missing}}{\rm dom}h-A(\mathop{\mathrm{missing}}{\rm dom}g)\right)$ holds,

then ${\rm{argmin}}~{}\psi\neq\emptyset$ . Moreover, if

b)

there is $\bar{v}\in{\rm{argmin}}~{}\psi$ for which the following qualification conditions are satisfied:

[TABLE] 2. c)

$g$ * is $p_{1}$ -tilt-conditioned on $\Omega$ , and $h$ is $p_{2}$ -tilt-conditioned on $A\Omega$ for some $p_{1},p_{2}\geq 1$ ,*

then $f$ is $p$ -conditioned on every bounded subset of $\Omega$ , with $p:=\max\{p_{1},p_{2}\}$ .

Proof.

The beginning of this proof starts as in the proof of Theorem 3.15: we use the nondegeneracy assumption a) with [11, Thm. 16.47] to get some $\bar{x}\in{\rm{argmin}}~{}f$ and $\bar{v}\in-\partial h(A\bar{x})$ such that $\bar{x}\in\partial g^{*}(A^{*}\bar{v})$ . So the condition [11, Thm. 19.1.iii] is verified, meaning that strong duality holds (in the sense that $\inf f=-\inf\psi$ ). This allows to use [11, Cor. 19.2] to obtain

[TABLE]

We can use again [11, Cor. 19.2], this time on the dual problem, to also obtain

[TABLE]

The above equality allows us to assume, without loss of generality, that $\bar{v}$ is the element of ${\rm{argmin}}~{}\psi$ satisfying b). So, it remains to prove that, for all $\delta>0$ , there exists $\gamma>0$ such that:

[TABLE]

Fix $\delta>0$ , let $x\in\Omega_{\delta}:=\Omega\cap\delta\mathbb{B}_{X}\cap\mathop{\mathrm{missing}}{\rm dom}f$ , and set $\tilde{g}=g-\langle A^{*}\bar{v},\cdot\rangle$ and $\tilde{h}=h+\langle\bar{v},\cdot\rangle$ . Setting $p=\max\{p_{1},p_{2}\}$ , we see from assumption c) and Proposition 3.4 that $\tilde{g}$ and $\tilde{h}$ are $p$ -conditioned on the bounded sets $\Omega_{\delta}$ and $A\Omega_{\delta}$ , respectively. Using the same arguments as in (8), we obtain that $\mbox{\rm argmin\,}\tilde{g}=\partial g^{*}(A^{*}\bar{v})\ni\bar{x}$ and $\mbox{\rm argmin\,}\tilde{h}=\partial h^{*}(-\bar{v})\ni A\bar{x}$ . Therefore, the conditioning of $\tilde{g}$ (resp. $\tilde{h}$ ) evaluated at $x\in\mathop{\mathrm{missing}}{\rm dom}f\subset\mathop{\mathrm{missing}}{\rm dom}g=\mathop{\mathrm{missing}}{\rm dom}\tilde{g}$ (resp. $Ax\in A\mathop{\mathrm{missing}}{\rm dom}f\subset\mathop{\mathrm{missing}}{\rm dom}h=\mathop{\mathrm{missing}}{\rm dom}\tilde{h}$ ) writes as

[TABLE]

Summing these two last inequalities gives,

[TABLE]

with $C_{1}=p^{-1}\min\{\gamma_{\tilde{g},\Omega_{\delta}},\gamma_{\tilde{h},A\Omega_{\delta}}\}$ . Since $\|\cdot\|_{\infty}\leq\|\cdot\|_{p}$ on $\mathbb{R}^{2}$ , we deduce that

[TABLE]

It remains to lower bound the right hand side by the distance to $\mbox{\rm argmin\,}f$ . By Example 3.11, thanks to the qualification condition (9) and the fact that $\Omega_{\delta}$ is bounded, we derive from (11) that there exists $C_{2}>0$ independent of $x$ such that

[TABLE]

Define $y:=\mbox{\rm proj}(Ax,R(A)\cap\partial h^{*}(-\bar{v})))$ , which is well defined since we assumed $R(A)$ to be closed. Let $\phi_{y}\in\Gamma_{0}(X)$ be defined by $\phi_{y}(u):=(1/2)\|Au-y\|^{2}$ . Since $y\in R(A)$ , necessarily $\inf\phi_{y}=0$ , so we deduce from Example 3.7 that

[TABLE]

On the one hand, we have $\mbox{\rm argmin\,}\phi_{y}=A^{-1}y\subset A^{-1}\partial h^{*}(-\bar{v})$ . On the other hand, the definition of $y$ implies $\phi_{y}(x)=(1/2)\mbox{\rm dist\,}^{2}(Ax,R(A)\cap\partial h^{*}(-\bar{v}))$ . Thus, it follows from (15) that

[TABLE]

Since this is true for any $x\in\Omega_{\delta}$ , we can combine it with (14) to get for all $x\in\Omega_{\delta}$

[TABLE]

with $C_{3}=C_{2}\max\{1,\sigma_{\inf}(A)^{-1}\}$ . To end the proof, use the qualification condition (10) with Example 3.11 again to get some $C_{4}>0$ such that for all $x\in\Omega_{\delta}$ ,

[TABLE]

The above inequality, combined with (16) and (12), concludes the proof. ∎

Remark 3.18 (On the qualification conditions).

When $g$ is not strictly convex, the conclusion of Theorem 3.17 may not hold if the qualification conditions (9) and (10) are removed, as proved in [103, Section 4.4.4] with $g=\|\cdot\|_{*}$ . Let us give some sufficient conditions for (9) and (10) to hold:

•

If $X$ and $Y$ have finite dimension, b) is equivalent to

[TABLE]

To prove this, use [11, Cor. 6.15] and [92, Thm. 6.7] to see that the above condition is equivalent to (9), which implies (10). This condition is for instance satisfied if $0\in\mbox{\rm ri\,}\partial\psi(\bar{v})$ and $0\in\mbox{\rm ri\,}\mathop{\mathrm{missing}}{\rm dom}g^{*}+A^{*}(\mbox{\rm ri\,}\mathop{\mathrm{missing}}{\rm dom}h^{*})$ (see [11, Thm. 16.47]). Those are the two conditions needed in [40, Theorem 4.2].

•

If $X$ and $Y$ have finite dimension and $h$ is strictly convex, then a sufficient condition for b) is $\bar{x}\in\mbox{\rm ri\,}\partial g^{*}(A^{*}\bar{v})$ [11, Prop. 18.9].

•

If $X$ and $Y$ have finite dimension, $g$ is polyhedral and $h$ is strictly convex, then assumption b) is not needed. As pointed out in [40, Cor. 4.3], this is due to the fact that the subdifferentials of $h^{*}$ and $g^{*}$ are polyhedral, which allows the use of Hoffman’s bound [58] instead of [10, Theorem 4.3] in the proof.

Remark 3.19 (On the closedness of the range).

In Theorem 3.15 we assume $R(A)$ to be closed. To see how important this hypothesis is in infinite dimension, take $g=0$ (which is not strictly convex), $h=\|\cdot\|^{2}$ and $A$ an operator with a nonclosed range. Then, for this example, the qualification conditions cannot be satisfied. Indeed, even if (9) is automatically satisfied (because $\partial g^{*}(0)=X$ ), condition (10) reduces to $0\in\mbox{\rm sri\,}R(A)$ , which is equivalent by definition to $R(A)=\mbox{\rm cl\,}R(A)$ , which is impossible. Worse, even if we could get rid of this qualification condition, and if the conclusion of the theorem were true, we would obtain that $x\mapsto\|Ax\|^{2}$ is $2$ -conditioned on bounded sets, which was proven to be impossible in [53, Theorem 2.1] (combine it with Proposition 3.3).

Remark 3.20 (Previous results).

Our results can be seen as extensions and refinements of arguments from [40], where the authors introduce the ideas of exploiting the $2$ -conditioning of tilted functions on compact sets, together with the description of ${\rm{argmin}}~{}f$ as an intersection (11). Theorem 3.17 improves on [40, Thm. 4.2] and [40, Cor. 4.3] which require the ${\rm{argmin}}~{}f$ to be bounded, and $h$ to be in $C^{1}$ with $\mathop{\mathrm{missing}}{\rm dom}h=Y$ (we only ask for a compatibility condition which is satisfied if $h$ is continuous at $A\bar{x}$ , see Remark 3.16). As far as we know, Theorem 3.15 is the first sum rule of this kind with such weak assumptions on $g$ .

To illustrate the interest of these sum rules, we provide a new result for regularized inverse problems where the loss function is the Kullback-Leibler divergence, and the regularizer is a polyhedral function, such as the $\ell^{1}$ norm, or the Total Variation, which are commonly used in the signal and image processing literature.

Proposition 3.21.

Let $f(x)=g(x)+KL(y;Ax)$ , where $g\in\Gamma_{0}(\mathbb{R}^{N})$ is polyhedral, $A\in\mathcal{M}_{M,N}(\mathbb{R})$ , and $y\in]0,+\infty[^{M}$ . If ${\rm{argmin}}~{}f\neq\emptyset$ , then $f$ is $2$ -conditioned on bounded sets.

Proof.

We just have to verify the hypotheses of Theorem 3.17, by noting $h:=KL(y;\ \cdot\ )$ . First, the nondegeneracy condition a) is verified because $\mathop{\mathrm{missing}}{\rm dom}h$ is open, and $h$ is continuous on its domain (see Remark 3.16). Second, the qualification conditions b) are not needed because we are in a finite dimensional setting, $g$ is polyhedral and $h$ is strictly convex (see Remark 3.18). Finally, $g$ being polyhedral implies that it is globally $1$ -tilt-conditioned (see [23, Corollary 3.6]), and we prove in Lemma A.6 that $h$ is $2$ -tilt-conditioned on bounded sets, so c) is verified. ∎

4 Sharp convergence rates for the Forward-Backward algorithm

In this section, we present sharp convergence results for the forward-backward algorithm applied to $p$ -Łojasiewicz functions on a subset $\Omega$ , building on the ideas in [5]. We extend the analysis to the case where $\Omega$ is an arbitrary set, which will allow us to deal with infinite dimensional inverse problems (see Section 5.1), or structured problems for which all the information is encoded in a manifold (see Section 5.2). We also provide explicit rates of convergence, for both the iterates and the values. The proofs of Section 4.1 are left in the Annex A.3.

4.1 Refined analysis with $p$ -Łojasiewicz functions

Theorem 4.1 (Strong convergence and rates, $p\geq 1$ ).

Suppose that Assumption 2.1 is in force, and that $f$ is bounded from below. Let $(x_{n})_{n\in\mathbb{N}}$ be generated by the FB algorithm. Assume that:

a)

(Localization)* for all ${n\in\mathbb{N}}$ , $x_{n}\in\Omega\subset X$ ,* 2. b)

(Geometry)* $f$ is $p$ -Łojasiewicz on $\Omega$ , for some $p\geq 1$ .*

Then the sequence $(x_{n})_{n\in\mathbb{N}}$ has finite length in $X$ , meaning that $\sum_{{n\in\mathbb{N}}}\|x_{n+1}-x_{n}\|<+\infty$ , and converges strongly to some $x_{\infty}\in\mbox{\rm argmin\,}f\neq\emptyset$ . Moreover, there exist some constants $C_{p},C_{p}^{\prime}>0$ with explicit expressions (see equations (52) and (54)), such that the following convergence rates hold, depending on the value of $p$ , and of $\kappa:=\lambda(2-\lambda L)[2c_{f,\Omega}^{2}]^{-1}$ :

i)

If $p=1$ , then $x_{n}=x_{\infty}$ for every $n\geq(f(x_{0})-\inf f)/\kappa$ . 2. ii)

If $p\in]1,2[$ , the convergence is superlinear: for all ${n\in\mathbb{N}}$ ,

[TABLE] 3. iii)

If $p=2$ , the convergence is linear: for all ${n\in\mathbb{N}}$ ,

[TABLE] 4. iv)

If $p\in]2,+\infty[$ , the convergence is sublinear: for all ${n\in\mathbb{N}}$ ,

[TABLE]

Note that the rates range from the finite termination, for $p=1$ , to the worst-case rates seen in Theorem 2.2, when $p$ tends to $+\infty$ . The bigger is $p$ , the more the function is ill-conditioned, in the sense that the rates of its values become closer to $o(n^{-1})$ , and the rates of its iterates become arbitrarily slow.

Remark 4.2 (Related work).

Theorem 4.1 collects known and new results. We present a simple proof of this theorem, focusing on the analysis of a real sequence satisfying (50) (see [27, Theorem 3.2] or [46, Theorem 3.4] for previous results). The superlinear rates in ii), which were known for the proximal point algorithm [82], are new for the Forward-Backward algorithm. Moreover, the case $p=2$ was giving R-linear rates for the values in [27, 46], while we prove here Q-linear rates. Also, the quantification of the number of steps in the case $p=1$ involving $\kappa$ is new.

Remark 4.3 (On the sharpness of the rates I).

Let $f=\|\cdot\|^{p}$ . According to (4) and (5), the order of the sublinear rates for the forward-backward algorithm that we obtain for both iterates and values are sharp when $p\in]2,+\infty[$ , see Remark 2.3. When $p=2$ , we see that the proximal algorithm verifies $x_{n+1}=(1+2\lambda)^{-1}x_{n}$ , and the algorithm converges linearly. Finally, when $p\in\left]1,2\right[$ , the order of superlinearity that we obtain is not sharp, since for this function the proximal algorithm has a Q-superlinear rate of order $(p-1)^{-1}$ . It is shown in [82, Theorem 3.1] that $\mbox{\rm dist\,}(x_{n},\mbox{\rm argmin\,}f)$ converges with this order for the proximal algorithm. For this, the author uses the stronger notion of metric subregularity, and we will extend this result in Theorem 4.21 to the FB algorithm.

Remark 4.4 (Best stepsize and condition number).

When $p\in\left[1,2\right]$ , we directly see that the bigger is $\kappa$ , the better are the constants in the rates for the values. This is true also for $p>2$ , by looking in the proof of Theorem 4.1 to the definition of the constant $C_{p}^{\prime}$ . The constant $\kappa$ is maximal when we take $\lambda=L^{-1}$ , in which case $\kappa=(L2c_{f,\Omega}^{2})^{-1}$ . When $f$ is a $\gamma$ -strongly convex function, $\kappa=\gamma/L$ is the condition number of $f$ (see Example 3.6) . So $(L2c_{f,\Omega}^{2})^{-1}$ can be seen as a generalized condition number, extending this notion from strongly convex functions to $p$ -Łojasiewicz ones.

In Theorem 4.1 the $p$ -Łojasiewicz assumption with $p\in\left[1,+\infty\right[$ implies that the $\mbox{\rm argmin\,}f$ is nonempty. In what follows we will derive convergence rates for the objective function values, even in the case where $f$ is bounded from below but has no minimizers. Such results are of interest for instance in function approximation theory, where the goal is to find the best approximation of a target function within a specified function class [35]. Since in general the considered classes are not closed in the ambient space, the minimizer of the error does not exist, but convergence rates in objective function values are useful. A similar problem appears also in supervised statistical learning theory, where some convergence results can still be obtained are available (see e.g. [34, Theorem 9] and [33, Theorem A.1]).

We show below that the $p$ -Łojasiewicz notion can be extended to nonpositive values of $p$ , which allows to describe thegeometry of problems without minimizers. Based on this new definition, we then derive sharp convergence rates for the objective function values.

Definition 4.5.

Let $p\in\left]-\infty,0\right[$ , let $f\in\Gamma_{0}(X)$ be bounded from below, and let $\Omega\subset X$ . We say that $f$ is $p$ -Łojasiewicz on $\Omega$ if $\exists c_{f,\Omega}>0$ such that the Łojasiewicz inequality holds:

[TABLE]

Similarly to the case $p\geq 1$ , where this property describes the behavior of $f$ around its minimizers, here it describes the decay of $f(x)$ when $\|x\|$ goes to $+\infty$ . This assumption leads to convergence rates, interpolating between $o(1)$ and $o(n^{-1})$ , depending on the value of $p<0$ . We will see in Section 5.1 that this result applies to ill-posed linear problems involving a compact operator between infinite dimensional spaces.

Theorem 4.6 (Rates of convergence, $p<0$ ).

Let $f\in\Gamma_{0}(X)$ be bounded from below and satisfying Assumption 2.1, $(x_{n})_{n\in\mathbb{N}}$ be generated by the FB algorithm. Assume that:

a)

(Localization)* for all ${n\in\mathbb{N}}$ , $x_{n}\in\Omega\subset X$ ,* 2. b)

(Geometry)* $f$ is $p$ -Łojasiewicz on $\Omega$ , for some $p<0$ .*

Then the values converge sublinearly (with $C_{p}^{\prime}$ defined as in (52)):

[TABLE]

Remark 4.7 (On the sharpness of the rates II).

The rates obtained in Theorem 4.6 are sharp. Indeed, the function defined in (6) is $p$ -Łojasiewicz on $\mathbb{R}$ with $p=-\alpha$ , and our rates match the lower bounds obtained in Remark 2.3.

Theorem 4.6, together with Theorem 4.1, give a complete (and sharp) picture of the asymptotic behavior of the FB algorithm. In fact, looking at the proofs of the mentioned results, we see that the only properties of forward-backward algorithm that are used are (46) and (47). We can then extend the previous theorems to a broader class of first-order descent methods, which encompasses block coordinate descent methods, and/or variable metric extensions of the FB algorithm [5, 18, 46].

Theorem 4.8 (General first-order descent method).

The statements of Theorems 4.1 and 4.6 remain true if the sequence $(x_{n})_{n\in\mathbb{N}}$ is generated by any algorithm satisfying:

[TABLE]

In that case the constant appearing in Theorem 4.1 becomes $\kappa:=ab^{-2}c_{f,\Omega}^{-2}$ .

4.2 How to localize the sequence of iterates

One of the two assumptions we do in Theorems 4.1 and 4.6 is that the sequence belongs to a set $\Omega$ on which the geometry of $f$ is known. We discuss here some possible choices. One first simple case is when $\Omega$ remains invariant under the action of $T_{\lambda}$ (see also Annex A.2).

Definition 4.9.

We say that $\Omega\subset X$ is FB-invariant if for all $\lambda\in]0,2L^{-1}[$ , $T_{\lambda}\Omega\subset\Omega$ .

Example 4.10 (FB-invariant sets).

Theorem 2.2 i-ii and Lemma A.9.ii imply that these sets are FB-invariant (as well as any of their intersection):

•

$\mathbb{B}_{X}(\bar{x},\delta)$ and $\overline{\mathbb{B}}_{X}(\bar{x},\delta)$ for every $\bar{x}\in\mbox{\rm argmin\,}f$ , and for every $\delta\in\left]0,+\infty\right]$ ,

•

$[f<r]$ for every $r>\inf f$ ,

•

$\{x\in X\ |\ \|\partial f(x)\|_{\_}<M\}$ and $\{x\in X\ |\ \|\partial f(x)\|_{\_}\leq M\}$ , for every $M\in]0,+\infty]$ ,

•

$\Omega=\{x_{n}\}_{n\in\mathbb{N}}$ if $(x_{n})_{n\in\mathbb{N}}$ is generated by the FB algorithm.

Assuming that $\Omega$ is FB-invariant, the localization property becomes a simple assumption on the initialization of the algorithm. The proof of the next corollary is immediate:

Corollary 4.11 (Geometry on stable sets gives global rates).

Let $f\in\Gamma_{0}(X)$ be bounded from below and satisfying Assumption 2.1, and $(x_{n})_{n\in\mathbb{N}}$ be generated by the FB algorithm. Assume that $\Omega\subset X$ is FB-invariant and that:

a)

(Initialization) $x_{0}\in\Omega$ , 2. b)

(Geometry) $f$ is $p$ -Łojasiewicz on $\Omega$ , for some $p\in]-\infty,0[\cup[1,+\infty[$ .

Then the results of Theorems 4.1 and 4.6 apply for the sequence $(x_{n})_{n\in\mathbb{N}}$ .

In some cases, it is possible to remove the assumption $x_{0}\in\Omega$ , to the price of having only asymptotic rates. Indeed, it suffices to prove that the sequence will enter in $\Omega$ at a certain iteration, which is the argument used in [5, 46], in a non-convex setting. This happens for instance with the local level sets, under a slight compactness assumption (see below).

Corollary 4.12 (Local geometry gives asymptotical rates).

Let $f\in\Gamma_{0}(X)$ be such that $\mbox{\rm argmin\,}f\neq\emptyset$ and satisfying Assumption 2.1. Let $(x_{n})_{n\in\mathbb{N}}$ be generated by the FB algorithm and assume that:

a)

(Compactness) $(x_{n})_{n\in\mathbb{N}}$ admits a subsequence strongly converging to $\bar{x}$ in $X$ , 2. b)

(Local geometry) for some $p\in\left[1,+\infty\right[$ :

[TABLE]

Then there exists $n_{0}\in\mathbb{N}$ such that the rates of Theorem 4.1 apply for the sequence $(x_{n_{0}+n})_{n\in\mathbb{N}}$ .

Proof.

Let $(x_{n_{k}})_{k\in\mathbb{N}}$ be a subsequence strongly converging to some $x_{\infty}$ , which belongs to $\mbox{\rm argmin\,}f$ according to Theorem 2.2. Therefore, $f$ is $p$ -Łojasiewicz on $\Omega:=\mathbb{B}_{X}(x_{\infty},\delta)\cap[f<r+\inf f]$ , for some $(\delta,r)\in\left]0,+\infty\right]$ . Since $x_{n_{k}}\to x_{\infty}$ and $f(x_{n_{k}})\downarrow\inf f$ , there exists $K\in\mathbb{N}$ such that $x_{n_{K}}\in\Omega$ . Since $\Omega$ is FB-invariant, we conclude that $(x_{n})_{n\geq N}\subset\Omega$ . ∎

Remark 4.13 (On the compactness assumption).

The compactness assumption made in Corollary 4.12 is always satisfied in finite dimension. Indeed Theorem 2.2 guarantees that the sequence is bounded under the assumption that $\mbox{\rm argmin\,}f\neq\emptyset$ . If $X$ has infinite dimension, this assumption can be verified provided that $f$ has compact level sets, due to the decreasing property of $f(x_{n})$ .

The property that a sequence $(x_{n})_{{n\in\mathbb{N}}}$ generated by an algorithm reaches a set of interest $\Omega$ after a finite number of iterations, is usually called identifiability, or finite identification of $\Omega$ [100, 68, 54], and $\Omega$ is therefore called an active set. For instance, the so-called active manifolds can be identified in finite time, under the assumption that $f$ is partially smooth with respect to this manifold [54, 55]. An alternative approach, recently introduced in [43], shows that the strata of mirror-stratifiable functions are identifiable. We will use this notion of active strata to derive another asymptotic convergence result.

Before introducing the notion of mirror-stratifiability, we recall that a set $M\subset\mathbb{R}^{N}$ is said to be stratified by $\{M_{i}\}_{i=1}^{s}\subset M$ if this family is a finite partition $\sqcup M_{i}=M$ such that $M_{i}\cap\mbox{\rm cl\,}M_{j}\neq\varnothing\Leftrightarrow M_{i}\subset\mbox{\rm cl\,}M_{j}$ . The latter inclusion endows the family of strata with an order relation $M_{i}\preceq M_{j}\Leftrightarrow M_{i}\subset\mbox{\rm cl\,}M_{j}$ . Given a point $x\in M$ , it will be useful to note $M_{x}$ the unique strata which contains $x$ .

Definition 4.14 (Mirror-stratifiable function).

We say that a function $f\in\Gamma_{0}(\mathbb{R}^{N})$ is mirror-stratifiable if

a)

$\mathop{\mathrm{missing}}{\rm dom}\partial f$ (resp. $\mathop{\mathrm{missing}}{\rm dom}\partial f^{*}$ ) is stratified by $\{M_{i}\}_{i=1}^{s}$ (resp. $\{M_{i}^{*}\}_{i=1}^{s}$ ), 2. b)

the map $J_{f}:M\longmapsto\bigcup\limits_{x\in M}\mbox{\rm ri\,}\partial f(x)$ realizes a bijection between $\{M_{i}\}_{i=1}^{s}$ and $\{M_{i}^{*}\}_{i=1}^{s}$ , 3. c)

the map $J_{f}$ is decreasing, in the sense that $M_{i}\preceq M_{j}\Leftrightarrow J_{f}(M_{j})\preceq J_{f}(M_{i})$ .

Both notions appear naturally in most sparsity-based inverse problems such as the $1$ -norm, group-lasso norm, nuclear norm, or the total variation, or any polyhedral function, see [43] for more details and many examples.

Corollary 4.15.

Suppose that Assumption 2.1 is in force, that $X=\mathbb{R}^{N}$ , and let $(x_{n})_{n\in\mathbb{N}}$ be the sequence generated by the FB algorithm converging to some $\bar{x}\in\mbox{\rm argmin\,}f$ . Assume that:

a)

$g$ is mirror-stratifiable, and we define $C_{\bar{x}}:=\cup\{M\ |\ M_{\bar{x}}\preceq M\preceq J_{g}^{-1}(M^{*}_{-\nabla h(\bar{x})})\}$ , 2. b)

$f$ is $p$ -Łojasiewicz on $C_{\bar{x}}\cap\mathbb{B}_{X}(\bar{x},\delta)$ for some $\delta\in]0,+\infty]$ and $p\in[1,+\infty[$ .

Then there exists $n_{0}\in\mathbb{N}$ such that the rates of Theorem 4.1 apply for the sequence $(x_{n_{0}+n})_{n\in\mathbb{N}}$ . Note that $C_{\bar{x}}=M_{\bar{x}}$ holds whenever $0\in\mbox{\rm ri\,}\partial f(\bar{x})$ .

Proof.

It follows from [43, Theorem 4] that there exists $n_{0}\in\mathbb{N}$ for which $x_{n_{0}+n}\in C_{\bar{x}}$ for every ${n\in\mathbb{N}}$ . Since $(x_{n})_{{n\in\mathbb{N}}}$ converges to $\bar{x}$ , we can assume that $n_{0}$ is such that $x_{n_{0}+n}\in C_{\bar{x}}\cap\mathbb{B}_{X}(\bar{x},\delta)$ for every ${n\in\mathbb{N}}$ . This, together with b, allows to apply Theorem 4.1 to the sequence $(x_{n_{0}+n})_{n\in\mathbb{N}}$ . The equality $C_{\bar{x}}=M_{\bar{x}}$ follows directly from the bijectivity of $J_{g}$ , and the fact that $-\nabla h(\bar{x})\in\mbox{\rm ri\,}\partial g(\bar{x})$ . ∎

The reader not familiar with the notion of mirror-stratifiability might wonder what is the active set $C_{\bar{x}}$ appearing in Corollary 4.15. Here are a few example of interest:

Example 4.16.

We keep here the notations of Corollary 4.15:

•

If $g(x)=\|x\|_{1}$ , we can choose a stratification based on sets with prescribed support, which gives

[TABLE]

where $\mbox{\rm supp}(x)$ is the support of $x$ , and $\mathop{\mathrm{missing}}{\rm act}(x^{*})=\{i\ |\ |x_{i}|=1\}$ is the set of active indices of $x^{*}$ in $[-1,1]^{N}$ . Some authors call $\mathop{\mathrm{missing}}{\rm act}(-\nabla h(\bar{x}))$ the extended support of $\bar{x}$ . In the case that $0\in\mbox{\rm ri\,}\partial f(\bar{x})$ , we have $\mbox{\rm supp}(\bar{x})=\mathop{\mathrm{missing}}{\rm act}(-\nabla h(\bar{x}))$ .

•

If $g(x)=\|x\|_{*}$ is the nuclear norm, we can choose a stratification based on sets of matrices with prescribed rank, which gives

[TABLE]

where $\sigma(x^{*})$ denotes the set of singular values of the matrix $x^{*}$ . If $0\in\mbox{\rm ri\,}\partial f(\bar{x})$ , we have $\mbox{\rm rank\,}(\bar{x})=\#\mathop{\mathrm{missing}}{\rm act}(\sigma(-\nabla h(\bar{x})))$ .

Remark 4.17 (Partial smoothness).

Even if there is no direct relation between mirror stratification and partial smoothness, all the above mentioned functions are both mirror-stratifiable and partially smooth, and it would be immediate to provide an analogue result to Corollary 4.15 for partially smooth functions. Note that when using the identification theorems for partially smooth functions, it is necessary to assume the qualification condition $0\in\mbox{\rm ri\,}\partial f(\bar{x})$ to hold. In this case, the active manifold coincide with the active set $C_{\bar{x}}=M_{\bar{x}}$ for most practical cases (polyhedral functions, spectral norms), meaning that those cases are already covered by Corollary 4.15.

Remark 4.18 (On the assumptions).

Note that our assumptions do not require or imply that $f$ has unique minimizer; we only require $f$ to be Łojasiewicz on the active set. In Section 5.2, we will show how this geometrical assumption can be guaranteed, provided that $\nabla^{2}h(\bar{x})$ is injective when restricted to the tangent cone of the active set. In [74, Thm. 3.7] the authors provide a sufficient condition for the Łojasiewicz inequality to hold locally when $g$ is a partially smooth function.

4.3 Linear rates of convergence for the Forward-Backward algorithm

In this Section we give more insights on the linear rates for the FB algorithm. According to Theorem 4.1, $f(x_{n})-\inf f$ and $\|x_{n}-x_{\infty}\|$ converge linearly when a $2$ -Łojasiewicz property is verified. Another decreasing quantity of interest is $\mbox{\rm dist\,}(x_{n},\mbox{\rm argmin\,}f)$ , and its Q-linear convergence is equivalent to asking that the forward-backward map $T_{\lambda}$ satisfies

[TABLE]

If such property holds on a set $\Omega$ containing $(x_{n})_{n\in\mathbb{N}}$ , the sequence $(\mbox{\rm dist\,}(x_{n},\mbox{\rm argmin\,}f))_{n\in\mathbb{N}}$ will converge Q-linearly. In fact, it is possible to show that (22) is equivalent to the $2$ -conditioning of $f$ on $\Omega$ , provided this set is FB-invariant (see Definition 4.9). This fact has been observed in [85] for the projected gradient method, with $\Omega=X$ and $\lambda=L^{-1}$ , and below we extend the argument to our more general setting.

Proposition 4.19 (Linear rates and $2$ -conditioning).

Suppose that Assumption 2.1 is in force and assume that $\mbox{\rm argmin\,}f\neq\emptyset$ . Let $\Omega\subset X$ and $\lambda\in]0,2L^{-1}[$ .

i)

If $f$ verifies (22) on $\Omega$ , then it is $2$ -conditioned on $\Omega$ with $\gamma_{f,\Omega}=\lambda^{-1}(2-\lambda L)(1-\varepsilon_{f,\Omega})^{2}$ . 2. ii)

If $f$ is $2$ -conditioned on $T_{\lambda}\Omega$ , then it verifies (22) on $\Omega$ with $\varepsilon_{f,\Omega}=(1+\lambda\gamma_{f,\Omega})^{-1/2}$ for stepsizes $\lambda\in\left]0,L^{-1}\right]$ .

Then, on FB-invariant sets, the $2$ -conditioning is equivalent to (22), for stepsizes $\lambda\in\left]0,L^{-1}\right]$ .

Proof.

Let $S=\mbox{\rm argmin\,}f$ , and let $x\in\Omega$ . It follows from the triangular inequality that

[TABLE]

Lemma A.9.i implies that

[TABLE]

For item i, combine (22), (23), and (24):

[TABLE]

For item ii, Lemma A.9.i with $u=\mbox{\rm proj}(x;S)$ , and the fact that $\lambda\leq 1/L$ implies

[TABLE]

Then, since $f$ is $2$ -conditioned on $T_{\lambda}\Omega\ni T_{\lambda}x$ , we can conclude from

[TABLE]

Let us assume that $f$ is a $\gamma$ -strongly convex function, with $\gamma>0$ as in Example 3.6, and let $\bar{x}$ be its unique minimizer. Let $(x_{n})_{n\in\mathbb{N}}$ be generated by the FB algorithm, for which we take $\lambda=1/L$ , and define the condition number of $f$ as $\kappa:=\gamma/L$ . We compare the different linear rates that we can get for $\|x_{n}-\bar{x}\|$ by using different theorems, relying on more or less strong assumptions. Using that $f$ is $2$ -Łojasiewicz (with $c_{f,X}=(2\gamma)^{-1/2}$ , see Example 3.6), Theorem 4.1 yields R-linear rates of the form

[TABLE]

where $\varepsilon={1}/{\sqrt{1+\kappa}}.$ If instead we exploit $2$ -conditioning (recall that in general this is a stronger notion than 2-Łojasiewicz , Proposition 3.3), we obtain Q-linear rates from Proposition 4.19 with exactly the same constant $\varepsilon$ . If we use directly the strong convexity of $f$ , we obtain in this case Q-linear rates with $\varepsilon=1-\kappa$ (see e.g. [95, Proposition 3]). So, the more information we use, the better rates we derive. In [85], the authors investigate different notions belonging between strong convexity and the $2$ -conditioning. For instance, under an assumption of “quasi strong convexity”, they obtain $\varepsilon=\sqrt{{(1-\kappa})/(1+\kappa)},$ which is smaller than $(1+\kappa)^{-1/2}$ , but not as good as $1-\kappa$ . In conclusion, two aspects are crucial in the linear convergence of forward-backward. First, to have Q-linear rates for the iterates, it is necessary and sufficient to require the $2$ -conditioning of the function, due to the equivalence result of Proposition 4.19. Second, just assuming $2$ -conditioning is not a guarantee of having a fast computation of the solution, since linear rates can be arbitrarily slow on any finite number of iterations. Indeed two constants play a key role: the condition number $\kappa$ , which is directly related to $\gamma_{f,\Omega}$ (some extra assumptions on $f$ could improve the value of $\gamma_{f,\Omega}$ , see e.g. the discussion in Subsection 5.2), and $\varepsilon$ (see also [85]).

4.4 Superlinear rates and finite termination

In this section, we refine the convergence analysis for the case $p\in\left]1,2\right[$ , replacing the $p$ -Łojasiewicz property with $p$ -metric subregularity (or $p$ -conditioning). As discussed in Remark 4.3, the order of superlinear convergence that we derive for the FB algorithm in the case $p\in]1,2[$ is not sharp. In Theorem 4.21, using $p$ -metric subregularity (or $p$ -conditioning) instead of $p$ -Łojasiewicz , we derive better (and indeed sharp, see Remark 4.3) superlinear rates. Keep in mind these three notions are only equivalent via Proposition 3.3 if $\Omega$ verifies a stability condition. The proof of Theorem 4.21 below follows directly from the next lemma, which is a partial analogue of Proposition 4.19-ii.

Lemma 4.20.

Suppose that Assumption 2.1 is in force and assume that $\mbox{\rm argmin\,}f\neq\emptyset$ .

i)

If $\partial f$ is $p$ -metrically subregular on $\Omega\subset X$ , then for all $p\in\left]1,2\right[$ , and $x\in\mbox{\rm dom}^{*}f$ :

[TABLE] 2. ii)

If $f$ is $p$ -conditioned on $\Omega$ , then for all $p\in\left]1,2\right[$ , and $x\in\mbox{\rm dom}^{*}f$ :

[TABLE]

Proof.

Let $S=\mbox{\rm argmin\,}f$ . Lemma A.9.ii, the triangular inequality, and Theorem 2.2-ii yield

[TABLE]

For i, use the hypothesis with (25) to derive $\gamma_{\partial f,\Omega}\mbox{\rm dist\,}(T_{\lambda}x,S)^{p-1}\leq(2/\lambda)\mbox{\rm dist\,}(x,S).$ For ii, use the $p$ -Łojasiewicz inequality via Proposition 3.3 , together with (25) and the $p$ -conditioning:

[TABLE]

Theorem 4.21.

Assume that $p\in]1,2[$ and that the hypotheses of Theorem 4.1 hold. If the $p$ -Łojasiewicz hypothesis is replaced by $p$ -metric subregularity (resp. $p$ -conditioning), then $\mbox{\rm dist\,}(x_{n},\mbox{\rm argmin\,}f)$ (resp. $f(x_{n})-\inf f$ ) Q-superlinearly converges with order $(p-1)^{-1}$ .

We now discuss the relevance of these fast rates when $f$ is $p$ -Łojasiewicz with $p\in\left[1,2\right[$ . While superlinear rates are well-known for the proximal algorithm applied to sharp functions, it is not observed for the gradient method. The apparent contradiction between this result and practice is in fact related to a quite intuitive fact, stated in the following Proposition: the more a function is smooth, the less it can be sharp. This means that the gradient algorithm cannot be applied to $p$ -Łojasiewicz function, with $p<2$ , because it is incompatible with $\nabla f$ being Lipschitz continuous. A similar statement, under different assumptions, can be found in [13, Proposition 2.8].

Proposition 4.22.

Let $f\in\Gamma_{0}(X)$ be such that $\mathop{\mathrm{missing}}{\rm dom}f$ has a nonempty interior. Assume $f$ to be differentiable on $\Omega$ , where $\Omega\subset X$ is convex and such that333Note that $\mbox{\rm proj}(\Omega;\mbox{\rm argmin\,}f)\subset\Omega$ holds when $\Omega=\mathbb{B}_{X}(\bar{x},\delta)\cap[f<r]$ , for $\bar{x}\in{\rm{argmin}}f$ , because $\mbox{\rm proj}(\cdot;\mbox{\rm argmin\,}f)$ is nonexpansive. $\mbox{\rm proj}(\Omega;\mbox{\rm argmin\,}f)\subsetneq\Omega$ . Assume that $f$ is $p$ -conditioned on $\Omega$ , and that $\nabla f$ is $\alpha$ -Hölder continuous on $\Omega$ , i.e.

[TABLE]

Then $p\in[\alpha+1,+\infty[$ . In the case that $p=\alpha+1$ , we have moreover that $\gamma_{f,\Omega}\leq L_{\nabla f,\Omega,\alpha}$ .

Proof.

Let $x\in\Omega\cap\mathop{\mathrm{missing}}{\rm dom}^{*}f$ , and $\bar{x}:=\mbox{\rm proj}(x,\mbox{\rm argmin\,}f)$ . Then $\bar{x}\in\Omega$ and $\bar{x}\neq x$ . For all $t\in\left]0,1\right]$ , let $x_{t}:=tx+(1-t)\bar{x}$ . Then $x_{t}\in\Omega\setminus\mbox{\rm argmin\,}f$ and $\bar{x}=\mbox{\rm proj}(x_{t},\mbox{\rm argmin\,}f)$ . From the $p$ -conditioning assumption and the Descent Lemma A.10 applied at $(\bar{x},x_{t})\in\Omega^{2}$ , we see that:

[TABLE]

If we suppose that $p<\alpha+1$ , then by passing to the limit for $t\to 0$ , we get $\gamma_{f,\Omega}/p\leq 0$ which is impossible. So $p\geq\alpha+1$ , and if equality holds, $\gamma_{f,\Omega}\leq L_{\nabla f,\Omega}$ follows from (26). ∎

As a consequence of Proposition 4.22, we should not expect more than linear rates for the gradient method applied to a $C^{1,1}$ convex function. Such a result cannot be extended straightforwardly to the Forward-backward algorithm. For instance, the function $f(x)=\|x\|^{2}+\|x\|$ has a nontrivial smooth term in its decomposition, but is still sharp at its minimizer.

5 Linear inverse problems: from modeling assumptions to convergence rates

Throughout this section, $X$ and $Y$ are Hilbert spaces and $A:X\longrightarrow Y$ is a bounded linear operator. $X$ is called the parameter space and $Y$ is the data space. Given the linear inverse problem $Ax=y$ , for some $y\in Y$ , we are interested in the (possibly regularized) convex optimization problem

[TABLE]

where $g\in\Gamma_{0}(X)$ and $h=D(\cdot,y)\in\Gamma_{0}(Y)$ . The goal of this section is to show that typical modeling assumptions made in the inverse problem literature can be interpreted as geometric assumptions on (27), which are often not local, in the sense of Definition 3.1. First, we show that the classical source conditions are equivalent to a Łojasiewicz condition on suitable subsets, that we call source sets. Second, we show that the restricted isometry property, which is the key for exact recovery in sparsity based regularization, induces a 2-conditioning of the problem over a cone of sparse vectors, which is identified in finite time by the algorithm. This result extends to general inverse problems with mirror-stratifiable regularizing functions, for which the restricted isometry property entails a 2-conditioning of the problem over an active set (introduced in Corollary 4.15).

5.1 Łojasiewicz property of quadratic functions via source conditions in Hilbert spaces

All across this Section 5.1, we assume that $A:X\rightarrow Y$ is a bounded linear operator, that $y\in Y$ , and that $f(x):=(1/2)\|Ax-y\|^{2}$ is the associated least squares function. We will also note $y^{\dagger}:=\mbox{\rm proj}(y,\mbox{cl}~{}R(A))$ , and, whenever $\mbox{\rm argmin\,}f\neq\emptyset$ , we will note $x^{\dagger}:=\mbox{\rm proj}(0,\mbox{\rm argmin\,}f)$ , which verifies $Ax^{\dagger}=y^{\dagger}$ .

5.1.1 Elements of linear algebra

Before going further into the topic, let us recall some basic (but not necessarily well-known) facts about bounded linear operators in Hilbert spaces. A first important difference with the finite-dimensional setting is that the set of minimizers of $f$ can be empty:

Proposition 5.1 ([51, Theorem 3.1.1]).

Let $A:X\rightarrow Y$ be a bounded linear operator, $y\in Y$ and $f(x):=\|Ax-y\|^{2}/2$ . Then $\mbox{\rm argmin\,}f\neq\emptyset\Leftrightarrow y\in R(A)+R(A)^{\perp}\Leftrightarrow y^{\dagger}\in R(A)$ .

We see that $\mbox{\rm argmin\,}f\neq\emptyset$ is guaranteed when $R(A)$ is closed, which for instance cannot happen for compact operators with infinite-dimensional range [51, Theorem 3.1.3]. Observe that the closedness of $R(A)$ can be checked by means of its singular values:

Proposition 5.2.

Let $A:X\rightarrow Y$ be a bounded linear operator. Then $R(A)$ is closed if and only if $\sigma_{inf}(A)>0$ .

Proof.

Use the fact that $R(A)=R((AA^{*})^{1/2})$ [42, Proposition 2.18] together with [53, Remark 2.3] and the fact that $\mbox{\rm spec}((AA^{*})^{1/2})=\mbox{\rm spec}(AA^{*})^{1/2}$ [56, §32 Theorem 3]. ∎

5.1.2 Known results about the Landweber algorithm

The quadratic function $f$ can be minimized by means of a gradient method, defined as

[TABLE]

A vast literature is devoted to this algorithm, which is often called in this context the Landweber algorithm. It is well-known that whenever $\mbox{\rm argmin\,}f\neq\emptyset$ , the sequence $(x_{n})_{n\in\mathbb{N}}$ generated by the Landweber algorithm converges strongly to the projection of $x_{0}$ onto $\mbox{\rm argmin\,}f$ (see e.g. [42, Theorem 6.1], or [51, Theorem 3.3.2] for varying stepsizes). When the range $R(A)$ is closed, the algorithm behaves exactly as in finite dimensions: both iterates and values converge linearly, see Example 3.7 and Theorem 4.1. If the $R(A)$ is not closed, instead, the rates for $\|x_{n}-\bar{x}_{0}\|$ can be arbitrarily slow without additional assumptions [32, Theorem 12]. Moreover, [53, Theorem 2.1] shows that no local Łojasiewicz property can be satisfied by such quadratic function when $R(A)$ is not closed. This could suggest that it is not possible to rely on geometrical assumptions to obtain convergence rates. Nevertheless, as we will see below, this is not true. Indeed, in the inverse problem literature, this worst-case scenario is avoided by making an extra assumption on the problem. For instance, if the following source condition is verified

[TABLE]

the Landweber algorithm initialized with $x_{0}=0$ is known [42] to have the rates

[TABLE]

Also, when $\mbox{\rm argmin\,}f=\emptyset$ , a source condition in $Y$ can be made:

[TABLE]

so that the Landweber algorithm initialized with $x_{0}=0$ verifies [34, Theorem 2.10]:

[TABLE]

The source condition (31) can be understood in light of Proposition 5.1. Indeed, this proposition says that the problem is well posed (in the sense that $\mbox{\rm argmin\,}f\neq\emptyset$ ) when $y^{\dagger}\in R(A)$ . So it is reasonable to think that the “deeper” $y^{\dagger}$ is in $R(A)$ , and the easier the problem is. In the ill-posed case $y^{\dagger}\in\mbox{\rm cl\,}R(A)\setminus R(A)$ , we could also imagine that the “further away” $y^{\dagger}$ is from $R(A)$ , and the more difficult the problem is. Estimating the location of $y^{\dagger}$ can be done thanks to the spaces $R(AA^{*})^{\nu}$ , because they form a sequence of nonincreasing dense subsets of $\mbox{\rm cl\,}R(A)$ (see Lemma A.14 and [42, Proposition 2.8]):

[TABLE]

The aim of this section is to highlight how the rates (30) and (32) can be simply explained using the results of Section 4. We show that the source conditions (29) and (31) are equivalent to assume that the initialization $x_{0}$ of the algorithm belongs to a so-called source set. Our main result in this section consists in showing that the function $f$ satisfies a Łojasiewicz inequality on these source sets, which are FB-invariant. As a by-product of Corollary 4.11, we will obtain a new and simple geometrical interpretation of the rates in (30) and (32).

5.1.3 Regularity spaces and source sets

Definition 5.3 (Regularity space and source set).

Given $(\nu,\delta)\in\left]0,+\infty\right[\times\left]0,+\infty\right[$ , the data regularity space and the data source set are respectively defined as:

[TABLE] 2. 2.

Given $(\mu,\delta)\in\left]-1/2,+\infty\right[\times\left]0,+\infty\right[$ , the regularity space and the source set are respectively defined as:

[TABLE]

where $A^{-1}$ denotes the preimage of a set under the application $A$ .

Proposition 5.4.

i)

$\mbox{\rm argmin\,}f=\emptyset$ if and only if $X_{\mu}=\emptyset$ for all $\mu\in\left[0,+\infty\right[$ . 2. ii)

$\mbox{\rm argmin\,}f\neq\emptyset$ if and only if $X_{\mu}=X$ for all $\mu\in\left]-1/2,0\right]$ . 3. iii)

Assume $R(A)$ is closed. Then $X_{\mu}=X$ for all $\mu\in\left]-1/2,+\infty\right[$ .

Proof.

Given any $x\in X$ , observe that $x\in X_{0}$ is, by definition, equivalent to $Ax\in Y_{1/2}$ . Since $R(A)=R({AA^{*}})^{1/2}$ , the latter is equivalent to $Ax\in y^{\dagger}+R(A)$ . We can then easily deduce, using also Proposition 5.1, that $X_{0}=X\Leftrightarrow X_{0}\neq\emptyset\Leftrightarrow y^{\dagger}\in R(A)\Leftrightarrow\mbox{\rm argmin\,}f\neq\emptyset.$ For items i and ii, the claim follows directly from the nonincreasingness of $\{X_{\mu}\}_{-1/2<\mu<+\infty}$ . For item iii, observe that for all $\nu>0$ , $\mbox{\rm spec}((AA^{*})^{\nu})=\mbox{\rm spec}(AA^{*})^{\nu}$ [56, §32 Theorem 3]. As a consequence of Proposition 5.2, we deduce that $R(AA^{*})^{\nu}$ is closed, and therefore $R(AA^{*})^{\nu}=R(A)$ (see Lemma A.14 in the Annex). In particular, $Y_{\nu}=y^{\dagger}+R(A)$ for all $\nu\in\left]0,+\infty\right[$ , and the result follows from item ii. ∎

For well-posed problems, for which $\mbox{\rm argmin\,}f\neq\emptyset$ (and $x^{\dagger}$ exists), the source sets can be expressed with a simpler expression (the proof is left in the Annex):

Lemma 5.5 (Source sets for well-posed problems).

Assume that $\mbox{\rm argmin\,}f\neq\emptyset$ . Then, for all $\mu>0$ and $\delta>0$ :

[TABLE]

Remark 5.6.

Given that $x^{\dagger}\in\ker A^{\perp}$ , we see that the classical conditions in (29) and (31) are equivalent, with our notations, to $0\in X_{\mu}$ and $0\in X_{\nu-1/2}$ . This means in particular that (29) is just a particular case of (31).

Remark 5.7 (Source sets as balls).

Assume that $A$ is injective and $y\in R(A)$ . For all $\mu>0$ , $R(A^{*}A)^{\mu}$ is a dense subspace of $X$ (Lemma A.14), and we can endow it with the norm induced by the unbounded operator $(A^{*}A)^{-\mu}$ , defined as $\|x\|_{\mu}:=\inf\{\|w\|\ |\ w\in X\text{ and }x=(A^{*}A)^{\mu}w\}$ . Then, we see that the source sets $X_{\mu,\delta}$ are nothing but balls centered at the solution $x^{\dagger}$ , with respect to this norm:

[TABLE]

while $X_{\mu}$ is the affine space spanned by these balls. By doing an analogy with the following example, the reader can think about this norm $\|\cdot\|_{\mu}$ in $X$ as if it was a Sobolev norm in an $L^{2}$ space. Note that these balls may have an empty interior with respect to the topology of $X$ .

Example 5.8 (Regularity spaces as Sobolev spaces).

Assume that $X$ is the space of zero mean $L^{2}$ -functions on $[0,2\pi]$ :

[TABLE]

If $A$ is the linear integration operator defined on $X$ , then $R(A^{*}A)^{\mu}$ coincides with the Sobolev space $H^{2\mu}([0,2\pi])\cap X$ [59, Theorem 6.4], so that the regularity space is here

[TABLE]

5.1.4 Properties of quadratic functions on source sets

Here is the main result of this section: on each source set $X_{\mu,\delta}$ , the least squares functional $f$ is $p$ -Łojasiewicz with $p=2+\mu^{-1}$ .

Theorem 5.9 (Geometry of least squares on source sets).

Let $\mu\in\left]-1/2,0\right[\cup\left]0,+\infty\right[$ and $\delta\in\left]0,+\infty\right[$ . Then $f(x)=\frac{1}{2}\|Ax-y\|^{2}$ is $p$ -Łojasiewicz on $\Omega=X_{\mu,\delta}$ , with

[TABLE]

Moreover, these two constants are sharp.

Proof.

Let $x\in X_{\mu,\delta}$ and remind that $y^{\dagger}=\mbox{\rm proj}(y,\mbox{\rm cl\,}R(A))$ . From Definition 5.3 and the definition of $y^{\dagger}$ , we get

[TABLE]

We first prove that $f$ verifies the Łojasiewicz inequality by using the interpolation inequality (see Lemma A.13 in the Annex) with $\alpha=\mu+(1/2)$ and $\beta=\mu+1$ , together with (34):

[TABLE]

We use (34) in the right member of (36), to write

[TABLE]

By combining (34), (35), (36) and (37), we obtain the following inequality

[TABLE]

Then the desired Łojasiewicz inequality holds by taking $p:=2+\mu^{-1}$ . Now we verify that the obtained constants in (33) are sharp. For this, let $X=\ell^{2}(\mathbb{N})$ , and let $(e_{k})_{k\in\mathbb{N}}\subset X$ be its canonical basis. Let $(\sigma_{k})_{k\in\mathbb{N}}$ be a strictly positive sequence converging to zero, and define $A:X\longrightarrow X$ as follows: $\forall x=(x_{k})_{k\in\mathbb{N}}\in X,Ax:=\sum_{k\in\mathbb{N}}\sigma_{k}x_{k}e_{k}$ . Let $f(x)=(1/2)\|Ax\|^{2}$ , $y=0$ , and let us assume that $f$ is $p$ -Łojasiewicz on $X_{\mu,\delta}$ for some $p\geq 1$ :

[TABLE]

Let $v^{k}:=\delta\sigma_{k}^{2\mu}e_{k}\in X_{\mu,\delta}$ , which satisfies $\|A^{*}Av^{k}\|=\delta\sigma_{k}^{2+2\mu}$ , and deduce from (38) that

[TABLE]

It follows from $\sigma_{k}\to 0$ that $4\mu-2\mu p+2\leq 0$ , which is equivalent to $\mu p\geq 2\mu+1$ . If $\mu>0$ , it means that $p\geq 2+\mu^{-1}>0$ , which is a regime in which the smallest is $p$ , the better. If $\mu\in]-1/2,0[$ , then $p\leq 2+\mu^{-1}<0$ , which is a regime in which the largest is $p$ , the better. In both cases we see that $2+\mu^{-1}$ is the best possible exponent. Moreover, when $p=2+\mu^{-1}$ , (39) becomes $2^{-\frac{1+\mu}{1+2\mu}}\delta^{\frac{1}{1+2\mu}}\leq c_{f,X_{\mu,\delta}},$ which implies the sharpness of the constant obtained in (33). ∎

Remark 5.10.

The result of Theorem 5.9 contrasts with [53, Theorem 2.1], in which the authors show that no local Łojasiewicz property can be satisfied by a quadratic function when $R(A)$ is not closed. The key difference here is that we look at the Łojasiewicz property on specific dense sets with empty interior (see Remark 5.7).

Let us now verify that the source sets are invariant under the action of the Landweber algorithm (28). As mentioned at the beginning of the section, the Landweber algorithm is the gradient decent algorithm applied to a quadratic function, and therefore it is an instance of the FB algorithm. We can thus apply the convergence rates of Section 4 once we prove that the source sets are invariant.

Proposition 5.11 (Invariance of source sets).

For all $(\mu,\delta)\in\left]-1/2,\infty\right[\times\left]0,+\infty\right[^{2}$ , the source set $X_{\mu,\delta}$ is FB-invariant.

Proof.

Let $x\in X_{\mu,\delta}$ , $\lambda\in\left]0,2/\|A^{*}A\|\right[$ , and let us prove that $T_{\lambda}x=x-\lambda A^{*}(Ax-y)$ belongs to $X_{\mu,\delta}$ . By using Lemma 5.5, we deduce that $Ax=y^{\dagger}+(AA^{*})^{\nu}\omega$ , $\nu:=\mu+1/2$ , and $\omega\in\mbox{\rm cl\,}R(A)$ with $\|\omega\|\leq\delta$ . Since $A^{*}(Ax-y)=A^{*}(Ax-y^{\dagger})$ , this implies that

[TABLE]

The above equality shows that $T_{\lambda}x\in X_{\mu}$ . It remains only to prove that $\hat{T}_{\lambda}\omega:=(I-\lambda AA^{*})\omega$ verifies $\hat{T}_{\lambda}\omega\in\mbox{\rm cl\,}R(A)$ and $\|\hat{T}_{\lambda}\omega\|\leq\delta$ . The condition $\hat{T}_{\lambda}\omega\in\mbox{\rm cl\,}R(A)$ immediately follows from $\omega\in\mbox{\rm cl\,}R(A)$ and $AA^{*}\omega\in R(A)$ . Next, observe that $\hat{T}_{\lambda}\omega$ is obtained by applying a gradient descent step to $\omega$ with respect to the function $u\mapsto(1/2)\|A^{*}u\|^{2}$ . Since this function has zero as a minimizer, and is differentiable with a $\|A^{*}A\|$ -Lipschitz gradient, the Fejér property (see Theorem 2.2-ii) implies that $\|\hat{T}_{\lambda}\omega\|\leq\|\omega\|\leq\delta$ . ∎

Next we combine all the results of this section to derive convergence rates of the Landweber algorithm under source conditions from Łojasiewicz conditions.

Corollary 5.12 (Convergence rates for Landweber algorithm).

Let $(x_{n})_{n\in\mathbb{N}}$ be a sequence generated by the Landweber algorithm (28). Assume that for some $\mu\in\left]-1/2,+\infty\right[$ , the source condition $x_{0}\in X_{\mu}$ is satisfied. Then:

i)

$f(x_{n})-\inf f=O({n^{-(1+2\mu)}})$ , 2. ii)

If $\mu>0$ , then $\|x_{n}-\bar{x}_{0}\|=O(n^{-\mu})$ , where $\bar{x}_{0}:=\mbox{\rm proj}(x_{0},\mbox{\rm argmin\,}f)$ .

Proof.

For item i, the source condition together with Proposition 5.11 imply $(x_{n})_{n\in\mathbb{N}}\subset X_{\mu,\delta}$ for some $\delta>0$ . If $\mu\neq 0$ , we derive from Theorem 5.9 that $f$ is $2+\mu^{-1}$ -Łojasiewicz on $X_{\mu,\delta}$ . Depending on the sign of $2+\mu^{-1}$ , the rates on $f(x_{n})-\inf f$ follow from Theorems 4.1 and 4.6. If $\mu=0$ , then the source condition and Proposition 5.4 ensures that $y^{\dagger}\in R(A)$ , meaning that $\mbox{\rm argmin\,}f\neq\emptyset$ , so the rate $O(n^{-1})$ follows from Theorem 2.2. For item ii, the convergence and rates on the iterates follows from Theorem 4.1. To show that the limit of the sequence (let us note it $x_{\infty}$ ) is $\bar{x}_{0}$ , it is enough to verify that $x_{\infty}-x_{0}\in\ker A^{\perp}$ , since $\mbox{\rm argmin\,}f$ is an affine space parallel to $\ker A$ . Because of the definition of the algorithm, it is easy to show by recurrence that $x_{n}-x_{0}\in R(A^{*})$ . This being true for all $n\in\mathbb{N}$ , we can pass to the limit and deduce that $x_{\infty}-x_{0}\in\mbox{\rm cl\,}R(A^{*})=\ker A^{\perp}$ . ∎

5.2 Sparsity based regularization, partial smoothness, and restricted injectivity

In this section we turn to the general case of optimization problems coming from a regularized inverse problem (27). In particular, we focus on the case where $\nabla^{2}h$ verifies a restricted injectivity condition at a solution, a situation which typically arises when $g$ is mirror-stratifiable, and typical modeling assumptions from the inverse problems/compressed sensing literature hold. In this setting we will be able to derive the 2-conditioning of the objective function in (27). In what follows, we will use the notation $\mathcal{S}_{+}(X)$ to refer to the set of bounded selfadjoint positive linear operators on $X$ .

5.2.1 Coercive linear operators on a cone

Definition 5.13.

We say that $K\subset X$ is a cone if it is a union of rays: $[0,+\infty[K\subset K$ .

Note that we do not require a cone to be convex. This is important for certain applications in which we have geometrical information about a function over a union of linear spaces, see for instance (40) in the context of sparse regularization problems.

Definition 5.14.

Let $S\in\mathcal{S}_{+}(X)$ , let $\gamma\in\left]0,+\infty\right[$ , and let $K\subset X$ be a cone. We say that $S$ is $\gamma$ -coercive on $K$ if, for all $d\in K$ , $\langle Sd,d\rangle\geq\gamma\|d\|^{2}$ .

Example 5.15 (coercivity for positive symmetric matrices).

A matrix $S\in\mathcal{S}_{+}(\mathbb{R}^{N})$ is coercive on a closed cone $K\subset\mathbb{R}^{N}$ if and only if $S$ is injective when restricted on $K$ (see Proposition A.15 for a proof):

[TABLE]

Example 5.16.

Any operator $S\in\mathcal{S}_{+}(X)$ is $\sigma_{\inf}(S)$ -coercive on ${\mbox{Ker}}~{}S^{\perp}$ (see e.g. the proof in [30, Thm. 4]). In particular, if $S$ is positive definite then it is $\sigma_{\inf}(S)$ -coercive on $X$ .

In the next proposition we relate the coercivity of the Hessian of a function $f$ on a cone to the $2$ -conditioning of $f$ on this cone. This relation can be seen as a weakened analogue of the well known fact (see [11, Prop. 10.8 & 17.7.(iii)]) that, for $f\in C^{2}(X)$ :

$f$ is $\gamma$ -strongly convex $\Leftrightarrow$ ( $\forall x\in X$ ) $\nabla^{2}f(x)$ is $\gamma$ -coercive on $X$ .

Strong convexity is a global notion, which requires the function to have a positive definite quadratic-like geometry at each $x\in X$ . On the contrary, the $2$ -conditioning requires the function to have a positive quadratic-like geometry, on a given set $\Omega$ . We now state our result (its proof is left in the Annex A.5). For similar results, see also [19, Section 3.3.1] and [41].

Proposition 5.17 (Coercivity of the Hessian implies $2$ -conditioning).

Let $f=g+h$ with $g,h\in\Gamma_{0}(X)$ and $\mbox{\rm argmin\,}f\neq\emptyset$ . Assume that $h$ is of class $C^{2}$ in a neighbourhood of $\bar{x}\in\mbox{\rm argmin\,}f$ , and that $\nabla^{2}h(\bar{x})$ is $\gamma$ -coercive on a closed cone $K\subset X$ . Then,

[TABLE]

and $\Omega\cap\mbox{\rm argmin\,}f=\{\bar{x}\}$ . If $h\in C^{2}(X)$ and $\nabla^{2}h$ is $L$ -Lipschitz, we can take $\delta=\frac{\gamma-\gamma^{\prime}}{L}$ .

5.2.2 Conditioning on prox-regular sets via restricted injectivity of the Hessian

Let us define some useful tools from variational analysis. The notion of reached set (or set with positive reach) was introduced by Federer [44, Def. 4.1], and later extended to prox-regularity (see Proposition A.19 and [93]).

Definition 5.18.

Let $C\subset\mathbb{R}^{N}$ . The (Bouligand) tangent cone to $C$ at $\bar{x}\in C$ is defined as

[TABLE]

The normal cone to $C$ at $\bar{x}$ is $N_{C}(\bar{x}):=\{\eta\in\mathbb{R}^{N}\ |\ (\forall d\in T_{C}(\bar{x}))\ \langle\eta,d\rangle\leq 0\}$ .

Definition 5.19.

Let $C\subset\mathbb{R}^{N}$ , and $\rho>0$ . We say that $C$ is $\rho$ -reached at $\bar{x}\in C$ , if it is locally closed at $\bar{x}$ , and verifies

[TABLE]

We say that $C$ is prox-regular at $\bar{x}$ if there exists $\rho>0$ and a closed neighbourhood $U$ of $\bar{x}$ such that $C\cap U$ is $\rho$ -reached at any $x\in U$ . We say further that $C$ is prox-regular if it is prox-regular at every $\bar{x}\in C$ .

Convex sets, and in particular affine spaces, are prox-regular. Manifolds of class $C^{2}$ are locally prox-regular (see Proposition A.19).

We now provide the result at the core of this section, which says that if a minimizer $\bar{x}$ belongs to some prox-regular set, and if the Hessian $\nabla^{2}h(\bar{x})$ is injective when restricted to the tangent cone of this set, then $f$ is $2$ -conditioned on this set around $\bar{x}$ . This will guarantee asymptotic linear rates when combined with Corollary 4.15.

Theorem 5.20 (Injective Hessian on tangent cone implies $2$ -conditioning).

Let $g,h\in\Gamma_{0}(\mathbb{R}^{N})$ , and $f=g+h$ . Assume that there exists some $\bar{x}\in\mbox{\rm argmin\,}f$ such that:

a)

$\bar{x}$ * belongs to some $C\subset\mathbb{R}^{N}$ which is $\rho$ -reached at $\bar{x}$ ,* 2. b)

$h$ * is of class $C^{2}$ in a neighbourhood of $\bar{x}$ ,* 3. c)

$\mbox{Ker}~{}\nabla^{2}f(\bar{x})$ * is $\gamma$ -coercive on $T_{C}(\bar{x})$ .*

Then ${\rm{argmin}}~{}f_{|C}=\{\bar{x}\}$ , and for every $\gamma^{\prime}\in]0,\gamma[$ , there exists $\delta\in]0,+\infty]$ such that $f$ is $2$ -conditioned on $\Omega:=C\cap\mathbb{B}(\bar{x},\delta)$ , with $\gamma_{f,\Omega}=\gamma^{\prime}$ . If we assume moreover that $\nabla^{2}h$ is $L$ -Lipschitz continuous, then we can take $\delta=\frac{2(\gamma-\gamma^{\prime})}{2L+\rho\|\nabla^{2}h(\bar{x})\|}$ .

Proof.

Let $K:=T_{C}(\bar{x})$ . Using Proposition A.20, we see that for every $\gamma^{\prime}<\gamma$ there exists a $\theta\in]0,\frac{\pi}{2}[$ such that the enlarged cone $K_{\theta}$ (see Definition A.16) contains $(C-\bar{x})\cap\mathbb{B}(0,\delta)$ for $\delta>0$ small enough, and such that $\nabla^{2}h(\bar{x})$ is $\gamma^{\prime}$ -coercive on $K_{\theta}$ . The conclusion of the claim follows from Proposition 5.17 applied to $h$ and $K_{\theta}$ . Under the additional assumption that $\nabla^{2}h$ is $L$ -Lipschitz, take any $\gamma^{\prime}\in]0,\gamma[$ , and let $\gamma^{\prime\prime}:=\alpha\gamma+(1-\alpha)\gamma^{\prime}$ , with $\alpha=2L/(2L+\rho\|\nabla^{2}h(\bar{x})\|)$ . Using again Proposition A.20, we obtain that $\nabla^{2}h(\bar{x})$ is $\gamma^{\prime\prime}$ -coercive on some cone $K_{\theta}$ , with $\bar{x}+K_{\theta}\supset C\cap\mathbb{B}(\bar{x},\delta_{1})$ and $\delta_{1}=2(\gamma-\gamma^{\prime\prime})/(\rho\|\nabla^{2}h(\bar{x})\|)$ . Then, Proposition 5.17 shows that $f$ is $2$ -conditioned on $\Omega=\bar{x}+K_{\theta}\cap\mathbb{B}(\bar{x},\delta_{2})$ , with $\delta_{2}=(\gamma^{\prime\prime}-\gamma^{\prime})/L$ and $\gamma_{f,\Omega}=\gamma^{\prime}$ . The conclusion follows by seeing that $\delta_{1}=\delta_{2}$ with our choice of $\gamma^{\prime\prime}$ . ∎

Theorem 5.20 can be used in combination with Corollary 4.15: in this case we obtain that the restricted injectivity of the Hessian on the tangent cone to the active set $C_{\bar{x}}$ guarantees asymptotic linear rates. In the example below, we detail what our assumptions mean for the examples in Example 4.16.

Example 5.21.

•

If $g(x)=\|x\|_{1}$ , the active set (20) is an open and dense subset of the vector space $X_{I}=\{x\in\mathbb{R}^{N}\ |\ \mbox{\rm supp}(x)\subset I\}$ with $I=\mathop{\mathrm{missing}}{\rm act}(-\nabla h(\bar{x}))$ . It is therefore $\rho$ -reached for every $\rho>0$ , and $T_{C_{\bar{x}}}(\bar{x})=X_{I}$ .

•

If $g(x)=\|x\|_{*}$ , let $r=\#\mathop{\mathrm{missing}}{\rm act}(\sigma(-\nabla h(\bar{x})))$ and let $\mathcal{M}_{r}$ be the manifold of matrices with rank equal to $r$ . If $0\in\mbox{\rm ri\,}\partial f(\bar{x})$ , the active set $C_{\bar{x}}$ (see (21)) is equal to $\mathcal{M}_{r}$ . In particular, it is prox-regular (see Proposition A.19), and an expression for its tangent space can be found in [69, Example 2.2]. More generally, $C_{\bar{x}}$ is locally prox-regular at $\bar{x}$ if $\mbox{\rm rank\,}(\bar{x})=r$ . To see this, use the same arguments as in [80, Prop. 3.1]: the fact that the singular values depend continuously on the matrix allows to find a neighbourhood $U$ of $\bar{x}$ where the matrices have a rank greater or equal to $r$ . This means that $C_{\bar{x}}\cap U=\mathcal{M}_{r}\cap U$ , which is prox-regular.

Remark 5.22 (Related results with partial smoothness).

While our results are new in the setting of mirror-stratifiable functions (where no condition $0\in\mbox{\rm ri\,}\partial f(\bar{x})$ is required), they intersect with existing results when $g$ is partially smooth with respect to an active manifold $\mathcal{M}$ . It is shown in [75] that the $\gamma$ -coercivity of $\nabla^{2}h(\bar{x})$ on the tangent space $T_{\mathcal{M}}(\bar{x})$ guarantees asymptotic linear rates. We recover a similar result by combining Theorem [54, Theorem 5.3] with Theorem 5.20 and Theorem 2.2. For a fixed stepsize $\lambda=1/L$ , [75, Thm. 3.1] predicts a Q-linear rate arbitrarily close to $\sqrt{2(1-\kappa)}$ (where $\kappa=\gamma/L$ ) provided that $\kappa\geq 1/2$ . Instead, our results predict a R-linear rate arbitrarily close to $(1+(\kappa/4))^{-1/2}$ , without condition on $\kappa$ . Note that our constant is worse (resp. better) than $\sqrt{2(1-\kappa)}$ when $\kappa$ is close to $1$ (resp. $1/2$ ). Note also that the partial smoothness of $g$ together with [54, Theorem 6.2.ii)] ensures that $f$ is $2$ -conditioned on a neighbourhood $\Omega$ of the solution, with $\gamma_{f,\Omega}=\gamma^{\prime}$ , meaning that we can use Proposition 4.19 to obtain Q-linear rates arbitrarily close to $(1+\kappa)^{-1/2}$ .

5.2.3 Application to low-complexity inverse problems

Consider $f\in\Gamma_{0}(\mathbb{R}^{N})$ be defined by, for every $x\in\mathbb{R}^{N}$ , $f(x)=\alpha\|x\|_{1}+(1/2)\|Ax-y\|^{2}$ . $f$ is the sum of a smooth function, with Hessian equal to $A^{*}A$ , and a nonsmooth function $\alpha\|x\|_{1}$ . Example 3.9 ensures that $f$ is locally $2$ -conditioned on its sublevel sets without any assumption on $A$ . This means, according to Theorem 4.1, that for any $r>\inf f$ , and any $x_{0}\in[f<r]$ , there exists a constant $\varepsilon\in]0,1[$ such that the iterative soft-thresholding initialized at $x_{0}$ verifies $f(x_{n+1})-\inf f\leq\varepsilon(f(x_{n})-\inf f)$ . Nevertheless, expressing the $2$ -conditioning constant, or $\varepsilon$ , in terms of the components of the problems is far to be easy [17]. One way to recover a meaningful constant is to exploit modeling assumptions which are usually made to ensure the stability and recovery of the inverse problem $Ax=y$ .

Suppose that we are given the sequence generated by the iterative soft-thresholding, which converges to a minimizer of $f$ , $x_{n}\rightarrow\bar{x}$ . It is known that, after some iterations, the support of the sequence is stable [76, 49]:

[TABLE]

In particular, if the qualification condition $0\in\mbox{\rm ri\,}\partial f(\bar{x})$ holds, we can take $I=\mbox{\rm supp}(\bar{x})$ [76, Prop. 3.6]. To estimate the rates of convergence for the sequence, it is then sufficient to make a restricted injectivity assumption on the matrix $A$ , depending on the knowledge we have on $I$ .

In the case we have access to $I$ , suppose that on the space $X_{I}:=\{x\in\mathbb{R}^{N}\ |\ \mbox{\rm supp}(x)\subset I\}$ the matrix $A$ is injective, i.e. $\operatorname{Ker}A\cap X_{I}=\{0\}$ holds. Then, there exists a constant $\gamma_{I}>0$ such that $A^{*}A$ is $\gamma$ -coercive on $X_{I}$ (see Example 5.15), which implies via Proposition 5.17 that $f$ is $2$ -conditioned on $X_{I}$ , with $\gamma_{f,X_{I}}=\gamma_{I}$ . We deduce then that, asymptotically, the rates are governed by $\varepsilon=(1+\gamma_{I}\|A^{*}A\|^{-1})^{-1}$ . It might happen that instead of knowing $I$ , we have only access to a partial information via the sparsity level $s:=|I|$ . We can then follow the same reasoning with the (nonconvex) cone $K_{s}:=\{x\in\mathbb{R}^{N}\ |\ |\mbox{\rm supp}(x)|\leq s\}$ instead of $X_{I}$ . In that case, the constant $\gamma_{s}$ of coercivity of $A^{*}A$ on $K_{s}$ is defined by

[TABLE]

and guarantees linear rates governed by $\varepsilon=(1+\gamma_{s}\|A^{*}A\|^{-1})^{-1}$ , using again Proposition 5.17. Such assumption is classical in sparsity based regularization, and it is related to the so-called Restricted Isometry Property [25], to ensure uniqueness of the minimizer and guarantee the robustness or recovery [99, 26]. Observe that while the computation of $\gamma_{s}$ remains impracticable [9], it is meaningful with respect to the properties of our problem, and, more importantly, can be estimated when the matrix $A$ is random [47, Section 9]. Of course, this whole discussion can be extended to other regularized inverse problems, in particular if $\|\cdot\|_{1}$ is replaced by a mirror-stratifiable function. In this case we will use Theorem 5.20 instead of Proposition 5.17 to derive linear rates.

6 Conclusion and perspectives

In this paper, we dicussed in details how geometry can be used to improve the rates of the FB method, or more general first-order descent schemes. We characterized the geometry, using tools that are often encountered in practice, like the $p$ -conditioning, and we provided a new sum rule for it. In Figure 6.1 we recall the various rates obtained for the FB method, from the worst case scenario (no minimizers, no assumptions) to the best one (sharp functions).

We also have discussed how those refined results can be obtained by decoupling the geometrical information we have on the function and the localization of the sequence we are looking at. This geometry-based analysis reduces then the gap between theory and practice, where the observed rates are often better than the ones resulting from a worst case analysis. It moreover shows that linear rates are tightly linked to $2$ -conditioned function. In addition, we showed how our analysis can be specialized to the inverse problems setting, and allows to explain typical modeling assumptions in this context, such as source conditions and restricted injectivity property. It is worth noting that the geometrical information such as conditioning or Łojasiewicz property can be exploited to derive sharper convergence rates for a broader class of functions and/or algorithms than just forward-backward algorithm [5]. We also emphasize that convexity plays no role in the proofs of Theorems 4.1 and 4.6. Indeed, some of these results were already known for non-convex functions [18, 27, 46]. One of the challenges in the future is to have quantitative results concerning the geometry of classes of nonconvex functions. For instance, what can be said about “simple” nonconvex piecewise polynomial functions (see [73] for an answer about maximum of finitely many polynomials)? Can we estimate the Łojasiewicz exponent of semialgebraic functions, depending on the degree of the polynomials defining their graph? Finally, a last challenge is the application of such geometrical tools to derive precise rates for nondescent methods. First results in this direction, using $2$ -conditioning are known for inertial methods [85, 77] or stochastic gradient methods [61]. It would be of interest to understand the behavior of these algorithms for other geometries.

Appendix A Appendix

A.1 Worst case analysis: proofs of Section 2

The following Lemma contains a detailed proof for the lower bound (7) in Example 2.3, which can also be applied to (5) by using a symmetry argument.

Lemma A.1 (Lower bounds for the proximal algorithm).

Let $p\in]-\infty,0[\cup]2,+\infty[$ , and let $f_{p}\in\Gamma_{0}(\mathbb{R})$ be the function defined by

[TABLE]

If $x_{0}\in\mathop{\mathrm{missing}}{\rm dom}f\setminus\mbox{\rm argmin\,}f$ , and $x_{n+1}=\mbox{prox}_{\lambda f}(x_{n})$ , then for all $n\geq 1$ :

[TABLE]

Proof.

Note that $\mathop{\mathrm{missing}}{\rm dom}f_{p}$ is an open interval, and that $f_{p}$ is infinitely derivable there. We can then see that $f_{p}$ , $f_{p}^{\prime}$ and $f_{p}^{\prime\prime}$ are non-negative. In particular, we deduce that $f_{p}$ and $f^{\prime}_{p}$ are non-decreasing on $\mathop{\mathrm{missing}}{\rm dom}f$ .

Let us now take some $x_{0}\in\mathop{\mathrm{missing}}{\rm dom}f\setminus\mbox{\rm argmin\,}f$ , and consider the following continuous trajectory

[TABLE]

It is a simple exercise to verify that $x(\cdot)$ is a solution of this differential equation:

[TABLE]

The main step towards proving our lower bound is to show, by induction, that for every $n\in\mathbb{N}$ , $x_{n}\geq x(n\lambda)$ . This is clearly true for $n=0$ , so, let us assume now that this is true for $n\in\mathbb{N}$ , and show that this implies $x_{n+1}\geq x((n+1)\lambda)$ . Start by writing

[TABLE]

On the one hand, $f_{p}^{\prime}$ is non-negative on $\mathop{\mathrm{missing}}{\rm dom}f$ , and $\dot{x}(t)=-f_{p}^{\prime}(x(t))$ , which means that $x(\cdot)$ is increasing. On the other hand, $f_{p}^{\prime}$ is non-decreasing, which means that $(-f_{p}^{\prime}\circ x)$ is increasing. This fact, together with our induction assumption, allows us to write

[TABLE]

Consider now the function $\phi:\mathop{\mathrm{missing}}{\rm dom}f_{p}\to]0,+\infty[$ defined by $\phi(t)=t+\lambda f_{p}^{\prime}(t)$ . It is clearly increasing and bijective on its image, so its inverse $\phi^{-1}$ is also increasing. We observe moreover that, by definition, the proximal sequence satisfies $x_{n+1}=\phi^{-1}(x_{n})$ . This allows us to write

[TABLE]

This ends the proof of the induction argument.

Observe that, given non-negative numbers $a,b>0$ , the following inequality holds

[TABLE]

This means that, for all $n\geq 1$ ,

[TABLE]

Passing this inequality through $f_{p}$ (which is non-decreasing) yields the desired result. ∎

A.2 Proofs of Section 3

A.2.1 Invariant sets and proofs of Section 3.1

We provide here a result concerning the equivalence between all the notions in Definition 3.1, for a large class of sets $\Omega\subset X$ . The sets $\Omega$ we will consider are directly related to the gradient flow induced by $\partial f$ . Given $u_{0}\in\mathop{\mathrm{missing}}{\rm dom}f$ , it is known444See [21, Thm 3.1] when $u_{0}\in\mathop{\mathrm{missing}}{\rm dom}\partial f$ , and [21, Thm. 3.2] with [11, Cor. 16.39] when $u_{0}\in\mbox{\rm cl\,}\mathop{\mathrm{missing}}{\rm dom}f$ . that there exists a unique absolutely continuous trajectory noted $u(\cdot;u_{0}):[0,+\infty[\longrightarrow X$ , called the steepest descent trajectory, which satisfies:

[TABLE]

Following [21], we introduce the notion of invariant sets for the flow of $\partial f$ :

Definition A.2.

A set $\Omega\subset X$ is $\partial f$ -invariant if for any $x\in\Omega\cap\mathop{\mathrm{missing}}{\rm dom}\partial f$ and a.e. $t>0$ , $u(t;x)\in\Omega$ holds.

In other words, $\Omega$ is said to be $\partial f$ -invariant if any steepest descent trajectory starting in $\Omega$ remains therein. It is straightforward to see that the intersection of two $\partial f$ -invariant sets is still $\partial f$ -invariant.

Example A.3.

An easy way to construct a $\partial f$ -invariant set is to consider the sublevel set of a Lyapunov function $\psi:X\rightarrow\mathbb{R}\cup\{+\infty\}$ for the gradient flow induced by $\partial f$ . A function is said to be Lyapunov if for any $x\in\mathop{\mathrm{missing}}{\rm dom}f$ , $\psi(u(\cdot;x)):[0,+\infty[\rightarrow\mathbb{R}$ is decreasing. Classical examples of this kind are:

•

$\Omega=X$ , which is $[\psi<1]$ with $\psi=0$ .

•

$\Omega=[f<r]$ for $r>\inf f$ , which is $[\psi<r]$ with $\psi=f$ (see [21, Thm. 3.2.17]).

•

$\Omega=\mathbb{B}(\bar{x},\delta)$ for $\bar{x}\in\mbox{\rm argmin\,}f$ , $\delta>0$ , which is $[\psi<\delta]$ with $\psi(x)=\|x-\bar{x}\|$ (see [21, Thm. 3.1.7]).

•

$\Omega=\{x\in X\ |\ \|\partial f(x)\|_{\_}<M\}$ for $M>0$ , which is $[\psi<M]$ with $\psi(x)=\|\partial f(x)\|_{\_}$ (see [21, Thm. 3.1.6]).

See [21, Section IV.4] for more details on the subject, as well as [22, 63]. It is also a good exercise to verify that the source sets considered in Proposition 5.11 are $\partial f$ -invariant.

We next prove Proposition 3.3, stating the equivalence between conditioning, metric subregularity and Łojasiewicz on $\partial f$ -invariant sets. The proof is based on an argument used in [17, Theorem 5], which relies essentially on the following convergence rate property for the continuous steepest descent dynamic (41).

Proof of Proposition 3.3.

Convexity of $f$ and the Cauchy-Schwartz inequality imply

[TABLE]

and so i $\implies$ ii $\implies$ iii. Next, we just have to prove that the Łojasiewicz property implies the conditioning one. So let us assume that $f$ is $p$ -Łojasiewicz on $\Omega$ , which is $\partial f$ -invariant, and fix $x\in\Omega\cap\mathop{\mathrm{missing}}{\rm dom}^{*}f$ . Define, for all $t\geq 0$ , $\varphi(t):=(pc_{f,\Omega})^{-1}t^{1/p}$ , which is derivable on $]0,+\infty[$ , and for all $u\in\mathop{\mathrm{missing}}{\rm dom}f$ , $r(u)=f(u)-\inf f$ . Let us lighten the notations by noting $u(\cdot)$ instead of $u(\cdot;x)$ , so that $u(0)=x$ . Because we will need to distinguish the case in which the trajectory converges in finite time, we introduce $T:=\inf\{t\geq 0\ |\ u(t)\in{\rm{argmin}}~{}f\}\in[0,+\infty]$ . Since $x\in\mathop{\mathrm{missing}}{\rm dom}^{*}f$ and $u(\cdot)$ is continuous, we see that $T>0$ . For every $t\in[0,T[$ , we have $u(t)\notin{\rm{argmin}}~{}f$ , so $u(t)\in\Omega\cap\mathop{\mathrm{missing}}{\rm dom}^{*}f$ and $r(u(t))\neq 0$ . If $T<+\infty$ , we also have for every $t>T$ that $u(t)=u(T)$ and $\dot{u}(t)=0$ . Now, we write:

[TABLE]

But $\frac{{\rm d}}{{\rm d}\tau}(r\circ u)(\tau)=-\|\dot{u}(\tau)\|^{2}=-\|\partial f(u(\tau))\|_{\_}^{2}$ (see [21]), so that the above equality becomes

[TABLE]

Since we assume $\Omega$ to be $\partial f$ -invariant, we can apply the Łojasiewicz inequality at $u(\tau)\in\Omega\cap\mathop{\mathrm{missing}}{\rm dom}^{*}f$ for all $\tau\in]0,t[$ , which can be rewritten in this case as $1\leq\varphi^{\prime}(r(u(\tau)))\|\partial f(u(\tau))\|_{\_}.$ This applied to (42) gives us:

[TABLE]

From (43) and the definition of $T$ , we see that $\int_{0}^{+\infty}\|\dot{u}(\tau)\|\ {\rm d}\tau\leq\varphi(r(x))<+\infty$ , meaning that the trajectory $u(\cdot)$ has finite length. As a consequence, it converges strongly to some $\bar{u}$ when $t$ tends to $+\infty$ . Finally, we use on (43) the fact that $\|u(0)-u(t)\|\leq\int_{0}^{t}\|\dot{u}(\tau)\|\ {\rm d}\tau$ , together with the fact that $\bar{u}\in\mbox{\rm argmin\,}f$ (see [21, Thm. 3.11]) to conclude that

[TABLE]

Proof of Proposition 3.4.

i: let $S:=\mbox{\rm argmin\,}f\neq\emptyset$ . Given $\delta>0$ , there exists $M\in]0,+\infty[$ such that

[TABLE]

Since $f$ is $p$ -conditioned on $\Omega$ , we deduce that:

[TABLE]

meaning that $f$ is $p^{\prime}$ -conditioned on $\Omega\cap\delta\mathbb{B}_{X}$ .

ii: the proof follows the same lines as in i. ∎

Proof of Proposition 3.5.

Assume by contradiction that there exists a sequence $(z^{n})_{n\in\mathbb{N}}\subset\Omega$ such that

[TABLE]

Since $\Omega$ is weakly compact, we can assume without loss of generality that $z^{n}$ weakly converges to some $z^{\infty}\in\Omega$ when $n\to+\infty$ . Then, it follows from (43), the boundedness of $(z^{n})_{n\in\mathbb{N}}\subset\Omega$ and the weak lower semi-continuity of $f$ that $f(z^{\infty})-\inf f\leq 0$ , meaning that $z^{\infty}\in\mbox{\rm argmin\,}f$ , contradicting $\Omega\cap\mbox{\rm argmin\,}f=\emptyset$ . ∎

A.2.2 Proofs of Section 3.2

Lemma A.4 (The Łojasiewicz constant for uniformly convex functions).

Let $f\in\Gamma_{0}(X)$ be uniformly convex, of order $p\geq 2$ , with constant $\gamma$ . Then $f$ is $p$ -Łojasiewicz on $X$ , with $c_{f,X}=q^{-1/q}\gamma^{-1/p}$ , where $1=(1/p)+(1/q)$ .

Proof.

Let $x\in\mathop{\mathrm{missing}}{\rm dom}\partial f$ , $\bar{x}\in\mbox{\rm argmin\,}f$ , and $x^{*}\in\partial f(x)$ . By definition of uniformly convex functions

[TABLE]

The right member of the above inequality involves a strictly convex optimization problem, whose unique optimal value $\bar{u}$ can be determined by using Fermat’s rule:

[TABLE]

Injecting this optimal value in (44) gives, after rearranging the terms,

[TABLE]

and, since $x^{*}$ is arbitrary in $\partial f(x)$ , the result follows after passing this inequality to the power $1-1/p$ . ∎

Proof of Example 3.10.ii).

To prove the claim, it is enough to verify the three conditions of [40, Theorem 4.2]. The first condition (boundedness of $\mbox{\rm argmin\,}f$ ) is guaranteed by the fact that $f$ is coercive. Indeed, $h$ is strongly convex, therefore bounded from below, and $g$ is itself coercive. The second condition (dual qualification conditions) follows immediately from the fact that both $h^{*}$ and $g^{*}$ , and are continuously differentiable. To see this, observe that in this example $g^{*}$ is (up to a constant) $\|\cdot\|_{q}^{q}$ , where $q$ is the conjugate number of $p$ : $(1/p)+(1/q)=1$ . Moreover, $h$ being strongly convex means that $h^{*}$ is also continuously differentiable, with $\mathop{\mathrm{missing}}{\rm dom}h^{*}=\mathbb{R}^{M}$ . The third condition (firm convexity) is easy to check for $h$ because it is strongly convex; for $g$ the proof is left in the following Lemma. We can then apply [40, Theorem 4.2], which ensures that $f$ is $2$ -conditioned on every compact set. Using again the fact that $f$ is coercive, and therefore has bounded sublevel sets, we conclude that $f$ is $2$ -conditioned on every sublevel set. ∎

A.2.3 Proofs of Section 3.3

Lemma A.5 ( $p$ -powers are $2$ -tilt conditioned when $p\in]1,2]$ ).

Let $p\in]1,2]$ , $u\in\mathbb{R}^{N}$ , and $f:\mathbb{R}^{N}\rightarrow\mathbb{R}$ be defined as $f(x)=\frac{1}{p}\|x\|_{p}^{p}-\langle u,x\rangle$ . Then $f$ is $2$ -conditioned on every bounded subset of $\mathbb{R}^{N}$ .

Proof.

This function is a separable sum, so, without loss of generality, we can assume from here that $N=1$ (see [40, Lemma 4.4]). Given a real $t\in\mathbb{R}$ , we will note its sign with $s(t)$ , which is equal to $-1$ (resp. $+1$ ) if $t<0$ (resp. $t>0$ ), or [math] if $t=0$ . Using the convexity, the differentiability of $f$ , and the Fermat’s rule, we see that $f$ admits a unique minimizer $\bar{x}$ , defined by the relations

[TABLE]

If $u=0$ , it is immediate to see that $f$ is $2$ -conditioned on $]-1,1[$ , where the relation $|t|^{2}\leq|t|^{p}$ holds. We therefore assume from now that $u\neq 0$ , which also means that $\bar{x}\neq 0$ . We now compute (we note $q=p/(p-1)$ )

[TABLE]

meaning that we are looking for an inequality like

[TABLE]

Using the L’Hôpital rule twice allows us to study the following limit:

[TABLE]

Note that our assumption that $\bar{x}\neq 0$ ensures that we can take the derivative of the second numerator around $\bar{x}$ . Since this limit is well-defined, and nonnegative, it means that $f$ is $2$ -conditioned on a small enough neighbourhood of $\bar{x}$ . To conclude the proof, it remains to verify that $f$ is $2$ -conditioned on any bounded set. This follows immediately from Proposition 3.5 and the fact that $\mbox{\rm argmin\,}f=\{\bar{x}\}$ . ∎

Lemma A.6 (Kullback-Leibler divergences are $2$ -tilt conditioned).

Let $\bar{x}\in]0,+\infty[^{N}$ , and $f\in\Gamma_{0}(\mathbb{R}^{N})$ be the Kullback-Leibler divergence to $\bar{x}$ :

[TABLE]

Then $f$ is $2$ -tilt-conditioned on every bounded set of $\mathbb{R}^{N}$ .

Proof.

Let $d\in\mathbb{R}^{N}$ , and define the tilted function $\tilde{f}=f+\langle d\cdot\rangle$ . Using Fermat’s rule, we see that ${\rm{argmin}}~{}f=\partial f^{*}(-d)$ . It is a simple exercice to verify that $\mathop{\mathrm{missing}}{\rm dom}\partial f^{*}=]-\infty,1[^{N}$ , so ${\rm{argmin}}~{}\tilde{f}\neq\emptyset$ if and only if $d\in]-1,+\infty[^{N}$ . Let $d$ be such vector, and write, for any $x_{i}>0$ :

[TABLE]

Let $X_{i}:=\frac{\bar{x}_{i}}{1+d_{i}}$ , which is well defined under our assumption that $d_{i}>-1$ . Then

[TABLE]

where $a_{i}=X_{i}(1+d_{i})\log(1+d_{i})>0$ . We then observe that ${\rm{argmin}}~{}\tilde{f}_{i}=\{X_{i}\}$ , from which we deduce that ${\rm{argmin}}~{}\tilde{f}=\{X\}$ with $X=(X_{i})_{i=1}^{N}$ .

Now, let $\delta>0$ be fixed, and let $x\in\mathbb{B}(X,\delta)$ . Let $\underline{d}:=\min_{i}d_{i}>-1$ , $c:=N\|X\|_{\infty}$ , and

[TABLE]

For each $i\in\{1,\dots,N\}$ , we have $|x_{i}-X_{i}|\leq\delta$ , so we can use [24, Lem. A.2] on $\tilde{f}_{i}$ to write

[TABLE]

This proves that $\tilde{f}$ is $2$ -conditioned on $\mathbb{B}(X,\delta)$ , which conludes the proof. ∎

A.3 The Forward-Backward algorithm and proofs of Section 4

Definition A.7.

Given a positive real sequence $(r_{n})_{{n\in\mathbb{N}}}$ converging to zero, we say that $r_{n}$ converges:

•

sublinearly (of order $\alpha\in]0,+\infty[$ ) if $\exists C\in]0,+\infty[$ such that $\forall{n\in\mathbb{N}}$ , $r_{n}\leq Cn^{-\alpha}$ ,

•

Q-linearly if $\exists\varepsilon\in]0,1[$ such that $\forall{n\in\mathbb{N}}$ , $r_{n+1}\leq\varepsilon r_{n}$ ,

•

R-linearly if $\exists(s_{n})_{n\in\mathbb{N}}$ Q-linearly converging such that $\forall{n\in\mathbb{N}}$ , $r_{n}\leq s_{n}$ ,

•

Q-superlinearly (of order $\beta\in]1,+\infty[$ ) if $\exists C\in]0,+\infty[$ such that $\forall{n\in\mathbb{N}}$ , $r_{n+1}\leq Cr_{n}^{\beta}$ ,

•

R-superlinearly if $\exists(s_{n})_{n\in\mathbb{N}}$ Q-superlinearly convergent such that $\forall{n\in\mathbb{N}}$ , $r_{n}\leq s_{n}$ .

It is easy to verify that $r_{n}$ is R-superlinearly convergent of order $\beta>1$ if and only if

[TABLE]

Note that $R$ -linear and $R$ -superlinear convergence ensures only the overall decrease of the sequence, while $Q$ -linear and $Q$ -superlinear convergence requires the sequence to decrease at a certain speed for each index. It is immediate from the definition that $Q$ -convergence implies $R$ -convergence.

Lemma A.8 (Estimate for sublinear real sequences).

Let $(r_{n})_{n\in\mathbb{N}}$ be a real sequence being strictly positive and satisfying, for some $\kappa>0$ , $\alpha>1$ and all ${n\in\mathbb{N}}$ : $r_{n}-r_{n+1}\geq\kappa r_{n+1}^{\alpha}.$ Define $\tilde{\kappa}:=\min\{\kappa,\kappa^{\frac{\alpha-1}{\alpha}}\}$ , and $\delta:=\max\limits_{s\geq 1}\min\left\{\frac{\alpha-1}{s},\kappa^{-\frac{\alpha-1}{\alpha}}r_{0}^{1-\alpha}\left(1-s^{-\frac{\alpha-1}{\alpha}}\right)\right\}\in\left]0,+\infty\right[.$ Then, for all ${n\in\mathbb{N}}$ , $r_{n}\leq(\tilde{\kappa}\delta n)^{-1/(\alpha-1)}.$

Proof.

It can be found in [72, Lemma 7.1], see also the proofs of [3, Theorem 2] or [46, Theorem 3.4]. ∎

Lemma A.9.

If Assumption 2.1 holds, then for all $(x,u)\in X^{2}$ and all $\lambda>0$ :

i)

$\|T_{\lambda}x-u\|^{2}-\|x-u\|^{2}\leq\left({\lambda L}-1\right)\|T_{\lambda}x-x\|^{2}+2\lambda(f(u)-f(T_{\lambda}x)).$ 2. ii)

$\|\partial f(T_{\lambda}x)\|_{\_}\leq\lambda^{-1}\|T_{\lambda}x-x\|\leq\|\partial f(x)\|_{\_}.$

Proof of Lemma A.9.

To prove item i), start by writing

[TABLE]

The optimality condition in (2) gives ${x-T_{\lambda}x}\in\lambda\partial g(T_{\lambda}x)+\lambda\nabla h(x)$ so that, by using the convexity of $g$ :

[TABLE]

Since we can write $\langle\nabla h(x),u-T_{\lambda}x\rangle=\langle\nabla h(x),u-x\rangle+\langle\nabla h(x),x-T_{\lambda}x\rangle$ , we deduce from the convexity of $h$ and the Descent Lemma ([11, Theorem 18.15]) that

[TABLE]

Item i) is then proved after combining the two previous inequalities. For item ii), the optimality condition in (2), together with a sum rule (see e.g. [87, Theorem 3.30]), to deduce that

[TABLE]

For the first inequality, use (45) with $(u,v)=(x-\lambda\nabla h(x),T_{\lambda}x)$ , together with the contraction property of the gradient map $x\mapsto x-\lambda\nabla h(x)$ when $0<\lambda\leq 2/L$ (see [11, Cor. 18.17 & Prop. 4.39 & Remark 4.34.i]) to obtain:

[TABLE]

For the second inequality, consider $x^{*}:=\mbox{\rm proj}(-\nabla h(x),\partial g(x))$ , and use (45) with $(u,v)=(x+\lambda x^{*},x)$ , together with the nonexpansiveness of the proximal map (see [11, Prop. 12.28]):

[TABLE]

Lemma A.10 (Descent Lemma for Hölder smooth functions).

Let $f:X\longrightarrow\mathbb{R}$ and $C\subset X$ be convex. Assume that $f$ is Gateaux differentiable on $C$ , and that there exists $(\alpha,L)\in]0,+\infty[^{2}$ , such that for all $(x,y)\in C^{2}$ , $\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|^{\alpha}$ holds. Then:

[TABLE]

Proof.

The argument used in [101, Remark 3.5.1] for $C=X$ extends directly to convex sets. ∎

Now we can prove the convergence rate results of Section 4.1:

Proof of Theorem 4.1.

We first show that $(x_{n})_{{n\in\mathbb{N}}}$ has finite length. Since $\inf f>-\infty$ , $r_{n}:=f(x_{n})-\inf f\in[0,+\infty[$ , and it follows from Lemma A.9 that

[TABLE]

If there exists ${n\in\mathbb{N}}$ such that $r_{n}=0$ then the algorithm would stop after a finite number of iterations (see (46)), therefore it is not restrictive to assume that $r_{n}>0$ for all ${n\in\mathbb{N}}$ . We set $\varphi(t):=pt^{1/p}$ and $c:=c_{f,\Omega}$ , so that the Łojasiewicz inequality at $x_{n}\in\Omega\cap\mathop{\mathrm{missing}}{\rm dom}^{*}f$ can be rewritten as

[TABLE]

Combining (46), (47), and (48), and using the concavity of $\varphi$ , we obtain for all $n\geq 1$ :

[TABLE]

By taking the square root on both sides, and using Young’s inequality, we obtain

[TABLE]

Sum this inequality, and reorder the terms to finally obtain

[TABLE]

We deduce that $(x_{n})_{n\in\mathbb{N}}$ has finite length and converges strongly to some $x_{\infty}$ . Moreover, from (47) and the strong closedness of $\partial f:X\rightrightarrows X$ , we conclude that $0\in\partial f(x_{\infty})$ .

Now we prove the convergence rates. Let $c=c_{f,\Omega}$ for short. We first derive rates for the sequence of values $r_{n}:=f(x_{n})-\inf f$ , from which we will derive the rates for the iterates. Equations (46) and (47) yield

[TABLE]

The Łojasiwecz inequality at $x_{n+1}\in\Omega\cap\mbox{\rm dom}^{*}f$ implies $c^{2}r_{n+1}^{2/p}(r_{n}-r_{n+1})\geq ab^{-2}r_{n+1}^{2},$ so we deduce that

[TABLE]

The rates for the values are derived from the analysis of the sequences satisfying the inequality in (50). Depending on the value of $p$ , we obtain different rates.

$\bullet$ If $p=1$ , then we deduce from (50) that for all ${n\in\mathbb{N}},r_{n+1}\neq 0$ implies $r_{n+1}\leq r_{n}-\kappa.$ Since the sequence $(r_{n})_{n\in\mathbb{N}}$ is decreasing and positive, $r_{n+1}\neq 0$ implies $n\leq r_{0}\kappa^{-1}$ .

For the other values of $p$ , we will assume that $r_{n}>0$ . In particular, we get from (50)

[TABLE]

$\bullet$ If $p\in]1,2[$ , then $\alpha\in]0,1[$ . The positivity of $r_{n+1}$ and (51) imply that for all ${n\in\mathbb{N}}$ , $r_{n+1}\leq\kappa^{-1/\alpha}r_{n}^{1/\alpha}$ , meaning that $r_{n}$ converges Q-superlinearly.

$\bullet$ If $p=2$ , then $\alpha=1$ and we deduce from (51) that for all ${n\in\mathbb{N}}$ , $r_{n+1}\leq{(1+\kappa)^{-1}}r_{n}$ , meaning that $r_{n}$ converges Q-linearly.

$\bullet$ If $p\in\left]2,+\infty\right[$ , then $\alpha\in\left]1,2\right[$ , and the analysis still relies on studying the asymptotic behaviour of a real sequence satisfying (51). Lemma A.8 in the Annex shows that we have $r_{n+1}\leq(C_{p}^{\prime})^{p/(p-2)}n^{-p/(p-2)}$ , by taking

[TABLE]

To end the proof, we will prove that the rates for $\|x_{n}-x_{\infty}\|$ are governed by the ones of $r_{n}$ . Let $1\leq n\leq N<+\infty$ , and sum the inequality in (49) between $n$ and $N$ to obtain (remind that $b=\lambda^{-1}$ ):

[TABLE]

Next, we pass to the limit for $N\to\infty$ , we use (46), and the fact that $r_{n}$ is decreasing to obtain

[TABLE]

Note that ${r_{n-1}^{1/2}}\leq r_{0}^{\frac{1}{2}-\frac{1}{p}}r_{n-1}^{1/p}$ if $p\in\left[2,+\infty\right[$ , and $r_{n-1}^{1/p}\leq r_{0}^{\frac{1}{p}-\frac{1}{2}}{r_{n-1}^{1/2}}$ if $p\in[1,2]$ . So, by defining

[TABLE]

we finally conclude from (53) that $\|x_{\infty}-x_{n}\|\leq C_{p}r_{n-1}^{1/\max\{2,p\}}$ when $n\geq 1$ . ∎

Proof of Theorem 4.6.

The proof is as for the case $p\in\left]2,+\infty\right[$ of Theorem 4.1: the $p$ -Łojasiewicz property implies (50), and the statement follows from Lemma A.8 with $\alpha=2(p-1)/p\in\left]2,+\infty\right[$ . ∎

Proof of Theorem 4.8.

The proofs of Theorems 4.1 and 4.6 rely on the combination of the Łojasiewicz inequality with the estimations (46) and (47), which can be replaced by (18) and (19). ∎

A.4 Linear inverse problems and proofs of Section 5.1

Here we will make use of is the Moore-Penrose pseudo-inverse of $A$ . It is a linear operator (not necessarily bounded), whose domain is $D(A^{\dagger}):=R(A)+R(A)^{\perp}$ , and satisfying

[TABLE]

It is easy to see that, whenever $y\in D(A^{\dagger})$ , the solution set of (27) is $A^{\dagger}y+\ker A$ .

Lemma A.11.

Let $A$ be a bounded linear opertator from $X$ to $Y$ . Then, for every continuous function $\phi:[0,+\infty[\rightarrow\mathbb{R}$ , we have $A\phi(A^{*}A)=\phi(AA^{*})A$ .

Proof.

A simple induction argument shows that, for every $k\geq 0$ , $A(A^{*}A)^{k}=(AA^{*})^{k}A$ . Taking linear combinations of this equality allows to see that, for every polynomial $P\in\mathbb{R}[X]$ , $AP(A^{*}A)=P(AA^{*})A$ . Now, if $\phi$ is continuous on $[0,+\infty[$ , it is in particular continuous on $[0,\|A\|^{2}]$ , which is an interval containing the spectrum of both $A^{*}A$ and $AA^{*}$ . Thus, $\phi$ restricted to this interval can be written as the uniform limit of a sequence of polynomials. Passing to the limit (see [56, Thm. VI.32.1]) in the last equality gives the desired result. ∎

Lemma A.12.

For all $b\in Y$ , $r\in\left]0,+\infty\right[$ , the following two properties are equivalent:

$(\exists x\in\ker A^{\perp})\quad b=Ax,\quad\|x\|=r$ 2. 2.

$(\exists y\in\mbox{\rm cl\,}R(A))\quad b=\sqrt{AA^{*}}y,\quad\|y\|=r.$

Proof.

It is shown in [42, Proposition 2.18] that $R(A)=R(\sqrt{AA^{*}})$ , so it is enough to verify this implication:

[TABLE]

Let $(x,y)$ be such a pair. Since $Ax=\sqrt{AA^{*}}y$ and $y\in\mbox{\rm cl\,}R(A)=\ker\sqrt{AA^{*}}^{\perp}$ , we deduce that $y=(\sqrt{AA^{*}})^{\dagger}Ax$ . Therefore, since $AA^{*}$ is self-adjoint, $(AA^{*})^{\dagger}Ax=(A^{*})^{\dagger}x$ (see [42, p.35]), and $A^{*}(A^{*})^{\dagger}x=\mbox{\rm proj}(x;\ker A^{\perp})$ , we get

[TABLE]

Proof of Lemma 5.5.

Remind that $y^{\dagger}=Ax^{\dagger}=AA^{\dagger}y$ and let $\nu=\mu+1/2$ . Then, Lemma A.12 yields:

[TABLE]

∎

Lemma A.13 (Interpolation inequality [42, p. 55]).

For all $x\in X$ and $0\leq\alpha<\beta$ , we have

[TABLE]

Lemma A.14 (Powers of self-adjoint operators).

Let $S$ be a bounded selfadjoint positive linear operator on a Hilbert space. Then, for all $\alpha>0$ , $\ker S=\ker S^{\alpha}$ , and $\mbox{\rm cl\,}R(S^{\alpha})=\mbox{\rm cl\,}R(S)$ .

Proof.

Given any $0<\alpha<\beta$ , we can write $S^{\beta}=S^{\beta-\alpha}S^{\alpha}$ , from which we deduce that $\ker S^{\alpha}\subset\ker S^{\beta}$ . This means that $(\ker S^{\alpha})_{\alpha>0}$ is a nondecreasing family. To prove that this family is constant, it is enough to see that $\ker S^{2}\subset\ker S$ , which we verify now: If $x\in\ker S^{2}$ , then $\|Sx\|^{2}=\langle Sx,Sx\rangle=\langle S^{2}x,x\rangle=0$ , therefore $x\in\mbox{Ker}~{}S$ . The conclusion follows from the fact that $\ker S^{\perp}=\mbox{\rm cl\,}R(S)$ . ∎

A.5 Regularized inverse problems and proofs of Section 5.2

Proposition A.15.

Let $K\subset\mathbb{R}^{N}$ be a closed cone and $S\in\mathcal{S}_{+}(\mathbb{R}^{N})$ . Then $S$ is coercive on $K$ if and only if $K\cap\ker S=\{0\}$ .

Proof.

The direct implication is immediate from Definition 5.14. For the reverse implication, let $K$ be a closed cone such that $K\cap\ker S=\{0\}$ . Since $S$ is linear, we know that $d\mapsto\langle Sd,d\rangle$ is convex and continuous. So, using the compactness of $K\cap\mathbb{S}_{\mathbb{R}^{N}}$ we deduce that:

[TABLE]

Because $\bar{d}\in K$ and $\bar{d}\neq 0$ , we deduce from our assumption that $\bar{d}\notin\mbox{Ker}~{}S$ . Therefore, $\gamma:=\langle S\bar{d},\bar{d}\rangle>0$ , from which we deduce that $S$ is $\gamma$ -coercive on $K$ . ∎

Definition A.16 (Cone enlargement).

Let $K\subset\mathbb{R}^{N}$ be a cone, and $\theta\in[0,\frac{\pi}{2}]$ . We define the $\theta$ -enlargement of $K$ as

[TABLE]

Lemma A.17.

If $K$ is a closed cone, then $K_{\theta}$ is a closed cone containing $K$ for all $\theta\in[0,\frac{\pi}{2}]$ .

Proof.

By definition, $K_{\theta}$ is a cone containing $K$ and $\Delta_{\theta}:=\left\{x\in\mathbb{S}_{\mathbb{R}^{N}}\ |\ (\exists y\in K\cap\mathbb{S}_{\mathbb{R}^{N}})\ \arccos\left({|\langle x,y\rangle|}\right)\leq\theta\right\}$ is compact, due to the compactness of $K\cap\mathbb{S}$ . Since $0\not\in\Delta_{\theta}$ , by compactness of $\Delta_{\theta}$ , we deduce that $K_{\theta}=\mathbb{R}\Delta_{\theta}$ is a closed cone (see e.g. [48, Proposition A.1.1]). ∎

Proposition A.18.

Let $S\in\mathcal{S}_{+}(\mathbb{R}^{N})$ which is $\gamma$ -coercive on a closed cone $K$ . Then, for every $\gamma^{\prime}\in]0,\gamma]$ , $S$ is $\gamma^{\prime}$ -coercive on $K_{\theta}$ , with $\theta:=\arcsin\left(\frac{\gamma-\gamma^{\prime}}{\|S\|}\right)\in[0,\frac{\pi}{2}[$ .

Proof.

Let $\theta$ and $\gamma$ be as in the statement. Since $S$ is $\gamma$ -coercive on $K$ , we see that $\gamma\leq\|S\|$ , which guarantees that $\theta\in[0,\frac{\pi}{2}[$ . Now, the fact that $K_{\theta}$ is closed (Lemma A.17) implies that $K_{\theta}\cap\mathbb{S}$ is compact in $X$ , so we can use the same arguments as in (55) to deduce that there exists $\bar{d}\in K_{\theta}\cap\mathbb{S}_{\mathbb{R}^{N}}$ such that $\langle S\bar{d},\bar{d}\rangle=\inf\limits_{d\in K_{\theta}\cap\mathbb{S}_{\mathbb{R}^{N}}}\langle Sd,d\rangle$ . Since $\bar{d}\in K_{\theta}$ , there exists by definition of $K_{\theta}$ some $\bar{v}\in K\cap\mathbb{S}$ such that $\arccos(|\langle\bar{d},\bar{v}\rangle|)\leq\theta$ . We can use [62, Theorem 1] to write

[TABLE]

Since $\bar{v}\in K\cap\mathbb{S}_{\mathbb{R}^{N}}\subset K_{\theta}\cap\mathbb{S}_{\mathbb{R}^{N}}$ , we have $\langle S\bar{v},\bar{v}\rangle\geq\langle S\bar{d},\bar{d}\rangle$ . Moreover, $\arccos(|\langle\bar{v},\bar{d}\rangle|)\leq\theta$ , so (56), implies

[TABLE]

We deduce from the definition of $\bar{d}$ that $S$ is $\gamma^{\prime}$ -coercive on $K_{\theta}$ . ∎

Proposition A.19.

Let $C\subset\mathbb{R}^{N}$ , and $\bar{x}\in C$ .

i)

For $\rho>0$ , $C$ is $\rho$ -prox-regular at $\bar{x}$ if and only if :

[TABLE] 2. ii)

If $C$ is a $C^{2}$ manifold, then there exists $\delta,\rho>0$ such that $C\cap\mathbb{B}(\bar{x},\delta)$ is $\rho$ -prox-regular.

Proof.

Item i) : Definition 5.19 can be rewritten as $(\forall\eta\in N_{C}(\bar{x})\cap\mathbb{S}_{\mathbb{R}^{N}})(\forall x\in C)\quad x\notin\mathbb{B}(\bar{x}+\frac{1}{\rho}\eta,\frac{1}{\rho})$ , where the condition $x\notin\mathbb{B}(\bar{x}+\frac{1}{\rho}\eta,\frac{1}{\rho})$ is equivalent to, after developing the square:

[TABLE]

The conclusion follows after cancelling and reorganizing the terms. Item ii) : Every $C^{2}$ -manifold is prox-regular in the sense of [93, Def. 10.23 & Prop. 13.32]. Therefore, for every $\bar{x}\in C$ , there exists $\delta,\rho>0$ such that for every $x\in C\cap\mathbb{B}(\bar{x},\delta)$ , and for every $\eta\in N_{C}(\bar{x})\cap\mathbb{S}_{\mathbb{R}^{N}}$ , the inequality (57) holds [93, Exercice 13.31]. Conclusion follows from the fact that $N_{C}(\bar{x})=N_{C\cap\mathbb{B}(\bar{x},\delta)}(\bar{x})$ .

∎

Here is a needed result estimating locally the coercivity of an operator on a prox-regular set via its coercivity on the tangent cone.

Proposition A.20.

Let $C\subset X$ be $\rho$ -prox-regular at $\bar{x}\in C$ . Let $S:X\rightarrow X$ be a bounded positive selfadjoint linear operator, being $\gamma$ -coercive on $T_{C}(\bar{x})$ . Then, for all $\gamma^{\prime}\in\left]0,\gamma\right[$ , there exists a cone $K\subset X$ such that $S$ is $\gamma^{\prime}$ -coercive on $K$ , and $C\cap\mathbb{B}_{X}(\bar{x},\delta)\subset\bar{x}+K$ , with $\delta=\frac{2(\gamma-\gamma^{\prime})}{\rho\|S\|}$ .

Proof.

Let $\gamma^{\prime}\in\left]0,\gamma\right[$ be fixed, and define $\theta:=\arcsin((\gamma-\gamma^{\prime})\|S\|^{-1})\in]0,\frac{\pi}{2}[$ . Let $K_{\theta}$ be the $\theta$ -enlargement of $T_{C}(\bar{x})$ , then Proposition A.18 guarantees that $S$ is $\gamma^{\prime}$ -coercive on $K_{\theta}$ . It remains to prove that there exists $\delta\in\left]0,+\infty\right[$ such that $C\cap\mathbb{B}(\bar{x},\delta)\subset\bar{x}+K_{\theta}$ . Let $x\in C$ . Because $C$ is $\rho$ -reached at $\bar{x}$ , we know that $T_{C}(\bar{x})$ is a convex cone (use [44, Thm. 4.8.(12)] and the fact that $C$ is locally closed at $\bar{x}$ ), so we can define $y:=\mbox{\rm proj}(x-\bar{x},T_{C}(\bar{x}))$ , and $\eta:=\mbox{\rm proj}(x-\bar{x},N_{C}(\bar{x}))$ . Using Moreau’s Theorem [11, Thm. 6.30], we deduce that $\eta=x-\bar{x}-y$ with $\langle\eta,y\rangle=0$ . We define $\delta:=\|x-\bar{x}\|$ , and look for a condition on it so that $x\in\bar{x}+K_{\theta}$ . For this to happen, it is enough to verify that

[TABLE]

Now, use Proposition A.19.i) together with the Cauchy-Schwarz inequality, and the polynomial inequality $X^{2}-cX\geq c^{2}/4$ , to write

[TABLE]

We can use this inequality, together with the facts that $x-\bar{x}=y+\eta$ and $\langle y,\eta\rangle=0$ , to write

[TABLE]

This allows us to conclude that (58) holds as long as:

[TABLE]

∎

Proof of Proposition 5.17.

Let $0<\gamma^{\prime}<\gamma$ , and set $S:=\mbox{\rm argmin\,}f$ . Since $h$ is of class $C^{2}$ around $\bar{x}\in S$ , there exists some $\delta>0$ such that for all $u\in\delta\mathbb{B}_{X}$ , $\|\nabla^{2}h(\bar{x}+u)-\nabla^{2}h(\bar{x})\|\leq\gamma-\gamma^{\prime}.$ Notice that when $\nabla^{2}h$ is Lipschitz continuous, we can take $\delta=(\gamma-\gamma^{\prime})/L$ . Also, if it is constant, we can just take $\delta=+\infty$ and $\gamma^{\prime}=\gamma$ . Let us show that $f$ is $2$ -conditioned on $\Omega:=\bar{x}+(K\cap\delta\mathbb{B}_{X})$ with the constant $\gamma_{f,\Omega}=\gamma^{\prime}$ . Take $x\in\Omega\cap\mathop{\mathrm{missing}}{\rm dom}g$ and use the optimality condition at $\bar{x}\in S$ and the convexity of $g$ to obtain

[TABLE]

By Taylor’s theorem applied to $h$ , we deduce from the inequality above that there exists $y\in[x,\bar{x}]$ such that:

[TABLE]

On the one hand, since $x\in\Omega$ , we have that $x-\bar{x}\in K$ . Thus, from the coercivity of $\nabla^{2}h(\bar{x})$ we have

[TABLE]

On the other hand, we use the Cauchy-Schwarz inequality together with the definition of $\delta$ and the fact that $\|y-\bar{x}\|\leq\|x-\bar{x}\|<\delta$ to obtain

[TABLE]

By combining the three previous inequalities, we deduce that

[TABLE]

This implies that $(\bar{x}+K)\cap\mbox{\rm argmin\,}f=\{\bar{x}\}$ , and the statement follows from $\|x-\bar{x}\|\geq\mbox{\rm dist\,}(x;S)$ . ∎

Bibliography105

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P.-A. Absil, R. Mahony and B. Andrews, Convergence of the iterates of descent methods for analytic cost functions , SIAM Journal on Optimization, 16 , pp. 531–547, 2005.
2[2] F.J. Aragón Artacho and M.H. Geoffroy, Characterization of metric regularity of subdifferentials , Journal of Convex Analysis, 15 (2), pp.365–380, 2008.
3[3] H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features , Mathematical Programming, 116 (1-2), pp. 5–16, 2009.
4[4] H. Attouch, J. Bolte, P. Redont and A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems. An approach based on the Kurdyka-Łojasiewicz inequality , Mathematics of Operations Research, 35 (2), pp. 438–457, 2010.
5[5] H. Attouch, J. Bolte and B.F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods , Mathematical Programming, 137 (1-2), pp. 91–129, 2013.
6[6] H. Attouch and R. Wets, Quantitative stability of variational systems II, a framework for nonlinear conditioning , SIAM Journal on Optimization, 3 (2), pp. 359–381, 1993.
7[7] D. Azé and J.-N. Corvellec, Nonlinear local error bounds via a change of metric , Journal of Fixed Point Theory and Applications, 16 (1), pp. 351–372, 2014.
8[8] J.-B. Baillon, Un exemple concernant le comportement asymptotique de la solution du problÃ¨me d u / d t + ∂ ϑ ∋ 0 0 𝑑 𝑢 𝑑 𝑡 italic-ϑ du/dt+\partial\vartheta\ni 0 , Journal of Functional Analysis, 28 (3), pp. 369–376, 1978.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Convergence of the Forward-Backward algorithm:

Abstract

1 Introduction

2 The forward-backward algorithm: notation and background

2.1 Notation and basic definitions

Assumption 2.1**.**

2.2 The Forward-Backward algorithm: worst-case analysis

Theorem 2.2** (Forward-Backward - convex case).**

Remark 2.3** (Sharpness of the results in the worst-case).**

3 Identifying the geometry of a function

3.1 Definitions

Definition 3.1**.**

Remark 3.2**.**

Proposition 3.3**.**

Proposition 3.4**.**

Proposition 3.5**.**

3.2 Examples

Example 3.6** (Uniformly convex functions).**

Example 3.7** (Least squares).**

Example 3.8** (Convex piecewise polynomials).**

Example 3.9** (L1 regularized least squares).**

Example 3.10** (Regularized problems).**

Example 3.11** (Distance to an intersection).**

Example 3.12** (Minimum of Łojasiewicz functions).**

3.3 A sum rule for ppp-conditioned functions

Definition 3.13**.**

Example 3.14** (Tilt-conditioned functions).**

Theorem 3.15** (Sum rule involving a strictly convex tilt-conditioned function).**

Proof.

Remark 3.16** (On the nondegeneracy condition a) of Theorem 3.15).**

Theorem 3.17** (Sum rule for tilt-conditioned functions).**

Proof.

Remark 3.18** (On the qualification conditions).**

Remark 3.19** (On the closedness of the range).**

Remark 3.20** (Previous results).**

Proposition 3.21**.**

Proof.

4 Sharp convergence rates for the Forward-Backward algorithm

4.1 Refined analysis with ppp-Łojasiewicz functions

Theorem 4.1** (Strong convergence and rates, p≥1p\geq 1p≥1).**

Remark 4.2** (Related work).**

Remark 4.3** (On the sharpness of the rates I).**

Remark 4.4** (Best stepsize and condition number).**

Definition 4.5**.**

Theorem 4.6** (Rates of convergence, p<0p<0p<0).**

Remark 4.7** (On the sharpness of the rates II).**

Theorem 4.8** (General first-order descent method).**

4.2 How to localize the sequence of iterates

Definition 4.9**.**

Example 4.10** (FB-invariant sets).**

Corollary 4.11** (Geometry on stable sets gives global rates).**

Corollary 4.12** (Local geometry gives asymptotical rates).**

Proof.

Remark 4.13** (On the compactness assumption).**

Definition 4.14** (Mirror-stratifiable function).**

Corollary 4.15**.**

Proof.

Example 4.16**.**

Remark 4.17** (Partial smoothness).**

Remark 4.18** (On the assumptions).**

4.3 Linear rates of convergence for the Forward-Backward algorithm

Proposition 4.19** (Linear rates and 222-conditioning).**

Proof.

4.4 Superlinear rates and finite termination

Lemma 4.20**.**

Proof.

Theorem 4.21**.**

Proposition 4.22**.**

Proof.

5 Linear inverse problems: from modeling assumptions to convergence rates

5.1 Łojasiewicz property of quadratic functions via source conditions in Hilbert spaces

5.1.1 Elements of linear algebra

Proposition 5.1** ([51, Theorem 3.1.1]).**

Proposition 5.2**.**

Assumption 2.1.

Theorem 2.2 (Forward-Backward - convex case).

Remark 2.3 (Sharpness of the results in the worst-case).

Definition 3.1.

Remark 3.2.

Proposition 3.3.

Proposition 3.4.

Proposition 3.5.

Example 3.6 (Uniformly convex functions).

Example 3.7 (Least squares).

Example 3.8 (Convex piecewise polynomials).

Example 3.9 (L1 regularized least squares).

Example 3.10 (Regularized problems).

Example 3.11 (Distance to an intersection).

Example 3.12 (Minimum of Łojasiewicz functions).

3.3 A sum rule for $p$ -conditioned functions

Definition 3.13.

Example 3.14 (Tilt-conditioned functions).

Theorem 3.15 (Sum rule involving a strictly convex tilt-conditioned function).

Remark 3.16 (On the nondegeneracy condition a) of Theorem 3.15).

Theorem 3.17 (Sum rule for tilt-conditioned functions).

Remark 3.18 (On the qualification conditions).

Remark 3.19 (On the closedness of the range).

Remark 3.20 (Previous results).

Proposition 3.21.

4.1 Refined analysis with $p$ -Łojasiewicz functions

Theorem 4.1 (Strong convergence and rates, $p\geq 1$ ).

Remark 4.2 (Related work).

Remark 4.3 (On the sharpness of the rates I).

Remark 4.4 (Best stepsize and condition number).

Definition 4.5.

Theorem 4.6 (Rates of convergence, $p<0$ ).

Remark 4.7 (On the sharpness of the rates II).

Theorem 4.8 (General first-order descent method).

Definition 4.9.

Example 4.10 (FB-invariant sets).

Corollary 4.11 (Geometry on stable sets gives global rates).

Corollary 4.12 (Local geometry gives asymptotical rates).

Remark 4.13 (On the compactness assumption).

Definition 4.14 (Mirror-stratifiable function).

Corollary 4.15.

Example 4.16.

Remark 4.17 (Partial smoothness).

Remark 4.18 (On the assumptions).

Proposition 4.19 (Linear rates and $2$ -conditioning).

Lemma 4.20.

Theorem 4.21.

Proposition 4.22.

Proposition 5.1 ([51, Theorem 3.1.1]).

Proposition 5.2.

Definition 5.3 (Regularity space and source set).

Proposition 5.4.

Lemma 5.5 (Source sets for well-posed problems).

Remark 5.6.

Remark 5.7 (Source sets as balls).

Example 5.8 (Regularity spaces as Sobolev spaces).

Theorem 5.9 (Geometry of least squares on source sets).

Remark 5.10.

Proposition 5.11 (Invariance of source sets).

Corollary 5.12 (Convergence rates for Landweber algorithm).

Definition 5.13.

Definition 5.14.

Example 5.15 (coercivity for positive symmetric matrices).

Example 5.16.

Proposition 5.17 (Coercivity of the Hessian implies $2$ -conditioning).

Definition 5.18.

Definition 5.19.

Theorem 5.20 (Injective Hessian on tangent cone implies $2$ -conditioning).

Example 5.21.

Remark 5.22 (Related results with partial smoothness).

Lemma A.1 (Lower bounds for the proximal algorithm).

Definition A.2.

Example A.3.

Lemma A.4 (The Łojasiewicz constant for uniformly convex functions).

Lemma A.5 ( $p$ -powers are $2$ -tilt conditioned when $p\in]1,2]$ ).

Lemma A.6 (Kullback-Leibler divergences are $2$ -tilt conditioned).

Definition A.7.

Lemma A.8 (Estimate for sublinear real sequences).

Lemma A.9.

Lemma A.10 (Descent Lemma for Hölder smooth functions).