Convergence of the Forward-Backward Algorithm: Beyond the Worst Case with the Help of Geometry
Guillaume Garrigos, Lorenzo Rosasco, Silvia Villa

TL;DR
This paper investigates the convergence of the forward-backward algorithm using geometric conditions, extending classical notions to more general sets and infinite-dimensional spaces, with applications in inverse problems and signal processing.
Contribution
It extends geometric convergence analysis of the forward-backward algorithm to arbitrary sets and infinite dimensions, introducing new inequalities and connections to inverse problem conditions.
Findings
First Lojasiewicz inequality for a quadratic function with a compact operator
New linear convergence rates for inverse problems with low-complexity priors
Unified framework connecting geometry and inverse problem conditions
Abstract
We provide a comprehensive study of the convergence of the forward-backward algorithm under suitable geometric conditions, such as conditioning or {\L}ojasiewicz properties. These geometrical notions are usually local by nature, and may fail to describe the fine geometry of objective functions relevant in inverse problems and signal processing, that have a nice behaviour on manifolds, or sets open with respect to a weak topology. Motivated by this observation, we revisit those geometric notions over arbitrary sets. In turn, this allows us to present several new results as well as collect in a unified view a variety of results scattered in the literature. Our contributions include the analysis of infinite dimensional convex minimization problems, showing the first {\L}ojasiewicz inequality for a quadratic function associated to a compact operator, and the derivation of new linear rates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
fourierlargesymbols147
Convergence of the Forward-Backward algorithm:
Beyond the worst-case with the help of geometry
Guillaume Garrigos1, Lorenzo Rosasco2,3, and Silvia Villa4
( LPSM, Université de Paris. 75205 Paris CEDEX 13, France.
DIBRIS, Università degli Studi di Genova. Via Dodecaneso 35, 16146, Genova, Italy.
LCSL, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology.
Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA.
Dipartimento di Matematica, Università degli Studi di Genova. Via Dodecaneso 35, 16146, Genova, Italy.
)
Abstract
We provide a comprehensive study of the convergence of the forward-backward algorithm under suitable geometric conditions, such as conditioning or Łojasiewicz properties. These geometrical notions are usually local by nature, and may fail to describe the fine geometry of objective functions relevant in inverse problems and signal processing, that have a nice behaviour on manifolds, or sets open with respect to a weak topology. Motivated by this observation, we revisit those geometric notions over arbitrary sets. In turn, this allows us to present several new results as well as collect in a unified view a variety of results scattered in the literature. Our contributions include the analysis of infinite dimensional convex minimization problems, showing the first Łojasiewicz inequality for a quadratic function associated to a compact operator, and the derivation of new linear rates for problems arising from inverse problems with low-complexity priors. Our approach allows to establish unexpected connections between geometry and a priori conditions in inverse problems, such as source conditions, or restricted isometry properties.
††Contact: G. Garrigos [email protected] L. Rosasco [email protected] S. Villa [email protected]
††Acknowledgements: This material is supported by the Center for Brains, Minds and Machines, funded by NSF STC award CCF-1231216, and the Air Force project FA9550-17-1-0390. L. Rosasco acknowledges the financial support of the Italian Ministry of Education, University and Research FIRB project RBFR12M3AC. S. Villa is supported by the INDAM GNAMPA research project 2017 Algoritmi di ottimizzazione ed equazioni di evoluzione ereditarie.
1 Introduction
Splitting algorithms based on first order descent methods are widely used to solve high dimensional convex optimization problems in signal and image processing [28], compressed sensing [31], and machine learning [84]. Their main advantage is their simplicity and complexity independent of the dimension of the problem. The worst case convergence rates of these methods have been intensively investigated in the last twenty years. The simplest example is the gradient method applied to a smooth convex function, which is known to converge in values as [32, 94]. Analogous results are known for the forward-backward splitting algorithm. We refer to these results as worst case since no particular assumption is made on the objective function aside from convexity and existence of a solution. Note that these rates are sharp, meaning that there are functions for which these rates are arbitrarily accurate. Clearly such a large class of convex functions allows for functions with wild behaviors around the minimizers [16], behaviors that might hardly appear in practice. It is then natural to ask whether improved rates can be proved under further regularity assumptions.
Previous work on optimization rates with geometry. One classical geometrical assumption is strong convexity, which indeed guarantees linear convergence rates [50, 95]. In practice, strong convexity is often too restrictive, and one would wish to relax it, while retaining fast rates. A relaxation of this condition is given by geometric conditions that, roughly speaking, describe convex functions that behave like
[TABLE]
for some and on some subset , which is typically a neighborhood of the minimizers and/or a sub-level set. The intuition behind this kind of assumption required on a neghborhood of the solution is clear: the bigger is , the more the function is “flat” around its minimizers, which in turns means that a gradient descent algorithm will converge slowly. The idea of exploiting geometric conditions to derive convergence rates has a long history dating back to [89, 91], and plenty of similar convergence rates results have been derived under different yet related geometrical properties.
The optimization community focused on several different but related geometric assumptions, namely the -conditioning, the -metric subregularity and the -Łojasiewicz properties (see Section 3 for their definitions). The first111If we discard the “classic” strong convexity assumption. result exploting geometry to derive fast convergence rates dates back to Polyak [89, Theorem 4], showing that the gradient method converges linearly (in terms of the values and iterates) when the objective function verifies the -Łojasiewicz inequality. Improved convergence rates for first-order descent methods were then obtained in [91], considering notions slightly stronger than -metric subregularity, and proving finite convergence of the proximal algorithm for , and linear convergence for . These results are improved and extended in [82], analyzing for the first time convergence rates for the iterates of the proximal algorithm using metric subregularity for general . The results in [82] recover those in [91] (see also [96, 97]), but also derive superlinear rates for , and sublinear rates for . Roughly speaking, the results in [82] show that the bigger is the slower is the algorithm. A related notion, nowadays called the Luo-Tseng error bound condition, has been considered in the seminal paper [81], and implies the linear convergence of several first order methods. Recently, this condition has been shown to be equivalent to 2-conditioning [40, 74]. In the early 90’s, some attention was devoted to the study of -conditioned functions, in particular for (some authors call this property superlinear conditioning, sharp growth or sharp minima property). In this context, [45, 64, 23] showed that the proximal algorithm terminates after a finite number of iterations. For , Polyak [90, Theorem 7.2.1] obtained the finite termination for the projected gradient method. The -conditioning was also used to obtain linear rates for the proximal algorithm in [70]. In [3], it was observed that the -Łojasiewicz property could be used to derive precise rates for the iterates of the proximal algorithm. The authors obtain finite convergence when , linear rates when , and sublinear rates when . Similar results can be found in [4, 83]. Such convergence rates for the iterates have been extended to the forward-backward algorithm (and its alternating versions) in [18], and similar rates also hold for the convergence of the values in [27, 46]. More recently, various papers focused on conditions equivalent (or stronger) to the -conditioning to derive linear rates [67, 75, 41, 78, 40, 61]. Some effort has also been made to show that the Łojasiewicz property and conditioning are equivalent [16, 17], and to relate it to other error bounds appearing in the literature [61]. See also [85] for a refined analysis of linear rates for the projected gradient algorithm under conditions that interpolate between strong convexity and -conditioning (see also Subsection 4.3).
A key observation. Our study starts from a basic observation which allows a number of developments. Indeed, motivated by several relevant examples described in Section 5, we require condition (1) to hold on an arbitrary set , which in general is neither a neighborhood of the solution, nor a sublevel set. This extension allows to establish a connection with modeling assumptions considered in different contexts and unveil their role in optimization. As we explain below, modeling assumptions, such as source conditions in inverse problems [42] or the restricted injectivity property in sparse recovery [25], correspond to conditioning assumptions on specific subsets. This ensures global convergence rates for the forward-backward algorithm that are faster compared to those given by a worst case analysis and indeed often observed in practice.
Geometry and inverse problems. As a first example of the importance of considering arbitrary sets to define geometrical properties, consider linear inverse problems for which the operator is an infinite dimensional compact operator, making the problem severely ill-posed. A common modeling assumption is to suppose that the minimal norm solution of the problem satisfies a source condition, which can be seen as a measure of its regularity (see Section 5.1 for a definition). Under this condition, it is shown that the sublinear rate of the gradient algorithm is faster than the worst case one [42]. However, such a behavior cannot be apparently explained in terms of classical geometrical conditions satisfied by the least squares function: indeed, it was shown in [53] that such a least squares function cannot verify any Łojasiewicz inequality (1) in a neighborhood of its minimizers. On the contrary,thanks to the extension of the definition considered in this paper, we show that geometric assumptions are indeed satisfied, but only on specific subsets. More precisely, we show in Theorem 5.9 that the source condition guarantees that the least squares is -Łojasiewicz () on a dense affine subspace having empty interior. This allows therefore to explain the faster global rates of the gradient algorithm which are typically observed in this context.
As a second example, consider linear inverse problems with a low-complexity prior, such as sparse inverse problems. For these problems, the restricted injectivity condition [25] is a key modeling assumption to guarantee stable recovery: it means that, even if a linear measurement is corrupted by noise, we can hope to reconstruct an approximated solution by solving a regularized optimization problem. In Section 5.2, we show that this assumption implies a -conditioning of the problem over a (nonconvex) cone of sparse vectors. Since this set is active, in the sense that it is reached by the algorithm after a finite time, it immediately gives us asymptotic linear rate of the algorithm. For problems with more general low-complexity priors the situation is similar: an active set will be identified by the iterates of the algorithm, and we show that restricted injectivity condition on the tangent cone to this active set induces a -conditioning of the problem on this set. Depending on the applications or on the hypothesis made on the problem, this set can be a low-dimensional manifold, or a set with less structure, and can be computed within the partial smoothness framework [54] or the mirror stratification one [43].
Paper contents. Motivated by the estimation problems presented in Section 5, the goal of this paper is to provide a comprehensive study of the convergence rates of the forward-backward algorithm for convex minimization problems satisfying geometric conditions on arbitrary sets. We collect in a unified view a variety of results scattered in the literature, and we extend them to this more general setting. In addition, we derive several novel results along the way. The paper is organized as follows.
After reviewing and discussing worst-case convergence results for the forward-backward algorithm in Section 2, we give in Section 3 the definition of different geometric conditions for a proper convex lower semicontinuous function : -conditioning, -metric subregularity, and -Łojasiewicz property on general subsets , rather than sublevel sets or open sets, as typically done in the literature. We show that those geometrical notion are equivalent, provided that the set is stable by the semigroup generated by (see Proposition 3.3). Since establishing -conditioning of a function may be hard in general, we provide two sum rules for conditioned functions in Theorem 3.15 and Theorem 3.17. The first one establishes that if a strictly convex function remains -conditioned under linear perturbations, then it is also -conditioned under convex perturbation. The second one gives conditions under which the sum of two conditioned functions are conditioned. It allows us to show in particular that the ROF model (minimization of the total variation and the Kullback-Leibler divergence) is -conditioned on every bounded set.
Section 4 exploits the -Łojasiewicz property on general sets to study the convergence of the forward-backward algorithm. In Theorem 4.1, we recover and extend results from the literature, getting finite / superlinear / linear / sublinear convergence rates, depending on the value of to our more general setting. Along the way, we extend the sharp superlinear rate known for the proximal method to the Forward-Backward algorithm. In addition, our approach allows to derive in a unified setting both nonasymptotic/global and asymptotic/local convergence results, see Corollaries 4.11 and 4.12. We go beyond the classical analysis by introducing a -Łojasiewicz property with taking nonpositive values. This allows to study convex functions being bounded from below but with no minimizers, a case which has drawn little attention so far, but which can arise for instance in function approximation [35] or in statistical learning theory [34, Theorem 9] (see also Section 5.1). For such ill-posed problems, we derive new and sharp sublinear rates for the values in Theorem 4.6, interpolating between and . We further show in Section 4.3 that the -conditioning is essentially equivalent to the linear convergence of the forward-backward algorithm, illustrating the importance of this notion for convergence rate analysis.
In Section 5, we apply the aforementioned results to optimization problems arising from inverse problems, and discuss the interaction between geometry and modeling assumptions. The key results of this section are Theorem 5.9 and Theorem 5.20. Theorem 5.9 establishes that classical source conditions in inverse problems guarantee the Łojasiewicz property on special sets, and therefore give better convergence rates of the gradient method with respect to worst case ones. Theorem 5.20 says that if we have an a priori assumption about the minimizer, which is assumed to belong to a set , then a restricted injectivity property of the Hessian of the smooth component of the objective function implies that is -conditioned on this set around the minimizer. This guarantees asymptotic linear rates for forward-backward when combined with Corollary 4.15.
2 The forward-backward algorithm: notation and background
2.1 Notation and basic definitions
We recall a few classic notions and introduce some notation. Throughout the paper is a Hilbert space. Given , we note and its interior and closure. We say that is a cone, if . We note (resp. ) the smallest cone (resp. linear subspace) in containing . Let , , and let and denote respectively the open and closed balls of radius centered at . We also use and to denote and , and to denote the unit sphere . The distance of from a set is , and stands for , so, in particular . If is closed and convex, is the projection of onto , and the relative interior and the strong relative interior of are respectively defined as [11, Definition 6.9]: , . Given a bounded linear operator between two Hilbert spaces, its spectrum, noted , is the set of spectral values such that is not boundedly invertible. We also note . The set of singular values of , noted , is defined as , and we note . Let be the class of convex, lower semi-continuous, and proper functions from to . For and , denotes the (Fenchel) subdifferential of at [11, Definition 16.1], and (resp. ) denotes the effective domain of (resp. of ). Moreover, is the Fenchel conjugate of , namely for all . We introduce the shorthand notation . We also introduce the following notation for the (strict) sublevel sets of : for every , .
The following assumption will be made throughout this paper.
Assumption 2.1**.**
Let be a Hilbert space, , and be differentiable and convex, with -Lipschitz continuous gradient for some and set .
Splitting methods, such as the forward-backward algorithm, are extremely popular for minimizing an objective function as in Assumption 2.1. To have an implementable procedure, we implicitly assume that the proximal operator of can be easily computed (see e.g. [28]):
[TABLE]
Remembering Assumption 2.1 is in force, we introduce the Forward-Backward (FB) map for :
[TABLE]
so that the FB algorithm can be simply written as .
2.2 The Forward-Backward algorithm: worst-case analysis
The following theorem collects known results about the convergence of the FB algorithm. This is a “worst-case” analysis, in the sense that it holds for every satisfying Assumption 2.1. The main goal of Section 4 is to show how these results can be improved taking into account the geometry of at its infimum.
Theorem 2.2** (Forward-Backward - convex case).**
Suppose that Assumption 2.1 is in force, and let be generated by the FB algorithm with . Then:
- i)
(Descent property) The sequence is decreasing, and converges to . 2. ii)
(Féjer property) For all , the sequence is decreasing. 3. iii)
(Boundedness) The sequence is bounded if and only if is nonempty.
Suppose in addition that is bounded from below. Then
(Subgradients convergence) The sequence converges decreasingly to zero, with
Moreover, if , we have:
(Weak convergence) The sequence converges weakly to a minimizer of . 2. 6.
(Global rates for function values) For all ,
[TABLE] 3. 7.
(Asymptotic rates for function values) When ,
Theorem 2.2 collects various convergence results on the FB algorithm. Item i appears in [94, Theorem 3.22] (see also [52]). Item ii is a consequence of the nonexpansiveness of the FB map (see (3)) [65, Lemma 3.2]. Item iii, which is a consequence of Opial’s Lemma [87, Lem. 5.2], can be found in [94, Theorem 3.12]. Item 4 follows from Lemma A.9.ii in the Annex. Item 5 is also a consequence of Opial’s Lemma, see [65, Proposition 3.1]. Items 6 and 7 are proved in [32, Theorem 3] (see also [20, Proposition 2] and [12, Theorem 3.1]).
Remark 2.3** (Sharpness of the results in the worst-case).**
The convergence results in Theorem 2.2 are sharp, in the following sense. First, the iterates may not converge strongly: see [8, 52] for a counterexample in . Even in finite dimension, no sublinear rates should be expected for the iterates. To see this, apply the proximal algorithm to the function , whose unique minimizer is zero. When , there exists a constant depending on such that (see e.g. the discussion following [83, Proposition 2.5], or Lemma A.1):
[TABLE]
The estimate (4) also provides a lower bound for the rates on the objective values:
[TABLE]
The above lower bounds imply that the rate in Theorem 7 cannot be improved into a rate , for some , because we can always find a large enough verifying . It also means that no polynomial rates can be expected for . This fact was also observed in [32, Theorem 12] on an infinite dimensional counterexample. When is bounded from below, but has no minimizers, the values go to zero but no rates can be obtained in general. To see this, consider for any the function defined by
[TABLE]
If is obtained by applying the proximal algorithm to this function, then (see Lemma A.1) there exists such that:
[TABLE]
Observe that this lower bound on the objective function values implies that the convergence for those functions is slower than the usual rate obtained in Theorem 2.2.6. It also shows that no polynomial rates can be proven for the values when .
3 Identifying the geometry of a function
3.1 Definitions
In this section we introduce the main geometrical concepts that will be used throughout the paper to derive precise rates for the FB method. Roughly speaking, these notions characterize functions which behave like (1) on an arbitrary set .
Definition 3.1**.**
Let , let with , and . We say that:
- i)
is -conditioned on if there exists a constant such that:
[TABLE] 2. ii)
is -metrically subregular on if there exists a constant such that:
[TABLE] 3. iii)
is -Łojasiewicz on if there exists a constant such that:
[TABLE]
We will refer to these notions as global if , and as local if for some and , .
The notion of conditioning, introduced in [98, 105], is a common tool in the optimization and regularization literature [6, 86, 66, 101, 17]. It is also called the growth condition [86], and it is strongly related to the notion of Tikhonov wellposedness [38]. The -metric subregularity coincides with metric subregularity of the subdifferential at the origin, and it is less used, generally defined for or with equal to a neighborhood of a specific minimizer [36, 67]. It is also called upper Lipschitz continuity at zero of in [29], or inverse calmness [37]. The Łojasiewicz property goes back to [79], and was initially designed as a tool to guarantee the convergence of trajectories for the gradient flow of analytic functions, before its recent use in convex and nonconvex optimization. It is generally presented with a constant which is equal, in our notation, to [79, 1, 14, 17], or [83, 53, 46]. In the remark below we explain the main difference between our definition and the one usually considered in the literature.
Remark 3.2**.**
There is a subtle but crucial difference in the terminology used in Definition 3.1 with respect to the one commonly used for the Łojasiewicz property. It is usually said that a function has the Łojasiewicz property at if there exist , , and such that holds on . If the latter property holds for every , the function is said to have the Łojasiewicz property on . This is a different requirement with respect to the one in Definition 3.1. Indeed, we require the inequality to hold uniformly on , while the above definition must hold locally around every point of interest in a given set, and typically only allows for asymptotic convergence rates (see Corollary 4.12). This change of viewpoint is motivated by the fact that for many convex functions, we have more than just a local information about the geometry (see Sections 3.3 and 4). More importantly, it is actually necessary for the analysis of the problems discussed in Section 5, which motivated this paper. Beyond that, it also allows to understand in a unified framework both global (Corollary 4.11) and local (Corollary 4.12) convergence rates.
The notions introduced in Definition 3.1 are closely related to each other. Indeed, for convex functions, -conditioning implies metric subregularity, which implies the Łojasiewicz property. Under some additional assumptions, it is possible to show that the reverse implications hold. For instance, metric subregularity implies conditioning when , [102, Theorem 4.3]. Similar results can also be found in [2, 7, 41, 39], and [29, Theorem 5.2] (for ). Also, it is shown in [17, Theorem 5] that the local Łojasiewicz property implies local conditioning. The next result, proved in Annex A.2, extends the mentioned ones, and states the equivalence between conditioning, metric subregularity, and Łojasiewicz property on -invariant sets (see Definition A.2 in Annex A.2).
Proposition 3.3**.**
Let , let , and let be such that . Consider the following properties:
- i)
is -conditioned on , 2. ii)
is -metrically subregular on , 3. iii)
is -Łojasiewicz on .
Then i ii iii. One can respectively take and . Assuming in addition that is -invariant, we also have iii i with .
The two next propositions show that these geometric notions are stronger when is smaller, and are meaningful only on sets containing minimizers (their proof follow directly from Definition 3.1 and are left to the reader).
Proposition 3.4**.**
Let be such that , , and .
- i)
If is -conditioned (resp. is -metrically subregular) on , then is -conditioned (resp. is -metrically subregular) on for any . 2. ii)
If is -Łojasiewicz on , then is -Łojasiewicz on for any .
Proposition 3.5**.**
Let be such that . If is a weakly compact set for which , then is -conditioned on for any .
3.2 Examples
In this section, we collect some relevant examples.
Example 3.6** (Uniformly convex functions).**
Suppose that is uniformly convex of order [11, Definition 10.7]. Then, there exists such that [101, Corollary 3.5.11.iv]:
[TABLE]
Such function is globally -conditioned, with , and globally -Łojasiewicz, with (see Lemma A.4). In the strongly convex case, when , the -Łojasiewicz inequality holds with the constant , which is sharp. Examples of uniformly convex functions of order are [11, Example 10.16].
Example 3.7** (Least squares).**
Let be a nonzero bounded linear operator between Hilbert spaces, and , for some . Then, the conditioning, metric subregularity, and Łojasiewicz properties, with and , are equivalent to verify on , respectively:
[TABLE]
If holds, one can see that the above inequalities hold with
[TABLE]
meaning in particular that is globally -conditioned. Since is equivalent for to be closed (see Proposition 5.2), it is in particular always true when has finite dimension. If instead holds, [53, Theorem 2.1] shows that cannot satisfy any local -Łojasiewicz property, for any . This is for instance the case for infinite dimensional compact operators. Nevertheless, we will show in Section 5, that the least squares always satisfies a -Łojasiewicz property on the so-called regularity sets, for any .
Example 3.8** (Convex piecewise polynomials).**
A convex continuous function is said to be convex piecewise polynomial if can be partitioned in a finite number of polyhedra such that for all , the restriction of to is a convex polynomial, of degree . The degree of is defined as . Assume . Convex piecewise polynomial functions are conditioned [71, Corollary 3.6]. More precisely, for all , is -conditioned on its sublevel set , with In general, the constant (which depends on ) cannot be explicitly computed. This result implies that polyhedral functions () are -conditioned (in agreement with [23, Corollary 3.6]), and that convex piecewise quadratic functions () are -conditioned (in agreement with [70, Theorem 2.7]). More generally, convex semi-algebraic functions are locally -conditioned [15].
Example 3.9** (L1 regularized least squares).**
Let , for some linear operator , and . As observed in [17, Section 3.2.1], is convex piecewise polynomial of degree , thus it is -conditioned on every nonempty level set . The computation of the conditioning constant is rather difficult. In [17, Lemma 10] an estimate of is provided, by means of Hoffman’s bound [58]. Extensions of this result to the infinite dimensional setting can be found in [49].
Example 3.10** (Regularized problems).**
Let be an Euclidean space, , where is a linear operator, , and is a strongly convex function, and . Then is -conditioned on any level set , for , if
- i)
with , (see [104, Corollary 2]), 2. ii)
with , (use [40, Theorem 4.2]; the details are left to the reader as an exercise, and can be checked in the Appendix), 3. iii)
is the nuclear norm of the matrix , provided the following qualification condition holds222We mention that this result was originally announced in [60, Theorem 3.1] without the qualification condition, but then corrected in [103, Proposition 12 & following remarks], in which the authors show that such condition is necessary. (see [103]): such that . 4. iv)
is polyhedral (see [103, Proposition 6]).
Note that in [103, 104], the authors do not prove directly that the functions are -conditioned, but that they verify the so-called Luo-Tseng error bound, that is known to be equivalent to -conditioning on sublevel sets [40, Corollary 3.6]. Note also that in items ii-iv), the strong convexity and assumptions on can be weakened (see [103] and [40, Theorem 4.2]).
Example 3.11** (Distance to an intersection).**
Let be two closed convex sets in such that , and for which the intersection is sufficiently regular, i.e. . Let . Clearly, , and . Then is -conditioned on bounded sets [10, Theorem 4.3]. Let . From , it follows that the function is -conditioned on bounded sets. The regularity condition is not necessary if the two sets are polyhedral, as proved by Hoffman [58].
Example 3.12** (Minimum of Łojasiewicz functions).**
If , with being continuous on its domain, and locally -Łojasiewicz at , then is locally -Łojasiewicz at [74, Theorem 3.1]. It is important to notice that this result do not need the ’s to be convex.
The next section presents new sum rules for conditioned functions.
3.3 A sum rule for -conditioned functions
Since verifying conditioning directly with the definition can be difficult, it is very useful to establish which basic operations preserve conditioning. In this section we present two new sum rules for conditioned functions in a setting where , where and are convex and is a bounded linear operator. Theorem 3.15 states that if strictly convex and -conditioned up to linear perturbations then also is -conditioned. Theorem 3.17 provides an alternative where the assumption of strict convexity of is replaced by a stable conditioning assumption on , which we formalise in the next definition, inspired by the terminology used in [88, 41, 40].
Definition 3.13**.**
Let , , and . We say that is -tilt-conditioned if, for every , the tilted function has no minimizers, or is -conditioned on .
Note that a similar notion is already present in the literature: if is -tilt-conditioned (in our sense) on every compact set, then it is firmly convex in the sense of [40, Definition 4.1].
Example 3.14** (Tilt-conditioned functions).**
Many conditioned functions relevant for inverse problems are also tilt-conditioned:
- •
The -norm , and more generally every polyhedral function, are -tilt-conditioned on Euclidean spaces [23, Cor. 3.6].
- •
Convex piecewise polynomials of degree 2 are -tilt-conditioned on their sublevel sets. This is due to Example 3.8 and the fact that this class of functions is stable up to linear perturbations.
- •
For the same reasons as above, -uniformly convex functions are -tilt-conditioned on , for .
- •
If denotes the Kullback-Leibler divergence between two vectors in , then the divergence is -tilt-conditioned on bounded sets. This result is new, and its proof can be found in Lemma A.6.
- •
The nuclear norm is -tilt-conditioned on bounded sets [103, Proposition 11].
- •
See [40, Section 4] for more examples and properties of -tilt-conditioned functions on compact sets.
In this first theorem, we show that if a strictly convex function remains conditioned up to linear perturbations, then it is also stable up to convex perturbations:
Theorem 3.15** (Sum rule involving a strictly convex tilt-conditioned function).**
Let , where , let be a Hilbert space, and a bounded linear operator. Suppose that . Let , and assume that:
- a)
the nondegeneracy condition holds, 2. b)
* is strictly convex on its domain,* 3. c)
* is -tilt conditioned on for some .*
Then, is -conditioned on . We have , where , for some .
Proof.
Let ; Fermat’s rule implies that . Using assumption a) with [11, Thm. 16.47], we can write . Let be such that , i.e., . Let , and set . Using the fact that linear forms are continuous, we can use again Fermat’s rule together with a sum rule [87, Thm. 3.30] to write
[TABLE]
meaning that . It follows then from assumption c) that is -conditioned on . Moreover, because is strictly convex, we have [11, Prop. 16.37.i], and [11, Cor 11.9]. These facts mean that . We can now write the conditioning of evaluated at , together with the convexity of (remember that ):
[TABLE]
Observe that we are allowed to use the conditioning of at , because . Summing these two last inequalities gives
[TABLE]
with , which concludes the proof. ∎
Remark 3.16** (On the nondegeneracy condition a) of Theorem 3.15).**
This condition is very mild, and is satisfied under any of the following sufficient conditions (we note a minimizer of ):
- •
is continuous at (see [11, Prop. 16.27 & Prop. 6.19.vii]).
- •
has a full domain.
- •
, and (see [11, Def. 6.9 & Prop. 6.19.ix]). These inclusions hold for instance if and have open domains.
Theorem 3.15 is useful, but proves to be impractical when is not strictly convex, which typically happens when corresponds to some low-complexity-inducing regularizer used in inverse problems ( norm, group lasso, nuclear norm, total variation, etc). The next theorem provides a setting for those functions; in exchange for the strict convexity of , we will require to also be tilt-conditioned, and to some strong qualification condition to hold.
Theorem 3.17** (Sum rule for tilt-conditioned functions).**
Let , where , and is a bounded linear operator with closed range. Suppose that , and let . If denotes the corresponding Fenchel-Rockafellar dual problem , and
- a)
the nondegeneracy condition holds,
then . Moreover, if
- b)
there is for which the following qualification conditions are satisfied:
[TABLE] 2. c)
* is -tilt-conditioned on , and is -tilt-conditioned on for some ,*
then is -conditioned on every bounded subset of , with .
Proof.
The beginning of this proof starts as in the proof of Theorem 3.15: we use the nondegeneracy assumption a) with [11, Thm. 16.47] to get some and such that . So the condition [11, Thm. 19.1.iii] is verified, meaning that strong duality holds (in the sense that ). This allows to use [11, Cor. 19.2] to obtain
[TABLE]
We can use again [11, Cor. 19.2], this time on the dual problem, to also obtain
[TABLE]
The above equality allows us to assume, without loss of generality, that is the element of satisfying b). So, it remains to prove that, for all , there exists such that:
[TABLE]
Fix , let , and set and . Setting , we see from assumption c) and Proposition 3.4 that and are -conditioned on the bounded sets and , respectively. Using the same arguments as in (8), we obtain that and . Therefore, the conditioning of (resp. ) evaluated at (resp. ) writes as
[TABLE]
Summing these two last inequalities gives,
[TABLE]
with . Since on , we deduce that
[TABLE]
It remains to lower bound the right hand side by the distance to . By Example 3.11, thanks to the qualification condition (9) and the fact that is bounded, we derive from (11) that there exists independent of such that
[TABLE]
Define , which is well defined since we assumed to be closed. Let be defined by . Since , necessarily , so we deduce from Example 3.7 that
[TABLE]
On the one hand, we have . On the other hand, the definition of implies . Thus, it follows from (15) that
[TABLE]
Since this is true for any , we can combine it with (14) to get for all
[TABLE]
with . To end the proof, use the qualification condition (10) with Example 3.11 again to get some such that for all ,
[TABLE]
The above inequality, combined with (16) and (12), concludes the proof. ∎
Remark 3.18** (On the qualification conditions).**
When is not strictly convex, the conclusion of Theorem 3.17 may not hold if the qualification conditions (9) and (10) are removed, as proved in [103, Section 4.4.4] with . Let us give some sufficient conditions for (9) and (10) to hold:
- •
If and have finite dimension, b) is equivalent to
[TABLE]
To prove this, use [11, Cor. 6.15] and [92, Thm. 6.7] to see that the above condition is equivalent to (9), which implies (10). This condition is for instance satisfied if and (see [11, Thm. 16.47]). Those are the two conditions needed in [40, Theorem 4.2].
- •
If and have finite dimension and is strictly convex, then a sufficient condition for b) is [11, Prop. 18.9].
- •
If and have finite dimension, is polyhedral and is strictly convex, then assumption b) is not needed. As pointed out in [40, Cor. 4.3], this is due to the fact that the subdifferentials of and are polyhedral, which allows the use of Hoffman’s bound [58] instead of [10, Theorem 4.3] in the proof.
Remark 3.19** (On the closedness of the range).**
In Theorem 3.15 we assume to be closed. To see how important this hypothesis is in infinite dimension, take (which is not strictly convex), and an operator with a nonclosed range. Then, for this example, the qualification conditions cannot be satisfied. Indeed, even if (9) is automatically satisfied (because ), condition (10) reduces to , which is equivalent by definition to , which is impossible. Worse, even if we could get rid of this qualification condition, and if the conclusion of the theorem were true, we would obtain that is -conditioned on bounded sets, which was proven to be impossible in [53, Theorem 2.1] (combine it with Proposition 3.3).
Remark 3.20** (Previous results).**
Our results can be seen as extensions and refinements of arguments from [40], where the authors introduce the ideas of exploiting the -conditioning of tilted functions on compact sets, together with the description of as an intersection (11). Theorem 3.17 improves on [40, Thm. 4.2] and [40, Cor. 4.3] which require the to be bounded, and to be in with (we only ask for a compatibility condition which is satisfied if is continuous at , see Remark 3.16). As far as we know, Theorem 3.15 is the first sum rule of this kind with such weak assumptions on .
To illustrate the interest of these sum rules, we provide a new result for regularized inverse problems where the loss function is the Kullback-Leibler divergence, and the regularizer is a polyhedral function, such as the norm, or the Total Variation, which are commonly used in the signal and image processing literature.
Proposition 3.21**.**
Let , where is polyhedral, , and . If , then is -conditioned on bounded sets.
Proof.
We just have to verify the hypotheses of Theorem 3.17, by noting . First, the nondegeneracy condition a) is verified because is open, and is continuous on its domain (see Remark 3.16). Second, the qualification conditions b) are not needed because we are in a finite dimensional setting, is polyhedral and is strictly convex (see Remark 3.18). Finally, being polyhedral implies that it is globally -tilt-conditioned (see [23, Corollary 3.6]), and we prove in Lemma A.6 that is -tilt-conditioned on bounded sets, so c) is verified. ∎
4 Sharp convergence rates for the Forward-Backward algorithm
In this section, we present sharp convergence results for the forward-backward algorithm applied to -Łojasiewicz functions on a subset , building on the ideas in [5]. We extend the analysis to the case where is an arbitrary set, which will allow us to deal with infinite dimensional inverse problems (see Section 5.1), or structured problems for which all the information is encoded in a manifold (see Section 5.2). We also provide explicit rates of convergence, for both the iterates and the values. The proofs of Section 4.1 are left in the Annex A.3.
4.1 Refined analysis with -Łojasiewicz functions
Theorem 4.1** (Strong convergence and rates, ).**
Suppose that Assumption 2.1 is in force, and that is bounded from below. Let be generated by the FB algorithm. Assume that:
- a)
(Localization)* for all , ,* 2. b)
(Geometry)* is -Łojasiewicz on , for some .*
Then the sequence has finite length in , meaning that , and converges strongly to some . Moreover, there exist some constants with explicit expressions (see equations (52) and (54)), such that the following convergence rates hold, depending on the value of , and of :
- i)
If , then for every . 2. ii)
If , the convergence is superlinear: for all ,
[TABLE] 3. iii)
If , the convergence is linear: for all ,
[TABLE] 4. iv)
If , the convergence is sublinear: for all ,
[TABLE]
Note that the rates range from the finite termination, for , to the worst-case rates seen in Theorem 2.2, when tends to . The bigger is , the more the function is ill-conditioned, in the sense that the rates of its values become closer to , and the rates of its iterates become arbitrarily slow.
Remark 4.2** (Related work).**
Theorem 4.1 collects known and new results. We present a simple proof of this theorem, focusing on the analysis of a real sequence satisfying (50) (see [27, Theorem 3.2] or [46, Theorem 3.4] for previous results). The superlinear rates in ii), which were known for the proximal point algorithm [82], are new for the Forward-Backward algorithm. Moreover, the case was giving R-linear rates for the values in [27, 46], while we prove here Q-linear rates. Also, the quantification of the number of steps in the case involving is new.
Remark 4.3** (On the sharpness of the rates I).**
Let . According to (4) and (5), the order of the sublinear rates for the forward-backward algorithm that we obtain for both iterates and values are sharp when , see Remark 2.3. When , we see that the proximal algorithm verifies , and the algorithm converges linearly. Finally, when , the order of superlinearity that we obtain is not sharp, since for this function the proximal algorithm has a Q-superlinear rate of order . It is shown in [82, Theorem 3.1] that converges with this order for the proximal algorithm. For this, the author uses the stronger notion of metric subregularity, and we will extend this result in Theorem 4.21 to the FB algorithm.
Remark 4.4** (Best stepsize and condition number).**
When , we directly see that the bigger is , the better are the constants in the rates for the values. This is true also for , by looking in the proof of Theorem 4.1 to the definition of the constant . The constant is maximal when we take , in which case . When is a -strongly convex function, is the condition number of (see Example 3.6) . So can be seen as a generalized condition number, extending this notion from strongly convex functions to -Łojasiewicz ones.
In Theorem 4.1 the -Łojasiewicz assumption with implies that the is nonempty. In what follows we will derive convergence rates for the objective function values, even in the case where is bounded from below but has no minimizers. Such results are of interest for instance in function approximation theory, where the goal is to find the best approximation of a target function within a specified function class [35]. Since in general the considered classes are not closed in the ambient space, the minimizer of the error does not exist, but convergence rates in objective function values are useful. A similar problem appears also in supervised statistical learning theory, where some convergence results can still be obtained are available (see e.g. [34, Theorem 9] and [33, Theorem A.1]).
We show below that the -Łojasiewicz notion can be extended to nonpositive values of , which allows to describe thegeometry of problems without minimizers. Based on this new definition, we then derive sharp convergence rates for the objective function values.
Definition 4.5**.**
Let , let be bounded from below, and let . We say that is -Łojasiewicz on if such that the Łojasiewicz inequality holds:
[TABLE]
Similarly to the case , where this property describes the behavior of around its minimizers, here it describes the decay of when goes to . This assumption leads to convergence rates, interpolating between and , depending on the value of . We will see in Section 5.1 that this result applies to ill-posed linear problems involving a compact operator between infinite dimensional spaces.
Theorem 4.6** (Rates of convergence, ).**
Let be bounded from below and satisfying Assumption 2.1, be generated by the FB algorithm. Assume that:
- a)
(Localization)* for all , ,* 2. b)
(Geometry)* is -Łojasiewicz on , for some .*
Then the values converge sublinearly (with defined as in (52)):
[TABLE]
Remark 4.7** (On the sharpness of the rates II).**
The rates obtained in Theorem 4.6 are sharp. Indeed, the function defined in (6) is -Łojasiewicz on with , and our rates match the lower bounds obtained in Remark 2.3.
Theorem 4.6, together with Theorem 4.1, give a complete (and sharp) picture of the asymptotic behavior of the FB algorithm. In fact, looking at the proofs of the mentioned results, we see that the only properties of forward-backward algorithm that are used are (46) and (47). We can then extend the previous theorems to a broader class of first-order descent methods, which encompasses block coordinate descent methods, and/or variable metric extensions of the FB algorithm [5, 18, 46].
Theorem 4.8** (General first-order descent method).**
The statements of Theorems 4.1 and 4.6 remain true if the sequence is generated by any algorithm satisfying:
[TABLE]
In that case the constant appearing in Theorem 4.1 becomes .
4.2 How to localize the sequence of iterates
One of the two assumptions we do in Theorems 4.1 and 4.6 is that the sequence belongs to a set on which the geometry of is known. We discuss here some possible choices. One first simple case is when remains invariant under the action of (see also Annex A.2).
Definition 4.9**.**
We say that is FB-invariant if for all , .
Example 4.10** (FB-invariant sets).**
Theorem 2.2i-ii and Lemma A.9.ii imply that these sets are FB-invariant (as well as any of their intersection):
- •
and for every , and for every ,
- •
for every ,
- •
and , for every ,
- •
if is generated by the FB algorithm.
Assuming that is FB-invariant, the localization property becomes a simple assumption on the initialization of the algorithm. The proof of the next corollary is immediate:
Corollary 4.11** (Geometry on stable sets gives global rates).**
Let be bounded from below and satisfying Assumption 2.1, and be generated by the FB algorithm. Assume that is FB-invariant and that:
- a)
(Initialization) , 2. b)
(Geometry) is -Łojasiewicz on , for some .
Then the results of Theorems 4.1 and 4.6 apply for the sequence .
In some cases, it is possible to remove the assumption , to the price of having only asymptotic rates. Indeed, it suffices to prove that the sequence will enter in at a certain iteration, which is the argument used in [5, 46], in a non-convex setting. This happens for instance with the local level sets, under a slight compactness assumption (see below).
Corollary 4.12** (Local geometry gives asymptotical rates).**
Let be such that and satisfying Assumption 2.1. Let be generated by the FB algorithm and assume that:
- a)
(Compactness) admits a subsequence strongly converging to in , 2. b)
(Local geometry) for some :
[TABLE]
Then there exists such that the rates of Theorem 4.1 apply for the sequence .
Proof.
Let be a subsequence strongly converging to some , which belongs to according to Theorem 2.2. Therefore, is -Łojasiewicz on , for some . Since and , there exists such that . Since is FB-invariant, we conclude that . ∎
Remark 4.13** (On the compactness assumption).**
The compactness assumption made in Corollary 4.12 is always satisfied in finite dimension. Indeed Theorem 2.2 guarantees that the sequence is bounded under the assumption that . If has infinite dimension, this assumption can be verified provided that has compact level sets, due to the decreasing property of .
The property that a sequence generated by an algorithm reaches a set of interest after a finite number of iterations, is usually called identifiability, or finite identification of [100, 68, 54], and is therefore called an active set. For instance, the so-called active manifolds can be identified in finite time, under the assumption that is partially smooth with respect to this manifold [54, 55]. An alternative approach, recently introduced in [43], shows that the strata of mirror-stratifiable functions are identifiable. We will use this notion of active strata to derive another asymptotic convergence result.
Before introducing the notion of mirror-stratifiability, we recall that a set is said to be stratified by if this family is a finite partition such that . The latter inclusion endows the family of strata with an order relation . Given a point , it will be useful to note the unique strata which contains .
Definition 4.14** (Mirror-stratifiable function).**
We say that a function is mirror-stratifiable if
- a)
(resp. ) is stratified by (resp. ), 2. b)
the map realizes a bijection between and , 3. c)
the map is decreasing, in the sense that .
Both notions appear naturally in most sparsity-based inverse problems such as the -norm, group-lasso norm, nuclear norm, or the total variation, or any polyhedral function, see [43] for more details and many examples.
Corollary 4.15**.**
Suppose that Assumption 2.1 is in force, that , and let be the sequence generated by the FB algorithm converging to some . Assume that:
- a)
is mirror-stratifiable, and we define , 2. b)
is -Łojasiewicz on for some and .
Then there exists such that the rates of Theorem 4.1 apply for the sequence . Note that holds whenever .
Proof.
It follows from [43, Theorem 4] that there exists for which for every . Since converges to , we can assume that is such that for every . This, together with b, allows to apply Theorem 4.1 to the sequence . The equality follows directly from the bijectivity of , and the fact that . ∎
The reader not familiar with the notion of mirror-stratifiability might wonder what is the active set appearing in Corollary 4.15. Here are a few example of interest:
Example 4.16**.**
We keep here the notations of Corollary 4.15:
- •
If , we can choose a stratification based on sets with prescribed support, which gives
[TABLE]
where is the support of , and is the set of active indices of in . Some authors call the extended support of . In the case that , we have .
- •
If is the nuclear norm, we can choose a stratification based on sets of matrices with prescribed rank, which gives
[TABLE]
where denotes the set of singular values of the matrix . If , we have .
Remark 4.17** (Partial smoothness).**
Even if there is no direct relation between mirror stratification and partial smoothness, all the above mentioned functions are both mirror-stratifiable and partially smooth, and it would be immediate to provide an analogue result to Corollary 4.15 for partially smooth functions. Note that when using the identification theorems for partially smooth functions, it is necessary to assume the qualification condition to hold. In this case, the active manifold coincide with the active set for most practical cases (polyhedral functions, spectral norms), meaning that those cases are already covered by Corollary 4.15.
Remark 4.18** (On the assumptions).**
Note that our assumptions do not require or imply that has unique minimizer; we only require to be Łojasiewicz on the active set. In Section 5.2, we will show how this geometrical assumption can be guaranteed, provided that is injective when restricted to the tangent cone of the active set. In [74, Thm. 3.7] the authors provide a sufficient condition for the Łojasiewicz inequality to hold locally when is a partially smooth function.
4.3 Linear rates of convergence for the Forward-Backward algorithm
In this Section we give more insights on the linear rates for the FB algorithm. According to Theorem 4.1, and converge linearly when a -Łojasiewicz property is verified. Another decreasing quantity of interest is , and its Q-linear convergence is equivalent to asking that the forward-backward map satisfies
[TABLE]
If such property holds on a set containing , the sequence will converge Q-linearly. In fact, it is possible to show that (22) is equivalent to the -conditioning of on , provided this set is FB-invariant (see Definition 4.9). This fact has been observed in [85] for the projected gradient method, with and , and below we extend the argument to our more general setting.
Proposition 4.19** (Linear rates and -conditioning).**
Suppose that Assumption 2.1 is in force and assume that . Let and .
- i)
If verifies (22) on , then it is -conditioned on with . 2. ii)
If is -conditioned on , then it verifies (22) on with for stepsizes .
Then, on FB-invariant sets, the -conditioning is equivalent to (22), for stepsizes .
Proof.
Let , and let . It follows from the triangular inequality that
[TABLE]
[TABLE]
For item i, combine (22), (23), and (24):
[TABLE]
For item ii, Lemma A.9.i with , and the fact that implies
[TABLE]
Then, since is -conditioned on , we can conclude from
[TABLE]
Let us assume that is a -strongly convex function, with as in Example 3.6, and let be its unique minimizer. Let be generated by the FB algorithm, for which we take , and define the condition number of as . We compare the different linear rates that we can get for by using different theorems, relying on more or less strong assumptions. Using that is -Łojasiewicz (with , see Example 3.6), Theorem 4.1 yields R-linear rates of the form
[TABLE]
where If instead we exploit -conditioning (recall that in general this is a stronger notion than 2-Łojasiewicz , Proposition 3.3), we obtain Q-linear rates from Proposition 4.19 with exactly the same constant . If we use directly the strong convexity of , we obtain in this case Q-linear rates with (see e.g. [95, Proposition 3]). So, the more information we use, the better rates we derive. In [85], the authors investigate different notions belonging between strong convexity and the -conditioning. For instance, under an assumption of “quasi strong convexity”, they obtain which is smaller than , but not as good as . In conclusion, two aspects are crucial in the linear convergence of forward-backward. First, to have Q-linear rates for the iterates, it is necessary and sufficient to require the -conditioning of the function, due to the equivalence result of Proposition 4.19. Second, just assuming -conditioning is not a guarantee of having a fast computation of the solution, since linear rates can be arbitrarily slow on any finite number of iterations. Indeed two constants play a key role: the condition number , which is directly related to (some extra assumptions on could improve the value of , see e.g. the discussion in Subsection 5.2), and (see also [85]).
4.4 Superlinear rates and finite termination
In this section, we refine the convergence analysis for the case , replacing the -Łojasiewicz property with -metric subregularity (or -conditioning). As discussed in Remark 4.3, the order of superlinear convergence that we derive for the FB algorithm in the case is not sharp. In Theorem 4.21, using -metric subregularity (or -conditioning) instead of -Łojasiewicz , we derive better (and indeed sharp, see Remark 4.3) superlinear rates. Keep in mind these three notions are only equivalent via Proposition 3.3 if verifies a stability condition. The proof of Theorem 4.21 below follows directly from the next lemma, which is a partial analogue of Proposition 4.19-ii.
Lemma 4.20**.**
Suppose that Assumption 2.1 is in force and assume that .
- i)
If is -metrically subregular on , then for all , and :
[TABLE] 2. ii)
If is -conditioned on , then for all , and :
[TABLE]
Proof.
Let . Lemma A.9.ii, the triangular inequality, and Theorem 2.2-ii yield
[TABLE]
For i, use the hypothesis with (25) to derive For ii, use the -Łojasiewicz inequality via Proposition 3.3 , together with (25) and the -conditioning:
[TABLE]
Theorem 4.21**.**
Assume that and that the hypotheses of Theorem 4.1 hold. If the -Łojasiewicz hypothesis is replaced by -metric subregularity (resp. -conditioning), then (resp. ) Q-superlinearly converges with order .
We now discuss the relevance of these fast rates when is -Łojasiewicz with . While superlinear rates are well-known for the proximal algorithm applied to sharp functions, it is not observed for the gradient method. The apparent contradiction between this result and practice is in fact related to a quite intuitive fact, stated in the following Proposition: the more a function is smooth, the less it can be sharp. This means that the gradient algorithm cannot be applied to -Łojasiewicz function, with , because it is incompatible with being Lipschitz continuous. A similar statement, under different assumptions, can be found in [13, Proposition 2.8].
Proposition 4.22**.**
Let be such that has a nonempty interior. Assume to be differentiable on , where is convex and such that333Note that holds when , for , because is nonexpansive. . Assume that is -conditioned on , and that is -Hölder continuous on , i.e.
[TABLE]
Then . In the case that , we have moreover that .
Proof.
Let , and . Then and . For all , let . Then and . From the -conditioning assumption and the Descent Lemma A.10 applied at , we see that:
[TABLE]
If we suppose that , then by passing to the limit for , we get which is impossible. So , and if equality holds, follows from (26). ∎
As a consequence of Proposition 4.22, we should not expect more than linear rates for the gradient method applied to a convex function. Such a result cannot be extended straightforwardly to the Forward-backward algorithm. For instance, the function has a nontrivial smooth term in its decomposition, but is still sharp at its minimizer.
5 Linear inverse problems: from modeling assumptions to convergence rates
Throughout this section, and are Hilbert spaces and is a bounded linear operator. is called the parameter space and is the data space. Given the linear inverse problem , for some , we are interested in the (possibly regularized) convex optimization problem
[TABLE]
where and . The goal of this section is to show that typical modeling assumptions made in the inverse problem literature can be interpreted as geometric assumptions on (27), which are often not local, in the sense of Definition 3.1. First, we show that the classical source conditions are equivalent to a Łojasiewicz condition on suitable subsets, that we call source sets. Second, we show that the restricted isometry property, which is the key for exact recovery in sparsity based regularization, induces a 2-conditioning of the problem over a cone of sparse vectors, which is identified in finite time by the algorithm. This result extends to general inverse problems with mirror-stratifiable regularizing functions, for which the restricted isometry property entails a 2-conditioning of the problem over an active set (introduced in Corollary 4.15).
5.1 Łojasiewicz property of quadratic functions via source conditions in Hilbert spaces
All across this Section 5.1, we assume that is a bounded linear operator, that , and that is the associated least squares function. We will also note , and, whenever , we will note , which verifies .
5.1.1 Elements of linear algebra
Before going further into the topic, let us recall some basic (but not necessarily well-known) facts about bounded linear operators in Hilbert spaces. A first important difference with the finite-dimensional setting is that the set of minimizers of can be empty:
Proposition 5.1** ([51, Theorem 3.1.1]).**
Let be a bounded linear operator, and . Then .
We see that is guaranteed when is closed, which for instance cannot happen for compact operators with infinite-dimensional range [51, Theorem 3.1.3]. Observe that the closedness of can be checked by means of its singular values:
Proposition 5.2**.**
Let be a bounded linear operator. Then is closed if and only if .
Proof.
Use the fact that [42, Proposition 2.18] together with [53, Remark 2.3] and the fact that [56, §32 Theorem 3]. ∎
5.1.2 Known results about the Landweber algorithm
The quadratic function can be minimized by means of a gradient method, defined as
[TABLE]
A vast literature is devoted to this algorithm, which is often called in this context the Landweber algorithm. It is well-known that whenever , the sequence generated by the Landweber algorithm converges strongly to the projection of onto (see e.g. [42, Theorem 6.1], or [51, Theorem 3.3.2] for varying stepsizes). When the range is closed, the algorithm behaves exactly as in finite dimensions: both iterates and values converge linearly, see Example 3.7 and Theorem 4.1. If the is not closed, instead, the rates for can be arbitrarily slow without additional assumptions [32, Theorem 12]. Moreover, [53, Theorem 2.1] shows that no local Łojasiewicz property can be satisfied by such quadratic function when is not closed. This could suggest that it is not possible to rely on geometrical assumptions to obtain convergence rates. Nevertheless, as we will see below, this is not true. Indeed, in the inverse problem literature, this worst-case scenario is avoided by making an extra assumption on the problem. For instance, if the following source condition is verified
[TABLE]
the Landweber algorithm initialized with is known [42] to have the rates
[TABLE]
Also, when , a source condition in can be made:
[TABLE]
so that the Landweber algorithm initialized with verifies [34, Theorem 2.10]:
[TABLE]
The source condition (31) can be understood in light of Proposition 5.1. Indeed, this proposition says that the problem is well posed (in the sense that ) when . So it is reasonable to think that the “deeper” is in , and the easier the problem is. In the ill-posed case , we could also imagine that the “further away” is from , and the more difficult the problem is. Estimating the location of can be done thanks to the spaces , because they form a sequence of nonincreasing dense subsets of (see Lemma A.14 and [42, Proposition 2.8]):
[TABLE]
The aim of this section is to highlight how the rates (30) and (32) can be simply explained using the results of Section 4. We show that the source conditions (29) and (31) are equivalent to assume that the initialization of the algorithm belongs to a so-called source set. Our main result in this section consists in showing that the function satisfies a Łojasiewicz inequality on these source sets, which are FB-invariant. As a by-product of Corollary 4.11, we will obtain a new and simple geometrical interpretation of the rates in (30) and (32).
5.1.3 Regularity spaces and source sets
Definition 5.3** (Regularity space and source set).**
Given , the data regularity space and the data source set are respectively defined as:
[TABLE] 2. 2.
Given , the regularity space and the source set are respectively defined as:
[TABLE]
where denotes the preimage of a set under the application .
Proposition 5.4**.**
- i)
if and only if for all . 2. ii)
if and only if for all . 3. iii)
Assume is closed. Then for all .
Proof.
Given any , observe that is, by definition, equivalent to . Since , the latter is equivalent to . We can then easily deduce, using also Proposition 5.1, that For items i and ii, the claim follows directly from the nonincreasingness of . For item iii, observe that for all , [56, §32 Theorem 3]. As a consequence of Proposition 5.2, we deduce that is closed, and therefore (see Lemma A.14 in the Annex). In particular, for all , and the result follows from item ii. ∎
For well-posed problems, for which (and exists), the source sets can be expressed with a simpler expression (the proof is left in the Annex):
Lemma 5.5** (Source sets for well-posed problems).**
Assume that . Then, for all and :
[TABLE]
Remark 5.6**.**
Given that , we see that the classical conditions in (29) and (31) are equivalent, with our notations, to and . This means in particular that (29) is just a particular case of (31).
Remark 5.7** (Source sets as balls).**
Assume that is injective and . For all , is a dense subspace of (Lemma A.14), and we can endow it with the norm induced by the unbounded operator , defined as . Then, we see that the source sets are nothing but balls centered at the solution , with respect to this norm:
[TABLE]
while is the affine space spanned by these balls. By doing an analogy with the following example, the reader can think about this norm in as if it was a Sobolev norm in an space. Note that these balls may have an empty interior with respect to the topology of .
Example 5.8** (Regularity spaces as Sobolev spaces).**
Assume that is the space of zero mean -functions on :
[TABLE]
If is the linear integration operator defined on , then coincides with the Sobolev space [59, Theorem 6.4], so that the regularity space is here
[TABLE]
5.1.4 Properties of quadratic functions on source sets
Here is the main result of this section: on each source set , the least squares functional is -Łojasiewicz with .
Theorem 5.9** (Geometry of least squares on source sets).**
Let and . Then is -Łojasiewicz on , with
[TABLE]
Moreover, these two constants are sharp.
Proof.
Let and remind that . From Definition 5.3 and the definition of , we get
[TABLE]
We first prove that verifies the Łojasiewicz inequality by using the interpolation inequality (see Lemma A.13 in the Annex) with and , together with (34):
[TABLE]
We use (34) in the right member of (36), to write
[TABLE]
By combining (34), (35), (36) and (37), we obtain the following inequality
[TABLE]
Then the desired Łojasiewicz inequality holds by taking . Now we verify that the obtained constants in (33) are sharp. For this, let , and let be its canonical basis. Let be a strictly positive sequence converging to zero, and define as follows: . Let , , and let us assume that is -Łojasiewicz on for some :
[TABLE]
Let , which satisfies , and deduce from (38) that
[TABLE]
It follows from that , which is equivalent to . If , it means that , which is a regime in which the smallest is , the better. If , then , which is a regime in which the largest is , the better. In both cases we see that is the best possible exponent. Moreover, when , (39) becomes which implies the sharpness of the constant obtained in (33). ∎
Remark 5.10**.**
The result of Theorem 5.9 contrasts with [53, Theorem 2.1], in which the authors show that no local Łojasiewicz property can be satisfied by a quadratic function when is not closed. The key difference here is that we look at the Łojasiewicz property on specific dense sets with empty interior (see Remark 5.7).
Let us now verify that the source sets are invariant under the action of the Landweber algorithm (28). As mentioned at the beginning of the section, the Landweber algorithm is the gradient decent algorithm applied to a quadratic function, and therefore it is an instance of the FB algorithm. We can thus apply the convergence rates of Section 4 once we prove that the source sets are invariant.
Proposition 5.11** (Invariance of source sets).**
For all , the source set is FB-invariant.
Proof.
Let , , and let us prove that belongs to . By using Lemma 5.5, we deduce that , , and with . Since , this implies that
[TABLE]
The above equality shows that . It remains only to prove that verifies and . The condition immediately follows from and . Next, observe that is obtained by applying a gradient descent step to with respect to the function . Since this function has zero as a minimizer, and is differentiable with a -Lipschitz gradient, the Fejér property (see Theorem 2.2-ii) implies that . ∎
Next we combine all the results of this section to derive convergence rates of the Landweber algorithm under source conditions from Łojasiewicz conditions.
Corollary 5.12** (Convergence rates for Landweber algorithm).**
Let be a sequence generated by the Landweber algorithm (28). Assume that for some , the source condition is satisfied. Then:
- i)
, 2. ii)
If , then , where .
Proof.
For item i, the source condition together with Proposition 5.11 imply for some . If , we derive from Theorem 5.9 that is -Łojasiewicz on . Depending on the sign of , the rates on follow from Theorems 4.1 and 4.6. If , then the source condition and Proposition 5.4 ensures that , meaning that , so the rate follows from Theorem 2.2. For item ii, the convergence and rates on the iterates follows from Theorem 4.1. To show that the limit of the sequence (let us note it ) is , it is enough to verify that , since is an affine space parallel to . Because of the definition of the algorithm, it is easy to show by recurrence that . This being true for all , we can pass to the limit and deduce that . ∎
5.2 Sparsity based regularization, partial smoothness, and restricted injectivity
In this section we turn to the general case of optimization problems coming from a regularized inverse problem (27). In particular, we focus on the case where verifies a restricted injectivity condition at a solution, a situation which typically arises when is mirror-stratifiable, and typical modeling assumptions from the inverse problems/compressed sensing literature hold. In this setting we will be able to derive the 2-conditioning of the objective function in (27). In what follows, we will use the notation to refer to the set of bounded selfadjoint positive linear operators on .
5.2.1 Coercive linear operators on a cone
Definition 5.13**.**
We say that is a cone if it is a union of rays: .
Note that we do not require a cone to be convex. This is important for certain applications in which we have geometrical information about a function over a union of linear spaces, see for instance (40) in the context of sparse regularization problems.
Definition 5.14**.**
Let , let , and let be a cone. We say that is -coercive on if, for all , .
Example 5.15** (coercivity for positive symmetric matrices).**
A matrix is coercive on a closed cone if and only if is injective when restricted on (see Proposition A.15 for a proof):
[TABLE]
Example 5.16**.**
Any operator is -coercive on (see e.g. the proof in [30, Thm. 4]). In particular, if is positive definite then it is -coercive on .
In the next proposition we relate the coercivity of the Hessian of a function on a cone to the -conditioning of on this cone. This relation can be seen as a weakened analogue of the well known fact (see [11, Prop. 10.8 & 17.7.(iii)]) that, for :
is -strongly convex () is -coercive on .
Strong convexity is a global notion, which requires the function to have a positive definite quadratic-like geometry at each . On the contrary, the -conditioning requires the function to have a positive quadratic-like geometry, on a given set . We now state our result (its proof is left in the Annex A.5). For similar results, see also [19, Section 3.3.1] and [41].
Proposition 5.17** (Coercivity of the Hessian implies -conditioning).**
Let with and . Assume that is of class in a neighbourhood of , and that is -coercive on a closed cone . Then,
[TABLE]
and . If and is -Lipschitz, we can take .
5.2.2 Conditioning on prox-regular sets via restricted injectivity of the Hessian
Let us define some useful tools from variational analysis. The notion of reached set (or set with positive reach) was introduced by Federer [44, Def. 4.1], and later extended to prox-regularity (see Proposition A.19 and [93]).
Definition 5.18**.**
Let . The (Bouligand) tangent cone to at is defined as
[TABLE]
The normal cone to at is .
Definition 5.19**.**
Let , and . We say that is -reached at , if it is locally closed at , and verifies
[TABLE]
We say that is prox-regular at if there exists and a closed neighbourhood of such that is -reached at any . We say further that is prox-regular if it is prox-regular at every .
Convex sets, and in particular affine spaces, are prox-regular. Manifolds of class are locally prox-regular (see Proposition A.19).
We now provide the result at the core of this section, which says that if a minimizer belongs to some prox-regular set, and if the Hessian is injective when restricted to the tangent cone of this set, then is -conditioned on this set around . This will guarantee asymptotic linear rates when combined with Corollary 4.15.
Theorem 5.20** (Injective Hessian on tangent cone implies -conditioning).**
Let , and . Assume that there exists some such that:
- a)
* belongs to some which is -reached at ,* 2. b)
* is of class in a neighbourhood of ,* 3. c)
* is -coercive on .*
Then , and for every , there exists such that is -conditioned on , with . If we assume moreover that is -Lipschitz continuous, then we can take .
Proof.
Let . Using Proposition A.20, we see that for every there exists a such that the enlarged cone (see Definition A.16) contains for small enough, and such that is -coercive on . The conclusion of the claim follows from Proposition 5.17 applied to and . Under the additional assumption that is -Lipschitz, take any , and let , with . Using again Proposition A.20, we obtain that is -coercive on some cone , with and . Then, Proposition 5.17 shows that is -conditioned on , with and . The conclusion follows by seeing that with our choice of . ∎
Theorem 5.20 can be used in combination with Corollary 4.15: in this case we obtain that the restricted injectivity of the Hessian on the tangent cone to the active set guarantees asymptotic linear rates. In the example below, we detail what our assumptions mean for the examples in Example 4.16.
Example 5.21**.**
- •
If , the active set (20) is an open and dense subset of the vector space with . It is therefore -reached for every , and .
- •
If , let and let be the manifold of matrices with rank equal to . If , the active set (see (21)) is equal to . In particular, it is prox-regular (see Proposition A.19), and an expression for its tangent space can be found in [69, Example 2.2]. More generally, is locally prox-regular at if . To see this, use the same arguments as in [80, Prop. 3.1]: the fact that the singular values depend continuously on the matrix allows to find a neighbourhood of where the matrices have a rank greater or equal to . This means that , which is prox-regular.
Remark 5.22** (Related results with partial smoothness).**
While our results are new in the setting of mirror-stratifiable functions (where no condition is required), they intersect with existing results when is partially smooth with respect to an active manifold . It is shown in [75] that the -coercivity of on the tangent space guarantees asymptotic linear rates. We recover a similar result by combining Theorem [54, Theorem 5.3] with Theorem 5.20 and Theorem 2.2. For a fixed stepsize , [75, Thm. 3.1] predicts a Q-linear rate arbitrarily close to (where ) provided that . Instead, our results predict a R-linear rate arbitrarily close to , without condition on . Note that our constant is worse (resp. better) than when is close to (resp. ). Note also that the partial smoothness of together with [54, Theorem 6.2.ii)] ensures that is -conditioned on a neighbourhood of the solution, with , meaning that we can use Proposition 4.19 to obtain Q-linear rates arbitrarily close to .
5.2.3 Application to low-complexity inverse problems
Consider be defined by, for every , . is the sum of a smooth function, with Hessian equal to , and a nonsmooth function . Example 3.9 ensures that is locally -conditioned on its sublevel sets without any assumption on . This means, according to Theorem 4.1, that for any , and any , there exists a constant such that the iterative soft-thresholding initialized at verifies . Nevertheless, expressing the -conditioning constant, or , in terms of the components of the problems is far to be easy [17]. One way to recover a meaningful constant is to exploit modeling assumptions which are usually made to ensure the stability and recovery of the inverse problem .
Suppose that we are given the sequence generated by the iterative soft-thresholding, which converges to a minimizer of , . It is known that, after some iterations, the support of the sequence is stable [76, 49]:
[TABLE]
In particular, if the qualification condition holds, we can take [76, Prop. 3.6]. To estimate the rates of convergence for the sequence, it is then sufficient to make a restricted injectivity assumption on the matrix , depending on the knowledge we have on .
In the case we have access to , suppose that on the space the matrix is injective, i.e. holds. Then, there exists a constant such that is -coercive on (see Example 5.15), which implies via Proposition 5.17 that is -conditioned on , with . We deduce then that, asymptotically, the rates are governed by . It might happen that instead of knowing , we have only access to a partial information via the sparsity level . We can then follow the same reasoning with the (nonconvex) cone instead of . In that case, the constant of coercivity of on is defined by
[TABLE]
and guarantees linear rates governed by , using again Proposition 5.17. Such assumption is classical in sparsity based regularization, and it is related to the so-called Restricted Isometry Property [25], to ensure uniqueness of the minimizer and guarantee the robustness or recovery [99, 26]. Observe that while the computation of remains impracticable [9], it is meaningful with respect to the properties of our problem, and, more importantly, can be estimated when the matrix is random [47, Section 9]. Of course, this whole discussion can be extended to other regularized inverse problems, in particular if is replaced by a mirror-stratifiable function. In this case we will use Theorem 5.20 instead of Proposition 5.17 to derive linear rates.
6 Conclusion and perspectives
In this paper, we dicussed in details how geometry can be used to improve the rates of the FB method, or more general first-order descent schemes. We characterized the geometry, using tools that are often encountered in practice, like the -conditioning, and we provided a new sum rule for it. In Figure 6.1 we recall the various rates obtained for the FB method, from the worst case scenario (no minimizers, no assumptions) to the best one (sharp functions).
We also have discussed how those refined results can be obtained by decoupling the geometrical information we have on the function and the localization of the sequence we are looking at. This geometry-based analysis reduces then the gap between theory and practice, where the observed rates are often better than the ones resulting from a worst case analysis. It moreover shows that linear rates are tightly linked to -conditioned function. In addition, we showed how our analysis can be specialized to the inverse problems setting, and allows to explain typical modeling assumptions in this context, such as source conditions and restricted injectivity property. It is worth noting that the geometrical information such as conditioning or Łojasiewicz property can be exploited to derive sharper convergence rates for a broader class of functions and/or algorithms than just forward-backward algorithm [5]. We also emphasize that convexity plays no role in the proofs of Theorems 4.1 and 4.6. Indeed, some of these results were already known for non-convex functions [18, 27, 46]. One of the challenges in the future is to have quantitative results concerning the geometry of classes of nonconvex functions. For instance, what can be said about “simple” nonconvex piecewise polynomial functions (see [73] for an answer about maximum of finitely many polynomials)? Can we estimate the Łojasiewicz exponent of semialgebraic functions, depending on the degree of the polynomials defining their graph? Finally, a last challenge is the application of such geometrical tools to derive precise rates for nondescent methods. First results in this direction, using -conditioning are known for inertial methods [85, 77] or stochastic gradient methods [61]. It would be of interest to understand the behavior of these algorithms for other geometries.
Appendix A Appendix
A.1 Worst case analysis: proofs of Section 2
The following Lemma contains a detailed proof for the lower bound (7) in Example 2.3, which can also be applied to (5) by using a symmetry argument.
Lemma A.1** (Lower bounds for the proximal algorithm).**
Let , and let be the function defined by
[TABLE]
If , and , then for all :
[TABLE]
Proof.
Note that is an open interval, and that is infinitely derivable there. We can then see that , and are non-negative. In particular, we deduce that and are non-decreasing on .
Let us now take some , and consider the following continuous trajectory
[TABLE]
It is a simple exercise to verify that is a solution of this differential equation:
[TABLE]
The main step towards proving our lower bound is to show, by induction, that for every , . This is clearly true for , so, let us assume now that this is true for , and show that this implies . Start by writing
[TABLE]
On the one hand, is non-negative on , and , which means that is increasing. On the other hand, is non-decreasing, which means that is increasing. This fact, together with our induction assumption, allows us to write
[TABLE]
Consider now the function defined by . It is clearly increasing and bijective on its image, so its inverse is also increasing. We observe moreover that, by definition, the proximal sequence satisfies . This allows us to write
[TABLE]
This ends the proof of the induction argument.
Observe that, given non-negative numbers , the following inequality holds
[TABLE]
This means that, for all ,
[TABLE]
Passing this inequality through (which is non-decreasing) yields the desired result. ∎
A.2 Proofs of Section 3
A.2.1 Invariant sets and proofs of Section 3.1
We provide here a result concerning the equivalence between all the notions in Definition 3.1, for a large class of sets . The sets we will consider are directly related to the gradient flow induced by . Given , it is known444See [21, Thm 3.1] when , and [21, Thm. 3.2] with [11, Cor. 16.39] when . that there exists a unique absolutely continuous trajectory noted , called the steepest descent trajectory, which satisfies:
[TABLE]
Following [21], we introduce the notion of invariant sets for the flow of :
Definition A.2**.**
A set is -invariant if for any and a.e. , holds.
In other words, is said to be -invariant if any steepest descent trajectory starting in remains therein. It is straightforward to see that the intersection of two -invariant sets is still -invariant.
Example A.3**.**
An easy way to construct a -invariant set is to consider the sublevel set of a Lyapunov function for the gradient flow induced by . A function is said to be Lyapunov if for any , is decreasing. Classical examples of this kind are:
- •
, which is with .
- •
for , which is with (see [21, Thm. 3.2.17]).
- •
for , , which is with (see [21, Thm. 3.1.7]).
- •
for , which is with (see [21, Thm. 3.1.6]).
See [21, Section IV.4] for more details on the subject, as well as [22, 63]. It is also a good exercise to verify that the source sets considered in Proposition 5.11 are -invariant.
We next prove Proposition 3.3, stating the equivalence between conditioning, metric subregularity and Łojasiewicz on -invariant sets. The proof is based on an argument used in [17, Theorem 5], which relies essentially on the following convergence rate property for the continuous steepest descent dynamic (41).
Proof of Proposition 3.3.
Convexity of and the Cauchy-Schwartz inequality imply
[TABLE]
and so i ii iii. Next, we just have to prove that the Łojasiewicz property implies the conditioning one. So let us assume that is -Łojasiewicz on , which is -invariant, and fix . Define, for all , , which is derivable on , and for all , . Let us lighten the notations by noting instead of , so that . Because we will need to distinguish the case in which the trajectory converges in finite time, we introduce . Since and is continuous, we see that . For every , we have , so and . If , we also have for every that and . Now, we write:
[TABLE]
But (see [21]), so that the above equality becomes
[TABLE]
Since we assume to be -invariant, we can apply the Łojasiewicz inequality at for all , which can be rewritten in this case as This applied to (42) gives us:
[TABLE]
From (43) and the definition of , we see that , meaning that the trajectory has finite length. As a consequence, it converges strongly to some when tends to . Finally, we use on (43) the fact that , together with the fact that (see [21, Thm. 3.11]) to conclude that
[TABLE]
Proof of Proposition 3.4.
i: let . Given , there exists such that
[TABLE]
Since is -conditioned on , we deduce that:
[TABLE]
meaning that is -conditioned on .
ii: the proof follows the same lines as in i. ∎
Proof of Proposition 3.5.
Assume by contradiction that there exists a sequence such that
[TABLE]
Since is weakly compact, we can assume without loss of generality that weakly converges to some when . Then, it follows from (43), the boundedness of and the weak lower semi-continuity of that , meaning that , contradicting . ∎
A.2.2 Proofs of Section 3.2
Lemma A.4** (The Łojasiewicz constant for uniformly convex functions).**
Let be uniformly convex, of order , with constant . Then is -Łojasiewicz on , with , where .
Proof.
Let , , and . By definition of uniformly convex functions
[TABLE]
The right member of the above inequality involves a strictly convex optimization problem, whose unique optimal value can be determined by using Fermat’s rule:
[TABLE]
Injecting this optimal value in (44) gives, after rearranging the terms,
[TABLE]
and, since is arbitrary in , the result follows after passing this inequality to the power . ∎
Proof of Example 3.10.ii).
To prove the claim, it is enough to verify the three conditions of [40, Theorem 4.2]. The first condition (boundedness of ) is guaranteed by the fact that is coercive. Indeed, is strongly convex, therefore bounded from below, and is itself coercive. The second condition (dual qualification conditions) follows immediately from the fact that both and , and are continuously differentiable. To see this, observe that in this example is (up to a constant) , where is the conjugate number of : . Moreover, being strongly convex means that is also continuously differentiable, with . The third condition (firm convexity) is easy to check for because it is strongly convex; for the proof is left in the following Lemma. We can then apply [40, Theorem 4.2], which ensures that is -conditioned on every compact set. Using again the fact that is coercive, and therefore has bounded sublevel sets, we conclude that is -conditioned on every sublevel set. ∎
A.2.3 Proofs of Section 3.3
Lemma A.5** (-powers are -tilt conditioned when ).**
Let , , and be defined as . Then is -conditioned on every bounded subset of .
Proof.
This function is a separable sum, so, without loss of generality, we can assume from here that (see [40, Lemma 4.4]). Given a real , we will note its sign with , which is equal to (resp. ) if (resp. ), or [math] if . Using the convexity, the differentiability of , and the Fermat’s rule, we see that admits a unique minimizer , defined by the relations
[TABLE]
If , it is immediate to see that is -conditioned on , where the relation holds. We therefore assume from now that , which also means that . We now compute (we note )
[TABLE]
meaning that we are looking for an inequality like
[TABLE]
Using the L’Hôpital rule twice allows us to study the following limit:
[TABLE]
Note that our assumption that ensures that we can take the derivative of the second numerator around . Since this limit is well-defined, and nonnegative, it means that is -conditioned on a small enough neighbourhood of . To conclude the proof, it remains to verify that is -conditioned on any bounded set. This follows immediately from Proposition 3.5 and the fact that . ∎
Lemma A.6** (Kullback-Leibler divergences are -tilt conditioned).**
Let , and be the Kullback-Leibler divergence to :
[TABLE]
Then is -tilt-conditioned on every bounded set of .
Proof.
Let , and define the tilted function . Using Fermat’s rule, we see that . It is a simple exercice to verify that , so if and only if . Let be such vector, and write, for any :
[TABLE]
Let , which is well defined under our assumption that . Then
[TABLE]
where . We then observe that , from which we deduce that with .
Now, let be fixed, and let . Let , , and
[TABLE]
For each , we have , so we can use [24, Lem. A.2] on to write
[TABLE]
This proves that is -conditioned on , which conludes the proof. ∎
A.3 The Forward-Backward algorithm and proofs of Section 4
Definition A.7**.**
Given a positive real sequence converging to zero, we say that converges:
- •
sublinearly (of order ) if such that , ,
- •
Q-linearly if such that , ,
- •
R-linearly if Q-linearly converging such that , ,
- •
Q-superlinearly (of order ) if such that , ,
- •
R-superlinearly if Q-superlinearly convergent such that , .
It is easy to verify that is R-superlinearly convergent of order if and only if
[TABLE]
Note that -linear and -superlinear convergence ensures only the overall decrease of the sequence, while -linear and -superlinear convergence requires the sequence to decrease at a certain speed for each index. It is immediate from the definition that -convergence implies -convergence.
Lemma A.8** (Estimate for sublinear real sequences).**
Let be a real sequence being strictly positive and satisfying, for some , and all : Define , and Then, for all ,
Proof.
It can be found in [72, Lemma 7.1], see also the proofs of [3, Theorem 2] or [46, Theorem 3.4]. ∎
Lemma A.9**.**
If Assumption 2.1 holds, then for all and all :
- i)
2. ii)
Proof of Lemma A.9.
To prove item i), start by writing
[TABLE]
The optimality condition in (2) gives so that, by using the convexity of :
[TABLE]
Since we can write , we deduce from the convexity of and the Descent Lemma ([11, Theorem 18.15]) that
[TABLE]
Item i) is then proved after combining the two previous inequalities. For item ii), the optimality condition in (2), together with a sum rule (see e.g. [87, Theorem 3.30]), to deduce that
[TABLE]
For the first inequality, use (45) with , together with the contraction property of the gradient map when (see [11, Cor. 18.17 & Prop. 4.39 & Remark 4.34.i]) to obtain:
[TABLE]
For the second inequality, consider , and use (45) with , together with the nonexpansiveness of the proximal map (see [11, Prop. 12.28]):
[TABLE]
Lemma A.10** (Descent Lemma for Hölder smooth functions).**
Let and be convex. Assume that is Gateaux differentiable on , and that there exists , such that for all , holds. Then:
[TABLE]
Proof.
The argument used in [101, Remark 3.5.1] for extends directly to convex sets. ∎
Now we can prove the convergence rate results of Section 4.1:
Proof of Theorem 4.1.
We first show that has finite length. Since , , and it follows from Lemma A.9 that
[TABLE]
If there exists such that then the algorithm would stop after a finite number of iterations (see (46)), therefore it is not restrictive to assume that for all . We set and , so that the Łojasiewicz inequality at can be rewritten as
[TABLE]
Combining (46), (47), and (48), and using the concavity of , we obtain for all :
[TABLE]
By taking the square root on both sides, and using Young’s inequality, we obtain
[TABLE]
Sum this inequality, and reorder the terms to finally obtain
[TABLE]
We deduce that has finite length and converges strongly to some . Moreover, from (47) and the strong closedness of , we conclude that .
Now we prove the convergence rates. Let for short. We first derive rates for the sequence of values , from which we will derive the rates for the iterates. Equations (46) and (47) yield
[TABLE]
The Łojasiwecz inequality at implies so we deduce that
[TABLE]
The rates for the values are derived from the analysis of the sequences satisfying the inequality in (50). Depending on the value of , we obtain different rates.
If , then we deduce from (50) that for all implies Since the sequence is decreasing and positive, implies .
For the other values of , we will assume that . In particular, we get from (50)
[TABLE]
If , then . The positivity of and (51) imply that for all , , meaning that converges Q-superlinearly.
If , then and we deduce from (51) that for all , , meaning that converges Q-linearly.
If , then , and the analysis still relies on studying the asymptotic behaviour of a real sequence satisfying (51). Lemma A.8 in the Annex shows that we have , by taking
[TABLE]
To end the proof, we will prove that the rates for are governed by the ones of . Let , and sum the inequality in (49) between and to obtain (remind that ):
[TABLE]
Next, we pass to the limit for , we use (46), and the fact that is decreasing to obtain
[TABLE]
Note that if , and if . So, by defining
[TABLE]
we finally conclude from (53) that when . ∎
Proof of Theorem 4.6.
The proof is as for the case of Theorem 4.1: the -Łojasiewicz property implies (50), and the statement follows from Lemma A.8 with . ∎
Proof of Theorem 4.8.
The proofs of Theorems 4.1 and 4.6 rely on the combination of the Łojasiewicz inequality with the estimations (46) and (47), which can be replaced by (18) and (19). ∎
A.4 Linear inverse problems and proofs of Section 5.1
Here we will make use of is the Moore-Penrose pseudo-inverse of . It is a linear operator (not necessarily bounded), whose domain is , and satisfying
[TABLE]
It is easy to see that, whenever , the solution set of (27) is .
Lemma A.11**.**
Let be a bounded linear opertator from to . Then, for every continuous function , we have .
Proof.
A simple induction argument shows that, for every , . Taking linear combinations of this equality allows to see that, for every polynomial , . Now, if is continuous on , it is in particular continuous on , which is an interval containing the spectrum of both and . Thus, restricted to this interval can be written as the uniform limit of a sequence of polynomials. Passing to the limit (see [56, Thm. VI.32.1]) in the last equality gives the desired result. ∎
Lemma A.12**.**
For all , , the following two properties are equivalent:
2. 2.
Proof.
It is shown in [42, Proposition 2.18] that , so it is enough to verify this implication:
[TABLE]
Let be such a pair. Since and , we deduce that . Therefore, since is self-adjoint, (see [42, p.35]), and , we get
[TABLE]
Proof of Lemma 5.5.
Remind that and let . Then, Lemma A.12 yields:
[TABLE]
∎
Lemma A.13** (Interpolation inequality [42, p. 55]).**
For all and , we have
[TABLE]
Lemma A.14** (Powers of self-adjoint operators).**
Let be a bounded selfadjoint positive linear operator on a Hilbert space. Then, for all , , and .
Proof.
Given any , we can write , from which we deduce that . This means that is a nondecreasing family. To prove that this family is constant, it is enough to see that , which we verify now: If , then , therefore . The conclusion follows from the fact that . ∎
A.5 Regularized inverse problems and proofs of Section 5.2
Proposition A.15**.**
Let be a closed cone and . Then is coercive on if and only if .
Proof.
The direct implication is immediate from Definition 5.14. For the reverse implication, let be a closed cone such that . Since is linear, we know that is convex and continuous. So, using the compactness of we deduce that:
[TABLE]
Because and , we deduce from our assumption that . Therefore, , from which we deduce that is -coercive on . ∎
Definition A.16** (Cone enlargement).**
Let be a cone, and . We define the -enlargement of as
[TABLE]
Lemma A.17**.**
If is a closed cone, then is a closed cone containing for all .
Proof.
By definition, is a cone containing and is compact, due to the compactness of . Since , by compactness of , we deduce that is a closed cone (see e.g. [48, Proposition A.1.1]). ∎
Proposition A.18**.**
Let which is -coercive on a closed cone . Then, for every , is -coercive on , with .
Proof.
Let and be as in the statement. Since is -coercive on , we see that , which guarantees that . Now, the fact that is closed (Lemma A.17) implies that is compact in , so we can use the same arguments as in (55) to deduce that there exists such that . Since , there exists by definition of some such that . We can use [62, Theorem 1] to write
[TABLE]
Since , we have . Moreover, , so (56), implies
[TABLE]
We deduce from the definition of that is -coercive on . ∎
Proposition A.19**.**
Let , and .
- i)
For , is -prox-regular at if and only if :
[TABLE] 2. ii)
If is a manifold, then there exists such that is -prox-regular.
Proof.
Item i) : Definition 5.19 can be rewritten as , where the condition is equivalent to, after developing the square:
[TABLE]
The conclusion follows after cancelling and reorganizing the terms. Item ii) : Every -manifold is prox-regular in the sense of [93, Def. 10.23 & Prop. 13.32]. Therefore, for every , there exists such that for every , and for every , the inequality (57) holds [93, Exercice 13.31]. Conclusion follows from the fact that .
∎
Here is a needed result estimating locally the coercivity of an operator on a prox-regular set via its coercivity on the tangent cone.
Proposition A.20**.**
Let be -prox-regular at . Let be a bounded positive selfadjoint linear operator, being -coercive on . Then, for all , there exists a cone such that is -coercive on , and , with .
Proof.
Let be fixed, and define . Let be the -enlargement of , then Proposition A.18 guarantees that is -coercive on . It remains to prove that there exists such that . Let . Because is -reached at , we know that is a convex cone (use [44, Thm. 4.8.(12)] and the fact that is locally closed at ), so we can define , and . Using Moreau’s Theorem [11, Thm. 6.30], we deduce that with . We define , and look for a condition on it so that . For this to happen, it is enough to verify that
[TABLE]
Now, use Proposition A.19.i) together with the Cauchy-Schwarz inequality, and the polynomial inequality , to write
[TABLE]
We can use this inequality, together with the facts that and , to write
[TABLE]
This allows us to conclude that (58) holds as long as:
[TABLE]
∎
Proof of Proposition 5.17.
Let , and set . Since is of class around , there exists some such that for all , Notice that when is Lipschitz continuous, we can take . Also, if it is constant, we can just take and . Let us show that is -conditioned on with the constant . Take and use the optimality condition at and the convexity of to obtain
[TABLE]
By Taylor’s theorem applied to , we deduce from the inequality above that there exists such that:
[TABLE]
On the one hand, since , we have that . Thus, from the coercivity of we have
[TABLE]
On the other hand, we use the Cauchy-Schwarz inequality together with the definition of and the fact that to obtain
[TABLE]
By combining the three previous inequalities, we deduce that
[TABLE]
This implies that , and the statement follows from . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P.-A. Absil, R. Mahony and B. Andrews, Convergence of the iterates of descent methods for analytic cost functions , SIAM Journal on Optimization, 16 , pp. 531–547, 2005.
- 2[2] F.J. Aragón Artacho and M.H. Geoffroy, Characterization of metric regularity of subdifferentials , Journal of Convex Analysis, 15 (2), pp.365–380, 2008.
- 3[3] H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features , Mathematical Programming, 116 (1-2), pp. 5–16, 2009.
- 4[4] H. Attouch, J. Bolte, P. Redont and A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems. An approach based on the Kurdyka-Łojasiewicz inequality , Mathematics of Operations Research, 35 (2), pp. 438–457, 2010.
- 5[5] H. Attouch, J. Bolte and B.F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods , Mathematical Programming, 137 (1-2), pp. 91–129, 2013.
- 6[6] H. Attouch and R. Wets, Quantitative stability of variational systems II, a framework for nonlinear conditioning , SIAM Journal on Optimization, 3 (2), pp. 359–381, 1993.
- 7[7] D. Azé and J.-N. Corvellec, Nonlinear local error bounds via a change of metric , Journal of Fixed Point Theory and Applications, 16 (1), pp. 351–372, 2014.
- 8[8] J.-B. Baillon, Un exemple concernant le comportement asymptotique de la solution du problème d u / d t + ∂ ϑ ∋ 0 0 𝑑 𝑢 𝑑 𝑡 italic-ϑ du/dt+\partial\vartheta\ni 0 , Journal of Functional Analysis, 28 (3), pp. 369–376, 1978.
