On the linear convergence rates of exchange and continuous methods for total variation minimization
Axel Flinth (IMT), Fr\'ed\'eric de Gournay (IMT, ITAV), Pierre Weiss, (IMT, ITAV)

TL;DR
This paper studies the linear convergence of exchange and continuous methods for total variation minimization, showing under certain conditions that both approaches converge linearly and proposing a combined alternating method.
Contribution
It provides the first analysis of linear convergence rates for exchange and continuous algorithms in total variation regularized inverse problems.
Findings
Exchange algorithm converges linearly under regularity conditions.
Continuous amplitude optimization achieves linear convergence with good initialization.
Combining both methods offers advantages in convergence and performance.
Abstract
We analyze an exchange algorithm for the numerical solution total-variation regularized inverse problems over the space M() of Radon measures on a subset of R d. Our main result states that under some regularity conditions, the method eventually converges linearly. Additionally, we prove that continuously optimizing the amplitudes of positions of the target measure will succeed at a linear rate with a good initialization. Finally, we propose to combine the two approaches into an alternating method and discuss the comparative advantages of this approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On the linear convergence rates of exchange and continuous methods for total variation minimization
Axel Flinth
IMT, Université de Toulouse, CNRS
Frédéric de Gournay
IMT, Université de Toulouse, CNRS
ITAV, Université de Toulouse, CNRS
Pierre Weiss
IMT, Université de Toulouse, CNRS
ITAV, Université de Toulouse, CNRS
Abstract
We analyze an exchange algorithm for the numerical solution total-variation regularized inverse problems over the space of Radon measures on a subset of . Our main result states that under some regularity conditions, the method eventually converges linearly. Additionally, we prove that continuously optimizing the amplitudes of positions of the target measure will succeed at a linear rate with a good initialization. Finally, we propose to combine the two approaches into an alternating method and discuss the comparative advantages of this approach.
Keywords: Total variation minimization, inverse problems, superresolution, semi-infinite programming.
MSC Classification: 49M25, 49M29, 90C34, 65K05.
Acknowledgement
The authors acknowledge support from ANR JCJC OMS.
1 Introduction
1.1 The problem
The main objective of this paper is to develop and analyze iterative algorithms to solve the following infinite dimensional problem:
[TABLE]
where is a bounded open domain of , is the set of Radon measures on , is the total variation (or mass) of the measure , is a convex lower semi-continuous function with non-empty domain and is a linear measurement operator.
An important property of Problem () is that at least one of its solutions has a support restricted to distinct points with (see e.g. [30, 15, 4]), i.e. is of the form
[TABLE]
with and . This property motivates us to study a class of exchange algorithms. They were introduced as early as 1934 [24] and then extended in various manners [23]. They consist in discretizing the domain coarsely and then refining it adaptively based on the analysis of so-called dual certificates. If the refinement process takes place around the locations only, these methods considerably reduce the computational burden compared to a finely discretized mesh.
Our main results consist in a set of convergence rates for this algorithm that depend on the regularity of and on the non-degeneracy of a dual certificate at the solution. We also show the linear convergence rate for first order algorithms that continuously vary the coefficients and of a discrete measure. Finally, we show that algorithms alternating between an exchange step and a continuous method share the best of both worlds: the global convergence guarantees of exchange algorithms together with the efficiency of first order methods. This yields a fast adaptive method with strong convergence guarantees for total variation minimization and related problems.
1.2 Applications
Our initial motivation to study the problem () stems from signal processing applications. We recover an infinite dimensional version of the basis pursuit problem [7] by setting
[TABLE]
Similarly, the choice , leads to an extension of the LASSO [27] called Beurling LASSO [9]. Both problems proved to be extremely useful in engineering applications. They got a significant attention recently thanks to theoretical progresses in the field of super-resolution [9, 26, 6, 13]. Our results are particularly strong for the quadratic fidelity term.
1.3 Numerical approaches in signal processing
The progresses on super-resolution [9, 26, 6, 13] motivated researchers from this field to develop numerical algorithms for the resolution of Problem (). By far the most widespread approach is to use a fine uniform discretization and solve a finite dimensional problem. The complexity of this approach is however too large if one wishes high precision solutions. This approach was analyzed from a theoretical point of view in [25, 13] for instance. The first papers investigating the use of () for super-resolution purposes advocated the use of semi-definite relaxations [26, 6], which are limited to specific measurement functions and domains, such as trigonometric polynomials on the 1D torus . The limitations were significantly reduced in [10], where the authors suggested the use of Lasserre hierarchies. These methods are however currently unable to deal with large scale problems. Another approach suggested in [5], and referred to as a Frank-Wolfe algorithm, consists in adding one point to a discretization set iteratively, where a so-called dual certificate is maximal. More recently, [28] began investigating the use of methods that continuously vary the positions and amplitudes of discrete measures parameterized as . The authors gave sufficient conditions for a simple gradient descent on the product-space to converge. In [3] and [11], this method was used alternatively with a Frank-Wolfe algorithm, the idea being to first add Dirac masses roughly at the right locations and then to optimize their locations and position continuously, leading to promising numerical results. Surprisingly enough, it seems that the connection with the mature field of semi-infinite programming has been ignored (or not explicitly stated) in all the mentioned references.
1.4 Some numerical approaches in semi-infinite programming
A semi-infinite program [23, 16] is traditionally defined as a problem of the form
[TABLE]
where and are subsets of and respectively, and are functions. The term semi-infinite stems from the fact that the variable is finite-dimensional, but it is subject to infinitely many constraints for . In order to see the connection between the semi-infinite program (SIP) and our problem (), we can formulate its dual, which reads as
[TABLE]
This dual will play a critical role in all the paper and it is easy to relate it to a SIP by setting , and .
Many numerical methods have been and are still being developed for semi-infinite programs and we refer the interested reader to the excellent chapter 7 of the survey book [23] for more insight. We sketch below two classes of methods that are of interest for our concerns.
1.4.1 Exchange algorithms
A canonical way of discretizing a semi-infinite program is to simply control finitely many of the constraints, say for , where is finite. The discretized problem SIP can then be solved by standard proximal methods or interior point methods. In order to obtain convergence towards an exact solution of the problem, it is possible to choose a sequence of nested sets such that is dense in . Solving the problems SIP for large however leads to a high numerical complexity due to the high number of discretization points. The idea of exchange algorithms is to iteratively update the discretization sets in a more clever manner than simply making them denser. A generic description is given by Algorithm 1.
In this paper, we consider s of the form
[TABLE]
where the points are local maximizers of . At each iteration, the set of discretization points can therefore be updated by adding and dropping a few prescribed points, explaining the name ’exchange’. The simplest rule consists of adding the single most violating point, i.e.
[TABLE]
It seems to be the first exchange algorithm and is nearly equivalent to the Remez algorithm from the 30’s [24]. It can be shown to be equivalent to a Frank-Wolfe (a.k.a. conditional gradient) method up to an epigraphical lift [11]. These methods were introduced in the field of signal processing in [5] and the connection with exchange algorithms was proposed in [14]. The update rule (2) is sufficient to guarantee convergence in the generic case and to ensure a decay of the cost function in , see [19]. Although ’exchange’ suggests that points are both added and subtracted, methods for which are also coined exchange algorithms. The use of such rules often leads to easier convergence analyses, since we get monotonicity of the objective values for free [16]. Other examples [17] include only adding points if they exceed a certain margin, i.e. , or all local maxima of . In the case of convex functions , algorithms that both add and remove points can be derived and analyzed with the use of cutting plane methods. All these instances have their pros and cons and perform differently on different types of problems. Since a semi-infinite program basically allows to minimize arbitrary continuous and finite dimensional problems, a theoretical comparison should depend on additional properties of the problem.
1.4.2 Continuous methods
Every iteration of an exchange algorithm can be costly: it requires solving a convex program with a number of constraints that increases if no discretization point is dropped. In addition, the problems tend to get more and more degenerate as the discretization points cluster, leading to numerical inaccuracies. In practice it is therefore tempting to use the following two-step strategy: i) find an approximate solution of the primal problem () using iterations of an exchange algorithm and ii) continuously move the positions and amplitudes starting from to minimize () using a nonlinear programming approach such as a gradient descent, a conjugate gradient algorithm or a Newton approach.
This procedure supposes that the output of the exchange algorithm has the right number of Dirac masses, that their amplitudes satisfy and that lies in the basin of attraction of the optimization algorithm around the global minimum . To the best of our knowledge, knowing a priori when those conditions are met is still an open problem and deciding when to switch from an exchange algorithm to a continuous method therefore relies on heuristics such as detecting when the number of masses stagnates for a few iterations. The cost of continuous methods is however much smaller than that of exchange algorithms since they amount to work over a small number of variables. In addition, the instabilities mentioned earlier are significantly reduced for these methods. This observation was already made in [3, 11] and proved in [28] for specific problems.
1.5 Contribution
Many recent results in the field of super-resolution provide sufficient conditions for a non degenerate source condition to hold [6, 26, 12, 1, 21]. The non degeneracy means that the solution of () is unique and that the dual certificate reaches at exactly points, where it is strictly concave. The main purpose of this paper is to study the implications of this non degeneracy for the convergence of a class of exchange algorithms and for continuous methods based on gradient descents. Our main results are as follows:
We show an eventual linear convergence rate of a class of exchange algorithms for convex functions with Lipschitz continuous gradient. More precisely, we prove that after a finite number of iterations the algorithm outputs vectors such that the set
[TABLE]
contains exactly -points .
Letting denote the solution of the finite dimensional problem , we also show the linear convergence rate of the cost function to and of the support in the following sense: after a number of initial iterations, it will take no more that iterations to ensure that . A similar statement holds for the coefficient vectors . 2. 2.
We also show that a well-initialized gradient descent algorithm on the pair converges linearly to the true solution and explicit the width of the basin of attraction. 3. 3.
We then show how the proposed guarantees may explain the success of methods alternating between exchange methods and continuous methods at each step, in a spirit similar to the sliding Frank-Wolfe algorithm [11]. 4. 4.
We finally illustrate the above results on total variation based problems in 1D and 2D.
2 Preliminaries
2.1 Notation
In all the paper, designs an open bounded domain of . The boundedness assumptions plays an important role to control the number of elements in discretization procedures. A grid is a finite set of points in . Its cardinality is denoted by . The distance between two sets and is defined by
[TABLE]
Note that this definition of distance is not symmetric: in general .
We let denote the set of continuous functions on vanishing on the boundary. The set of Radon measures can be identified as the dual of , i.e. the set of continuous linear forms on . For any sub-domain , we let denote the set of Radon measures supported on . For , the -norm of a function is denoted by . The total variation of a measure is denoted . It can be defined by duality as
[TABLE]
The -norm of a vector is also denoted . The Frobenius norm of a matrix is denoted by .
Let denote a convex lower semi-continuous function with non-empty domain . Its subdifferential is denoted . Its Fenchel transform is defined by
[TABLE]
If is differentiable, we let denote its gradient and if it is twice differentiable, we let denote its Hessian matrix. We let and , where is the largest singular value of . A convex function is said to be -strongly convex if
[TABLE]
for all and all . A differentiable function is said to have an -Lipschitz gradient if it satisfies . This implies that
[TABLE]
We recall the following equivalence [18]:
Proposition 2.1**.**
Let denote a convex and closed function with non empty domain. Then the following two statements are equivalent:
- •
* has an -Lipschitz gradient.*
- •
* is -strongly convex.*
The linear measurement operators considered in this paper can be viewed as a collection of continuous functions . For , the notation corresponds to the vector .
2.2 Existence results and duality
In order to obtain existence and duality results, we will now make further assumptions.
Assumption 1**.**
* is convex and lower bounded. In addition, we assume that either or that is polyhedral (that is, its epigraph is a finite intersection of closed halfspaces).*
Assumption 2**.**
The operator is weak--continuous. Equivalently, the measurement functionals defined by are given by
[TABLE]
for functions . In addition, we assume that is surjective on .
The following results relate the primal and the dual.
Proposition 2.2** (Existence and strong duality).**
Under Assumptions 1 and 2, the following statements are true:
- •
The primal problem () and its dual () both admit a solution.
- •
The following strong duality result holds
[TABLE]
- •
Let denote a primal-dual pair. They are related as follows
[TABLE]
Proof.
The stated assumptions ensure the existence of a feasible measure . In addition, the primal function is coercive since is bounded below. This yields existence of a primal solution. The existence of a dual solution stems from the compactness of the set (which itself follows from the surjectivity of ) and the continuity of on its domain. The strong duality result follows from [2, Thm 4.2]. The primal-dual relationship directly derives from the first order optimality conditions. ∎
The left inclusion in equation (9) plays an important role, which is well detailed in [13]. It implies that the support of satisfies: .
3 An Exchange Algorithm and its convergence
3.1 The algorithm
We assume that an initial grid is given (e.g. a coarse Euclidean grid). Given a discretization , we can define a discretized primal problem ()
[TABLE]
[TABLE]
In this paper, we will investigate the exchange rule below:
[TABLE]
The implementation of this rule requires finding , the set of all the local maximizers of exceeding .
3.2 A generic convergence result
The exchange algorithm above converges under quite weak assumptions. For instance, it is enough to assume that the function is differentiable.
Assumption 3**.**
The data fitting function is differentiable with -Lipschitz continuous gradient.
Alternatively, we may assume that the initial set is fine enough, which in particular implies that .
Assumption 4**.**
The initial set is such that restricted to is surjective.
We may now present and prove our first result.
Theorem 3.1** (Generic convergence).**
Under assumptions 1, 2 and 3 or 4, a subsequence of will converge in the weak--topology towards a solution pair of () and (), as well as in objective function value. If the solution of () and/or () is unique, the entire sequence will converge.
Proof.
First remark that the sequence is non-increasing since the spaces are nested. Due to the boundedness below of , the same must be true for . Hence there exists a subsequence , which we do not relabel, that weak- converges towards a measure .
Now, we will prove that the sequence of dual variables is bounded. If Assumption 3 is satisfied, then is strongly convex and since [math] is a feasible point, we must have , which is bounded. Alternatively, if Assumption 4 is satisfied, notice that . Since is surjective, the previous inequality implies that is bounded. Hence, in both cases, the sequence converges up to a subsequence to a point .
The key is now to prove that . To this end, let us first argue that the family is equicontiuous. For this, let be arbitrary. Since the functions all are uniformly continuous, there exists a with the property
[TABLE]
Consequently,
[TABLE]
Due to the convergence of , the sequence is converging strongly to . We will now prove that . If for some , , we will have for all , and in particular and thus . Hence, we may assume that for each , i.e. that we add at least one point to in each iteration.
Now, towards a contradiction, assume that for an . Set as in (11). For each , let be the element in which has the largest distance to . Due to for each , there needs to exist a compact subset such that . Indeed, there exists for each a such that for all . Now, if , we get
[TABLE]
for every . Since , we conclude . Consequently, a subsequence (which we do not rename) of must converge. Thus, for some and every , we have . We then have
[TABLE]
In the last estimate, we used the constraint of () and the fact that . Since the last inequality holds for every , we obtain
[TABLE]
where we used the fact that converges strongly towards . This is a contradiction, and hence, we do have .
Overall, we proved that the primal-dual pair is feasible. It remains to prove that it is actually a solution. To do this, let us first remark that by weak duality. To prove the second inequality, first notice that the weak--continuity of implies that . Assumption 1 furthermore implies that is lower semi-continuous. As a supremum of linear functions, so is . Since also , we conclude
[TABLE]
Assumptions 1, 2 together with Proposition 2.2 imply exact duality of the discretized problems. This means . Since the norm is weak--l.s.c. , we thus obtain
[TABLE]
Reshuffling these inequalities yields , i.e., the reverse inequality. Thus, and fulfill the duality conditions, and are solutions. The final claim follows from a standard subsequence argument. ∎
Remark 1**.**
Let us mention that the convergence result in Theorem 3.1 and its proof, is not new, see e.g. [22]. The proof technique can be applied to prove similar statements for other refinement rules. For instance, the result still holds if we add the single most violating point:
[TABLE]
The result that we have just shown is very generally applicable. It however does not give us any knowledge of the convergence rate. The next section will be devoted to proving a linear convergence rate in a significant special case.
3.3 Non degenerate source condition
The idea behind adding points to the grid adaptively is to avoid a uniform refinement, which results in computationally expensive problems (). However, there is a priori no reason for the exchange rule not to refine in a uniform manner. In this section, we prove that additional assumptions improve the situation. First, we will from now on work under Assumption (3). It implies that the dual solutions are unique for every , since Proposition (2.1) ensures the strong convexity of the Fenchel conjugate . We furthermore assume that the are smooth.
Assumption 5** (Assumption on the measurement functionals ).**
The measurement functions all belong to and their first and second order derivatives are uniformly bounded on . We hence may define
[TABLE]
We also assume the following regularity condition on the solution of (), and its corresponding primal solution .
Assumption 6** (Assumption on the primal-dual pair).**
We assume that () admits a unique -sparse solution supported on :
[TABLE]
Let denote the associated dual pair. We assume that the only points for which are the points in , and that the second derivative of is negative definite in each point . It follows that there exists and such that
[TABLE]
We note that if Equations (14) and (15) are valid for some , they are also valid for any with and .
Assumption (6) may look very strong and hard to verify in advance. Recent advances in signal processing actually show that it is verified under clear geometrical conditions. First, there will always exists at most -sparse solutions to problem (), [30, 15, 4]. Therefore, the main difficulty comes from the uniqueness of the primal solution and from the two regularity conditions (14) and (15). These assumptions are called non-degenerate source condition of the dual certificate [13]. Many results in this direction have been shown for or , where with a finitely supported measure. The papers [6, 26, 12] deal with different Fourier-type operators, [1] about a few other special cases whereas [21] provides an analysis for arbitrary integral operators sampled at random.
3.4 Auxiliary results
In this and the following sections, we always work under Assumptions 1, 2, 3 without further notice. We derive several lemmata that are direct consequences of the above assumptions. The first two rely strongly on the Lipschitz regularity of the gradient of .
Lemma 3.2** (Boundedness of the dual variables ).**
Let denote the prox-center of . For all , we have
[TABLE]
Proof of Lemma 3.2.
For all , we have , hence . By strong convexity of and optimality of and , we get:
[TABLE]
Therefore and the conclusion follows from a triangle inequality. ∎
Proposition 3.3**.**
Let be the solution of (). Let
[TABLE]
Then for any , we have
[TABLE]
Proof.
Let denote the sub-level set of and . We first claim that and only have the point in common. Indeed solves the problem and by strong duality of the problem restricted to , solves . By strong convexity of , is the unique solution , this exactly means .
The fact that implies that there exists a separating hyperplane there. Since the hyperplane must be tangent to , it can be written as for a , with . Consequently, letting , we have
[TABLE]
Now, the strong convexity of implies for every ,
[TABLE]
Rearranging this, we obtain
[TABLE]
which is the claim. ∎
Before moving on, let us record the following proposition:
Proposition 3.4**.**
We have
[TABLE]
Proof.
The proof of the first inequality of (18) is a standard Taylor expansion :
[TABLE]
The proof of the second part of (18) follows the same lines as the first part and is left to the reader. ∎
The next two lemmata aim at transferring bounds from the geometric distances of the sets , and to bounds on . Using Proposition 3.3, we may then transfer these bounds to bounds on the errors of the dual solutions and the dual (or primal) objective values.
Lemma 3.5**.**
The following inequalities hold
[TABLE]
[TABLE]
Proof of Lemma 3.5.
To show (19), first notice that
[TABLE]
Indeed, by definition, the global maximum of lies in and satisfies . Furthermore, by construction, all points in satisfy . Using a Taylor expansion, we get for all
[TABLE]
Taking as the point in minimizing the distance to leads to (20). In addition, we have by Lemma 3.2, so that with .
Now, letting , we have just proven that . Furthermore, due to the optimality of for the discretized problem and to the fact that is feasible for that problem, we will have , i.e., is included in the -sub-level set of : . An application of Proposition 3.3 now yields the result. ∎
Lemma 3.6**.**
Suppose that and . Then
[TABLE]
Proof.
Let (resp. ) be the point closest to in (resp. ). By assumption, we have . For all , we have
[TABLE]
Then, for all , using the fact that , we get
[TABLE]
Hence, we have . To conclude, we use Proposition 3.3 again. ∎
The last assertion takes full advantage of Assumption 6 and the fact that the function is uniformly concave around its maximizers. It allows to transfer bounds from to bounds on the distance from to .
Proposition 3.7**.**
Define and assume that , then
[TABLE]
Moreover, for each , if is the ball or radius around , then contains at most one point in and has the same sign as in .
Proof.
Define and note that . By Proposition 3.4, we have for each
[TABLE]
The above inequalities together with Assumption 6 imply the following for all :
- (i)
For with , we have 2. (ii)
For with , we have 3. (iii)
For with , we have 4. (iv)
For with , we have
The estimate deserves a slightly more detailed justification than the others. Define and for . We may apply the mean value theorem to conclude that
[TABLE]
for some . Since , and , due to in , we obtain
[TABLE]
since by assumption. The last estimate was the claim .
This implies a number of things. First, any local maximum of with must lie within a distance of from the set (since for all other points, we have – via – or – via ). Since is locally concave on the -neighborhoods of the – this follows from – at most one local extremum furthermore exists in each such neighborhood. This is the claim.
∎
3.5 Fixed grids estimates
In this section, we consider a fixed grid and ask what we need to assume about it in order to guarantee that the set of local maxima of is close to true support . We express our result in terms of a geometrical property that we can control, the width of the grid .
Theorem 3.8**.**
Assume that , then
[TABLE]
Proof.
It is trivial that . Applying Lemma 3.5, we immediately obtain the bound on . By the same lemma,
[TABLE]
In order to obtain the first bound, remark that and use Proposition 3.7. ∎
Remark 2**.**
Note that Theorem 3.8 allows to control but not . Indeed each is guaranteed to be close to a , but not every needs to have a point in closeby. Note however that the bounds on the optimal value indicates that in this case the missed is not crucial to produce a good candidate for solving the primal problem. We will provide more insight on this, in the case of being strongly convex, in Section 4.
3.6 Eventual linear convergence rate
In this section, we provide a convergence rate for the iterative algorithm. As a follow-up to Remark 2, the proof of convergence relies on the fact that the distances and are equal. In order to ensure this fact, one has to wait for a finite number of iterations, this is exactly the purpose of the next proposition.
Proposition 3.9**.**
Let . There exists a finite number of iterations , such that for all , has exactly points, one in each . It follows that . Moreover if is the set of active point of , that is
[TABLE]
then and for each , .
Proof.
We first prove that contains a point in . To this end, define the set of measures and
[TABLE]
By assumption (6), . Since converges to , there exists such that , . Hence must for each have points such that has non-zero mass at . Consequently, , hence, each contains at least one point in such that .
Notice that converges to by Theorem 3.1. Hence there a finite number of iterations such that for all . By item of the proof of Proposition 3.7, outside , and by item , is strictly concave in each . Hence each contains exactly one maximizer of exceeding one.
∎
We now move on to analyzing our exchange approach. Before formulating the main result, let us introduce a term: -regimes.
Definition 1**.**
We say that the algorithm enters a -regime at iteration if for all , we have . In particular it means that only points with a distance at most from are added to the grid.
Lemma 3.10**.**
Let and . Let be as in Proposition 3.9.
For any , the algorithm enters a -regime after a finite number of iterations. 2. 2.
Assume that iterations have passed and that the algorithm is in a -regime with . Then for every it takes no more than iterations to enter an -regime.
Proof.
Note that for any , if there exists such that
[TABLE]
we will enter an -regime after iteration by applying Proposition 3.7.
To prove , note that we without loss of generality can assume that (since entering a -regime means in particular entering a -regime for any .) Then , since tends to zero as goes to infinity, (22) with is true after a finite number of iterations.
To prove , we proceed as follows : Proposition 3.9 ensures that in each iteration, exactly one point is added in each ball . Let be the actual iteration, a covering number argument [29] ensures, for any that after iterations, each point in needs to lie at a distance at most from , i.e., .
Now, if we choose , Lemma 3.5 together with Proposition 3.7 imply
[TABLE]
Since for all , the distance is non-increasing. As a result for all . Since we are in -regime, we know that and . Hence we can apply Lemma 3.6 to obtain that
[TABLE]
Then inequality (22) is satisfied with and the algorithm enters a -regime. ∎
The main result will tell us how many iterations we need to enter a -regime.
Theorem 3.11**.**
Let and be the iteration on which the algorithm enters a -regime. Then , and the algorithm will enter a -regime after no more than iterations, where
[TABLE]
Additionally, we will have
[TABLE]
for . In other words, the algorithm will eventually converge linearly.
Proof.
The fact that is the first assertion of Lemma 3.10. As for the other part, we argue as follows: Fix . Since we have entered a -regime at iteration , Lemma 3.10 implies that it will take no more than additional iterations to enter a . Repeating this argument, we see that after no more than
[TABLE]
iterations, we will have entered a regime. Choosing and , we obtain the first statement.
The second statement immediately follows from Lemma 3.6 (as in the proof of Theorem 3.8) and the fact that entering a -regime exactly amounts to that for all future , and therefore in particular . ∎
The inequality (23) upper-bounds the cost function for the problem (). In practice, the numerical resolution of this problem is hard since contains clusters of points and in practice it is beneficial to solve the simpler discrete problem
[TABLE]
For this measure, we also obtain an a posteriori estimate of the convergence rate.
Proposition 3.12**.**
Define as the solution of (), if , we have
[TABLE]
Proof.
For any , denote a point in closest to and define . We have and . Furthermore, we have
[TABLE]
The last term in the inequality is dealt with the following estimate:
[TABLE]
As for the penultimate term, remember that . This implies
[TABLE]
By making a Taylor expansion of in each , utilizing that the derivative vanishes there, and that for each , we see that for each . This yields
[TABLE]
Overall, we obtain
[TABLE]
∎
4 Convergence of continuous methods
In this section, we study an alternative algorithm that consists of using nonlinear programming approaches to minimize the following finite dimensional problem:
[TABLE]
where . This principle is similar to continuous methods in semi-infinite programming [23] and was proposed specifically for total variation minimization in [3, 11, 28, 8]. By Proposition 3.9, we know that after a finite number of iterations, will contain exactly points located in a neighborhood of . This motivates the following hybrid algorithm:
- •
Launch the proposed exchange method until some criterion is met. This yields a grid and we let .
- •
Find the solution of the finite convex program
[TABLE]
- •
Use the following gradient descent:
[TABLE]
where is a suitably defined step-size (e.g. defined using Wolfe conditions).
We tackle the following question: does the gradient descent algorithm converge to the solution if initialized well enough?
4.1 Existence of a basin of attraction
This section is devoted to proving the existence of a basin of attraction of a descent method in . Under two additional assumptions, we state our result in Proposition 4.1.
Assumption 7**.**
The function is twice differentiable and -strongly convex.
The twice differentiability assumption is mostly due to convenience, but the strong convexity is crucial. The second assumption is related to the structure of the support of the solution .
Assumption 8**.**
For any denote . The transition matrix
[TABLE]
is assumed to be positive definite, with a smallest eigenvalue larger than .
It is again possible to prove for many important operators that this assumption is satisfied if the set is separated. See the references listed in the discussion about Assumption 6. The following proposition describes the links between minimizing and solving ().
Proposition 4.1**.**
Let be the solution of (). Under Assumption 7 and 8, is global minimum of . Additionally, is differentiable with a Lipschitz gradient and strongly convex in a neighborhood of .
*Hence, there exists a basin of attraction around such that performing a gradient descent on will yield the solution of () at a linear rate. *
The rest of this section is devoted to the proof of Proposition 4.1. Let us begin by stating a simple auxiliary result.
Lemma 4.2**.**
Let and be vector spaces and be a linear operator with for a . Then, for any
[TABLE]
Proof.
If is positive semidefinite, we claim holds. Since for arbitrary
[TABLE]
the latter is the case. ∎
Let us introduce some notation that will be used in this section: for an for some , denotes the matrix . Analogously, and denote the operators
[TABLE]
respectively. Note that for and ,
[TABLE]
We will also use the shorthands , , and, for , denotes the operator
[TABLE]
We have
[TABLE]
so that in points with for all , and in particular in a neighborhood of , is differentiable and its gradient is given by :
[TABLE]
As for the second derivatives, we have
[TABLE]
We may now prove our claims.
Proof 4.1.
First, let us note that due to the optimality conditions of , we know that
[TABLE]
Letting , it is furthermore fruitful to decompose the Hessian of into two parts:
[TABLE]
Now, has local maxima in the points , so that . In these points, we furthermore have that , so that the gradient of given in (27) vanishes.
To prove the rest, it is enough to show that the Hessian of is positive definite in a neighborhood around . Let be arbitrary. is an operator of the form , with and
[TABLE]
Due to the -strong convexity of , . We furthermore have
[TABLE]
Let us now turn to . If we define , we have
[TABLE]
by Assumption (8). We however have
[TABLE]
Now, by definition of and ,
[TABLE]
and similarly for . Also, we have, by (18):
[TABLE]
so that all in all
[TABLE]
We may now apply Lemma 4.2 twice to conclude
[TABLE]
where we defined
[TABLE]
It remains to analyze . Define the bilinear form
[TABLE]
Then, if we define , we have
[TABLE]
This makes it evident that
[TABLE]
The -Lipschitz gradient of proves that . Using (18) yields directly:
[TABLE]
We still need to bound . First remember that Assumption 6 asserts that for each , and in the ball of radius around . Consequently, if is chosen so that for each , we get , and
[TABLE]
By definition of , we can further estimate
[TABLE]
Using , we obtain
[TABLE]
If we now define
[TABLE]
we obtain
[TABLE]
Further utilizing the definition of and (28), we arrive at
[TABLE]
Since for and , we obtain the claim. ∎
4.2 Eventually entering the basin of attraction
The following proposition shows that defined as the amplitudes and positions of the Dirac-components of the solution of (), will lie in the basin described by Proposition 4.1. This result is stated in Corollary 4.4, the rest of this section is dedicated to proving it.
Proposition 4.3**.**
Assume that Assumptions 7 and 8 are true. Consider an -sparse measure
[TABLE]
for some and pairwise different points of . We then have
[TABLE]
Proof.
Let be the Moore-Penrose inverse of . Due to Assumption 8, has full rank and has an operator norm no larger than . Since
[TABLE]
bounds on and will therefore transform to a bound on .
Let us begin with the former. We have
[TABLE]
where we used the Cauchy-Schwarz inequality in the last step.
To bound the latter, recall that -strong convexity of means that
[TABLE]
The optimality conditions for () tell us that , and hence
[TABLE]
where we in the last step used that . Plugging the above inequality in (29) yields
[TABLE]
The claim follows. ∎
Corollary 4.4**.**
By Proposition 3.9, if is large enough then contains exactly points. In this case, let be the solution of (). Applying Proposition 4.3, recalling that and using the bound (24), we obtain :
[TABLE]
Since is guaranteed to eventually converge to zero by Theorem 3.11 and are bounded ( e.g. by lower boundedness of and upper boundedness of ) , will eventually lie in the basin of attraction of .
5 Description of the hybrid approach
To conclude this paper, we propose a method alternating between an exchange step and a continuous gradient descent. It is detailed in Algorithm 2. The idea is, after each iteration of an exchange algorithm, to start a gradient descent of initialized at the solution of (). If this gradient descent converges to a measure , we can subsequently test if it is an optimal point by checking if fulfills the stopping criterion , where is a user defined stopping criterion (the latter is justified by Proposition 3.3). If so, we may output , and if not, we may instead continue our exchange algorithm, possibly after adding also the support points of . Its behavior is described in the following theorem.
Theorem 5.1** (Convergence guarantees for the alternating method).**
Algorithm 2 comes with the following guarantees:
(Theorem 3.1)* Under Assumptions 1, 2 and 3, it is guaranteed to stop after a finite number of iterations for any stopping criterion .* 2. 2.
(Theorem 3.11)* If in addition Assumptions 5 and 6 are satisfied, then the algorithm eventually converges linearly: with , we have .* 3. 3.
(Proposition 4.1, Theorem 3.11 and Proposition 4.3)* If in addition Assumptions 7 and 8 are satisfied, then - for large enough - the low complexity gradient descent (26) method converges linearly : for some .*
Overall, this method has many desirable properties: the continuous method should be used whenever the exchange method reaches its basin of attraction since its per iteration cost is much cheaper. However, it is unclear in general that this basin even exists. In that case, the exchange method should be preferred since it eventually converges linearly under quite mild assumptions. The proposed algorithmic scheme somehow captures the best of all methods. Let us notice that it is very similar in spirit to the sliding Frank-Wolfe algorithm proposed in [11], apart from the fact that we suggest adding all the points violating the constraints, while the single most violating point is added in [11]. We believe that the proposed analysis sheds some light on the good numerical performance of this method.
Arguably the most complicated step in this algorithm is to evaluate , the set of local maximizers of exceeding . This is an impossible task for an arbitrary function . However, a simple heuristic described in the next section provided rather satisfactory results for the measurement functions considered in this paper (trigonometric polynomials and Gaussian convolution).
Apart from this, let us outline that the subproblems in this algorithm are well suited for numerical resolution. In the exchange algorithm, we only solve the dual problems which are strongly convex. Hence first-order methods for instance come with guarantees of convergence to in -norm. Recovering the masses , solutions of is also stable since (the local maximizers of ) is typically a well separated set of low cardinality. The gradient descent (or alternative nonlinear programming approach) on is performed over a low dimensional set. If the convergence is not satisfactory (e.g. the norm of doesn’t decay fast enough), it can be stopped, and we can switch back to the exchange algorithm.
6 Numerical Experiments
To test our theory, we have implemented our algorithm in MATLAB. Before displaying the results of the experiments, let us discuss a few key steps in the implementation. In the entire section, we assume that for or for simplicity. Note that this is no true restriction: we can always by scaling and translation ensure that , and trivially extend the measurement functions by [math] to the entirety of .
Evaluating
Each iteration of the exchange algorithm requires the exact calculation of the local maximizers of exceeding . This is, in general, an impossible task. We resort to the following heuristic method: Given a , we first evaluate on a fixed rectangular grid , and determine all of the discrete peaks, i.e. points in which is larger than all of its neighbors in the grid, and where exceeds for a threshold . Next, we start a gradient descent in each of these points, stopping them once is lower than another threshold. Since it is possible that several of these gradient descents land in the same point , we subsequently check if the set contains sets of points which are too close to each other - if this is the case, we discard all but one of them in such a group. We finally remove any point in which is not larger than , for a small .
Solving the Discrete Problems
We have chosen to solve the problems () and using an accelerated proximal gradient descent [20].
6.1 Example 1: Super-resolution from Fourier measurements in 1D.
We start by testing our algorithm on a popular instance of problem (): super-resolution of a measure from finitely many of its Fourier moments
[TABLE]
We use a quadratic data fidelity term . This example is well studied by the signal processing community [26, 6, 13, 21].
We chose to be equal to , and a vector generated as , where is chosen at random as a -sparse atomic measure with amplitudes close to or . The positions of the Dirac masses were chosen as a small random perturbation from a uniform grid. The initial grid was chosen as a uniform grid with points, i.e. . We made experiments, with iterations of the exchange algorithm. The evolution of and for the first iterations for a typical iteration is displayed in Figure 1. We see that after already iterations, appears to be very close to . Before this iteration, the algorithm ’chooses’ to add points relatively uniformly to the grid, but after that, new points are only added close to . This is further emphasized by Figure 2, in which is plotted for each iteration, along with size of .
To track the success of the algorithm a bit more systematically, we chose to track the evolution of , and . The median over the 100 iterations, along with confidence intervals covering all experiments but the top and bottom are plotted in Figures 3. We see that all of the quality measures seem to converge linearly to 0.
Finally, we performed the same analysis for the optimum gap , the error and the sizes of the grids . ( was in each case chosen as the lowest value of over all iterations , and as the corresponding dual solution). We see that the optimum gap seems to converge exponentially to [math] right from the first iteration, wheras the error initially does not. The ’two-phase’-effect is also easy to spot: After about iterations, the algorithm switches from adding many points to adding only few points close to . Interestingly, the plateau of the -errors seems to be simultaneuos with the ’phase-transition’.
6.2 Example 2: Super-resolution from Gaussian measurements in 2D
Next, we perform a study in a two-dimensional setting. We consider and measurement functions of the form
[TABLE]
where the points live on a Euclidean grid of size , restricted to the domain . We then add white Gaussian noise to the measurements, leading to pictures of the type shown in Fig. 5. Here, the true underlying measure contains Dirac masses with random positive amplitudes and random locations on .
6.2.1 Exchange algorithm
The evolution of the grids and of the dual certificates is shown in Figure 6. As can be seen, points are initially added anywhere in the domain, but after a few iterations, they all cluster around the true locations, as expected from the theory. To further stress this phenomenon and illustrate our theorems and lemmata, we display many quantities of interest appearing in our main results in Fig. 7. the distance from to (where is estimated as ) on Fig. 7c, the distance from to on Fig. 7b, the evolution of on Fig. 7a, on Fig. 7e. Finally, the number of maxima of is shown on Fig. 7f. As can be seen, the number of maxima quickly stabilizes, suggesting that we reached a -regime. Then all the quantities (cost function, distance from , violation of the constraints) seem to converge to [math] linearly. This is not true after iteration 15, and we suspect that this is solely due to numerical inaccuracies when computing the solution of the discretized problems. Notice however that the accuracy of the Dirac locations drops below after 14 iterations, and that this accuracy is more than enough for the particular super-resolution application. Notice that if we wished to reach this accuracy with a fixed grid, we would need a Euclidean discretization containing points, while we here needed only 152 (). In addition, the resolution is stable since it is accomplished on a grid containing only points.
6.2.2 Continuous method
In this experiment, we evaluate the behavior of the gradient descent (26) depending on the initialization and on the number of iterations. We use the same setting as in the previous section. The left graph of Fig. 8 illustrates that the gradient descent typically converges linearly when initialized close enough to the true minimizer . This was predicted by Theorem 4.1. In this case (and actually all the others related to this experiment), it converges to machine precision in less than 1000 iterations. This is remarkable since the gradient descent is a simple algorithm that can be easily improved by using e.g. Nesterov acceleration (we proved that the function is locally convex) or other optimization schemes such as L-BFGS.
In order to evaluate the size of the basin of attraction around the global minimizer, we start from random points of the form , where and are random perturbations with an amplitude set as , with in . We then run gradient descents with different realizations of and record the success rate (i.e. the number of times the gradient descent converges to with an accuracy of at least ). We plot this success rate with respect to in Fig. 8b. As can be seen, the success rate is always when the relative error is less than , showing that for this particular problem, a rather rough initialization suffices for the gradient descent to converge to the global minimizer.
6.2.3 Alternating method
The alternating method suggested in Algorithm 2 turns out to converge in a single iteration when applied to the setting described above. We therefore apply it to a more challenging scenario with 30 Dirac masses instead of 11 and more noise. The measurements are shown in Fig. 9. We compare three implementations: a pure exchange method, an alternating method as in Algorithm 2 without line 14 and an alternating method as in in Algorithm 2 with line 14. The conclusions are as follows:
- •
All methods rapidly conclude that the underlying measure contains 30 Dirac masses. (The pure exchange algorithm after 10 iterations, the alternating method with line 14 already after the first).
- •
The pure exchange algorithm quickly gets to a point close to the optimum. The positions then slowly converge to the tue locations. It does however eventually find the basin of attraction of (in this example, it needed 10 iterations).
- •
Line 14 in the alternating method improves the convergence significantly. In fact, omitting it, we need 10 iterations to find the basin of attraction, whereas the version with the line finds it directly. Investigating this effect more closely is an interesting line of future research.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bernard G Bodmann, Axel Flinth, and Gitta Kutyniok. Compressed sensing for analog signals. ar Xiv preprint ar Xiv:1803.04218 , 2018.
- 2[2] John M. Borwein and Adrian S. Lewis. Partially finite convex programming, part i: Quasi relative interiors and duality theory. Mathematical Programming , 57(1):15–48, May 1992.
- 3[3] Nicholas Boyd, Geoffrey Schiebinger, and Benjamin Recht. The alternating descent conditional gradient method for sparse inverse problems. SIAM Journal on Optimization , 27(2):616–639, 2017.
- 4[4] Claire Boyer, Antonin Chambolle, Yohann De Castro, Vincent Duval, Frédéric De Gournay, and Pierre Weiss. On Representer Theorems and Convex Regularization. ar Xiv preprint ar Xiv:1806.09810 , 2018.
- 5[5] Kristian Bredies and Hanna Katriina Pikkarainen. Inverse problems in spaces of measures. ESAIM: Control, Optimisation and Calculus of Variations , 19(1):190–218, 2013.
- 6[6] Emmanuel J Candès and Carlos Fernandez-Granda. Towards a Mathematical Theory of Super-resolution. Communications on Pure and Applied Mathematics , 67(6):906–956, 2014.
- 7[7] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review , 43(1):129–159, 2001.
- 8[8] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems , pages 3036–3046, 2018.
