On the linear convergence rates of exchange and continuous methods for   total variation minimization

Axel Flinth (IMT); Fr\'ed\'eric de Gournay (IMT; ITAV); Pierre Weiss; (IMT; ITAV)

arXiv:1906.09919·math.OC·June 25, 2019·Math. Program.

On the linear convergence rates of exchange and continuous methods for total variation minimization

Axel Flinth (IMT), Fr\'ed\'eric de Gournay (IMT, ITAV), Pierre Weiss, (IMT, ITAV)

PDF

TL;DR

This paper studies the linear convergence of exchange and continuous methods for total variation minimization, showing under certain conditions that both approaches converge linearly and proposing a combined alternating method.

Contribution

It provides the first analysis of linear convergence rates for exchange and continuous algorithms in total variation regularized inverse problems.

Findings

01

Exchange algorithm converges linearly under regularity conditions.

02

Continuous amplitude optimization achieves linear convergence with good initialization.

03

Combining both methods offers advantages in convergence and performance.

Abstract

We analyze an exchange algorithm for the numerical solution total-variation regularized inverse problems over the space M( $Ω$ ) of Radon measures on a subset $Ω$ of R d. Our main result states that under some regularity conditions, the method eventually converges linearly. Additionally, we prove that continuously optimizing the amplitudes of positions of the target measure will succeed at a linear rate with a good initialization. Finally, we propose to combine the two approaches into an alternating method and discuss the comparative advantages of this approach.

Equations239

μ \in M (Ω) in f J (μ) = \mbox d e f . ∥ μ ∥_{M} + f (A μ),

μ \in M (Ω) in f J (μ) = \mbox d e f . ∥ μ ∥_{M} + f (A μ),

μ^{⋆} = i = 1 \sum s α_{i}^{⋆} δ_{ξ_{i}},

μ^{⋆} = i = 1 \sum s α_{i}^{⋆} δ_{ξ_{i}},

f (x) = ι_{{y}} (x) = {0 + \infty if x = y otherwise.

f (x) = ι_{{y}} (x) = {0 + \infty if x = y otherwise.

q \in Q c (x, q) \leq 0, x \in Ω min u (q)

q \in Q c (x, q) \leq 0, x \in Ω min u (q)

q \in R^{m}, ∥ A^{*} q ∥_{\infty} \leq 1 sup - f^{*} (q) .

q \in R^{m}, ∥ A^{*} q ∥_{\infty} \leq 1 sup - f^{*} (q) .

Ω_{k + 1} \subset Ω_{k} \cup {x_{k}^{1}, \dots, x_{k}^{p_{k}}},

Ω_{k + 1} \subset Ω_{k} \cup {x_{k}^{1}, \dots, x_{k}^{p_{k}}},

Ω_{k + 1} = Ω_{k} \cup argmax_{x \in Ω} c (x, q_{k}) .

Ω_{k + 1} = Ω_{k} \cup argmax_{x \in Ω} c (x, q_{k}) .

X_{k} = \mbox d e f . {x \in Ω ∣ x_{k} local maximizer of ∣ A^{*} q_{k} ∣, ∣ A^{*} q_{k} ∣ (x) \geq 1}

X_{k} = \mbox d e f . {x \in Ω ∣ x_{k} local maximizer of ∣ A^{*} q_{k} ∣, ∣ A^{*} q_{k} ∣ (x) \geq 1}

dist (Ω_{1}, Ω_{2}) = x_{2} \in Ω_{2} sup x_{1} \in Ω_{1} in f ∥ x_{1} - x_{2} ∥_{2} .

dist (Ω_{1}, Ω_{2}) = x_{2} \in Ω_{2} sup x_{1} \in Ω_{1} in f ∥ x_{1} - x_{2} ∥_{2} .

∥ μ ∥_{M} = u \in C_{0} (Ω) ∥ u ∥_{\infty} \leq 1 sup μ (u) .

∥ μ ∥_{M} = u \in C_{0} (Ω) ∥ u ∥_{\infty} \leq 1 sup μ (u) .

f^{*} (y) = x \in R^{m} sup ⟨ x, y ⟩ - f (x) .

f^{*} (y) = x \in R^{m} sup ⟨ x, y ⟩ - f (x) .

f (x_{2}) \geq f (x_{1}) + ⟨ η, x_{2} - x_{1} ⟩ + \frac{l}{2} ∥ x_{2} - x_{1} ∥_{2}^{2}

f (x_{2}) \geq f (x_{1}) + ⟨ η, x_{2} - x_{1} ⟩ + \frac{l}{2} ∥ x_{2} - x_{1} ∥_{2}^{2}

f (x_{2}) \leq f (x_{1}) + ⟨ f^{'} (x_{1}), x_{2} - x_{1} ⟩ + \frac{L}{2} ∥ x_{2} - x_{1} ∥_{2}^{2} \mbox f or a l l (x_{1}, x_{2}) \in R^{m} \times R^{m} .

f (x_{2}) \leq f (x_{1}) + ⟨ f^{'} (x_{1}), x_{2} - x_{1} ⟩ + \frac{L}{2} ∥ x_{2} - x_{1} ∥_{2}^{2} \mbox f or a l l (x_{1}, x_{2}) \in R^{m} \times R^{m} .

⟨ a_{i}^{*}, μ ⟩ = \int_{Ω} a_{i} d μ,

⟨ a_{i}^{*}, μ ⟩ = \int_{Ω} a_{i} d μ,

μ \in M (Ω) min ∥ μ ∥_{M (Ω)} + f (A μ) = q \in R^{m}, ∥ A^{*} q ∥_{\infty} \leq 1 max - f^{*} (q) .

μ \in M (Ω) min ∥ μ ∥_{M (Ω)} + f (A μ) = q \in R^{m}, ∥ A^{*} q ∥_{\infty} \leq 1 max - f^{*} (q) .

A^{*} q^{⋆} \in \partial_{∥ \cdot ∥_{M}} (μ^{⋆}) \mbox an d - q^{⋆} \in \partial f (A μ^{⋆}) .

A^{*} q^{⋆} \in \partial_{∥ \cdot ∥_{M}} (μ^{⋆}) \mbox an d - q^{⋆} \in \partial f (A μ^{⋆}) .

μ \in M (Ω_{k}) in f ∥ μ ∥_{M} + f (A μ),

μ \in M (Ω_{k}) in f ∥ μ ∥_{M} + f (A μ),

q \in R^{m}, ∣ A^{*} q (x) ∣ \leq 1, \forall x \in Ω_{k} sup - f^{*} (q) .

q \in R^{m}, ∣ A^{*} q (x) ∣ \leq 1, \forall x \in Ω_{k} sup - f^{*} (q) .

Ω_{k + 1} = Ω_{k} \cup X_{k} \mbox w h er e X_{k} \mbox i s d e f in e d in \eqref e q : d e f X k .

Ω_{k + 1} = Ω_{k} \cup X_{k} \mbox w h er e X_{k} \mbox i s d e f in e d in \eqref e q : d e f X k .

∥ x - y ∥_{2} < δ \Rightarrow ∣ a_{i} (x) - a_{i} (y) ∣ < \frac{ϵ}{sup _{k} ∥ q _{k} ∥ _{1}} for all i .

∥ x - y ∥_{2} < δ \Rightarrow ∣ a_{i} (x) - a_{i} (y) ∣ < \frac{ϵ}{sup _{k} ∥ q _{k} ∥ _{1}} for all i .

∥ x - y ∥_{2} < δ \Rightarrow ∣ (A^{*} q_{k}) (x) - (A^{*} q_{k}) (y) ∣

∥ x - y ∥_{2} < δ \Rightarrow ∣ (A^{*} q_{k}) (x) - (A^{*} q_{k}) (y) ∣

< \frac{ϵ}{sup _{k} ∥ q _{k} ∥ _{1}} i = 1 \sum m ∣ q_{k} (i) ∣ \leq ϵ .

∣ A^{*} q_{k} (x) ∣ = i = 1 \sum m a_{i} (x) q_{k} (i) \leq i = 1 \sum m ∣ a_{i} (x) ∣ ∣ q_{k} (i) ∣ < \frac{1}{sup _{k} ∥ q _{k} ∥ _{1}} i = 1 \sum m ∣ q_{k} (i) ∣ \leq 1

∣ A^{*} q_{k} (x) ∣ = i = 1 \sum m a_{i} (x) q_{k} (i) \leq i = 1 \sum m ∣ a_{i} (x) ∣ ∣ q_{k} (i) ∣ < \frac{1}{sup _{k} ∥ q _{k} ∥ _{1}} i = 1 \sum m ∣ q_{k} (i) ∣ \leq 1

∥ A^{*} q_{k} ∥_{\infty} = ∣ (A^{*} q_{k}) (x_{k}^{⋆}) ∣ < (A^{*} q_{k}) (x_{k_{0}}^{⋆}) + ϵ \leq 1 + ϵ .

∥ A^{*} q_{k} ∥_{\infty} = ∣ (A^{*} q_{k}) (x_{k}^{⋆}) ∣ < (A^{*} q_{k}) (x_{k_{0}}^{⋆}) + ϵ \leq 1 + ϵ .

∥ A^{*} q_{\infty} ∥_{\infty} = k \to \infty lim ∥ A^{*} q_{k} ∥_{\infty} \leq 1 + ϵ,

∥ A^{*} q_{\infty} ∥_{\infty} = k \to \infty lim ∥ A^{*} q_{k} ∥_{\infty} \leq 1 + ϵ,

f^{*} (q_{\infty}) + f (A μ_{\infty}) \leq k \to \infty lim inf f^{*} (q_{k}) + f (A μ_{k}) .

f^{*} (q_{\infty}) + f (A μ_{\infty}) \leq k \to \infty lim inf f^{*} (q_{k}) + f (A μ_{k}) .

k \to \infty lim inf f^{*} (q_{k}) + f (A μ_{k}) = k \to \infty lim inf - ∥ μ_{k} ∥_{M} \leq - k \to \infty lim inf ∥ μ_{k} ∥_{M} \leq - ∥ μ_{\infty} ∥_{M} .

k \to \infty lim inf f^{*} (q_{k}) + f (A μ_{k}) = k \to \infty lim inf - ∥ μ_{k} ∥_{M} \leq - k \to \infty lim inf ∥ μ_{k} ∥_{M} \leq - ∥ μ_{\infty} ∥_{M} .

Ω_{k + 1} \supseteq Ω_{k} \cup {x_{k}} \mbox w i t h x_{k} \in argmax_{x \in Ω} ∣ A^{*} q_{k} ∣.

Ω_{k + 1} \supseteq Ω_{k} \cup {x_{k}} \mbox w i t h x_{k} \in argmax_{x \in Ω} ∣ A^{*} q_{k} ∣.

κ = \mbox d e f . ∥ q ∥_{2} \leq 1 sup ∥ A^{*} q ∥_{\infty} = x \in Ω sup ∥ A (x) ∥_{2}, κ_{\nabla} = \mbox d e f . ∥ q ∥_{2} \leq 1 sup ∥ (A^{*} q)^{'} ∥_{\infty}, κ_{hess} = \mbox d e f . ∥ q ∥_{2} \leq 1 sup ∥ (A^{*} q)^{''} ∥_{\infty} .

κ = \mbox d e f . ∥ q ∥_{2} \leq 1 sup ∥ A^{*} q ∥_{\infty} = x \in Ω sup ∥ A (x) ∥_{2}, κ_{\nabla} = \mbox d e f . ∥ q ∥_{2} \leq 1 sup ∥ (A^{*} q)^{'} ∥_{\infty}, κ_{hess} = \mbox d e f . ∥ q ∥_{2} \leq 1 sup ∥ (A^{*} q)^{''} ∥_{\infty} .

μ^{⋆} = i = 1 \sum s α_{i}^{⋆} δ_{ξ_{i}} .

μ^{⋆} = i = 1 \sum s α_{i}^{⋆} δ_{ξ_{i}} .

∣ A^{*} q^{⋆} ∣^{''} (x) ≼ - γ Id and ∣ A^{*} q^{⋆} ∣ (x) \geq \frac{γ τ _{0}^{2}}{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On the linear convergence rates of exchange and continuous methods for total variation minimization

Axel Flinth

IMT, Université de Toulouse, CNRS

Frédéric de Gournay

IMT, Université de Toulouse, CNRS

ITAV, Université de Toulouse, CNRS

Pierre Weiss

IMT, Université de Toulouse, CNRS

ITAV, Université de Toulouse, CNRS

Abstract

We analyze an exchange algorithm for the numerical solution total-variation regularized inverse problems over the space $\mathcal{M}(\Omega)$ of Radon measures on a subset $\Omega$ of $\mathbb{R}^{d}$ . Our main result states that under some regularity conditions, the method eventually converges linearly. Additionally, we prove that continuously optimizing the amplitudes of positions of the target measure will succeed at a linear rate with a good initialization. Finally, we propose to combine the two approaches into an alternating method and discuss the comparative advantages of this approach.

Keywords: Total variation minimization, inverse problems, superresolution, semi-infinite programming.

MSC Classification: 49M25, 49M29, 90C34, 65K05.

Acknowledgement

The authors acknowledge support from ANR JCJC OMS.

1 Introduction

1.1 The problem

The main objective of this paper is to develop and analyze iterative algorithms to solve the following infinite dimensional problem:

[TABLE]

where $\Omega$ is a bounded open domain of $\mathbb{R}^{d}$ , $\mathcal{M}(\Omega)$ is the set of Radon measures on $\Omega$ , $\|\mu\|_{\mathcal{M}}$ is the total variation (or mass) of the measure $\mu$ , $f:\mathbb{R}^{m}\to\mathbb{R}\cup\{+\infty\}$ is a convex lower semi-continuous function with non-empty domain and $A:\mathcal{M}(\Omega)\to\mathbb{R}^{m}$ is a linear measurement operator.

An important property of Problem ( $\mathcal{P}(\Omega)$ ) is that at least one of its solutions $\mu^{\star}$ has a support restricted to $s$ distinct points with $s\leq m$ (see e.g. [30, 15, 4]), i.e. is of the form

[TABLE]

with $\xi_{i}\in\Omega$ and $\alpha_{i}^{\star}\in\mathbb{R}$ . This property motivates us to study a class of exchange algorithms. They were introduced as early as 1934 [24] and then extended in various manners [23]. They consist in discretizing the domain $\Omega$ coarsely and then refining it adaptively based on the analysis of so-called dual certificates. If the refinement process takes place around the locations $(\xi_{i})$ only, these methods considerably reduce the computational burden compared to a finely discretized mesh.

Our main results consist in a set of convergence rates for this algorithm that depend on the regularity of $f$ and on the non-degeneracy of a dual certificate at the solution. We also show the linear convergence rate for first order algorithms that continuously vary the coefficients $\alpha_{i}$ and $x_{i}$ of a discrete measure. Finally, we show that algorithms alternating between an exchange step and a continuous method share the best of both worlds: the global convergence guarantees of exchange algorithms together with the efficiency of first order methods. This yields a fast adaptive method with strong convergence guarantees for total variation minimization and related problems.

1.2 Applications

Our initial motivation to study the problem ( $\mathcal{P}(\Omega)$ ) stems from signal processing applications. We recover an infinite dimensional version of the basis pursuit problem [7] by setting

[TABLE]

Similarly, the choice $f(x)=\frac{\tau}{2}\|x-y\|_{2}^{2}$ , leads to an extension of the LASSO [27] called Beurling LASSO [9]. Both problems proved to be extremely useful in engineering applications. They got a significant attention recently thanks to theoretical progresses in the field of super-resolution [9, 26, 6, 13]. Our results are particularly strong for the quadratic fidelity term.

1.3 Numerical approaches in signal processing

The progresses on super-resolution [9, 26, 6, 13] motivated researchers from this field to develop numerical algorithms for the resolution of Problem ( $\mathcal{P}(\Omega)$ ). By far the most widespread approach is to use a fine uniform discretization and solve a finite dimensional problem. The complexity of this approach is however too large if one wishes high precision solutions. This approach was analyzed from a theoretical point of view in [25, 13] for instance. The first papers investigating the use of ( $\mathcal{P}(\Omega)$ ) for super-resolution purposes advocated the use of semi-definite relaxations [26, 6], which are limited to specific measurement functions and domains, such as trigonometric polynomials on the 1D torus $\mathbb{T}$ . The limitations were significantly reduced in [10], where the authors suggested the use of Lasserre hierarchies. These methods are however currently unable to deal with large scale problems. Another approach suggested in [5], and referred to as a Frank-Wolfe algorithm, consists in adding one point to a discretization set iteratively, where a so-called dual certificate is maximal. More recently, [28] began investigating the use of methods that continuously vary the positions $(x_{i})$ and amplitudes $(\alpha_{i})$ of discrete measures parameterized as $\mu=\sum_{i=1}^{s}\alpha_{i}\delta_{x_{i}}$ . The authors gave sufficient conditions for a simple gradient descent on the product-space $(\alpha,x)$ to converge. In [3] and [11], this method was used alternatively with a Frank-Wolfe algorithm, the idea being to first add Dirac masses roughly at the right locations and then to optimize their locations and position continuously, leading to promising numerical results. Surprisingly enough, it seems that the connection with the mature field of semi-infinite programming has been ignored (or not explicitly stated) in all the mentioned references.

1.4 Some numerical approaches in semi-infinite programming

A semi-infinite program [23, 16] is traditionally defined as a problem of the form

[TABLE]

where $Q$ and $\Omega$ are subsets of $\mathbb{R}^{n}$ and $\mathbb{R}^{m}$ respectively, $u:Q\to\mathbb{R}$ and $c:\Omega\times Q\to\mathbb{R}$ are functions. The term semi-infinite stems from the fact that the variable $q$ is finite-dimensional, but it is subject to infinitely many constraints $c(x,q)\leq 0$ for $x\in\Omega$ . In order to see the connection between the semi-infinite program (SIP $[\Omega]$ ) and our problem ( $\mathcal{P}(\Omega)$ ), we can formulate its dual, which reads as

[TABLE]

This dual will play a critical role in all the paper and it is easy to relate it to a SIP by setting $Q=\mathbb{R}^{m}$ , $u=f^{*}$ and $c=\left|(A^{*}q)(x)\right|-1$ .

Many numerical methods have been and are still being developed for semi-infinite programs and we refer the interested reader to the excellent chapter 7 of the survey book [23] for more insight. We sketch below two classes of methods that are of interest for our concerns.

1.4.1 Exchange algorithms

A canonical way of discretizing a semi-infinite program is to simply control finitely many of the constraints, say $c(x,q)\leq 0$ for $x\in\Omega_{0}\subseteq\Omega$ , where $\Omega_{0}$ is finite. The discretized problem SIP $[\Omega_{0}]$ can then be solved by standard proximal methods or interior point methods. In order to obtain convergence towards an exact solution of the problem, it is possible to choose a sequence $(\Omega_{k})$ of nested sets such that $\bigcup_{k}\Omega_{k}$ is dense in $\Omega$ . Solving the problems SIP $[\Omega_{k}]$ for large $k$ however leads to a high numerical complexity due to the high number of discretization points. The idea of exchange algorithms is to iteratively update the discretization sets $\Omega_{k}$ in a more clever manner than simply making them denser. A generic description is given by Algorithm 1.

In this paper, we consider $\mathrm{Update\_Rule}$ s of the form

[TABLE]

where the points $x_{k}^{i}$ are local maximizers of $c(\cdot,q_{k})$ . At each iteration, the set of discretization points can therefore be updated by adding and dropping a few prescribed points, explaining the name ’exchange’. The simplest rule consists of adding the single most violating point, i.e.

[TABLE]

It seems to be the first exchange algorithm and is nearly equivalent to the Remez algorithm from the 30’s [24]. It can be shown to be equivalent to a Frank-Wolfe (a.k.a. conditional gradient) method up to an epigraphical lift [11]. These methods were introduced in the field of signal processing in [5] and the connection with exchange algorithms was proposed in [14]. The update rule (2) is sufficient to guarantee convergence in the generic case and to ensure a decay of the cost function in $O\left(\frac{1}{k}\right)$ , see [19]. Although ’exchange’ suggests that points are both added and subtracted, methods for which $\Omega_{k}\subseteq\Omega_{k+1}$ are also coined exchange algorithms. The use of such rules often leads to easier convergence analyses, since we get monotonicity of the objective values $u(q_{k})$ for free [16]. Other examples [17] include only adding points if they exceed a certain margin, i.e. $c(x,y)\geq\epsilon_{k}$ , or all local maxima of $c(q_{k},\cdot)$ . In the case of convex functions $f$ , algorithms that both add and remove points can be derived and analyzed with the use of cutting plane methods. All these instances have their pros and cons and perform differently on different types of problems. Since a semi-infinite program basically allows to minimize arbitrary continuous and finite dimensional problems, a theoretical comparison should depend on additional properties of the problem.

1.4.2 Continuous methods

Every iteration of an exchange algorithm can be costly: it requires solving a convex program with a number of constraints that increases if no discretization point is dropped. In addition, the problems tend to get more and more degenerate as the discretization points cluster, leading to numerical inaccuracies. In practice it is therefore tempting to use the following two-step strategy: i) find an approximate solution $\mu_{k}=\sum_{i=1}^{p_{k}}\alpha_{k}^{i}\delta_{x_{k}^{i}}$ of the primal problem ( $\mathcal{P}(\Omega)$ ) using $k$ iterations of an exchange algorithm and ii) continuously move the positions $X=(x_{i})$ and amplitudes $\alpha=(\alpha_{i})$ starting from $(\alpha_{k},X_{k})$ to minimize ( $\mathcal{P}(\Omega)$ ) using a nonlinear programming approach such as a gradient descent, a conjugate gradient algorithm or a Newton approach.

This procedure supposes that the output $\mu_{k}$ of the exchange algorithm has the right number $p_{k}=s$ of Dirac masses, that their amplitudes satisfy $\operatorname{sign}(\alpha_{i})=\operatorname{sign}(\alpha_{i}^{\star})$ and that $\mu_{k}$ lies in the basin of attraction of the optimization algorithm around the global minimum $\mu^{\star}$ . To the best of our knowledge, knowing a priori when those conditions are met is still an open problem and deciding when to switch from an exchange algorithm to a continuous method therefore relies on heuristics such as detecting when the number of masses $p_{k}$ stagnates for a few iterations. The cost of continuous methods is however much smaller than that of exchange algorithms since they amount to work over a small number $s(d+1)$ of variables. In addition, the instabilities mentioned earlier are significantly reduced for these methods. This observation was already made in [3, 11] and proved in [28] for specific problems.

1.5 Contribution

Many recent results in the field of super-resolution provide sufficient conditions for a non degenerate source condition to hold [6, 26, 12, 1, 21]. The non degeneracy means that the solution $q^{\star}$ of ( $\mathcal{D}(\Omega)$ ) is unique and that the dual certificate $|A^{*}q^{\star}|$ reaches $1$ at exactly $s$ points, where it is strictly concave. The main purpose of this paper is to study the implications of this non degeneracy for the convergence of a class of exchange algorithms and for continuous methods based on gradient descents. Our main results are as follows:

We show an eventual linear convergence rate of a class of exchange algorithms for convex functions $f$ with Lipschitz continuous gradient. More precisely, we prove that after a finite number of iterations $N$ the algorithm outputs vectors $q_{k}$ such that the set

[TABLE]

contains exactly $s$ -points $(x_{k}^{1},\ldots,x_{k}^{s})$ .

Letting $\widehat{\mu}_{k}=\sum_{i=1}^{s}\alpha_{i}^{k}\delta_{x_{i}^{k}}$ denote the solution of the finite dimensional problem $\inf_{\mu\in\mathcal{M}(X_{k})}\|\mu\|_{\mathcal{M}}+f(A\mu)$ , we also show the linear convergence rate of the cost function $J(\mu_{k})$ to $J(\mu^{\star})$ and of the support in the following sense: after a number $N$ of initial iterations, it will take no more that $k_{\tau}=C\log(\tau^{-1})$ iterations to ensure that $\operatorname{dist}(X_{k_{\tau}+N},\xi)\leq\tau$ . A similar statement holds for the coefficient vectors $\alpha^{k}$ . 2. 2.

We also show that a well-initialized gradient descent algorithm on the pair $(\alpha,x)$ converges linearly to the true solution $\mu^{\star}$ and explicit the width of the basin of attraction. 3. 3.

We then show how the proposed guarantees may explain the success of methods alternating between exchange methods and continuous methods at each step, in a spirit similar to the sliding Frank-Wolfe algorithm [11]. 4. 4.

We finally illustrate the above results on total variation based problems in 1D and 2D.

2 Preliminaries

2.1 Notation

In all the paper, $\Omega$ designs an open bounded domain of $\mathbb{R}^{d}$ . The boundedness assumptions plays an important role to control the number of elements in discretization procedures. A grid $\Omega_{k}$ is a finite set of points in $\Omega$ . Its cardinality is denoted by $|\Omega_{k}|$ . The distance between two sets $\Omega_{1}$ and $\Omega_{2}$ is defined by

[TABLE]

Note that this definition of distance is not symmetric: in general $\operatorname{dist}(\Omega_{1},\Omega_{2})\neq\operatorname{dist}(\Omega_{2},\Omega_{1})$ .

We let $\mathcal{C}_{0}(\Omega)$ denote the set of continuous functions on $\Omega$ vanishing on the boundary. The set of Radon measures $\mathcal{M}(\Omega)$ can be identified as the dual of $\mathcal{C}_{0}(\Omega)$ , i.e. the set of continuous linear forms on $\mathcal{C}_{0}(\Omega)$ . For any sub-domain $\Omega_{k}\subset\Omega$ , we let $\mathcal{M}(\Omega_{k})$ denote the set of Radon measures supported on $\Omega_{k}$ . For $p\in[1,+\infty]$ , the $L^{p}$ -norm of a function $u\in\mathcal{C}_{0}(\Omega)$ is denoted by $\|u\|_{p}$ . The total variation of a measure $\mu\in\mathcal{M}(\Omega)$ is denoted $\|\mu\|_{\mathcal{M}}$ . It can be defined by duality as

[TABLE]

The $\ell^{p}$ -norm of a vector $x\in\mathbb{R}^{m}$ is also denoted $\|x\|_{p}$ . The Frobenius norm of a matrix $M$ is denoted by $\|M\|_{F}$ .

Let $f:\mathbb{R}^{m}\to\mathbb{R}\cup\{+\infty\}$ denote a convex lower semi-continuous function with non-empty domain $\operatorname{dom}(f)=\{x\in\mathbb{R}^{m},f(x)<+\infty\}$ . Its subdifferential is denoted $\partial f$ . Its Fenchel transform $f^{*}$ is defined by

[TABLE]

If $f$ is differentiable, we let $f^{\prime}\in\mathbb{R}^{m}$ denote its gradient and if it is twice differentiable, we let $f^{\prime\prime}\in\mathbb{R}^{m\times m}$ denote its Hessian matrix. We let $\|f^{\prime}\|_{\infty}=\sup_{x\in\Omega}\|f^{\prime}(x)\|_{2}$ and $\|f^{\prime\prime}\|_{\infty}=\sup_{x\in\Omega}\|f^{\prime\prime}(x)\|$ , where $\|f^{\prime\prime}(x)\|$ is the largest singular value of $f^{\prime\prime}(x)$ . A convex function $f$ is said to be $l$ -strongly convex if

[TABLE]

for all $(x_{1},x_{2})\in\mathbb{R}^{m}\times\mathbb{R}^{m}$ and all $\eta\in\partial f(x_{1})$ . A differentiable function $f$ is said to have an $L$ -Lipschitz gradient if it satisfies $\|f^{\prime}(x_{1})-f^{\prime}(x_{2})\|_{2}\leq L\|x_{1}-x_{2}\|_{2}$ . This implies that

[TABLE]

We recall the following equivalence [18]:

Proposition 2.1.

Let $f:\mathbb{R}^{m}\to\mathbb{R}\cup\{+\infty\}$ denote a convex and closed function with non empty domain. Then the following two statements are equivalent:

•

$f$ * has an $L$ -Lipschitz gradient.*

•

$f^{*}$ * is $\frac{1}{L}$ -strongly convex.*

The linear measurement operators $A$ considered in this paper can be viewed as a collection of $m$ continuous functions $(a_{i})_{1\leq i\leq m}$ . For $x\in\Omega$ , the notation $A(x)$ corresponds to the vector $[a_{1}(x),\ldots,a_{m}(x)]\in\mathbb{R}^{m}$ .

2.2 Existence results and duality

In order to obtain existence and duality results, we will now make further assumptions.

Assumption 1.

$f:\mathbb{R}^{m}\to\mathbb{R}\cup\left\{\infty\right\}$ * is convex and lower bounded. In addition, we assume that either $\operatorname{dom}(f)=\mathbb{R}^{m}$ or that $f$ is polyhedral (that is, its epigraph is a finite intersection of closed halfspaces).*

Assumption 2.

The operator $A$ is weak- $*$ -continuous. Equivalently, the measurement functionals $a_{i}^{*}$ defined by $\left\langle a_{i}^{*},\mu\right\rangle=(A(\mu))_{i}$ are given by

[TABLE]

for functions $a_{i}\in\mathcal{C}_{0}(\Omega)$ . In addition, we assume that $A$ is surjective on $\mathbb{R}^{m}$ .

The following results relate the primal and the dual.

Proposition 2.2 (Existence and strong duality).

Under Assumptions 1 and 2, the following statements are true:

•

The primal problem ( $\mathcal{P}(\Omega)$ ) and its dual ( $\mathcal{D}(\Omega)$ ) both admit a solution.

•

The following strong duality result holds

[TABLE]

•

Let $(\mu^{\star},q^{\star})$ denote a primal-dual pair. They are related as follows

[TABLE]

Proof.

The stated assumptions ensure the existence of a feasible measure $\mu$ . In addition, the primal function is coercive since $f$ is bounded below. This yields existence of a primal solution. The existence of a dual solution stems from the compactness of the set $\{q\in\mathbb{R}^{m},\|A^{*}q\|_{\infty}\leq 1\}$ (which itself follows from the surjectivity of $A$ ) and the continuity of $f^{*}$ on its domain. The strong duality result follows from [2, Thm 4.2]. The primal-dual relationship directly derives from the first order optimality conditions. ∎

The left inclusion in equation (9) plays an important role, which is well detailed in [13]. It implies that the support of $\mu^{\star}$ satisfies: $\operatorname{supp}(\mu^{\star})\subseteq\{x\in\Omega,|A^{*}q^{\star}(x)|=1\}$ .

3 An Exchange Algorithm and its convergence

3.1 The algorithm

We assume that an initial grid $\Omega_{0}\subseteq\Omega$ is given (e.g. a coarse Euclidean grid). Given a discretization $\Omega_{k}$ , we can define a discretized primal problem ( $\mathcal{P}(\Omega_{k})$ )

[TABLE]

and its associated dual ( $\mathcal{D}(\Omega_{k})$ )

[TABLE]

In this paper, we will investigate the exchange rule below:

[TABLE]

The implementation of this rule requires finding $X_{k}$ , the set of all the local maximizers of $\left|A^{*}q_{k}\right|$ exceeding $1$ .

3.2 A generic convergence result

The exchange algorithm above converges under quite weak assumptions. For instance, it is enough to assume that the function $f$ is differentiable.

Assumption 3.

The data fitting function $f:\mathbb{R}^{m}\to\mathbb{R}$ is differentiable with $L$ -Lipschitz continuous gradient.

Alternatively, we may assume that the initial set $\Omega_{0}$ is fine enough, which in particular implies that $|\Omega_{0}|\geq m$ .

Assumption 4.

The initial set $\Omega_{0}$ is such that $A$ restricted to $\Omega_{0}$ is surjective.

We may now present and prove our first result.

Theorem 3.1 (Generic convergence).

Under assumptions 1, 2 and 3 or 4, a subsequence of $(\mu_{k},q_{k})$ will converge in the weak- $*$ -topology towards a solution pair $(\mu^{\star},q^{\star})$ of ( $\mathcal{P}(\Omega)$ ) and ( $\mathcal{D}(\Omega)$ ), as well as in objective function value. If the solution of ( $\mathcal{P}(\Omega)$ ) and/or ( $\mathcal{D}(\Omega)$ ) is unique, the entire sequence will converge.

Proof.

First remark that the sequence $(\|\mu_{k}\|_{\mathcal{M}}+f(A\mu_{k}))_{k\in\mathbb{N}}$ is non-increasing since the spaces $\mathcal{M}(\Omega_{k})$ are nested. Due to the boundedness below of $f$ , the same must be true for $(\|\mu_{k}\|_{\mathcal{M}})$ . Hence there exists a subsequence $(\mu_{k})$ , which we do not relabel, that weak- $*$ converges towards a measure $\mu_{\infty}$ .

Now, we will prove that the sequence of dual variables $(q_{k})_{k\in\mathbb{N}}$ is bounded. If Assumption 3 is satisfied, then $f^{*}$ is strongly convex and since [math] is a feasible point, we must have $q_{k}\in\{q\in\mathbb{R}^{m},f^{*}(q)\leq f^{*}(0)\}$ , which is bounded. Alternatively, if Assumption 4 is satisfied, notice that $1\geq\|A_{k}^{*}q_{k}\|_{\infty}\geq\|A_{0}^{*}q_{k}\|_{\infty}$ . Since $A_{0}$ is surjective, the previous inequality implies that $(\|q_{k}\|_{2})_{k\in\mathbb{N}}$ is bounded. Hence, in both cases, the sequence $(q_{k})_{k\in\mathbb{N}}$ converges up to a subsequence to a point $q_{\infty}$ .

The key is now to prove that $\|A^{*}q_{\infty}\|_{\infty}\leq 1$ . To this end, let us first argue that the family $(A^{*}q_{k})_{k\in\mathbb{N}}$ is equicontiuous. For this, let $\epsilon>0$ be arbitrary. Since the functions $a_{i}\in\mathcal{C}_{0}(\Omega)$ all are uniformly continuous, there exists a $\delta>0$ with the property

[TABLE]

Consequently,

[TABLE]

Due to the convergence of $(q_{k})_{k\in\mathbb{N}}$ , the sequence $(A^{*}q_{k})_{k\in\mathbb{N}}$ is converging strongly to $A^{*}q_{\infty}$ . We will now prove that $\|A^{*}q_{\infty}\|_{\infty}\leq 1$ . If for some $k$ , $\|A^{*}q_{k}\|_{\infty}\leq 1$ , we will have $A^{*}q_{\ell}=A^{*}q_{k}$ for all $\ell\geq k$ , and in particular $q_{\infty}=q_{k}$ and thus $\|A^{*}q_{\infty}\|\leq 1$ . Hence, we may assume that $\|A^{*}q_{k}\|_{\infty}>1$ for each $k$ , i.e. that we add at least one point to $\Omega_{k}$ in each iteration.

Now, towards a contradiction, assume that $\|A^{*}q_{\infty}\|_{\infty}=1+2\epsilon$ for an $\epsilon>0$ . Set $\delta$ as in (11). For each $k\in\mathbb{N}$ , let $x_{k}^{\star}$ be the element in $\mathop{\mathrm{argmax}}_{x}\left|(A^{*}q_{k})(x)\right|$ which has the largest distance to $\Omega_{k}$ . Due to $a_{\ell}\in\mathcal{C}_{0}(\Omega)$ for each $k$ , there needs to exist a compact subset $C\subseteq\Omega$ such that $(x_{k}^{\star})_{k}\subseteq C$ . Indeed, there exists for each $\ell=1,\dots,m$ a $C_{\ell}$ such that $\left|a_{\ell}(x)\right|\leq(\sup_{k}\|q_{k}\|_{1})^{-1}$ for all $x\notin C_{\ell}$ . Now, if $x\notin C\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\bigcup_{\ell=1}^{m}C_{\ell}$ , we get

[TABLE]

for every $k$ . Since $\left|A^{*}q_{k}(x_{k}^{\star})\right|>1$ , we conclude $(x_{k}^{\star})_{k}\subseteq C$ . Consequently, a subsequence (which we do not rename) of $(x_{k}^{\star})$ must converge. Thus, for some $k_{0}$ and every $k>k_{0}$ , we have $\|x_{k}^{\star}-x_{k_{0}}^{\star}\|_{2}<\delta$ . We then have

[TABLE]

In the last estimate, we used the constraint of ( $\mathcal{D}(\Omega_{k})$ ) and the fact that $x_{k_{0}}^{\star}\in\Omega_{k}$ . Since the last inequality holds for every $k\geq k_{0}$ , we obtain

[TABLE]

where we used the fact that $(A^{*}q_{k})_{k}$ converges strongly towards $A^{*}q_{\infty}$ . This is a contradiction, and hence, we do have $\|A^{*}q_{\infty}\|_{\infty}\leq 1$ .

Overall, we proved that the primal-dual pair $(\mu_{\infty},q_{\infty})$ is feasible. It remains to prove that it is actually a solution. To do this, let us first remark that $\|\mu_{\infty}\|_{\mathcal{M}}+f(A\mu_{\infty})\geq-f^{*}(q_{\infty})$ by weak duality. To prove the second inequality, first notice that the weak- $*$ -continuity of $A$ implies that $A\mu_{k}\to A\mu_{\infty}$ . Assumption 1 furthermore implies that $f$ is lower semi-continuous. As a supremum of linear functions, so is $f^{*}$ . Since also $q_{k}\to q_{\infty}$ , we conclude

[TABLE]

Assumptions 1, 2 together with Proposition 2.2 imply exact duality of the discretized problems. This means $f^{*}(q_{k})+f(A\mu_{k})=-\|\mu_{k}\|_{\mathcal{M}}$ . Since the norm is weak- $*$ -l.s.c. , we thus obtain

[TABLE]

Reshuffling these inequalities yields $\|\mu_{\infty}\|_{\mathcal{M}}+f(A\mu_{\infty})\leq-f^{*}(q_{\infty})$ , i.e., the reverse inequality. Thus, $\mu_{\infty}$ and $q_{\infty}$ fulfill the duality conditions, and are solutions. The final claim follows from a standard subsequence argument. ∎

Remark 1.

Let us mention that the convergence result in Theorem 3.1 and its proof, is not new, see e.g. [22]. The proof technique can be applied to prove similar statements for other refinement rules. For instance, the result still holds if we add the single most violating point:

[TABLE]

The result that we have just shown is very generally applicable. It however does not give us any knowledge of the convergence rate. The next section will be devoted to proving a linear convergence rate in a significant special case.

3.3 Non degenerate source condition

The idea behind adding points to the grid adaptively is to avoid a uniform refinement, which results in computationally expensive problems ( $\mathcal{D}(\Omega_{k})$ ). However, there is a priori no reason for the exchange rule not to refine in a uniform manner. In this section, we prove that additional assumptions improve the situation. First, we will from now on work under Assumption (3). It implies that the dual solutions $q_{k}$ are unique for every $k$ , since Proposition (2.1) ensures the strong convexity of the Fenchel conjugate $f^{*}$ . We furthermore assume that the $a_{j}$ are smooth.

Assumption 5 (Assumption on the measurement functionals ).

The measurement functions $a_{j}$ all belong to $\mathcal{C}_{0}^{2}(\Omega)\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\mathcal{C}_{0}(\Omega)\cap\mathcal{C}^{2}(\Omega)$ and their first and second order derivatives are uniformly bounded on $\Omega$ . We hence may define

[TABLE]

We also assume the following regularity condition on the solution $q^{\star}$ of ( $\mathcal{D}(\Omega)$ ), and its corresponding primal solution $\mu^{\star}$ .

Assumption 6 (Assumption on the primal-dual pair).

We assume that ( $\mathcal{P}(\Omega)$ ) admits a unique $s$ -sparse solution $\mu^{\star}$ supported on $\xi=(\xi_{i})_{i=1}^{s}\in\Omega^{s}$ :

[TABLE]

Let $q^{\star}$ denote the associated dual pair. We assume that the only points $x$ for which $\left|A^{*}q^{\star}(x)\right|=1$ are the points in $\xi$ , and that the second derivative of $\left|A^{*}q^{\star}\right|$ is negative definite in each point $\xi_{i}$ . It follows that there exists $\tau_{0}>0$ and $\gamma>0$ such that

[TABLE]

We note that if Equations (14) and (15) are valid for some $(\gamma,\tau_{0})$ , they are also valid for any $(\tilde{\gamma},\tilde{\tau}_{0})$ with $\tilde{\gamma}\leq\gamma$ and $\tilde{\tau}_{0}\leq\tau_{0}$ .

Assumption (6) may look very strong and hard to verify in advance. Recent advances in signal processing actually show that it is verified under clear geometrical conditions. First, there will always exists at most $m$ -sparse solutions to problem ( $\mathcal{P}(\Omega)$ ), [30, 15, 4]. Therefore, the main difficulty comes from the uniqueness of the primal solution and from the two regularity conditions (14) and (15). These assumptions are called non-degenerate source condition of the dual certificate $A^{*}q^{\star}$ [13]. Many results in this direction have been shown for $f=\xi_{\left\{b\right\}}$ or $f(\cdot)=\frac{L}{2}\|\cdot-b\|_{2}^{2}$ , where $b=A\mu_{0}$ with $\mu_{0}$ a finitely supported measure. The papers [6, 26, 12] deal with different Fourier-type operators, [1] about a few other special cases whereas [21] provides an analysis for arbitrary integral operators sampled at random.

3.4 Auxiliary results

In this and the following sections, we always work under Assumptions 1, 2, 3 without further notice. We derive several lemmata that are direct consequences of the above assumptions. The first two rely strongly on the Lipschitz regularity of the gradient of $f$ .

Lemma 3.2 (Boundedness of the dual variables ).

Let $\bar{q}=\mathop{\mathrm{argmin}}_{q\in\mathbb{R}^{m}}f^{*}(q)$ denote the prox-center of $f^{*}$ . For all $k\in\mathbb{N}$ , we have

[TABLE]

Proof of Lemma 3.2.

For all $k\in\mathbb{N}$ , we have $0\in\{q\in\mathbb{R}^{m},\|A^{*}_{k}q\|_{\infty}\leq 1\}$ , hence $f^{*}(q_{k})\leq f^{*}(0)$ . By strong convexity of $f^{*}$ and optimality of $\bar{q}$ and $q_{k}$ , we get:

[TABLE]

Therefore $\|q_{k}-\bar{q}\|_{2}\leq\sqrt{2L(f^{*}(0)-f^{*}(\bar{q}))}$ and the conclusion follows from a triangle inequality. ∎

Proposition 3.3.

Let $q^{\star}$ be the solution of ( $\mathcal{D}(\Omega)$ ). Let

[TABLE]

Then for any $q$ , we have

[TABLE]

Proof.

Let $M=\{q\in\mathbb{R}^{m},f^{*}(q)\leq f^{*}(q^{\star})\}$ denote the sub-level set of $f^{*}$ and $D=\left\{q\in\mathbb{R}^{n}\,|\,\sup_{x\in\xi}|A^{*}q|(x)\leq 1\right\}$ . We first claim that $M$ and $D$ only have the point $q^{\star}$ in common. Indeed $\mu^{\star}$ solves the problem $\mathcal{P}(\xi)$ and by strong duality of the problem restricted to $\mathcal{M}(\xi)$ , $q^{\star}$ solves $\mathcal{D}(\xi)$ . By strong convexity of $f$ , $q^{\star}$ is the unique solution $\mathcal{D}(\xi)$ , this exactly means $M\cap D=\{q^{\star}\}$ .

The fact that $M\cap D=\{q^{\star}\}$ implies that there exists a separating hyperplane there. Since the hyperplane must be tangent to $M$ , it can be written as $\left\{q\,|\,\left\langle w,q\right\rangle=\left\langle w,q^{\star}\right\rangle\right\}$ for a $w\in\partial f^{*}(q^{\star})$ , with $D\subset\left\{q\,|\,\left\langle w,q\right\rangle\geq\left\langle w,q^{\star}\right\rangle\right\}$ . Consequently, letting $\epsilon=\sup_{x\in\xi}|A^{*}q(x)|-1$ , we have

[TABLE]

Now, the strong convexity of $f^{*}$ implies for every $q\in(1+\epsilon)D\cap M$ ,

[TABLE]

Rearranging this, we obtain

[TABLE]

which is the claim. ∎

Before moving on, let us record the following proposition:

Proposition 3.4.

We have

[TABLE]

Proof.

The proof of the first inequality of (18) is a standard Taylor expansion :

[TABLE]

The proof of the second part of (18) follows the same lines as the first part and is left to the reader. ∎

The next two lemmata aim at transferring bounds from the geometric distances of the sets $X_{k}$ , $\Omega_{k}$ and $\xi$ to bounds on $|A^{*}q_{k}(\xi)|$ . Using Proposition 3.3, we may then transfer these bounds to bounds on the errors of the dual solutions and the dual (or primal) objective values.

Lemma 3.5.

The following inequalities hold

[TABLE]

Proof of Lemma 3.5.

To show (19), first notice that

[TABLE]

Indeed, by definition, the global maximum $z$ of $|A^{*}q_{k}|$ lies in $X_{k}$ and satisfies $(A^{*}q_{k})^{\prime}(z)=0$ . Furthermore, by construction, all points $x$ in $\Omega_{k}$ satisfy $|A^{*}q_{k}(x)|\leq 1$ . Using a Taylor expansion, we get for all $x\in\Omega$

[TABLE]

Taking $x$ as the point in $\Omega_{k}$ minimizing the distance to $z$ leads to (20). In addition, we have $\|(A^{*}q_{k})^{\prime\prime}\|_{\infty}\leq R\kappa_{\operatorname{hess}}$ by Lemma 3.2, so that $\|A^{*}q_{k}\|_{\infty}\leq 1+\epsilon$ with $\epsilon=R\kappa_{\operatorname{hess}}\frac{\operatorname{dist}(\Omega_{k},X_{k})^{2}}{2}$ .

Now, letting $C=\left\{q\,|\,\|A^{*}q\|_{\infty}\leq 1\right\}$ , we have just proven that $q_{k}\in(1+\epsilon)C$ . Furthermore, due to the optimality of $q_{k}$ for the discretized problem and to the fact that $q^{\star}$ is feasible for that problem, we will have $f^{*}(q_{k})\leq f^{*}(q^{\star})$ , i.e., $q_{k}$ is included in the $f^{*}(q^{\star})$ -sub-level set of $f^{*}$ : $M=\{q\in\mathbb{R}^{m}|f^{*}(q)\leq f^{*}(q^{\star})\}$ . An application of Proposition 3.3 now yields the result. ∎

Lemma 3.6.

Suppose that $\operatorname{dist}(X_{k},\xi)\leq\delta$ and $\operatorname{dist}(\Omega_{k},\xi)\leq\delta$ . Then

[TABLE]

Proof.

Let $y_{k}^{i}$ (resp. $x_{k}^{i}$ ) be the point closest to $\xi_{i}$ in $\Omega_{k}$ (resp. $X_{k}$ ). By assumption, we have $\|x_{k}^{i}-y_{k}^{i}\|_{2}\leq 2\delta$ . For all $i$ , we have

[TABLE]

Then, for all $z\in[y_{k}^{i},\xi_{i}]$ , using the fact that $(A^{*}q_{k})^{\prime}(x_{k}^{i})=0$ , we get

[TABLE]

Hence, we have $|A^{*}q_{k}(\xi_{i})|\leq 1+2\delta R\kappa_{\operatorname{hess}}\|\xi_{i}-y_{k}^{i}\|_{2}\leq 1+2\delta R\kappa_{\operatorname{hess}}\operatorname{dist}(\Omega_{k},\xi)$ . To conclude, we use Proposition 3.3 again. ∎

The last assertion takes full advantage of Assumption 6 and the fact that the function $|A^{*}q^{\star}|$ is uniformly concave around its maximizers. It allows to transfer bounds from $\|q_{k}-q^{\star}\|_{2}$ to bounds on the distance from $X_{k}$ to $\xi$ .

Proposition 3.7.

Define $c_{q}=\gamma\min\left(\frac{\tau_{0}^{2}}{2\kappa},\frac{\tau_{0}}{\kappa_{\nabla}},\frac{1}{\kappa_{\operatorname{hess}}}\right)$ and assume that $\|q_{k}-q^{\star}\|_{2}<c_{q}$ , then

[TABLE]

Moreover, for each $i$ , if $B_{i}$ is the ball or radius $\tau_{0}$ around $\xi_{i}$ , then $X_{k}$ contains at most one point in $B_{i}$ and $A^{*}q_{k}$ has the same sign as $A^{*}q^{\star}(\xi_{i})$ in $B_{i}$ .

Proof.

Define $\tau=\frac{\kappa_{\nabla}}{\gamma}\|q_{k}-q^{\star}\|$ and note that $\tau<\tau_{0}$ . By Proposition 3.4, we have for each $x\in\Omega$

[TABLE]

The above inequalities together with Assumption 6 imply the following for all $1\leq i\leq s$ :

(i)

For $x$ with $\|x-\xi_{i}\|_{2}\leq\tau_{0}$ , we have $\operatorname{sign}(A^{*}q_{k})(x)=\operatorname{sign}(A^{*}q^{\star})(x)=\operatorname{sign}(A^{*}q^{\star})(\xi_{i}).$ 2. (ii)

For $x$ with $\|x-\xi_{i}\|_{2}\leq\tau_{0}$ , we have $(\left|A^{*}q_{k}\right|)^{\prime\prime}(x)\prec(\left|A^{*}q^{\star}\right|)^{\prime\prime}(x)+\gamma\operatorname{id}\prec 0.$ 3. (iii)

For $x$ with $\|x-\xi_{i}\|_{2}\geq\tau_{0}$ , we have $\left|(A^{*}q_{k})(x)\right|<\left|(A^{*}q^{\star})(x)\right|+\frac{\gamma\tau_{0}^{2}}{2}\leq 1-\frac{\gamma\tau_{0}^{2}}{2}+\frac{\gamma\tau_{0}^{2}}{2}=1.$ 4. (iv)

For $x$ with $\tau<\|x-\xi_{i}\|_{2}\leq\tau_{0}$ , we have $\|(A^{*}q_{k})^{\prime}(x)\|_{2}\geq\|(A^{*}q^{\star})^{\prime}(x)\|_{2}-\gamma\tau>0.$

The estimate $\|(A^{*}q^{\star})^{\prime}(x)\|_{2}>\gamma\tau$ deserves a slightly more detailed justification than the others. Define $w=x-\xi_{i}$ and $g(\theta)=\left\langle(A^{*}q)^{\prime}(\xi_{i}+\theta w),w\right\rangle$ for $\theta\in(0,1)$ . We may apply the mean value theorem to conclude that

[TABLE]

for some $\hat{\theta}\in(0,1)$ . Since $g(0)=\left\langle(A^{*}q^{\star})^{\prime}(\xi_{i}),w\right\rangle=\left\langle 0,w\right\rangle=0$ , and $\left\langle(A^{*}q^{\star})^{\prime\prime}(\xi_{i}+\hat{\theta}w)w,w\right\rangle\leq-\gamma\|w\|_{2}^{2}$ , due to $(\left|A^{*}q^{\star}\right|)^{\prime\prime}\preccurlyeq-\gamma\operatorname{id}$ in $\{x\in\Omega,\|x-\xi_{i}\|_{2}\leq\tau_{0}\}$ , we obtain

[TABLE]

since $\|w\|_{2}=\|x-\xi_{i}\|_{2}>\tau$ by assumption. The last estimate was the claim $(iv)$ .

This implies a number of things. First, any local maximum of $\left|A^{*}q_{k}\right|$ with $\left|A^{*}q_{k}\right|\geq 1$ must lie within a distance of $\tau$ from the set $\xi$ (since for all other points, we have $\left|A^{*}q_{k}\right|<1$ – via $(iii)$ – or $(Aq_{k})^{\prime}\neq 0$ – via $(iv)$ ). Since $\left|A^{*}q_{k}\right|$ is locally concave on the $\tau_{0}$ -neighborhoods of the $\xi_{i}$ – this follows from $(ii)$ – at most one local extremum furthermore exists in each such neighborhood. This is the claim.

∎

3.5 Fixed grids estimates

In this section, we consider a fixed grid $\Omega_{0}$ and ask what we need to assume about it in order to guarantee that the set of local maxima of $\left|A^{*}q_{0}(x)\right|$ is close to true support $\xi$ . We express our result in terms of a geometrical property that we can control, the width of the grid $\operatorname{dist}(\Omega_{0},\Omega)$ .

Theorem 3.8.

Assume that $\operatorname{dist}(\Omega_{0},\Omega)\leq\frac{c_{q}}{\rho\sqrt{\kappa_{\operatorname{hess}}}}$ , then

[TABLE]

Proof.

It is trivial that $\operatorname{dist}(\Omega_{0},X_{0})\leq\operatorname{dist}(\Omega_{0},\Omega)$ . Applying Lemma 3.5, we immediately obtain the bound on $\|q_{0}-q^{\star}\|_{2}$ . By the same lemma,

[TABLE]

In order to obtain the first bound, remark that $\|q_{0}-q^{\star}\|_{2}\leq c_{q}$ and use Proposition 3.7. ∎

Remark 2.

Note that Theorem 3.8 allows to control $\operatorname{dist}(\xi,X_{0})$ but not $\operatorname{dist}(X_{0},\xi)$ . Indeed each $x\in X_{0}$ is guaranteed to be close to a $\xi_{i}$ , but not every $\xi_{i}$ needs to have a point in $X_{0}$ closeby. Note however that the bounds on the optimal value indicates that in this case the missed $\xi_{i}$ is not crucial to produce a good candidate for solving the primal problem. We will provide more insight on this, in the case of $f$ being strongly convex, in Section 4.

3.6 Eventual linear convergence rate

In this section, we provide a convergence rate for the iterative algorithm. As a follow-up to Remark 2, the proof of convergence relies on the fact that the distances $\operatorname{dist}(X_{k},\xi)$ and $\operatorname{dist}(\xi,X_{k})$ are equal. In order to ensure this fact, one has to wait for a finite number of iterations, this is exactly the purpose of the next proposition.

Proposition 3.9.

Let $B_{i}=\{x\in\Omega,\|x-\xi_{i}\|_{2}<\tau_{0}\}$ . There exists a finite number of iterations $N$ , such that for all $k\geq N$ , $X_{k}$ has exactly $s$ points, one in each $B_{i}$ . It follows that $dist(X_{k},\xi)=dist(\xi,X_{k})$ . Moreover if $S_{k}$ is the set of active point of $\mathcal{D}(\Omega_{k})$ , that is

[TABLE]

then $S_{k}\subset\cup_{i}B_{i}$ and for each $i$ , $B_{i}\cap S_{k}\neq\emptyset$ .

Proof.

We first prove that $B_{i}$ contains a point in $S_{k}$ . To this end, define the set of measures $\mathcal{M}_{-}=\{\mu\in\mathcal{M}(\Omega),\exists i\in\{1,\ldots,s\},\operatorname{supp}(\mu)\cap B_{i}=\emptyset\}$ and

[TABLE]

By assumption (6), $J_{+}>J^{\star}$ . Since $(J(\mu_{k}))_{k\in\mathbb{N}}$ converges to $J(\mu^{\star})$ , there exists $k_{2}\in\mathbb{N}$ such that $\forall k\geq k_{2}$ , $J(\mu_{k})<J_{+}$ . Hence $\mu_{k}$ must for each $1\leq i\leq s$ have points $z_{k}^{i}\in\Omega_{k}$ such that $\mu_{k}$ has non-zero mass at $z_{k}^{i}$ . Consequently, $|A^{*}q_{k}(z_{k}^{i})|=1$ , hence, each $B_{i}$ contains at least one point in $\Omega_{k}$ such that $|A^{*}q_{k}(z_{k}^{i})|=1$ .

Notice that $q_{k}$ converges to $q^{\star}$ by Theorem 3.1. Hence there a finite number of iterations $k_{1}$ such that $\|q_{k}-q^{\star}\|<c_{q}$ for all $k\geq k_{1}$ . By item $(iii)$ of the proof of Proposition 3.7, $|A^{*}q_{k}|<1$ outside $\cup_{i}B_{i}$ , and by item $(ii)$ , $|A^{*}q_{k}|$ is strictly concave in each $B_{i}$ . Hence each $B_{i}$ contains exactly one maximizer of $|A^{*}q_{k}|$ exceeding one.

∎

We now move on to analyzing our exchange approach. Before formulating the main result, let us introduce a term: $\delta$ -regimes.

Definition 1.

We say that the algorithm enters a $\delta$ -regime at iteration $k_{\delta}$ if for all $k\geq k_{\delta}$ , we have $\operatorname{dist}(\xi,X_{k})\leq\delta$ . In particular it means that only points with a distance at most $\delta$ from $\xi$ are added to the grid.

Lemma 3.10.

Let $\bar{\tau}_{0}=\frac{\kappa_{\nabla}}{\gamma}c_{q}$ and $A=2^{d+1}d^{d/2}\left(\frac{\rho\sqrt{R\kappa_{hess}}\kappa_{\nabla}}{\gamma}\right)^{3d}$ . Let $N$ be as in Proposition 3.9.

For any $\tau$ , the algorithm enters a $\tau$ -regime after a finite number of iterations. 2. 2.

Assume that $N$ iterations have passed and that the algorithm is in a $\tau$ -regime with $\tau\leq\bar{\tau}_{0}$ . Then for every $\alpha\in(0,1)$ it takes no more than $\left\lceil\frac{A}{\alpha^{2d}}\right\rceil+1$ iterations to enter an $\alpha\tau$ -regime.

Proof.

Note that for any $\delta\leq\bar{\tau}_{0}$ , if there exists $p\in\mathbb{N}$ such that

[TABLE]

we will enter an $\delta$ -regime after iteration $p$ by applying Proposition 3.7.

To prove $(1)$ , note that we without loss of generality can assume that $\tau\leq\bar{\tau}_{0}$ (since entering a $\tau$ -regime means in particular entering a $\tau^{\prime}$ -regime for any $\tau^{\prime}\geq\tau$ .) Then , since $\|q_{k}-q^{\star}\|_{2}$ tends to zero as $k$ goes to infinity, (22) with $\delta=\tau$ is true after a finite number of iterations.

To prove $(2)$ , we proceed as follows : Proposition 3.9 ensures that in each iteration, exactly one point is added in each ball $\{x\in\Omega,\|x-\xi_{i}\|_{2}\leq\tau\}$ . Let $k_{0}$ be the actual iteration, a covering number argument [29] ensures, for any $\Delta$ that after $\delta_{0}=\left\lceil 2d^{d/2}\left(\frac{\tau}{\Delta}\right)^{d}\right\rceil$ iterations, each point in $X_{k}$ needs to lie at a distance at most $\Delta$ from $\Omega_{k}$ , i.e., $\operatorname{dist}(\Omega_{k},X_{k})\leq\Delta$ .

Now, if we choose $\Delta=\left(\frac{\gamma}{\kappa_{\nabla}\rho\sqrt{R\kappa_{hess}}}\right)^{3}\frac{\alpha^{2}\tau}{2}$ , Lemma 3.5 together with Proposition 3.7 imply

[TABLE]

Since $\Omega_{k+1}\subset\Omega_{k}$ for all $k$ , the distance $\operatorname{dist}(\Omega_{k},\xi)$ is non-increasing. As a result $\operatorname{dist}(\Omega_{k},\xi)\leq\left(\frac{\gamma\alpha}{\kappa_{\nabla}\rho}\right)^{2}\frac{\tau}{2R\kappa_{hess}}$ for all $k\geq k_{0}+\delta_{0}+1$ . Since we are in $\tau$ -regime, we know that $\operatorname{dist}(X_{k},\xi)\leq\tau$ and $\operatorname{dist}(\Omega_{k},\xi)\leq\tau$ . Hence we can apply Lemma 3.6 to obtain that

[TABLE]

Then inequality (22) is satisfied with $\delta=\alpha\tau$ and the algorithm enters a $\alpha\tau$ -regime. ∎

The main result will tell us how many iterations we need to enter a $\tau$ -regime.

Theorem 3.11.

Let $\tau\leq\bar{\tau}_{0}\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\frac{\kappa_{\nabla}}{R\gamma}c_{q}$ and $k_{0}$ be the iteration on which the algorithm enters a $\bar{\tau}_{0}$ -regime. Then $k_{0}<\infty$ , and the algorithm will enter a $\tau$ -regime after no more than $k_{0}+k_{\tau}$ iterations, where

[TABLE]

Additionally, we will have

[TABLE]

for $k\geq k_{0}+k_{\tau}+1$ . In other words, the algorithm will eventually converge linearly.

Proof.

The fact that $k_{0}<\infty$ is the first assertion of Lemma 3.10. As for the other part, we argue as follows: Fix $\alpha\in(0,1)$ . Since we have entered a $\bar{\tau}_{0}$ -regime at iteration $k_{0}$ , Lemma 3.10 implies that it will take no more than $\left\lceil\frac{A}{\alpha^{2d}}\right\rceil+1$ additional iterations to enter a $\alpha\bar{\tau}_{0}$ . Repeating this argument, we see that after no more than

[TABLE]

iterations, we will have entered a $\alpha^{n}\bar{\tau}_{0}$ regime. Choosing $\alpha=e^{-1/2d}$ and $n=\lceil 2d\log\left(\bar{\tau}_{0}/\tau\right)\rceil$ , we obtain the first statement.

The second statement immediately follows from Lemma 3.6 (as in the proof of Theorem 3.8) and the fact that entering a $\tau$ -regime exactly amounts to that $\operatorname{dist}(X_{k},\xi)\leq\tau$ for all future $k$ , and therefore in particular $\operatorname{dist}(\Omega_{k+1},\xi)\leq\tau$ . ∎

The inequality (23) upper-bounds the cost function for the problem ( $\mathcal{P}(\Omega_{k})$ ). In practice, the numerical resolution of this problem is hard since $\Omega_{k}$ contains clusters of points and in practice it is beneficial to solve the simpler discrete problem

[TABLE]

For this measure, we also obtain an a posteriori estimate of the convergence rate.

Proposition 3.12.

Define $\widehat{\mu}_{k}$ as the solution of ( $\mathcal{P}(X_{k})$ ), if $\operatorname{dist}(X_{k},\xi)\leq\tau$ , we have

[TABLE]

Proof.

For any $i$ , denote $x^{i}_{k}$ a point in $X_{k}$ closest to $\xi_{i}$ and define $\tilde{\mu}_{k}=\sum_{i=1}^{s}\alpha_{i}^{\star}\delta_{x_{k}^{i}}$ . We have $J(\widehat{\mu}_{k})\leq J(\tilde{\mu}_{k})$ and $\|\tilde{\mu}_{k}\|_{\mathcal{M}}\leq\|\mu^{\star}\|_{\mathcal{M}}$ . Furthermore, we have

[TABLE]

The last term in the inequality is dealt with the following estimate:

[TABLE]

As for the penultimate term, remember that $q^{\star}=-\nabla f(A\mu^{\star})$ . This implies

[TABLE]

By making a Taylor expansion of $A^{*}q^{\star}$ in each $\xi_{i}$ , utilizing that the derivative vanishes there, and that $\|(A^{*}q^{\star})^{\prime\prime}(x)\|\leq\kappa_{\operatorname{hess}}\|q^{\star}\|_{2}$ for each $x\in\Omega$ , we see that $\left|(A^{*}q^{\star})(x_{k}^{i})-(A^{*}q^{\star})(\xi_{i})\right|\leq\frac{\kappa_{\operatorname{hess}}\|q^{\star}\|_{2}}{2}\|x_{k}^{i}-\xi_{i}\|_{2}^{2}$ for each $i$ . This yields

[TABLE]

Overall, we obtain

[TABLE]

∎

4 Convergence of continuous methods

In this section, we study an alternative algorithm that consists of using nonlinear programming approaches to minimize the following finite dimensional problem:

[TABLE]

where $X=(x_{1},\ldots,x_{p})$ . This principle is similar to continuous methods in semi-infinite programming [23] and was proposed specifically for total variation minimization in [3, 11, 28, 8]. By Proposition 3.9, we know that after a finite number of iterations, $X_{k}$ will contain exactly $s$ points located in a neighborhood of $\xi$ . This motivates the following hybrid algorithm:

•

Launch the proposed exchange method until some criterion is met. This yields a grid $X^{(0)}=X_{k}$ and we let $p=|X_{k}|$ .

•

Find the solution of the finite convex program

[TABLE]

•

Use the following gradient descent:

[TABLE]

where $\tau$ is a suitably defined step-size (e.g. defined using Wolfe conditions).

We tackle the following question: does the gradient descent algorithm converge to the solution if initialized well enough?

4.1 Existence of a basin of attraction

This section is devoted to proving the existence of a basin of attraction of a descent method in $G$ . Under two additional assumptions, we state our result in Proposition 4.1.

Assumption 7.

The function $f$ is twice differentiable and $\Lambda$ -strongly convex.

The twice differentiability assumption is mostly due to convenience, but the strong convexity is crucial. The second assumption is related to the structure of the support $\xi$ of the solution $\mu^{\star}$ .

Assumption 8.

For any $x,y\in\Omega$ denote $K(x,y)=\sum_{\ell}a_{\ell}(x)a_{\ell}(y)$ . The transition matrix

[TABLE]

is assumed to be positive definite, with a smallest eigenvalue larger than $\Gamma>0$ .

It is again possible to prove for many important operators $A$ that this assumption is satisfied if the set $\xi$ is separated. See the references listed in the discussion about Assumption 6. The following proposition describes the links between minimizing $G$ and solving ( $\mathcal{P}(\Omega)$ ).

Proposition 4.1.

Let $\mu^{\star}=\sum_{i=1}^{s}\alpha^{\star}_{i}\delta_{\xi_{i}}\neq 0$ be the solution of ( $\mathcal{P}(\Omega)$ ). Under Assumption 7 and 8, $(\alpha^{\star},\xi)$ is global minimum of $G$ . Additionally, $G$ is differentiable with a Lipschitz gradient and strongly convex in a neighborhood of $(\alpha^{\star},\xi)$ .

*Hence, there exists a basin of attraction around $(\alpha^{\star},\xi)$ such that performing a gradient descent on $G$ will yield the solution of ( $\mathcal{P}(\Omega)$ ) at a linear rate. *

The rest of this section is devoted to the proof of Proposition 4.1. Let us begin by stating a simple auxiliary result.

Lemma 4.2.

Let $U$ and $V$ be vector spaces and $C:V\to V$ be a linear operator with $C\succcurlyeq\lambda\operatorname{id}_{V}$ for a $\lambda\geq 0$ . Then, for any $B:U\to V$

[TABLE]

Proof.

If $B^{*}CB-\lambda B^{*}B$ is positive semidefinite, we claim holds. Since for $v\in U$ arbitrary

[TABLE]

the latter is the case. ∎

Let us introduce some notation that will be used in this section: for an $X=(x_{1},\ldots,x_{p})\in\Omega^{p}$ for some $p$ , $A(X)$ denotes the matrix $[a_{i}(x_{j})]$ . Analogously, $A^{\prime}(X)$ and $A^{\prime\prime}(X)$ denote the operators

[TABLE]

respectively. Note that for $q\in\mathbb{R}^{m}$ and $X\in\Omega^{p}$ ,

[TABLE]

We will also use the shorthands $\mu=\sum_{i}\alpha_{i}\delta_{x_{i}}$ , $G_{f}(\alpha,X)=f(A\mu)$ , and, for $\alpha\in\mathbb{R}^{p}$ , $D(\alpha)$ denotes the operator

[TABLE]

We have

[TABLE]

so that in points $(\alpha,X)$ with $\alpha_{i}\neq 0$ for all $i$ , and in particular in a neighborhood of $(\alpha^{\star},\xi)$ , $G$ is differentiable and its gradient is given by :

[TABLE]

As for the second derivatives, we have

[TABLE]

We may now prove our claims.

Proof 4.1.

First, let us note that due to the optimality conditions of $\mathcal{P}(\Omega)$ , we know that

[TABLE]

Letting $q=-\nabla f(A\mu)$ , it is furthermore fruitful to decompose the Hessian of $G$ into two parts:

[TABLE]

Now, $\left|A^{*}q^{\star}\right|$ has local maxima in the points $\xi_{i}$ , so that $(A^{*}q^{\star})^{\prime}(\xi)=0$ . In these points, we furthermore have that $\operatorname{sign}(\alpha_{i}^{\star})=A^{*}q^{\star}(\xi_{i})$ , so that the gradient of $G$ given in (27) vanishes.

To prove the rest, it is enough to show that the Hessian of $G_{f}$ is positive definite in a neighborhood around $(\alpha^{\star},\xi)$ . Let $(\alpha,X)$ be arbitrary. $H_{1}$ is an operator of the form $M_{1}^{*}M_{2}(X)^{*}{\mathcal{L}}M_{2}(X)M_{1}$ , with ${\mathcal{L}}=f^{\prime\prime}(A\mu):\mathbb{R}^{m}\to\mathbb{R}^{m}$ and

[TABLE]

Due to the $\Lambda$ -strong convexity of $f$ , $\mathcal{L}\succcurlyeq\Lambda\operatorname{id}$ . We furthermore have

[TABLE]

Let us now turn to $M_{2}(X)^{*}M_{2}(X)$ . If we define $M_{2}(\xi)=\begin{bmatrix}A(\xi)&A^{\prime}(\xi)\end{bmatrix}$ , we have

[TABLE]

by Assumption (8). We however have

[TABLE]

Now, by definition of $\kappa$ and $\kappa_{\nabla}$ ,

[TABLE]

and similarly for $\|M_{2}(X)\|_{2}=\|M_{2}(X)^{*}\|_{2}$ . Also, we have, by (18):

[TABLE]

so that all in all

[TABLE]

We may now apply Lemma 4.2 twice to conclude

[TABLE]

where we defined

[TABLE]

It remains to analyze $H_{2}$ . Define the bilinear form

[TABLE]

Then, if we define $w=\nabla f(A\mu)-\nabla f(A\mu^{\star})=q^{\star}-q$ , we have

[TABLE]

This makes it evident that

[TABLE]

The $L$ -Lipschitz gradient of $f$ proves that $\|w\|_{2}\leq L\|A\mu-A\mu_{*}\|_{2}$ . Using (18) yields directly:

[TABLE]

We still need to bound $\mathcal{H}_{2}$ . First remember that Assumption 6 asserts that for each $i$ , $\operatorname{sign}\alpha_{i}^{\star}(A^{*}q^{\star})^{\prime\prime}\preccurlyeq-\gamma\operatorname{id}$ and $\operatorname{sign}(\alpha_{i})=\operatorname{sign}(\alpha_{i}^{\star})$ in the ball of radius $\tau_{0}$ around $\xi_{i}$ . Consequently, if $(\alpha,X)$ is chosen so that for each $i$ , $\|x_{i}-\xi_{i}\|_{2}\leq\tau_{0}$ we get $-\alpha_{i}A^{*}q^{\star}(x_{i})\succcurlyeq\left|\alpha_{i}\right|\gamma\operatorname{id}$ , and

[TABLE]

By definition of $\kappa_{\operatorname{hess}}$ , we can further estimate

[TABLE]

Using $(A^{*}q^{\star})^{\prime}(\xi_{i})=0$ , we obtain

[TABLE]

If we now define

[TABLE]

we obtain

[TABLE]

Further utilizing the definition of $G_{1}$ and (28), we arrive at

[TABLE]

Since $G_{1}(X),G_{2}(\alpha,X)\to 0$ for $\alpha\to\alpha^{\star}$ and $X\to\xi$ , we obtain the claim. ∎

4.2 Eventually entering the basin of attraction

The following proposition shows that $(\tilde{\alpha},X_{k})$ defined as the amplitudes and positions of the Dirac-components of the solution $\widehat{\mu}$ of ( $\mathcal{P}(X_{k})$ ), $(\tilde{\alpha},X_{k})$ will lie in the basin described by Proposition 4.1. This result is stated in Corollary 4.4, the rest of this section is dedicated to proving it.

Proposition 4.3.

Assume that Assumptions 7 and 8 are true. Consider an $s$ -sparse measure

[TABLE]

for some $\tilde{\alpha}\in\mathbb{R}^{s}$ and $(\tilde{x}_{\ell})_{\ell=1\dots s}$ pairwise different points of $\Omega$ . We then have

[TABLE]

Proof.

Let $A(\xi)^{\dagger}$ be the Moore-Penrose inverse of $A(\xi)=[A(\xi_{1}),\dots,A(\xi_{s})]$ . Due to Assumption 8, $A(\xi)^{\dagger}$ has full rank and has an operator norm no larger than $\Gamma^{-1/2}$ . Since

[TABLE]

bounds on $A(\xi)\tilde{\alpha}-A\tilde{\mu}$ and $A\tilde{\mu}-A(\xi)\alpha^{\star}$ will therefore transform to a bound on $\tilde{\alpha}-\alpha^{\star}$ .

Let us begin with the former. We have

[TABLE]

where we used the Cauchy-Schwarz inequality in the last step.

To bound the latter, recall that $\Lambda$ -strong convexity of $f$ means that

[TABLE]

The optimality conditions for ( $\mathcal{P}(\Omega)$ ) tell us that $q^{\star}=-\nabla f(A\mu^{\star})$ , and hence

[TABLE]

where we in the last step used that $\|A^{*}q^{\star}\|_{\infty}\leq 1$ . Plugging the above inequality in (29) yields

[TABLE]

The claim follows. ∎

Corollary 4.4.

By Proposition 3.9, if $k$ is large enough then $X_{k}$ contains exactly $s$ points. In this case, let $\widehat{\mu}_{k}=\sum_{i=1}^{s}\widehat{\alpha}_{i}\delta_{\hat{x}_{i}^{k}}$ be the solution of ( $\mathcal{P}(X_{k})$ ). Applying Proposition 4.3, recalling that $\max_{i}\|\xi_{i}-\hat{x}_{i}^{k}\|_{2}\leq\operatorname{dist}(X_{k},\xi)$ and using the bound (24), we obtain :

[TABLE]

Since $\operatorname{dist}(X_{k},\xi)$ is guaranteed to eventually converge to zero by Theorem 3.11 and $\|\widehat{\mu}_{k}\|_{\mathcal{M}}$ are bounded ( e.g. by lower boundedness of $f$ and upper boundedness of $J(\widehat{\mu}_{k})$ ) , $(\widehat{\alpha},X_{k})$ will eventually lie in the basin of attraction of $G$ .

5 Description of the hybrid approach

To conclude this paper, we propose a method alternating between an exchange step and a continuous gradient descent. It is detailed in Algorithm 2. The idea is, after each iteration of an exchange algorithm, to start a gradient descent of $G$ initialized at the solution $\widehat{\mu}_{k}$ of ( $\mathcal{P}(X_{k})$ ). If this gradient descent converges to a measure $\bar{\mu}_{k}$ , we can subsequently test if it is an optimal point by checking if $\bar{q}_{k}=-\nabla f(A\bar{\mu}_{k})$ fulfills the stopping criterion $\|A^{*}\bar{q}_{k}\|_{\infty}\leq 1+\epsilon$ , where $\epsilon$ is a user defined stopping criterion (the latter is justified by Proposition 3.3). If so, we may output $\bar{\mu}_{k}$ , and if not, we may instead continue our exchange algorithm, possibly after adding also the support points of $\bar{\mu}_{k}$ . Its behavior is described in the following theorem.

Theorem 5.1 (Convergence guarantees for the alternating method).

Algorithm 2 comes with the following guarantees:

(Theorem 3.1)* Under Assumptions 1, 2 and 3, it is guaranteed to stop after a finite number of iterations for any stopping criterion $\epsilon>0$ .* 2. 2.

(Theorem 3.11)* If in addition Assumptions 5 and 6 are satisfied, then the algorithm eventually converges linearly: $k\geq N+k_{\tau}$ with $k_{\tau}\lesssim\log(\tau^{-1})$ , we have $\operatorname{dist}(\Omega_{k},\xi)\leq\tau$ .* 3. 3.

(Proposition 4.1, Theorem 3.11 and Proposition 4.3)* If in addition Assumptions 7 and 8 are satisfied, then - for large enough $k$ - the low complexity gradient descent (26) method converges linearly : $\|(\alpha^{(t)},X^{(t)})-(\alpha^{\star},\xi)\|_{2}\leq c^{t}\|(\alpha^{(0)},X^{(0)})-(\alpha^{\star},\xi)\|_{2}$ for some $0\leq c<1$ .*

Overall, this method has many desirable properties: the continuous method should be used whenever the exchange method reaches its basin of attraction since its per iteration cost is much cheaper. However, it is unclear in general that this basin even exists. In that case, the exchange method should be preferred since it eventually converges linearly under quite mild assumptions. The proposed algorithmic scheme somehow captures the best of all methods. Let us notice that it is very similar in spirit to the sliding Frank-Wolfe algorithm proposed in [11], apart from the fact that we suggest adding all the points $X_{k}$ violating the constraints, while the single most violating point is added in [11]. We believe that the proposed analysis sheds some light on the good numerical performance of this method.

Arguably the most complicated step in this algorithm is to evaluate $X_{k}$ , the set of local maximizers of $A^{*}q_{k}$ exceeding $1$ . This is an impossible task for an arbitrary function $A^{*}q_{k}$ . However, a simple heuristic described in the next section provided rather satisfactory results for the measurement functions considered in this paper (trigonometric polynomials and Gaussian convolution).

Apart from this, let us outline that the subproblems in this algorithm are well suited for numerical resolution. In the exchange algorithm, we only solve the dual problems $\mathcal{D}(\Omega_{k})$ which are strongly convex. Hence first-order methods for instance come with guarantees of convergence to $q_{k}$ in $\ell^{2}$ -norm. Recovering the masses $\hat{\alpha}_{k}$ , solutions of $\mathcal{P}(X_{k})$ is also stable since $X_{k}$ (the local maximizers of $A^{*}q_{k}$ ) is typically a well separated set of low cardinality. The gradient descent (or alternative nonlinear programming approach) on $G(\alpha,X)$ is performed over a low dimensional set. If the convergence is not satisfactory (e.g. the norm of $\nabla G$ doesn’t decay fast enough), it can be stopped, and we can switch back to the exchange algorithm.

6 Numerical Experiments

To test our theory, we have implemented our algorithm in MATLAB. Before displaying the results of the experiments, let us discuss a few key steps in the implementation. In the entire section, we assume that $\Omega=[0,1]^{d}$ for $d=1$ or $2$ for simplicity. Note that this is no true restriction: we can always by scaling and translation ensure that $\Omega\subseteq[0,1]^{d}$ , and trivially extend the measurement functions by [math] to the entirety of $[0,1]^{d}$ .

Evaluating $X_{k}$

Each iteration of the exchange algorithm requires the exact calculation of the local maximizers of $A^{*}q_{k}$ exceeding $1$ . This is, in general, an impossible task. We resort to the following heuristic method: Given a $q_{k}$ , we first evaluate $|A^{*}q_{k}|$ on a fixed rectangular grid $G=((n)^{-1}[0,\dots,n])^{d}$ , and determine all of the discrete peaks, i.e. points in which $\left\{A^{*}q_{k}\right\}$ is larger than all of its neighbors in the grid, and where $A^{*}q_{k}$ exceeds $1-\epsilon_{1}$ for a threshold $\epsilon_{1}>0$ . Next, we start a gradient descent in each of these points, stopping them once $\|(A^{*}q_{k})^{\prime}\|_{2}$ is lower than another threshold. Since it is possible that several of these gradient descents land in the same point $x$ , we subsequently check if the set contains sets of points which are too close to each other - if this is the case, we discard all but one of them in such a group. We finally remove any point in which $\left|A^{*}q_{k}\right|$ is not larger than $1-\epsilon_{2}$ , for a small $\epsilon_{2}>0$ .

Solving the Discrete Problems

We have chosen to solve the problems ( $\mathcal{D}(\Omega_{k})$ ) and $(\mathcal{P}(X_{k}))$ using an accelerated proximal gradient descent [20].

6.1 Example 1: Super-resolution from Fourier measurements in 1D.

We start by testing our algorithm on a popular instance of problem ( $\mathcal{P}(\Omega)$ ): super-resolution of a measure $\mu\in\mathcal{M}(0,1)$ from finitely many of its Fourier moments

[TABLE]

We use a quadratic data fidelity term $f(z)=\frac{L}{2}\|z-y\|_{2}^{2}$ . This example is well studied by the signal processing community [26, 6, 13, 21].

We chose $m$ to be equal to $30$ , and a vector $y$ generated as $A\mu_{0}$ , where $\mu_{0}$ is chosen at random as a $5$ -sparse atomic measure with amplitudes close to $1$ or $-1$ . The positions of the Dirac masses were chosen as a small random perturbation from a uniform grid. The initial grid $\Omega_{0}$ was chosen as a uniform grid with $8$ points, i.e. $[0,\frac{1}{8},\dots,\frac{7}{8}]$ . We made $100$ experiments, with $20$ iterations of the exchange algorithm. The evolution of $\mu_{k}$ and $q_{k}$ for the first iterations for a typical iteration is displayed in Figure 1. We see that after already $8$ iterations, $A^{*}q_{k}$ appears to be very close to $A^{*}q^{\star}$ . Before this iteration, the algorithm ’chooses’ to add points relatively uniformly to the grid, but after that, new points are only added close to $\xi$ . This is further emphasized by Figure 2, in which $X_{k}$ is plotted for each iteration, along with size of $\Omega_{k}$ .

To track the success of the algorithm a bit more systematically, we chose to track the evolution of $\operatorname{dist}(\xi,X_{k})$ , $\operatorname{dist}(\Omega_{k},X_{k})$ and $\operatorname{dist}(\Omega_{k},\xi)$ . The median over the 100 iterations, along with confidence intervals covering all experiments but the top and bottom $5\%$ are plotted in Figures 3. We see that all of the quality measures seem to converge linearly to 0.

Finally, we performed the same analysis for the optimum gap $\min\eqref{eq:discretePrimal}-\min\eqref{eq:primal}$ , the error $\|q_{k}-q^{\star}\|_{2}$ and the sizes of the grids $\Omega_{k}$ . ( $\min\eqref{eq:primal}$ was in each case chosen as the lowest value of $\min\eqref{eq:discretePrimal}$ over all iterations $k$ , and $q^{\star}$ as the corresponding dual solution). We see that the optimum gap seems to converge exponentially to [math] right from the first iteration, wheras the error $\|q_{k}-q^{\star}\|_{2}$ initially does not. The ’two-phase’-effect is also easy to spot: After about $5-6$ iterations, the algorithm switches from adding many points to adding only few points close to $\xi$ . Interestingly, the plateau of the $q$ -errors seems to be simultaneuos with the ’phase-transition’.

6.2 Example 2: Super-resolution from Gaussian measurements in 2D

Next, we perform a study in a two-dimensional setting. We consider $\Omega=[-1,1]^{2}$ and measurement functions of the form

[TABLE]

where the points $x_{i}$ live on a Euclidean grid of size $64\times 64$ , restricted to the domain $[-0.5,0.5]^{2}$ . We then add white Gaussian noise to the measurements, leading to pictures of the type shown in Fig. 5. Here, the true underlying measure contains $11$ Dirac masses with random positive amplitudes and random locations on $[-0.4,0.4]^{2}$ .

6.2.1 Exchange algorithm

The evolution of the grids $\Omega_{k}$ and of the dual certificates $|A^{*}q_{k}|$ is shown in Figure 6. As can be seen, points are initially added anywhere in the domain, but after a few iterations, they all cluster around the true locations, as expected from the theory. To further stress this phenomenon and illustrate our theorems and lemmata, we display many quantities of interest appearing in our main results in Fig. 7. the distance from $X_{k}$ to $\xi$ (where $\xi$ is estimated as $X_{40}$ ) on Fig. 7c, the distance from $\Omega_{k}$ to $\xi$ on Fig. 7b, the evolution of $J(\widehat{\mu}_{k})-J(\widehat{\mu}_{40})$ on Fig. 7a, $\|A^{*}q_{k}\|_{\infty}-1$ on Fig. 7e. Finally, the number of maxima of $|A^{*}q_{k}|$ is shown on Fig. 7f. As can be seen, the number of maxima quickly stabilizes, suggesting that we reached a $\tau_{0}$ -regime. Then all the quantities (cost function, distance from $\xi$ , violation of the constraints) seem to converge to [math] linearly. This is not true after iteration 15, and we suspect that this is solely due to numerical inaccuracies when computing the solution of the discretized problems. Notice however that the accuracy of the Dirac locations drops below $10^{-3}$ after 14 iterations, and that this accuracy is more than enough for the particular super-resolution application. Notice that if we wished to reach this accuracy with a fixed grid, we would need a Euclidean discretization containing $10^{6}$ points, while we here needed only 152 ( $|\Omega_{14}|=152$ ). In addition, the $\ell^{1}$ resolution is stable since it is accomplished on a grid $X_{14}$ containing only $11$ points.

6.2.2 Continuous method

In this experiment, we evaluate the behavior of the gradient descent (26) depending on the initialization $(\alpha^{(0)},X^{(0)})$ and on the number of iterations. We use the same setting as in the previous section. The left graph of Fig. 8 illustrates that the gradient descent typically converges linearly when initialized close enough to the true minimizer $(\alpha^{\star},\xi)$ . This was predicted by Theorem 4.1. In this case (and actually all the others related to this experiment), it converges to machine precision in less than 1000 iterations. This is remarkable since the gradient descent is a simple algorithm that can be easily improved by using e.g. Nesterov acceleration (we proved that the function is locally convex) or other optimization schemes such as L-BFGS.

In order to evaluate the size of the basin of attraction around the global minimizer, we start from random points of the form $(\alpha^{(0)},X^{(0)})=(\alpha^{\star},\xi)+(\Delta_{\alpha},\Delta_{X})$ , where $\Delta_{\alpha}$ and $\Delta_{X}$ are random perturbations with an amplitude set as $\|(\Delta_{\alpha},\Delta_{X})\|_{2}=\gamma\|(\alpha^{\star},\xi)\|_{2}$ , with $\gamma$ in $[0,1]$ . We then run $50$ gradient descents with different realizations of $(\alpha^{(0)},X^{(0)})$ and record the success rate (i.e. the number of times the gradient descent converges to $(\alpha^{\star},\xi)$ with an accuracy of at least $10^{-6}$ ). We plot this success rate with respect to $\gamma$ in Fig. 8b. As can be seen, the success rate is always $1$ when the relative error $\gamma$ is less than $5\%$ , showing that for this particular problem, a rather rough initialization suffices for the gradient descent to converge to the global minimizer.

6.2.3 Alternating method

The alternating method suggested in Algorithm 2 turns out to converge in a single iteration when applied to the setting described above. We therefore apply it to a more challenging scenario with 30 Dirac masses instead of 11 and more noise. The measurements $y$ are shown in Fig. 9. We compare three implementations: a pure exchange method, an alternating method as in Algorithm 2 without line 14 and an alternating method as in in Algorithm 2 with line 14. The conclusions are as follows:

•

All methods rapidly conclude that the underlying measure contains 30 Dirac masses. (The pure exchange algorithm after 10 iterations, the alternating method with line 14 already after the first).

•

The pure exchange algorithm quickly gets to a point close to the optimum. The positions then slowly converge to the tue locations. It does however eventually find the basin of attraction of $G$ (in this example, it needed 10 iterations).

•

Line 14 in the alternating method improves the convergence significantly. In fact, omitting it, we need 10 iterations to find the basin of attraction, whereas the version with the line finds it directly. Investigating this effect more closely is an interesting line of future research.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Bernard G Bodmann, Axel Flinth, and Gitta Kutyniok. Compressed sensing for analog signals. ar Xiv preprint ar Xiv:1803.04218 , 2018.
2[2] John M. Borwein and Adrian S. Lewis. Partially finite convex programming, part i: Quasi relative interiors and duality theory. Mathematical Programming , 57(1):15–48, May 1992.
3[3] Nicholas Boyd, Geoffrey Schiebinger, and Benjamin Recht. The alternating descent conditional gradient method for sparse inverse problems. SIAM Journal on Optimization , 27(2):616–639, 2017.
4[4] Claire Boyer, Antonin Chambolle, Yohann De Castro, Vincent Duval, Frédéric De Gournay, and Pierre Weiss. On Representer Theorems and Convex Regularization. ar Xiv preprint ar Xiv:1806.09810 , 2018.
5[5] Kristian Bredies and Hanna Katriina Pikkarainen. Inverse problems in spaces of measures. ESAIM: Control, Optimisation and Calculus of Variations , 19(1):190–218, 2013.
6[6] Emmanuel J Candès and Carlos Fernandez-Granda. Towards a Mathematical Theory of Super-resolution. Communications on Pure and Applied Mathematics , 67(6):906–956, 2014.
7[7] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review , 43(1):129–159, 2001.
8[8] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems , pages 3036–3046, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

On the linear convergence rates of exchange and continuous methods for total variation minimization

Abstract

Acknowledgement

1 Introduction

1.1 The problem

1.2 Applications

1.3 Numerical approaches in signal processing

1.4 Some numerical approaches in semi-infinite programming

1.4.1 Exchange algorithms

1.4.2 Continuous methods

1.5 Contribution

2 Preliminaries

2.1 Notation

Proposition 2.1**.**

2.2 Existence results and duality

Assumption 1**.**

Assumption 2**.**

Proposition 2.2** (Existence and strong duality).**

Proof.

3 An Exchange Algorithm and its convergence

3.1 The algorithm

3.2 A generic convergence result

Assumption 3**.**

Assumption 4**.**

Theorem 3.1** (Generic convergence).**

Proof.

Remark 1**.**

3.3 Non degenerate source condition

Assumption 5** (Assumption on the measurement functionals ).**

Assumption 6** (Assumption on the primal-dual pair).**

3.4 Auxiliary results

Lemma 3.2** (Boundedness of the dual variables ).**

Proof of Lemma 3.2.

Proposition 3.3**.**

Proof.

Proposition 3.4**.**

Proof.

Lemma 3.5**.**

Proof of Lemma 3.5.

Lemma 3.6**.**

Proof.

Proposition 3.7**.**

Proof.

3.5 Fixed grids estimates

Theorem 3.8**.**

Proof.

Remark 2**.**

3.6 Eventual linear convergence rate

Proposition 3.9**.**

Proof.

Definition 1**.**

Lemma 3.10**.**

Proof.

Theorem 3.11**.**

Proof.

Proposition 3.12**.**

Proof.

4 Convergence of continuous methods

4.1 Existence of a basin of attraction

Assumption 7**.**

Assumption 8**.**

Proposition 4.1**.**

Lemma 4.2**.**

Proof.

Proof 4.1.

4.2 Eventually entering the basin of attraction

Proposition 4.3**.**

Proof.

Corollary 4.4**.**

5 Description of the hybrid approach

Theorem 5.1** (Convergence guarantees for the alternating method).**

6 Numerical Experiments

Evaluating XkX_{k}Xk​

Solving the Discrete Problems

Proposition 2.1.

Assumption 1.

Assumption 2.

Proposition 2.2 (Existence and strong duality).

Assumption 3.

Assumption 4.

Theorem 3.1 (Generic convergence).

Remark 1.

Assumption 5 (Assumption on the measurement functionals ).

Assumption 6 (Assumption on the primal-dual pair).

Lemma 3.2 (Boundedness of the dual variables ).

Proposition 3.3.

Proposition 3.4.

Lemma 3.5.

Lemma 3.6.

Proposition 3.7.

Theorem 3.8.

Remark 2.

Proposition 3.9.

Definition 1.

Lemma 3.10.

Theorem 3.11.

Proposition 3.12.

Assumption 7.

Assumption 8.

Proposition 4.1.

Lemma 4.2.

Proposition 4.3.

Corollary 4.4.

Theorem 5.1 (Convergence guarantees for the alternating method).

Evaluating $X_{k}$