Linear convergence of accelerated conditional gradient algorithms in   spaces of measures

Konstantin Pieper; Daniel Walter

arXiv:1904.09218·math.OC·March 30, 2021

Linear convergence of accelerated conditional gradient algorithms in spaces of measures

Konstantin Pieper, Daniel Walter

PDF

TL;DR

This paper introduces a class of generalized conditional gradient algorithms for optimization in spaces of Radon measures, demonstrating sub-linear convergence generally and linear convergence under certain structural assumptions.

Contribution

The paper develops a new class of algorithms for measure space optimization and proves their convergence rates, including local linear convergence under specific conditions.

Findings

01

Achieves a sub-linear $ ext{O}(1/k)$ convergence rate in general cases.

02

Under structural assumptions, attains local linear convergence with rate $ ext{O}( ho^k)$.

03

Provides analysis for finite-dimensional subproblem resolution within the algorithm.

Abstract

A class of generalized conditional gradient algorithms for the solution of optimization problem in spaces of Radon measures is presented. The method iteratively inserts additional Dirac-delta functions and optimizes the corresponding coefficients. Under general assumptions, a sub-linear $O (1/ k)$ rate in the objective functional is obtained, which is sharp in most cases. To improve efficiency, one can fully resolve the finite-dimensional subproblems occurring in each iteration of the method. We provide an analysis for the resulting procedure: under a structural assumption on the optimal solution, a linear $O (ζ^{k})$ convergence rate is obtained locally.

Equations372

u = U_{A} (u) = n = 1 \sum N u_{n} δ_{x_{n}},

u = U_{A} (u) = n = 1 \sum N u_{n} δ_{x_{n}},

j_{N} (x, u) = F (n = 1 \sum N k (x_{n}, u_{n})) + G (n = 1 \sum N ∥ u_{n} ∥_{H}) .

j_{N} (x, u) = F (n = 1 \sum N k (x_{n}, u_{n})) + G (n = 1 \sum N ∥ u_{n} ∥_{H}) .

Minimize j_{N} (x, u) for x \in Ω^{N}, u \in H^{N}, N \geq 0,

Minimize j_{N} (x, u) for x \in Ω^{N}, u \in H^{N}, N \geq 0,

K u = n = 1 \sum N k (x_{n}, u_{n}), ∥ u ∥_{M (Ω, H)} = n = 1 \sum N ∥ u_{n} ∥_{H} .

K u = n = 1 \sum N k (x_{n}, u_{n}), ∥ u ∥_{M (Ω, H)} = n = 1 \sum N ∥ u_{n} ∥_{H} .

Minimize j (u) = F (K u) + G (∥ u ∥_{M (Ω, H)}) for u \in M (Ω, H) .

Minimize j (u) = F (K u) + G (∥ u ∥_{M (Ω, H)}) for u \in M (Ω, H) .

Minimize \frac{1}{2} ∥ K u - y_{d} ∥_{Y}^{2} + β ∥ u ∥_{M (Ω, H)} for u \in M (Ω, H),

Minimize \frac{1}{2} ∥ K u - y_{d} ∥_{Y}^{2} + β ∥ u ∥_{M (Ω, H)} for u \in M (Ω, H),

u^{k + 1} = u^{k} + s^{k} (v^{k} - u^{k}), v^{k} = v^{k} δ_{x^{k}}, s^{k} \in [0, 1],

u^{k + 1} = u^{k} + s^{k} (v^{k} - u^{k}), v^{k} = v^{k} δ_{x^{k}}, s^{k} \in [0, 1],

Minimize j (U_{A_{k + 1/2}} (u)) for u \in H^{N_{k} + 1} .

Minimize j (U_{A_{k + 1/2}} (u)) for u \in H^{N_{k} + 1} .

Minimize F (K u) + G (∥ u ∥_{L^{1} (Ω, H)}) + \frac{ε}{2} ∥ u ∥_{L^{2} (Ω, H)}^{2} for u \in L^{2} (Ω, H)

Minimize F (K u) + G (∥ u ∥_{L^{1} (Ω, H)}) + \frac{ε}{2} ∥ u ∥_{L^{2} (Ω, H)}^{2} for u \in L^{2} (Ω, H)

∥ u ∥_{M} = ∣ u ∣ (Ω) = \int_{Ω} d ∣ u ∣ .

∥ u ∥_{M} = ∣ u ∣ (Ω) = \int_{Ω} d ∣ u ∣ .

u^{'} \in L^{\infty} (Ω, ∣ u ∣; H) with ∥ u^{'} (x) ∥_{H} = 1 for ∣ u ∣ -almost all x \in Ω,

u^{'} \in L^{\infty} (Ω, ∣ u ∣; H) with ∥ u^{'} (x) ∥_{H} = 1 for ∣ u ∣ -almost all x \in Ω,

u (O) = \int_{O} d u = \int_{O} u^{'} d ∣ u ∣ for all O \in B (Ω);

u (O) = \int_{O} d u = \int_{O} u^{'} d ∣ u ∣ for all O \in B (Ω);

∥ φ ∥_{C} = x \in Ω max ∥ φ (x) ∥_{H}

∥ φ ∥_{C} = x \in Ω max ∥ φ (x) ∥_{H}

⟨ φ, u ⟩ = \int_{Ω} (φ (x), u^{'} (x))_{H} d ∣ u ∣ (x)

⟨ φ, u ⟩ = \int_{Ω} (φ (x), u^{'} (x))_{H} d ∣ u ∣ (x)

⟨ φ, u^{k} ⟩ \to ⟨ φ, u ⟩ for all φ \in C (Ω, H)

⟨ φ, u^{k} ⟩ \to ⟨ φ, u ⟩ for all φ \in C (Ω, H)

K u = \int_{Ω} k (x, u^{'} (x)) d ∣ u ∣ (x),

K u = \int_{Ω} k (x, u^{'} (x)) d ∣ u ∣ (x),

K^{⋆} y = φ, where (φ (x), u)_{H} = (k (x, u), y)_{Y} for all x \in Ω, u \in H .

K^{⋆} y = φ, where (φ (x), u)_{H} = (k (x, u), y)_{Y} for all x \in Ω, u \in H .

∥ K^{⋆} ∥_{L (Y, C (Ω, H))} = x \in Ω, ∥ u ∥_{H} = 1 sup ∥ k (x, u) ∥_{Y} < \infty.

∥ K^{⋆} ∥_{L (Y, C (Ω, H))} = x \in Ω, ∥ u ∥_{H} = 1 sup ∥ k (x, u) ∥_{Y} < \infty.

u \in M (Ω, H) min j (u) : = [F (K u) + G (∥ u ∥_{M})] .

u \in M (Ω, H) min j (u) : = [F (K u) + G (∥ u ∥_{M})] .

f (u) : = F (K u) .

f (u) : = F (K u) .

f^{'} (u) (δ u) = (\nabla F (K u), K δ u)_{Y} = ⟨ K^{⋆} \nabla F (K u), δ u ⟩ .

f^{'} (u) (δ u) = (\nabla F (K u), K δ u)_{Y} = ⟨ K^{⋆} \nabla F (K u), δ u ⟩ .

\nabla f : dom j \to C (Ω, H), \nabla f (u) = K^{⋆} \nabla F (K u),

\nabla f : dom j \to C (Ω, H), \nabla f (u) = K^{⋆} \nabla F (K u),

⟨ \overset{p}{ˉ}, \overset{u}{ˉ} ⟩ = ∥ \overset{p}{ˉ} ∥_{C} ∥ \overset{u}{ˉ} ∥_{M}, ∥ \overset{p}{ˉ} ∥_{C} \in \partial G (∥ \overset{u}{ˉ} ∥_{M})

⟨ \overset{p}{ˉ}, \overset{u}{ˉ} ⟩ = ∥ \overset{p}{ˉ} ∥_{C} ∥ \overset{u}{ˉ} ∥_{M}, ∥ \overset{p}{ˉ} ∥_{C} \in \partial G (∥ \overset{u}{ˉ} ∥_{M})

\overset{y}{ˉ} : = K \overset{u}{ˉ} \in Y, \overset{p}{ˉ} : = - \nabla f (\overset{u}{ˉ}) = - K^{⋆} \nabla F (\overset{y}{ˉ}) \in C (Ω, H), \overset{ˉ}{λ} : = ∥ \overset{p}{ˉ} ∥_{C} .

\overset{y}{ˉ} : = K \overset{u}{ˉ} \in Y, \overset{p}{ˉ} : = - \nabla f (\overset{u}{ˉ}) = - K^{⋆} \nabla F (\overset{y}{ˉ}) \in C (Ω, H), \overset{ˉ}{λ} : = ∥ \overset{p}{ˉ} ∥_{C} .

\operatorname{\operatorname{supp}}\bar{u}\subset\left\{\,x\in\Omega\;\big{|}\;\lVert\bar{p}(x)\rVert_{H}=\bar{\lambda}\,\right\},\quad\text{and}\quad\bar{u}^{\prime}(x)=\frac{\bar{p}(x)}{\bar{\lambda}}\quad\lvert\bar{u}\rvert\text{-a.a. }x\in\Omega.

\operatorname{\operatorname{supp}}\bar{u}\subset\left\{\,x\in\Omega\;\big{|}\;\lVert\bar{p}(x)\rVert_{H}=\bar{\lambda}\,\right\},\quad\text{and}\quad\bar{u}^{\prime}(x)=\frac{\bar{p}(x)}{\bar{\lambda}}\quad\lvert\bar{u}\rvert\text{-a.a. }x\in\Omega.

\bar{x}_{n}\in\left\{\,x\in\Omega\;\big{|}\;\lVert\bar{p}(x)\rVert_{H}=\bar{\lambda}\,\right\},\quad\frac{\bar{\bm{u}}_{n}}{\lVert\bar{\bm{u}}_{n}\rVert_{H}}=\frac{\bar{p}(\bar{x}_{n})}{\bar{\lambda}}\quad\text{for }n\text{ with }\bar{\bm{u}}_{n}\neq 0.

\bar{x}_{n}\in\left\{\,x\in\Omega\;\big{|}\;\lVert\bar{p}(x)\rVert_{H}=\bar{\lambda}\,\right\},\quad\frac{\bar{\bm{u}}_{n}}{\lVert\bar{\bm{u}}_{n}\rVert_{H}}=\frac{\bar{p}(\bar{x}_{n})}{\bar{\lambda}}\quad\text{for }n\text{ with }\bar{\bm{u}}_{n}\neq 0.

\overset{ˉ}{P} \in C (Ω) : \overset{ˉ}{P} (x) : = ∥ \overset{p}{ˉ} (x) ∥_{H},

\overset{ˉ}{P} \in C (Ω) : \overset{ˉ}{P} (x) : = ∥ \overset{p}{ˉ} (x) ∥_{H},

{x \in Ω ∣ \overset{ˉ}{P} (x) = \overset{ˉ}{λ}} = {\overset{x}{ˉ}_{n}}_{n = 1}^{N} .

{x \in Ω ∣ \overset{ˉ}{P} (x) = \overset{ˉ}{λ}} = {\overset{x}{ˉ}_{n}}_{n = 1}^{N} .

\displaystyle\left\{\,\bm{k}(\bar{x}_{n},\bar{p}(\bar{x}_{n}))\;\big{|}\;n=1,\dots,N\,\right\}\subset Y,

\displaystyle\left\{\,\bm{k}(\bar{x}_{n},\bar{p}(\bar{x}_{n}))\;\big{|}\;n=1,\dots,N\,\right\}\subset Y,

\overset{u}{ˉ} = n = 1 \sum N \overset{ˉ}{u}_{n} δ_{\overset{x}{ˉ}_{n}}, \overset{ˉ}{u}_{n} = \overset{μ}{ˉ}_{n} \frac{p ˉ ( x ˉ _{n} )}{λ ˉ}, with \overset{μ}{ˉ}_{n} = ∥ \overset{ˉ}{u}_{n} ∥_{H} \geq 0.

\overset{u}{ˉ} = n = 1 \sum N \overset{ˉ}{u}_{n} δ_{\overset{x}{ˉ}_{n}}, \overset{ˉ}{u}_{n} = \overset{μ}{ˉ}_{n} \frac{p ˉ ( x ˉ _{n} )}{λ ˉ}, with \overset{μ}{ˉ}_{n} = ∥ \overset{ˉ}{u}_{n} ∥_{H} \geq 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Linear convergence of accelerated

conditional gradient algorithms in spaces of measures

Konstantin Pieper

Computer Science and Mathematics Division, Oak Ridge National Laboratory, One Bethel Valley Road, P.O. Box 2008, MS-6211, Oak Ridge, TN 37831 ([email protected])

and

Daniel Walter

Johann Radon Institute for Computational and Applied Mathematics, ÖAW, Altenbergerstraße 69, 4040 Linz, Austria ([email protected])

Abstract.

A class of generalized conditional gradient algorithms for the solution of optimization problem in spaces of Radon measures is presented. The method iteratively inserts additional Dirac-delta functions and optimizes the corresponding coefficients. Under general assumptions, a sub-linear $\mathcal{O}(1/k)$ rate in the objective functional is obtained, which is sharp in most cases. To improve efficiency, one can fully resolve the finite-dimensional subproblems occurring in each iteration of the method. We provide an analysis for the resulting procedure: under a structural assumption on the optimal solution, a linear $\mathcal{O}(\zeta^{k})$ convergence rate is obtained locally.

Key words and phrases:

vector-valued finite Radon measures, generalized conditional gradient, sparsity, nonsmooth optimization

1991 Mathematics Subject Classification:

46E27, 65J22, 65K05, 90C25, 49M05

KP acknowledges funding by the US Air Force Office of Scientific Research grant FA9550-15-1-0001 and the Laboratory Directed Research and Development Program at Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). DW acknowledges support by the DFG through the International Research Training Group IGDK 1754 “Optimization and Numerical Analysis for Partial Differential Equations with Nonsmooth Structures”. Furthermore, support from the TopMath Graduate Center of TUM Graduate School at Technische Universität München, Germany and from the TopMath Program at the Elite Network of Bavaria is gratefully acknowledged.

1. Introduction

In this paper we consider generalized conditional gradient methods for sparse optimization problems, where the optimization variable lies in a space of measures. These problems arise in different contexts, and they are intrinsically related to certain optimization problems in terms of the spatial location parameters and associated coefficient variables: For the purposes of this paper, we want to find a “sparse” measure, which consists of a sum of Dirac delta functions,

[TABLE]

with a finite point set $\mathcal{A}=\{\,x_{n}\;|\;n=1,\ldots,N\,\}\subset\Omega$ from a continuous candidate set $\Omega$ (a possibly uncountably infinite compact subset of $\mathbb{R}^{d}$ , $d\geq 1$ ) and corresponding coefficients $\bm{u}_{n}$ in a Hilbert space $H$ (for instance, $\mathbb{R}$ , $\mathbb{C}^{M}$ , $M\geq 1$ , etc.), and $N\geq 0$ the cardinality of the support set. It should be emphasized that neither the number of points, nor the coefficients are subject to any further restrictions. Usually, the measure $u$ has a physical interpretation as a number of point-wise sources or sensors in a physics-based model. There are many applications, where one is interested to choose $N$ , $\bm{x}=(x_{1},x_{2},\ldots,x_{N})$ , and $\bm{u}=(\bm{u}_{1},\bm{u}_{2},\ldots,\bm{u}_{N})$ to minimize a functional of the form:

[TABLE]

Here, $F$ is a suitable design functional or quality criterion for the variable $y=\sum_{n}\bm{k}(x_{n},\bm{u}_{n})$ (which we will also refer to as observation variable), which is given in terms of the kernel function $\bm{k}\colon\Omega\times H\to Y$ , and evaluates the response of a model to the optimization variables $\bm{x}$ and $\bm{u}$ . The second term, which is expressed in terms of the sum of the norms of the coefficients (the $\ell^{1}(H)$ norm of $\bm{u}$ ) models either the cost of the coefficient variable, or is added as a regularization term to ensure that the coefficients are sufficiently small.

Often, the functionals $F$ and $G$ are convex, but $\bm{k}$ is linear only in the coefficients $\bm{u}$ , but not in the location parameters $\bm{x}$ . Thus, the corresponding optimization problem is not convex:

[TABLE]

Moreover, it has a combinatorial aspect, since $N$ is not fixed. However, by embedding this problem into a more general formulation, a convex formulation can be obtained. Concretely, the sparse measure (1.1) can be considered as an element of the space of regular vector-measures $\mathcal{M}(\Omega,H)$ . Requiring $\bm{k}$ to be continuous in the coefficients, we can introduce the (integral) operator $K$ and the total variation norm as

[TABLE]

We refer to section 2 for the rigorous definitions in the case of a general measure from the space of vector measures. Now, we can formulate the following generalized convex optimization problem:

[TABLE]

Note that the formulation ( $\mathcal{P}$ ) is more general than (1.2), since not all vector measure are of the form (1.1) (in particular, the Lebesgue space $L^{1}(\Omega,H)$ is contained in $\mathcal{M}(\Omega,H)$ . However, in many cases, the solutions of ( $\mathcal{P}$ ) have the desired discrete sparsity structure. In particular, if $Y$ is a finite-dimensional space, sparse solutions with $N\leq\dim Y$ can always be found; see, e.g., [8] or [58, Proposition 6.32]. This then renders both problem formulations essentially equivalent.

Our main motivation for studying these problems is given by applications in inverse source location [9, 52], optimal control [15, 39, 40, 30], or compressed sensing [10, 24, 3]: Here, ( $\mathcal{P}$ ) is of the simpler form:

[TABLE]

where $u$ encodes a collection of vector valued signals originating from a number of source locations $x\in\Omega$ , and $K$ models the signal that will be received by a measurement setup. The data vector $y_{d}$ contains (potentially noisy) observations obtained in practice, and the first term measures the misfit of the data to the response of the model. Often, such models involve trigonometric polynomials or other analytically given functions [24, 3, 10]. More complicated models involve partial differential equations [13, 52, 9]. Here, $Ku$ corresponds to (possibly pointwise) observations of the PDE solution corresponding to a source term $u$ .

A second motivation arises in the theory of optimal design [27, 53, 56], going back to the concept of approximate designs by Kiefer and Wolfowitz [38]. Here, $x$ corresponds to a spatial sensor location, $1/\bm{u}\in\mathbb{R}_{+}$ to the error variance of the corresponding sensor, and $Ku$ to the Fisher information matrix of a linear (or linearized) Gaussian model associated to a measurement setup $u$ . In this case, there are various different “information criteria” $F$ to evaluate the quality of the overall measurement setup $u$ , which are usually convex, smooth, but extended real valued functionals (allowing for the value $+\infty$ ). $G$ is often chosen to be a convex indicator function to enforce $\lVert u\rVert_{\mathcal{M}(\Omega)}\leq 1$ , but a cost term as in ( $\mathcal{P}_{\textrm{source}}$ ) can also be considered; see, e.g., [50].

With the intent of providing a unified analysis that covers all of the mentioned problem instances, we study the general formulation ( $\mathcal{P}$ ), where we impose additional assumptions: We require certain regularity and coercivity properties of the convex functions $F$ and $G$ , which covers the examples mentioned above (see Assumptions 3.1 and 5.1); and we require second order differentiability of the kernel $\bm{k}$ with respect to $x$ around the optimal locations (see Section 2), which can be verified in many of the mentioned cases.

Accelerated GCG methods

The objective of this paper is to analyze certain sequential point insertion and coefficient optimization methods as efficient solution algorithms for sparse optimization problems of the form ( $\mathcal{P}$ ). We refer to [6, 9] for a description and analysis of the method applied to special instances of the general problem ( $\mathcal{P}$ ). Starting from a sparse initial measure $u^{0}$ of the form (1.1), these type of algorithms generates a sequence of sparse iterates $u^{k}$ , $k=0,1,2,\ldots$ , by the iterative procedure

[TABLE]

where $\widehat{x}^{k}$ maximizes a certain continuous function over the set $\Omega$ , which is computed from the previous iterate $u^{k}$ ; see Algorithm 2 below. The new source location $\widehat{x}^{k}$ and the coefficient $\widehat{\bm{v}}$ are chosen such that $\widehat{v}^{k}$ corresponds to a descent direction in a generalized conditional gradient method (GCG) – also known as Frank-Wolfe algorithm [29] – applied to an equivalent reformulation of ( $\mathcal{P}$ ). We also point to different variations of the Fedorov-Wynn algorithm [63, 49, 61, 60, 26, 62], developed in the context of approximate design theory, which can be interpreted in this framework.

While the practical implementation of the GCG algorithm is fairly simple, it suffers from slow asymptotic convergence. Several works [50, 25, 9, 6] derive a sublinear $\mathcal{O}(1/k)$ convergence rate for the objective functional values of the iterates under mild assumptions on the problem and several choices of the step size $s^{k}$ . Numerical experiments (e.g., [50]) confirm that this convergence is also observed in practice. Therefore, it is unpractical to solve the problem to high precision, which motivates the introduction of additional acceleration steps. Moreover, the absence of point removal steps leads to undesirable clustering effects: The support size of the iterate grows monotonically with $k$ and, in later iterations, new support points are inserted very close to existing ones. As a remedy, one is also interested to incorporate additional sparsification steps which can iteratively remove support points without increasing the objective functional values. In the present work, we consider additional optimization steps based on the sparse representation of the iterates in terms of their support points $\mathbf{x}$ and coefficients $\bm{u}$ according to (1.1). Defining the updated support (active set) corresponding to (1.4) as $\mathcal{A}_{k+1/2}=\mathcal{A}_{k}\cup\{\,\widehat{x}^{k}\,\}$ , where $\mathcal{A}_{k}=\operatorname{\operatorname{supp}}u^{k}=\{\,x_{i}\;|\;i=1,\ldots,N_{k}\,\}$ , we improve the coefficients of the next iterate by approximately solving the coefficient optimization problem

[TABLE]

Note that this is a convex minimization problem on the Hilbert space $H^{N_{k}+1}$ due to the linearity of the kernel $\bm{k}$ in the argument $\bm{u}$ . In fact, (1.5) has the same structure as ( $\mathcal{P}$ ); it is simply its restriction to the space $\mathcal{M}(\mathcal{A}_{k+1/2},H)$ . Since it is also a sparse optimization problem, some coefficients of the associated optimal solution may be zero. In the next iteration, we can thus exclude the corresponding support points from the representation of the measure (1.1), which also serves as a sparsification step. Thus, we obtain the next iterate, by setting $u^{k+1}=U_{\mathcal{A}_{k+1/2}}(\bm{u}^{k+1})$ for $\bm{u}^{k+1}$ the solution of (1.5), and $\mathcal{A}_{k+1}=\operatorname{\operatorname{supp}}u^{k+1}$ . In [9] the authors suggests to improve the GCG algorithm by performing several steps of a proximal gradient method for (1.5) starting from the current coefficients as initial guess. Acceleration of GCG by fully resolving the coefficient optimization problem (1.5) in each iteration of the method has been proposed in [6, 25, 61, 52].

Alternatively to coefficient optimization, point moving strategies have been suggested. Here, we additionally solve a finite-dimensional, generally non-convex optimization problem in $\bm{x}$ subject to constraints imposed by the set $\Omega$ . In [9] it is proposed to move the support points according to the gradient flow of the smooth part $\bm{x}\mapsto F(U_{\mathcal{A}}(\bm{u}^{k+1}))$ and in [6] it is advocated to employ general purpose optimization methods based on first order derivatives. Further, in [18] the authors propose to include steps which simultaneously optimize the positions and coefficients of the current iterate. Note that the nonconvex (and also nonsmooth, if both coefficients and positions are optimized) point moving problem to be solved here is more computationally intensive than (1.5). Moreover, for sparse minimization problems associated to PDEs, often the kernel $\bm{k}$ is not given analytically and needs to be further approximated [12, 39, 40, 52, 50]. To solve ( $\mathcal{P}$ ) in practice, the operator $K$ is replaced by an approximation employing finite elements. Note that the most commonly employed Lagrangian finite elements are continuous, but not continuously differentiable and thus the objective function is no longer $\mathcal{C}^{1}$ with respect to $\bm{x}$ . This prevents a straightforward algorithmic solution of the point moving problem by derivative based methods, whereas coefficient optimization can be implemented in a straightforward fashion and the new point $\widehat{x}^{k}$ can be found by a direct search over the grid nodes (cf. [52, 50]). For these reasons, we do not consider point moving in this paper.

Contribution

The main contribution of this paper is to analyze the procedure resulting from combining point insertion steps (1.4) with subsequent full resolution of the coefficient optimization problem (1.5), which is summarized in Algorithm 1. Note that the method can be interpreted as an active set method, where new points are added to the active set at the global maxima of a dual variable, and points are removed if their primal coefficients are set to zero (by resolving (1.5)), we also refer to this method as Primal-Dual-Active-Point strategy (PDAP).

Since the coefficient optimization steps are carried out in addition to the point insertion steps, the $\mathcal{O}(1/k)$ convergence rate for GCG is also valid for the accelerated methods. We recall this convergence result in Theorem 5.4 for the general problem formulation ( $\mathcal{P}$ ). Concerning the improved convergence behavior of methods combining point insertion and coefficient optimization over GCG – as reported in [52, 50] – we are not aware of any improved theoretical results. However, in this paper, we prove a linear convergence rate $\mathcal{O}(\zeta^{k})$ for $0\leq\zeta<1$ ; see Theorem 5.17. Note that, since the improved result is local in character, we still have to rely on the general $\mathcal{O}(1/k)$ convergence result mentioned above to ensure that the iterate $u^{k}$ is sufficiently close to an optimal solution for $k$ large enough. In order to obtain the improved linear convergence result, we impose a non-degeneracy condition on the optimal solution; see Assumptions 3.2 and 3.3. This enables us to derive further convergence results for the location parameters $\bm{x}^{k}$ and the coefficients $\bm{u}^{k}$ . In particular, we show that the support points of the iterate asymptotically converge towards the support points of the optimal solution, again at a linear rate; see Theorem 5.19. This also gives theoretical evidence for the sparsifying effect of the coefficient optimization steps, since it shows that support points far away from the optimal locations eventually will be removed from the iterate measure. Moreover, we derive convergence estimates for the coefficients. Here, we need to account for the fact that multiple support points of $u^{k}$ can be close to the each optimal location. Lumping together the corresponding coefficients, we again obtain a linear convergence rate; see Theorem 5.24. Together, this results in a linear convergence rate of the iterate measure $u^{k}$ in the dual space $\mathcal{C}^{0,1}(\Omega,H)^{*}$ ; see Theorem 5.25.

We note that the improved convergence rate proved here also requires additional regularity assumptions. In particular, we need second derivatives of the kernel function in $x$ , which may not be available if discrete approximations to $K$ are employed in practice. We point out that these assumptions are only of technical nature: The computation of the derivatives of the kernel function with respect to the position is not required in the algorithm. Consequently, the method can be readily adapted to discretizations of ( $\mathcal{P}$ ), and the linear rate proved here is also observed in practice [52, 50].

Related work

The design of efficient algorithms for ( $\mathcal{P}$ ) is a challenging task since the space of vector-valued Borel measures is in general non-reflexive. Moreover, it lacks useful properties such as strict convexity and smoothness which are desirable for the convergence analysis of many optimization methods. Consequently, a direct extension of most well-known optimization routines to the present setting is not possible.

Discretization-based methods

A first approach to the solution of ( $\mathcal{P}$ ) for a continuous candidate set is to replace $\Omega$ by a approximating sequence of finite sets with $\Omega_{h}\subset\Omega$ for a sequence of mesh parameters $h>0$ . For example, $\Omega_{h}$ may be chosen as the nodal set of a triangulation $\mathcal{T}_{h}$ of $\Omega$ . Since $\Omega_{h}$ consists of $N_{h}\geq 0$ many points, every $u\in\mathcal{M}(\Omega_{h},H)$ is of the form $u=\sum_{x_{i}\in\Omega_{h}}\bm{u}_{i}\delta_{x_{i}}$ . Substituting the space of regular Borel measures in ( $\mathcal{P}$ ) with the discretized space $\mathcal{M}(\Omega_{h},H)$ yields a convex minimization problem for the coefficient functions $\bm{u}\in H^{N_{h}}$ similar to (1.5) with $\mathcal{A}$ replaced by $\Omega_{h}$ . While the resulting problem remains non-smooth due to the appearance of the total variation norm, it can be solved by a large number of well-studied algorithms. For examples we point to semi-smooth Newton methods [57, 48], the fast iterative shrinkage-thresholding algorithm (FISTA) [4], and the alternating direction of multipliers method [7]. However, this philosophy of discretize then optimize harbors the danger of yielding mesh dependent solution methods. While a particular algorithm may be efficient for the solution of the discrete problem associated to a fixed discretization parameter $h$ , its convergence behaviour can critically depend on the fineness of the discretization, which is usually the case for the aforementioned methods. For the methods analyzed in this paper, such problems only have to be solved on a very small candidate set.

Regularization based methods

A different approach to circumvent the lack of reflexivity of the space $\mathcal{M}(\Omega,H)$ can be based on path-following strategies. Here the original problem is replaced by a sequence of $L^{2}$ -regularized ones:

[TABLE]

with the Hilbert space $L^{2}(\Omega,H)\subset\mathcal{M}(\Omega,H)$ . Note that the appearance of the $L^{1}(\Omega,H)$ norm in the objective functional (as the restriction of the total variation norm to $L^{2}(\Omega,H)$ ) still promotes optimal solutions which are nonzero only on small subsets of $\Omega$ . Furthermore in the limiting case for $\varepsilon\rightarrow 0$ the $L^{2}$ -regularized solutions approximate solutions to ( $\mathcal{P}$ ); see, e.g., [51]. For fixed $\varepsilon>0$ those problems are amenable to efficient function space based solution methods such as semi-smooth Newton (SSN) [57, 55, 33]. For linear-quadratic problems such as ( $\mathcal{P}_{\textrm{source}}$ ), these methods can be further interpreted as active set methods [35] (specifically Primal-Dual-Active-Set method, PDAS). While these methods behave mesh independent in practice and their performance scales linearly with the degrees of freedom underlying mesh, the convergence behavior deteriorates for small values of $\varepsilon$ . In the practical realization it is therefore necessary to start at a large value of $\varepsilon$ and to alternate between decreasing the regularization parameter and a (possibly inexact) solution of the regularized problem initialized at the previous iterate. Thus, a complete analysis of path-following methods requires a quantitative convergence analysis of the method used for the solution of the regularized problem in dependence of $\varepsilon$ , a quantification of the additional regularization error and update strategies for the parameter; cf., e.g., [36]. We refer to [58] for further discussion and a numerical comparison of the path-following approach to the PDAP method analyzed here, which shows a substantial advantage of PDAP for the case that the optimal $N$ is small compared to the number of degrees of freedom of the mesh.

Existing convergence results for conditional gradient methods

Conditional gradient methods (see, e.g., [44]) have been originally proposed by Frank and Wolfe [29]. They constitute a simple iterative scheme for computing a minimizer of a smooth convex function over compact subsets of a Banach space. Since norm balls in $\mathcal{M}(\Omega,H)$ are weak* compact, the general problem formulation fits into this setting for the choice of the convex indicator function $G(m)=I_{m\leq M}$ . Feasibility of the iterates is ensured by taking the new iterate $u^{k+1}$ as a convex combination between the previous iterate $u^{k}$ and a trial point $\widehat{v}^{k}$ , which is obtained by minimizing a linearization of the objective functional around $u^{k}$ over the admissible set. A sublinear rate for the convergence of the objective functional values towards its minimum can be proven for various choices of the step size $s^{k}$ . For an overview we refer to [21, 22, 20]. The sublinear rate is tight even for strongly convex objective functionals [11]. An improved rate of convergence can only be derived in more restrictive settings: For problems on infinite dimensional spaces, a linear rate of convergence is provided in [44, 17] if the gradient of the objective functional is uniformly bounded away from zero on a strongly convex admissible set. The papers [20, 21] yield the same rate if the linearized objective functional fulfills a certain growth condition on the admissible set. We emphasize that, apart from trivial cases, none of the mentioned results is directly applicable to the problem at hand. Moreover, we point out that, on finite dimensional spaces, accelerated conditional gradient methods, such as Wolfe’s away-step conditional gradient [59], eventually yield a linear rate of convergence [1, 41]. In infinite dimensions, where the candidate set $\Omega$ is not finite, we are not aware of similar results. Last we point out that if we replace $H$ with the cone $\mathbb{R}_{+}\subset\mathbb{R}$ and set $G(m)=I_{m\leq M}$ , Algorithm 1 corresponds to the fully-corrective conditional gradient method [37]. For finite-dimensional observation space $Y$ , this particular algorithm can be related to an exchange method [34] on the semi-infinite convex dual problem of ( $\mathcal{P}$ ). We are also not aware of convergence results comparable to those provided in this work for these type of methods.

After this manuscript was finalized we were made aware of [28], where the authors prove linear convergence of a similar accelerated conditional gradient method for the particular case of $H=C=\mathbb{R}$ and $G(\lVert u\rVert_{\mathcal{M}})=\beta\lVert u\rVert_{\mathcal{M}}$ . We note that [28] and the present manuscript were derived independently of each other and differ in certain important aspects. In particular, in contrast to our work, the authors require $\mathcal{A}_{k}\subset\mathcal{A}_{k+1}$ , i.e. the dimension of the coefficient optimization problem (1.5) increases monotonically. Moreover, the active set is updated by adding all sufficiently large local maximizers of a certain dual certificate, while we only require the addition of one global maximum (as in the original GCG method).

Plan of the paper

The paper is organized as follows. In Section 2, we fix some basic notation and provide the functional analytic background used for the rest of the work. Section 3 introduces the optimization problem and some basic results on the existence and structure of optimal solutions are derived. We also discuss how different practically relevant problems fit into the general framework. In Section 4 we formulate the optimization algorithms and prove the subsequential convergence of the generated iterates as well as a sublinear worst-case convergence rate for the objective functional values. Under additional structural assumptions on the problem, an improved local linear rate of convergence is established in Section 5. Moreover, quantitative convergence results for the support points and the coefficients of the iterates are presented. Finally, in Section 6, we illustrate the theoretical findings by numerical experiments.

2. Notation

Let $\Omega\subset\mathbb{R}^{d}$ , $d\geq 1$ , be compact and denote by $H$ a separable Hilbert space with respect to the norm $\lVert\cdot\rVert_{H}$ induced by the inner product $(\cdot,\cdot)_{H}$ . In the following, $H$ is identified with its dual space using the Riesz representation theorem. A countably additive mapping $u\colon\mathcal{B}(\Omega)\to H$ is called a vector measure, where $\mathcal{B}(\Omega)$ denote the Borel sets of $\Omega$ . Associated to $u$ we define its total variation measure $\lvert u\rvert\colon\mathcal{B}(\Omega)\to\mathbb{R}_{+}$ in the usual way. The space of vector measures with finite total variation $\lvert u\rvert(\Omega)$ is now denoted by $\mathcal{M}(\Omega,H)$ , which is a Banach space with respect to the norm

[TABLE]

For reference, see the discussion in [43, Chapter 12.3]. The support of $u$ is defined as the support of the corresponding total variation measure $\operatorname{\operatorname{supp}}u=\operatorname{\operatorname{supp}}\lvert u\rvert\subset\Omega$ . We point out that for a measure of the form (1.1) consisting of a finite sum of Dirac delta functions, we have $\lvert u\rvert=\sum_{n}\lVert\bm{u}_{n}\rVert_{H}\delta_{x_{n}}$ and $\lVert u\rVert_{\mathcal{M}}=\sum_{n}\lVert\bm{u}_{n}\rVert_{H}$ . Additionally, those measures are precisely the measures of finite support, which are characterized by their support $\operatorname{\operatorname{supp}}u=\{\,x_{n}\;|\;\lVert\bm{u}_{n}\rVert_{H}>0\,\}$ and their coefficients $\bm{u}_{n}=u(\{x_{n}\})$ for $x_{n}\in\operatorname{\operatorname{supp}}u$ (which we will also abbreviate by $u(x_{n})$ , by a slight abuse of notation). We denote the cardinality of the support by $\#\operatorname{\operatorname{supp}}u\in\mathbb{N}\cup\{\,0\,\}$ .

Moreover, any $u\in\mathcal{M}(\Omega,H)$ is absolutely continuous with respect to $\lvert u\rvert$ , and there exists a unique function

[TABLE]

such that $u$ can be decomposed as

[TABLE]

see, e.g., [42, Chapter 12.4]. The function $u^{\prime}$ is called the Radon-Nikodým derivative of $u$ with respect to $|u|$ ; see [19]. For abbreviation we write $\mathop{}\!\mathrm{d}u=u^{\prime}\mathop{}\!\mathrm{d}|u|$ in the following. For finitely supported measures of the form (1.1) it clearly holds $u^{\prime}(x_{n})=\bm{u}_{n}/\lVert\bm{u}_{n}\rVert_{H}$ for $x_{n}\in\operatorname{\operatorname{supp}}u$ .

By $\mathcal{C}(\Omega,H)$ we further denote the space of bounded and continuous functions on $\Omega$ which assume values in $H$ . It is a separable Banach space when endowed with the usual supremum norm

[TABLE]

for any $\varphi\in\mathcal{C}(\Omega,H)$ ; see e.g. [2, Lemma 3.85]. By Singer’s representation theorem (see, e.g., [32]) its topological dual space is identified with $\mathcal{M}(\Omega,H)$ where the associated duality paring is given by

[TABLE]

for arbitrary $\varphi\in\mathcal{C}(\Omega,H)$ and $u\in\mathcal{M}(\Omega,H)$ . A sequence $u^{k}\in\mathcal{M}(\Omega,H)$ , $k\geq 0$ , is called weak* convergent with limit $u\in\mathcal{M}(\Omega,H)$ if

[TABLE]

for $k\to\infty$ . We denote this by $u^{k}\rightharpoonup^{*}u$ .

Finally, let $Y$ be another Hilbert space and $\bm{k}\colon\Omega\times H\to Y$ be a weak-to-strong continuous function, which is linear in the second argument. Now, we define the operator $K\colon\mathcal{M}(\Omega,H)\to Y$ for each argument $u\in\mathcal{M}(\Omega,H)$ by

[TABLE]

which clearly extends the definition for finite measures given in (1.3), using the linearity of $\bm{k}$ in the second argument. Additionally, we define the pre-adjoint operator $K^{\star}\colon Y\to\mathcal{C}(\Omega,H)$ by

[TABLE]

It is easy to see that $(Ku,v)_{Y}=\langle u,K^{\star}v\rangle$ for all $u\in\mathcal{M}(\Omega,H)$ and $y\in Y$ , using the definitions. Moreover, $K^{\star}$ is a linear and bounded operator with norm

[TABLE]

Thus, $K$ is the Banach space adjoint of $K^{\star}$ and thus also linear and bounded with the same norm bound. Since $\bm{k}$ is weak-to-strong continuous, $K$ is sequentially weak*-to-strong continuous. Note that $K^{\star}$ is not the Banach space adjoint of $K$ , since $\mathcal{M}(\Omega,H)^{*}\neq\mathcal{C}(\Omega,H)$ . It can be understood as the adjoint in the sense of topological vector spaces, if $\mathcal{M}(\Omega,H)$ is endowed with the weak* topology, but we will not need this property in the following.

Finally, to prove the convergence result of this manuscript, we require higher smoothness assumptions on the kernel with respect to $x\in\Omega$ . We denote the partial derivatives of $\bm{k}$ with respect to $x$ by $\partial_{i}\bm{k}(x,\bm{u})$ , $i=1,\ldots,d$ , for any $x\in\operatorname{int}\Omega$ and $\bm{u}\in H$ (if they exist) and analogously the higher derivatives. By $\nabla\bm{k}(x,\bm{u})\in Y^{d}$ and $\nabla^{2}\bm{k}(x,\bm{u})\in Y^{d\times d}$ we denote the Gradient and Hessian with respect to $x$ , respectively. We require smoothness of the kernel only on a neighborhood of the optimal support points, and the precise assumptions will be given in section 3.2. For smooth functions on any open subset $\Omega^{\prime}\subset\Omega$ we denote by $\mathcal{C}^{2}(\bar{\Omega}^{\prime})$ the spaces of twice continuously differentiable functions with derivatives that can be continuously extended up to the boundary of $\Omega^{\prime}$ , endowed with the usual supremum norm over all partial derivatives. The space $\mathcal{C}^{0,1}(\bar{\Omega}^{\prime})$ denotes the Lipschitz continuous functions endowed with the usual Lipschitz norm. Finally, for smooth functions taking values in the Hilbert space $H$ (resp. $Y$ ), by $\mathcal{C}^{0,1}(\bar{\Omega}^{\prime},H)$ and $\mathcal{C}^{2}(\bar{\Omega}^{\prime},H)$ we denote the vector valued variants of the above spaces, defined in the canonical way.

3. Sparse minimization problems

We now turn to sparse minimization problems. Our aim is to solve the nonsmooth convex problem

[TABLE]

Here, the loss functional $F\colon Y\to\mathbb{R}\cup\{\,+\infty\,\}$ is a convex (extended real valued) functional with open domain $\operatorname{\operatorname{dom}}F=\{\,y\in Y\;|\;F(y)<+\infty\,\}$ on the Hilbert space $Y$ . The convex cost functional $G\colon\mathbb{R}\to\mathbb{R}\cup\{\,+\infty\,\}$ is assumed to be monotone on $\mathbb{R}^{+}$ . We note that its domain is given by $\operatorname{\operatorname{dom}}j=\{\,u\in\mathcal{M}(\Omega,H)\;|\;\lVert u\rVert_{\mathcal{M}}\in\operatorname{\operatorname{dom}}G,\;Ku\in\operatorname{\operatorname{dom}}F\,\}$ . In order to ensure well-posedness of this problem, the following assumptions are made.

Assumption 3.1.

Let the following assumptions hold:

(i.)

The function $G\colon\mathbb{R}\rightarrow\mathbb{R}\cup\{+\infty\}$ is proper, convex, lower semi-continuous, and monotonically increasing on $\mathbb{R}_{+}$ with $G(m)\rightarrow+\infty$ for $m\rightarrow\infty$ . Without loss of generality we set $G(m)=+\infty$ for $m<0$ .

(ii.)

The domain of the functional $j$ is nonempty and $j$ is radially unbounded.

(iii.)

The function $F\colon Y\rightarrow\mathbb{R}\cup\{\,+\infty\,\}$ is convex and lower semi-continuous. Moreover, $\operatorname{\operatorname{dom}}F$ is open in $Y$ , and $F$ is strictly convex and continuously Fréchet differentiable on $\operatorname{\operatorname{dom}}F$ .

Note that (i.), (iii.) and the weak*-to-strong continuity of $K$ imply that $j$ is weak* lower semicontinuous on $\mathcal{M}(\Omega,H)$ . The convex subdifferential of $G$ will be denoted by $\partial G$ and the (Hilbert-space) Fréchet derivative of $F$ at $y\in\operatorname{\operatorname{dom}}F$ will be denoted by $\nabla F(y)$ . For later use, we also define the smooth part of the reduced cost functional as

[TABLE]

From Assumption 3.1(iii.), the linearity of $K$ as well as the chain rule we conclude that $f$ is Fréchet differentiable at $u\in\operatorname{\operatorname{dom}}j$ . In order to identify the Fréchet derivative we compute the directional derivative of $f$ in direction $\delta u\in\mathcal{M}(\Omega,H)$ as

[TABLE]

Thus, the Fréchet derivative of $f$ at $u$ can be identified with the continuous function $\nabla f(u)\coloneqq K^{\star}\nabla F(Ku)\in\mathcal{C}(\Omega,H)\subset\mathcal{M}(\Omega,H)^{*}$ . Moreover, due to the weak*-to-strong continuity of $K$ , the mapping

[TABLE]

is sequentially weak*-to-strong continuous.

3.1. Existence of minimizers and optimality conditions

Before we turn to the algorithmic solution of ( $\mathcal{P}$ ) we summarize some basic properties, such as existence and optimality conditions, which will be necessary in the following. The existence of at least one global minimizer to ( $\mathcal{P}$ ) thus follows immediately by the direct method of variational calculus (see, e.g., [16, Chapter 1]).

Proposition 3.1.

There exists at least one optimal solution $\bar{u}\in\mathcal{M}(\Omega,H)$ to ( $\mathcal{P}$ ).

Let us turn to a structural characterization of minimizers obtained from ( $\mathcal{P}$ ). The following theorem is a direct consequence of the one-homogeneity of the norm and Assumption 3.1(i.); see, e.g., [58, Theorem 6.22].

Theorem 3.2.

Let $\bar{u}\in\operatorname{\operatorname{dom}}j$ be given. Set $\bar{p}=-\nabla f(\bar{u})\in\mathcal{C}(\Omega,H)$ . Then $\bar{u}$ is an optimal solution to ( $\mathcal{P}$ ) if and only if

[TABLE]

Throughout the rest of the paper we will consider a solution $\bar{u}$ of ( $\mathcal{P}$ ) and define

[TABLE]

We refer to $\bar{y}$ as the optimal observation and to $\bar{p}$ as the dual variable associated to $\bar{u}$ .

Proposition 3.3.

The optimal observation $\bar{y}=K\bar{u}$ and dual variable $\bar{p}=-\nabla f(\bar{u})$ are the same for every minimizer $\bar{u}$ to ( $\mathcal{P}$ ).

Proof.

The uniqueness of the optimal observation $\bar{y}$ can be shown by a standard argument using the strict convexity of $F$ . Thus, the optimal dual variable $\bar{p}=-\nabla f(\bar{u})=-K^{\star}\nabla F(\bar{y})$ is unique as well. ∎

The optimality conditions can be equivalently expressed through a sparsity condition on the total variation measure $\lvert\bar{u}\rvert$ and a projection formula for the Radon-Nikodým derivative $\bar{u}^{\prime}$ ; see, e.g., [58, Theorem 6.24].

Theorem 3.4.

Let $\bar{u}\in\operatorname{\operatorname{dom}}j$ with polar decomposition $\mathop{}\!\mathrm{d}\bar{u}=\bar{u}^{\prime}\mathop{}\!\mathrm{d}|\bar{u}|$ . Then the condition (3.1) is equivalent to $\lVert\bar{p}\rVert_{\mathcal{C}}\in\partial G(\lVert\bar{u}\rVert_{\mathcal{M}})$ and either $\bar{\lambda}=\lVert\bar{p}\rVert_{\mathcal{C}}=0$ or

[TABLE]

In many situations it can be ensured that a minimizer with finite support (1.1) exists: for instance, when the space $Y$ is finite dimensional (see, e.g., [52, Theorem 3.7], [58, Proposition 6.32]) or when the dual variable assumes its maximum only in a finite set of points. We will impose the latter condition below, which will be necessary for the improved convergence analysis. In this case, Theorem 3.4 can be interpreted as follows:

Corollary 3.5.

Assume that $\bar{u}\in\operatorname{\operatorname{dom}}j$ has finite support, i.e. $\bar{u}=\sum_{n=1}^{N}\bar{\bm{u}}_{n}\delta_{\bar{x}_{n}}$ with $N\geq 0$ , $\bar{\bm{u}}_{n}\in H$ , and $\bar{x}_{n}\in\Omega$ . Then the condition (3.1) is equivalent to $\lVert\bar{p}\rVert_{\mathcal{C}}\in\partial G(\lVert\bar{u}\rVert_{\mathcal{M}})$ and either $\bar{\lambda}=\lVert\bar{p}\rVert_{\mathcal{C}}=0$ or

[TABLE]

Proof.

This follows directly with $\operatorname{\operatorname{supp}}\bar{u}\subset\{\,\bar{x}_{n}\,\}_{n=1,\ldots,N}$ and $\bar{u}^{\prime}(\bar{x}_{n})=\bar{\bm{u}}_{n}/\lVert\bar{\bm{u}}_{n}\rVert_{H}$ for all $\bar{x}_{n}\in\operatorname{\operatorname{supp}}\bar{u}$ . ∎

Note that, for problems of the form ( $\mathcal{P}_{\textrm{source}}$ ) and $\bar{u}\neq 0$ , $\bar{\lambda}$ is simply equal to the cost or regularization parameter $\beta$ .

3.2. Uniqueness of solutions and non-degeneracy conditions

In order to ensure uniqueness of the solution $\bar{u}$ itself, we introduce the corresponding unique dual certificate

[TABLE]

where we recall that $\bar{p}=-\nabla f(\bar{u})=-K^{\star}\nabla F(\bar{y})$ and $\bar{\lambda}\coloneqq\lVert\bar{p}\rVert_{\mathcal{C}}=\lVert\bar{P}\rVert_{\mathcal{C}(\Omega)}$ (which are uniquely defined according to Proposition 3.3).

In the following, we will restrict ourselves to optimal solutions with non-degenerate dual variable $\bar{p}=-\nabla f(\bar{u})$ , i.e. $\bar{p}\neq 0$ and thus $\bar{\lambda}>0$ , since a trivial dual variable $\bar{p}$ would imply that $\bar{u}$ is also a global minimizer of the functional $f$ over the set $\mathcal{M}(\Omega,H)$ , which can be easily ruled out in most situations. Furthermore, we impose the following conditions for the analysis of the paper.

Assumption 3.2.

There is a discrete set $\{\,\bar{x}_{n}\,\}^{N}_{n=1}\subset\operatorname{int}\Omega$ for some $N\geq 0$ such that the dual certificate $\bar{P}\in\mathcal{C}(\Omega)$ defined above and $\bar{\lambda}=\lVert\bar{P}\rVert_{\mathcal{C}(\Omega)}>0$ fulfill

[TABLE]

Moreover, the following set is linearly independent:

[TABLE]

This assumption ensures the existence of a unique, sparse minimizer $\bar{u}$ ; cf. also [24].

Proposition 3.6.

Under Assumption 3.2 the problem ( $\mathcal{P}$ ) admits a unique discrete minimizer $\bar{u}\in\mathcal{M}(\Omega,H)$ given by a finite sum of Dirac delta functions

[TABLE]

Proof.

We note that points in (3.3) constitute the potential support set for the optimal solution, i.e. $\operatorname{\operatorname{supp}}\bar{u}\subset\{\,\bar{x}_{n}\,\}_{n=1,\ldots,N}$ due to Theorem 3.4 and thus $\bar{u}$ is given as in (3.5); cf. Corollary 3.5. Together with the form of the integral operator (1.3) it holds $\bar{y}=K\bar{u}=\bar{\bm{K}}\bar{\mu}$ for $\bar{\mu}_{n}=\lVert\bar{\bm{u}}_{n}\rVert_{H}$ where $\bar{\bm{K}}\colon\mathbb{R}^{N}\to Y$ is defined as

[TABLE]

Now, with (3.4) (using linearity of $\bm{k}$ in the second argument) the mapping $\bar{\bm{K}}$ is injective and $\bar{\mu}$ is unique, which directly implies that $\bar{u}$ is unique. ∎

The previous assumption can be guaranteed in several settings, and is commonly imposed for the purpose of error analysis, see, e.g., [24]. For instance, this condition holds if the operator $K$ is injective as in, e.g., sparse initial value identification problems [45]. Furthermore, if (3.3) holds, then we note that the linear independence of (3.4) is in fact necessary for the existence of a unique sparse minimizer with nonzero coefficient functions. More in detail, if $\bar{u}=\sum^{N}_{n=1}\bar{\bm{u}}_{n}\,\delta_{\bar{x}_{n}}$ with $\lVert\bar{\bm{u}}_{n}\rVert_{H}>0$ is a minimizer to ( $\mathcal{P}$ ) and (3.4) is linearly dependent then there exists another minimizer $\widetilde{u}$ to ( $\mathcal{P}$ ) with $\#\operatorname{\operatorname{supp}}\widetilde{u}<N$ ; see, e.g., [52]. Finally, recalling the definition of $\bar{\bm{K}}$ from (3.6), the vector $\bar{\mu}$ is the unique minimizer of $F(\bar{\bm{K}}\mu)+G(\lvert\mu\rvert_{\ell^{1}})$ over $\mu\in\mathbb{R}^{N}$ with $\mu\geq 0$ and satisfies

[TABLE]

for some $\theta>0$ if and only if $\bar{\bm{K}}$ has full column rank (i.e. if (3.4) is linearly independent) and $F$ fulfills a strong convexity condition on the image of $\bar{\bm{K}}$ . We will use this later to estimate the error of the optimal coefficients $\bar{\bm{u}}_{n}$ (see Proposition 5.23 below for details). Similar assumptions are used in the convergence analysis of semismooth Newton methods for problems with $\ell^{1}$ -regularization; see [48, p. 18], [47, Example 4.3.9].

In order to ensure the stability of the location points of approximations to $\bar{u}$ and to be able to quantify the error, we require a further strengthened form of the optimality conditions. First, we introduce an appropriate neighborhood of the optimal support points: with Assumption 3.2 there exists a radius $R>0$ such that

[TABLE]

Now, on this neighborhood of the support points, we impose the following additional smoothness requirements on the kernel:

[TABLE]

This has several consequences. First, we observe that this implies that $K^{\star}y\in\mathcal{C}^{2}(\bar{\Omega}_{R},H)$ for any $y\in Y$ and therefore $\bar{p}\in\mathcal{C}^{2}({\bar{\Omega}_{R}},H)$ . Next, we note that the $H$ -norm $\lVert\cdot\rVert_{H}$ is two times continuously Fréchet differentiable at every $\bm{u}\in H,~{}\bm{u}\neq 0$ . Thus if $p\in\mathcal{C}^{2}({\bar{\Omega}_{R}},H)$ satisfies $\lVert p(x)\rVert_{H}\geq c$ for some $c>0$ and all $x\in\bar{\Omega}_{R}$ , then the composition $P(x)=\lVert p(x)\rVert_{H}$ is in $\mathcal{C}^{2}(\bar{\Omega}_{R})$ . In particular this implies $\bar{P}\in\mathcal{C}^{2}({\bar{\Omega}_{R}})$ . Since $\bar{x}_{n}$ are the maximizers of $\bar{P}$ , it holds

[TABLE]

Finally, we impose the main requirement for the analysis from this section: we assume that the curvature of $\bar{P}$ around its global maximizers does not degenerate.

Assumption 3.3.

There holds $\operatorname{\operatorname{supp}}\bar{u}=\{\,\bar{x}_{n}\,\}^{N}_{n=1}$ , i.e. $\lVert\bar{\bm{u}}_{n}\rVert_{H}>0$ for $n=1,\dots,N$ . Furthermore, the kernel fulfills (3.8) for a radius $R>0$ satisfying (3.7) and there is a $\theta_{0}>0$ such that for all $n=1,\ldots,N$ it holds:

[TABLE]

Let us briefly motivate the last assumption and recall similar concepts from the literature. First we point out $\bar{P}(\bar{x}_{n})=\max_{x\in\Omega}\bar{P}(x)$ . Hence, (3.9) corresponds to a second order sufficient condition (SSC) for the global maximizers of $\bar{P}$ . In particular, this is equivalent to the quadratic growth

[TABLE]

of $\bar{P}$ for all $x$ in the vicinity of $\bar{x}_{n}$ (see Lemma 5.7 below for details). This will allow us to derive estimates on the support points of approximations to $\bar{u}$ by perturbation results for the dual certificate $\bar{P}$ . In the context of super-resolution the conditions of Assumptions 3.3 (for the case of $H=\mathbb{R}$ ) are referred to as a non-degeneracy source condition for the measure $\bar{u}$ ; cf. [24, 23]. Furthermore, we recall the connection of sparse minimization problems to state constrained optimization; cf. [14]. From this point of view the equality condition on $\operatorname{\operatorname{supp}}{u^{k}}$ corresponds to a strict complementarity assumption on the Lagrange multiplier associated to the state constraint. Moreover, in this case the definiteness assumption on the Hessian of $\bar{P}$ can be interpreted as a condition on the curvature of the optimal state around those points in which it touches the constraint. Both of these conditions are well-established in the field of semi-infinite optimization. We refer to, e.g., [46] where similar assumptions are used to derive finite element error estimates. In [54] comparable conditions are imposed to derive second order optimality conditions for semi-infinite optimization problems.

4. Algorithmic solution

In this section we discuss the numerical solution of ( $\mathcal{P}$ ) with a conceptually simple algorithm operating on sparse finitely supported measures (1.1). It consists of the repeated insertion of a single new point at the maximum of a dual variable, the full resolution of a convex optimization problem on the current support, and the subsequent removal of all zero coefficients associated to the current support.

To describe the algorithm, we consider an active set of distinct points $\mathcal{A}=\left\{\,x_{n}\in\Omega\;|\;n=1,\dots,\#\mathcal{A}\right\}$ and the associated parametrization $U_{\mathcal{A}}$ defined by

[TABLE]

The convex optimization problem at the core of the algorithm arises from fixing the support of the measure to the set $\mathcal{A}$ and considering only the convex problem for the coefficients on the Hilbert space $H^{\#\mathcal{A}}$ :

[TABLE]

Clearly ( $\mathcal{P}_{\mathcal{A}}$ ) corresponds to ( $\mathcal{P}$ ) with the support of the optimization variable restricted to $\mathcal{A}$ .

4.1. Primal Dual Active Point Strategy

The algorithm updates the active point set $\mathcal{A}_{k}=\operatorname{\operatorname{supp}}u^{k}$ corresponding to the current iterate $u^{k}$ in every iteration $k=1,2,\ldots$ by adding a single point $\widehat{x}^{k}$ , found at the global maximum of the current dual

[TABLE]

Subsequently, the convex sparse optimization problem ( $\mathcal{P}_{\mathcal{A}}$ ), is solved on the updated support, and entries from the active set with zero coefficient are pruned. We point out that the solution of ( $\mathcal{P}_{\mathcal{A}}$ ) is sparse due to the sparsity promoting cost term, i.e. several coefficients of $\bm{u}$ may be zero. The full procedure is outlined in Algorithm 1.

In the following, we analyze the iterates $u^{k}$ of Algorithm 1 without any termination criterion, which generates an infinite sequence $k\to\infty$ , and analyze their structure and convergence. For this, we require the first order necessary optimality conditions for solutions to the coefficient optimization problem ( $\mathcal{P}_{\mathcal{A}}$ ) in step 3. of Algorithm 1, which are given as follows.

Proposition 4.1.

Let $k\geq 1$ and $\mathcal{A}\coloneqq\mathcal{A}_{k-1/2}=\left\{\,x_{i}\in\Omega\;|\;i=1,\ldots,\#\mathcal{A}_{k-1/2}\,\right\}$ be the active set in iteration $k-1$ of Algorithm 1. Accordingly, denote by $\bm{u}^{k}\in H^{\#\mathcal{A}}$ the optimal solution to ( $\mathcal{P}_{\mathcal{A}}$ ). Define the next iterate $u^{k}\coloneqq U_{\mathcal{A}}(\bm{u}^{k})$ , dual variable $p^{k}\coloneqq-\nabla f(u^{k})$ and $\lambda^{k}\coloneqq\max_{x\in\mathcal{A}_{k-1/2}}\lVert p^{k}(x)\rVert_{H}$ . Then there holds

[TABLE]

If $\lambda^{k}>0$ this implies

[TABLE]

Proof.

The optimality conditions are obtained analogously to Corollary 3.5. To this end note that for the given $\mathcal{A}$ the mapping (4.1) can be understood as an isometric isomorphism

[TABLE]

where $\mathcal{M}(\mathcal{A},H)\subset\mathcal{M}(\Omega,H)$ is the space of vector measures supported on $\mathcal{A}$ and the $\ell^{1}(H)$ norm of $\bm{u}\in H^{\#\mathcal{A}}$ is given by $\sum_{n}\lVert\bm{u}_{n}\rVert_{H}$ . Moreover the operator $K$ can be restricted to a linear continuous operator

[TABLE]

Thus $u^{k}=U_{\mathcal{A}}(\bm{u}^{k})$ is a solution to $\min_{u\in\mathcal{M}(\mathcal{A},H)}j(u)$ where $j$ is restricted to vector measures supported on $\mathcal{A}$ . The claimed conditions follow now from Theorem 3.2 and Corollary 3.5 by replacing $\Omega$ with $\mathcal{A}$ and realizing that the dual variable for the restricted problem is the restriction of the dual variable $p^{k}=-\nabla f(u^{k})$ to $\mathcal{A}$ .

∎

We note the difference between the quantity $\lambda^{k}$ from (4.2), which is the maximum of $P^{k}(x)=\lVert p^{k}(x)\rVert_{H}$ over $x\in\mathcal{A}_{k-1/2}$ , and the quantity $\lVert p^{k}\rVert_{\mathcal{C}}=P^{k}(\widehat{x}^{k})$ , which is the maximum of $P^{k}$ over $\Omega$ . Now, it is clear that

[TABLE]

and that $u^{k}$ is optimal for the original problem ( $\mathcal{P}$ ) if and only if $\lVert p^{k}\rVert_{\mathcal{C}}-\lambda^{k}=0$ . In fact, in this case the conditions from Proposition 4.1 imply the (sufficient) optimality conditions from Theorem 3.4 (cf. Corollary 3.5).

By Proposition 4.1, $P^{k}(x)=\lambda^{k}$ holds for all $x\in\mathcal{A}_{k}=\operatorname{\operatorname{supp}}u^{k}=\{\,x_{i}\in\mathcal{A}_{k-1/2}\;|\;\bm{u}_{i}^{k}\neq 0\,\}$ . Thus, if $\mathcal{A}_{k}\neq\emptyset$ or equivalently $u^{k}\neq 0$ it also holds

[TABLE]

since $\mathcal{A}_{k}$ contains only the support points of $u^{k}$ . Associated to those support points we denote the coefficients of $u^{k}$ by $u^{k}(x_{i})\coloneqq u^{k}(\{x_{i}\})$ (by a slight abuse of notation), and there holds further

[TABLE]

again from Proposition 4.1. Moreover, it is apparent that Algorithm 1 is a descent method: In fact step 3. implies that

[TABLE]

decays monotonically along the iterates, $j(u^{k+1})\leq j(u^{k})$ .

We will show in section 5 that $j(u^{k})\to j(\bar{u})$ for $k\to\infty$ , which implies the weak* convergence of the iterates $u^{k}\rightharpoonup^{*}\bar{u}$ . However, we note that this notion of convergence requires careful interpretation. For instance, we cannot expect the coefficients of the iterate measure $u^{k}$ to converge towards coefficients of $\bar{u}$ . In fact, $u^{k}$ is allowed to have much more support points than $\bar{u}$ , i.e. $N_{k}\coloneqq\#\mathcal{A}_{k}\gg\#\operatorname{\operatorname{supp}}\bar{u}=N$ , where multiple support points $x^{k}_{j}$ approximate asymptotically a single support point $\bar{x}_{n}$ . Instead, we assume for the moment that the balls around the optimal support points from (3.7) are known and define for each support point of the exact solution $\bar{x}_{n}$ the “lumped” coefficients $\bm{U}^{k}_{n}=u^{k}(B_{R}(\bar{x}_{n}))$ , Then, the lumped coefficients fulfill

[TABLE]

Moreover, we can choose $X_{n}^{k}$ as an arbitrary convex combination of $\mathcal{A}_{k}\cap B_{R}(\bar{x}_{n})$ and replace $u^{k}$ with the “lumped” measure

[TABLE]

which still converges to $\bar{u}$ in the weak* sense, but has the correct number of support points.

We note that the construction of the “lumped” coefficients supposes knowledge of the balls $B_{R}(\bar{x}_{n})$ . This is sufficient for the purposes of the error analysis of the next section but problematic in practical computations, where neither the number $N$ nor these balls are known a priori. In practice, these balls could be estimated from the knowledge of the approximate solution $u^{k}$ and $p^{k}$ a posteriori, but we do not pursue this here.

5. Convergence analysis

We now provide a convergence analysis for Algorithm 1. It requires a final set of conditions imposed on the functional $F$ , for which we introduce additional notation: For $u\in\operatorname{\operatorname{dom}}j$ define the sublevel set

[TABLE]

as well as its image set

[TABLE]

The set $E_{j}(u)$ is weak* compact since $j$ is radially unbounded and weak* lower semicontinuous (Assumption 3.1). Consequently, since $K$ is weak*-to-strong continuous, $KE_{j}(u)$ is compact. Observe that Algorithm 1 is a descent method due to (4.6) and thus $u^{k}\in E_{j}(u^{0})$ for all $k\geq 0$ . For the convergence analysis we impose the following additional assumptions on $F$ , which are weaker than global Lipschitz continuity of its gradient and strong convexity.

Assumption 5.1.

For every $u\in\operatorname{\operatorname{dom}}j$ the gradient $\nabla F$ is Lipschitz continuous on the image set $KE_{j}(u)$ : There exists a constant $L_{u}$ only depending on $j(u)$ with

[TABLE]

Moreover $F$ is strongly convex around the optimal observation $\bar{y}=K\bar{u}\in\operatorname{\operatorname{dom}}F$ , i.e. there exist a neighborhood $\mathcal{N}(\bar{y})\subset\operatorname{\operatorname{dom}}F$ of $\bar{y}$ in $Y$ and a constant $\gamma_{0}>0$ with

[TABLE]

It is clear that a quadratic $F$ as in ( $\mathcal{P}_{\textrm{source}}$ ) fulfills these conditions. However, in cases where the domain of $F$ is a proper subset of $Y$ (cf., e.g., [50]), $F$ fulfills neither (5.3) nor (5.4) uniformly for all $y_{1},y_{2}\in Y$ , and requires the weaker form given above.

In order to make the following presentation more transparent we state the main result of this section beforehand: The following theorem provides linear convergence for the residual $r_{j}(u^{k})$ of the functional defined as

[TABLE]

along the iterates $u^{k}$ , the support set $\mathcal{A}_{k}$ , and the lumped coefficients of the iterates introduced in (4.7).

Theorem 5.1.

Suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold and let the sequence $u^{k}$ be generated by Algorithm 1 started at $u^{0}$ . Recall the definition of the balls $B_{R}(\bar{x}_{n})$ and their union $\Omega_{R}$ from (3.7). Then, there exists a constant $\bar{k}\geq 1$ with

[TABLE]

Moreover, there exist $c\geq 0$ and $\zeta\in(0,1)$ , such that for all $k\geq\bar{k}$ :

[TABLE]

Finally, the we have $u^{k}\rightharpoonup^{*}\bar{u}$ in $\mathcal{M}(\Omega,H)$ , and the error decays linearly in the dual space $\mathcal{C}^{0,1}(\Omega,H)^{*}$ of the space of $H$ valued Lipschitz continuous functions on $\Omega$ : $\lVert u^{k}-\bar{u}\rVert_{\mathcal{C}^{0,1}(\Omega,H)^{*}}\leq c\,\zeta^{k}$ .

Before giving the proof of Theorem 5.1, we sketch the main idea of the proof of the convergence result for the functional beforehand: Due to the localization result as given in the first assertion of Theorem 5.1, we know that for $k$ large enough, every newly inserted support point $\widehat{x}^{k}$ is contained in exactly one ball around the support points $\bar{x}_{n}$ , and we will denote the associated index $\widehat{n}_{k}\in\{\,1,\ldots,N\,\}$ and the ball by

[TABLE]

By perturbation arguments we can estimate the difference of the existing support points and the new point $\widehat{x}^{k}$ to the corresponding optimal location $\bar{x}_{\,\widehat{n}_{k}}$ in terms of the error quantity $\lVert p^{k}\rVert_{\mathcal{C}}-\lambda^{k}$ from (4.3), which bounds the functional residual $r_{j}(u^{k})$ up to constant; see subsection 5.2.1. Then, we define the search direction

[TABLE]

and the associated trial point

[TABLE]

which replaces all Dirac delta functions in $\widehat{B}^{k}$ with a single one supported on $\widehat{x}^{k}$ , where the magnitude of the coefficient $\widehat{\mu}^{k}$ is maintained, but the direction is taken from the dual variable at $\widehat{x}^{k}$ , motivated by the optimality conditions. Relying on the fact that the coefficients of $u^{k}$ solve the problem ( $\mathcal{P}_{\mathcal{A}}$ ) and the aforementioned perturbation arguments, we can document the decrease in terms of the objective functional by defining an intermediate update

[TABLE]

Here, if $s^{k}$ is large, we move a large fraction of “mass” from the Dirac-delta functions supported on $\widehat{B}^{k}$ over to the newly inserted point. We show that for appropriately chosen $s^{k}$ not only $j(u^{k+1/2})\leq j(u^{k})$ , but in fact the error decreases linearly according to $r_{j}(u^{k+1/2})\leq\zeta r_{j}(u^{k})$ ; see Theorem 5.17. Clearly, since $\operatorname{\operatorname{supp}}u^{k+1/2}\subset\mathcal{A}_{k+1/2}$ it holds $j(u^{k+1})\leq j(u^{k+1/2})$ and the same estimate follows for $r_{j}(u^{k+1})$ .

The rest of this section is dedicated to providing the proof of Theorem 5.1. Throughout the derivations, the generic constant $c>0$ will be chosen different from line to line, but always independent of the iteration index $k$ . The plan of the proof is as follows. First, in subsection 5.1 we establish the global convergence of Algorithm 1 by interpreting it as an accelerated version of a generalized conditional gradient method. The corresponding theory yields only a much slower sublinear rate of convergence, since it can not fully exploit the decrease achieved in step 3. of Algorithm 1. However, the global convergence result at a sublinear rate is necessary for our proof since it allows us to apply perturbation arguments based on Assumption 3.3 (which are valid only for $k\geq\bar{k}$ ) and the improved convergence analysis for the residual is given in subsection 5.2. We first establish the localization result of the support points in Corollary 5.11 and then the linear convergence of the residual in Theorem 5.17. In subsection 5.3 the results for the iterates are derived as consequences; see Theorem 5.19 for the support points, Theorem 5.24 for the lumped coefficients, and Theorem 5.25 for the estimate in the dual norm.

*Remark 5.1**.*

It may be tempting to attempt to use the update (5.8) directly to replace the resolution of the subproblem in step 3. of Algorithm 1. However, this is not immediately possible for several reasons: First, the construction of the search direction $\Delta^{k}$ requires knowledge of the ball $\widehat{B}^{k}$ based on the exact solution, which is not available in practical computations. Second, the descent properties of (5.8) rely on the fact that $u^{k}$ is a previous iterate computed from ( $\mathcal{P}_{\mathcal{A}}$ ) and fulfills (4.4) and (4.5). To start, the choice of $\widehat{\mu}^{k}$ leads $\lVert u^{k+1/2}\rVert_{\mathcal{M}}=\lVert u^{k}\rVert_{\mathcal{M}}$ and thus can, on its own, not lead to convergent method. Moreover, it is beneficial to resolve ( $\mathcal{P}_{\mathcal{A}}$ ) fully and obtain in each step the sparsity condition from Proposition 4.1, since this removes unnecessary support points in step 4. and keeps the active set small and enables the estimate on the points of the active set given above.

5.1. Worst-case convergence analysis

Now, we derive a first convergence result for the sequence $u^{k}$ generated by Algorithm 1. For this, we rely on existing analysis for generalized conditional gradient methods. Here, the same point $\widehat{x}^{k}$ is inserted at each iteration, but, in contrast to Algorithm 1, a convex combination of the old iterate and a new trial iterate supported on $\widehat{x}^{k}$ is taken as the update, instead of solving the subproblem ( $\mathcal{P}_{\mathcal{A}}$ ) to obtain the new iterate.

Similar to [9], our derivation relies on an equivalent surrogate of ( $\mathcal{P}$ ). Let $M_{0}>0$ be an upper bound on the norms of the elements in the set $E_{j}(u^{0})$ , which exists due to Assumption 3.1(ii.). For instance, for the indicator function $G(m)=I_{m\leq M}$ it can be simply chosen as the value of the constraint $M_{0}\coloneqq M$ and for problems of the form ( $\mathcal{P}_{\textrm{source}}$ ) we can set $M_{0}\coloneqq j(u^{0})/\beta$ using $F\geq 0$ . Note that $u^{k}\in E_{j}(u^{0})$ for all $k\geq 0$ and thus $\lVert u^{k}\rVert_{\mathcal{M}}\leq M_{0}$ . Consider the norm-constrained problem

[TABLE]

Clearly, by choice of $M_{0}$ , the problems ( $\mathcal{P}_{M_{0}}$ ) and ( $\mathcal{P}$ ) admit the same global minimizers. Associated to this auxiliary problem we define the gap functional $\Phi\colon\operatorname{\operatorname{dom}}j\rightarrow\mathbb{R}$ as

[TABLE]

Due to the additional constraint we can easily see that $\Phi(u)$ is finite for $\lVert u\rVert_{\mathcal{M}}\leq M_{0}$ and there holds $\Phi(u)\geq 0$ with equality if and only if $u$ is a solution to ( $\mathcal{P}$ ).

The gap functional $\Phi(u^{k})$ corresponds to the gap of the value of the primal function $j$ at $u^{k}$ and a dual function value of ( $\mathcal{P}_{M_{0}}$ ) evaluated at $p^{k}=-\nabla f(u^{k})$ (see [58, Remark 6.4]) and is an important quantity for the following analysis. In particular, it provides an upper bound on the functional residual; see, e.g., [58, Lemma 6.12].

Proposition 5.2.

For any $u\in\operatorname{\operatorname{dom}}j$ here holds

[TABLE]

Proof.

By convexity of $f$ , we have $f(\bar{u})-f(u)\geq\langle\nabla f(u),\bar{u}-u\rangle$ and thus with $p=-\nabla f(u)$ it follows

[TABLE]

The trial point for the GCG method is defined for a given iterate $u^{k}\in\operatorname{\operatorname{dom}}j$ as a maximizer in the maximization problem occurring in the evaluation of $\Phi(u^{k})$ . It can be computed analytically: Let $\widehat{x}^{k}\in\Omega$ be, as before, a maximum of the current dual $p^{k}=-\nabla f(u^{k})$ , $\lVert p^{k}(\widehat{x}^{k})\rVert_{H}=\lVert p^{k}\rVert_{\mathcal{C}}$ , and define

[TABLE]

where $\widehat{m}^{k}\in[0,M_{0}]$ is chosen with

[TABLE]

The connection of $\widehat{v}^{k}$ to $\Phi(u^{k})$ is proved in the following result.

Proposition 5.3.

Let $u^{k}$ be generated by Algorithm 1. Set $p^{k}=-\nabla f(u^{k})$ and $\widehat{v}^{k}$ as in (5.11) and (5.10). Then $\widehat{v}^{k}$ solves

[TABLE]

and thus $\Phi(u^{k})=\langle p^{k},\widehat{v}^{k}-u^{k}\rangle+G(\lVert u^{k}\rVert_{\mathcal{M}})-G(\lVert\widehat{v}^{k}\rVert_{\mathcal{M}})$ .

Proof.

By standard arguments, condition (5.11) implies that $\widehat{m}^{k}$ is a minimizer to

[TABLE]

Next we observe that

[TABLE]

for all $v\in\mathcal{M}(\Omega,H)$ , $\lVert v\rVert_{\mathcal{M}}\leq M_{0}$ . We distinguish two cases. First, if $\lVert p^{k}\rVert_{\mathcal{C}}=0$ then $\widehat{v}^{k}=0$ satisfies

[TABLE]

and therefore $\widehat{v}^{k}=0$ is a solution to (5.12). Note that the second equality holds due to the monotonicity of $G$ from Assumption 3.1. Else, if $\lVert p^{k}\rVert_{\mathcal{C}}>0$ , the measure $\widehat{v}^{k}$ defined in (5.10) satisfies

[TABLE]

where we use $\lVert p^{k}(\widehat{x}^{k})\rVert_{H}=\lVert p^{k}\rVert_{\mathcal{C}}$ in the second equality. Hence we again conclude the optimality of $\widehat{v}^{k}$ for (5.12) which finishes the proof. ∎

We note that the addition of the constraint $\lVert v\rVert_{\mathcal{M}}\leq M_{0}$ ensures that the minimum in (5.12) is finite, which may otherwise not hold for all cost functions $G$ (e.g., ( $\mathcal{P}_{\textrm{source}}$ ) with $G(m)=\beta m+I_{m\geq 0}$ ).

Last, we define the GCG update of $u^{k}$ with stepsize $s^{k}\in[0,1]$ by $u^{k+1/2}=u^{k}+s^{k}(\widehat{v}^{k}-u^{k})$ . This stepsize can be chosen in various ways; since we rely on the analysis of [50, 58], we employ the same Armijo-Goldstein rule discussed there. The resulting GCG algorithm is summarized in Algorithm 2.

The worst-case convergence analysis of Algorithm 1, relies on the inclusion of the optional step 4. in Algorithm 2. It allows to replace the GCG update $u^{k+1/2}$ with any $u^{k+1}$ that decreases the functional value. Clearly the update $u^{k+1}$ defined in step 5. of Algorithm 1 fulfills this condition, due to the fact that $\operatorname{\operatorname{supp}}u^{k+1/2}\subset\mathcal{A}_{k+1/2}$ . Hence, Algorithm 1 achieves at least as much descent in the objective functional as the GCG update $u^{k+1/2}$ in each step and thus the convergence analysis of Algorithm 2 applies to Algorithm 1, which can be considered an accelerated version of the former. Using this observation, we conclude the global unconditional convergence of Algorithm 1 and a sublinear rate of convergence for the residuals. For similar results in specific settings; see also [25, 6, 9, 50]. We refer to [58, Theorem 6.29] for a detailed derivation of the specific form of the following result.

Theorem 5.4.

Suppose that Assumption 3.1 and condition (5.3) hold and let $u^{k}$ be generated by Algorithm 2 or Algorithm 1. Then $u^{k}$ is a minimizing sequence for $j$ and there exists a $q\in(0,1]$ with

[TABLE]

Moreover $u^{k}$ admits at least one weak convergent subsequence and each weak* accumulation point $\bar{u}$ of $u^{k}$ is a minimizer of $j$ over $\mathcal{M}(\Omega,H)$ . If the solution $\bar{u}$ to ( $\mathcal{P}$ ) is unique then we have $u^{k}\rightharpoonup^{*}\bar{u}$ for the whole sequence as well as $F(Ku^{k})\rightarrow F(K\bar{u}),~{}G(\lVert u^{k}\rVert_{\mathcal{M}})\rightarrow G(\lVert\bar{u}\rVert_{\mathcal{M}})$ .*

Let us comment briefly on the main difference of the algorithms, which lies in the update of the coefficients in steps 3.–4., respectively. The GCG method in Algorithm 2 attempts to move as much “mass” as possible, simultaneously from all coefficients of $u^{k}$ to the trial point $\widehat{v}^{k}$ . The problem in obtaining an improved linear convergence result for this method lies in the choice of the point $\widehat{v}^{k}$ : for $N>1$ it does not converge to the true solution, $\widehat{v}^{k}\not\rightharpoonup^{*}\bar{u}$ , since $\widehat{v}^{k}$ is supported only on a single point. However, it holds that $j(u^{k+1/2})\to j(\bar{u})$ and thus

[TABLE]

and therefore we must have $s^{k}\to 0$ for $k\to\infty$ . This prevents improving the convergence rate of Theorem 5.4 without any acceleration in step 4. In contrast, the proof of the linear rate for Algorithm 1 relies on an improved intermediate iterate defined with the trial point ${\widehat{u}^{k}}$ , which we show converges to $\bar{u}$ . Here, we are able to choose $s^{k}>s_{\min}>0$ which yields the linear convergence; see the proof of Theorem 5.17.

5.2. Improved rates for the residual

In the following, we turn our attention back to the improved convergence analysis of Algorithm 1 from Theorem 5.1, where we first prove the improved rate for the residual. For this purpose we now and for the rest of this section suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold. We again recall the definition of the optimal state $\bar{y}$ , the dual variable $\bar{p}$ , the dual certificate $\bar{P}$ and its maximal value $\bar{\lambda}$ :

[TABLE]

as introduced in section 3. Analogously we define the corresponding iterates of Algorithm 1:

[TABLE]

as introduced in section 4. We note that we have given the form of the multiplier $\lambda^{k}$ from (4.4), which requires $u^{k}\neq 0$ . By the global convergence result of the previous section, this is indeed the case:

Corollary 5.5.

For all $k$ large enough there holds $u^{k}\neq 0$ and $\lambda^{k}>0$ .

Proof.

According to Theorem 5.4 we have $u^{k}\rightharpoonup^{*}\bar{u}$ in $\mathcal{M}(\Omega,H)$ and $p^{k}\rightarrow\bar{p}$ in $\mathcal{C}(\Omega,H)$ . In particular, since $\bar{u}\neq 0$ , this implies $u^{k}\neq 0$ for all $k$ large enough. Thus it remains to address the positivity of $\lambda^{k}$ . From the weak* convergence of $u^{k}$ , the strong convergence of $p^{k}$ and the weak* lower semicontinuity of the norm we readily obtain

[TABLE]

and thus $\lambda^{k}>0$ for all $k$ large enough. ∎

We first explore some immediate consequences of Assumption 5.1 that allow us to estimate the error of the important algorithmic quantities in terms of the functional residual. This guarantees their convergence at the already established rate and will be used for the proof of the improved rate below.

Lemma 5.6.

For all $k$ large enough there holds

[TABLE]

Proof.

Recall the neighborhood $\mathcal{N}(\bar{y})$ from Assumption 5.1. Due to the weak* convergence of $u^{k}$ towards $\bar{u}$ , see Theorem 5.4, and the weak*-to-strong continuity of $K$ (see the discussion at the end of Section 2) there holds $y^{k}\in\mathcal{N}(\bar{y})$ for all $k$ large enough. Thus, invoking the strong convexity (5.4) from Assumption 5.1 and recalling the definition of $\Phi$ from (5.9), we conclude

[TABLE]

where we used $\lVert u^{k}\rVert_{\mathcal{M}}\leq M_{0}$ in the last inequality. By optimality of $\bar{u}$ there holds $\Phi(\bar{u})=0$ and thus

[TABLE]

Dividing both sides by $\gamma_{0}$ and taking the square root yields the estimate on $y^{k}-\bar{y}$ .

Next recall the definition of the compact set $KE_{j}(u^{0})$ from (5.2). Note that Algorithm 1 is a descent method and thus $y^{k}=Ku^{k}\in KE_{j}(u^{0})$ . Moreover, since the gradient $\nabla F$ is Lipschitz continuous on $KE_{j}(u^{0})$ due to Assumption 5.1, we have

[TABLE]

for some $L_{u^{0}}>0$ only depending on $j(u^{0})$ . The estimate for the dual variables $p^{k}$ follows immediately since

[TABLE]

Finally we note that

[TABLE]

The next lemma establishes some immediate properties of the dual certificate that will be useful for estimating the distance of the inserted point $\widehat{x}^{k}$ and the support points of $u^{k}$ to the optimal support points of $\bar{u}$ .

Lemma 5.7.

There exist $0<R^{\prime}\leq R$ and $\sigma>0$ such that with $\Omega_{R^{\prime}}=\bigcup_{n=1}^{N}B_{R^{\prime}}(\bar{x}_{n})$ there holds

[TABLE]

where $\theta_{0}>0$ denotes the constant from Assumption 3.3. Moreover, for all $n=1,\ldots,N$ , the following quadratic growth condition is satisfied:

[TABLE]

Proof.

According to Corollary 3.5 we have $\bar{P}(\bar{x}_{n})=\|\bar{P}\|_{\mathcal{C}(\Omega)}$ , and with Assumption 3.2 it holds $\bar{P}(x)<\|\bar{P}\|_{\mathcal{C}(\Omega)}$ for all $x\in\Omega\setminus\{\,\bar{x}_{n}\,\}^{N}_{n=1}$ . Thus the existence of $R^{\prime}\leq R$ and $\sigma>0$ such that (5.15) holds follows from the continuity of $\bar{P}$ . For a given index $n$ , without restriction $R^{\prime}$ can be chosen small enough such that

[TABLE]

and thus

[TABLE]

for all $x\in{B}_{R^{\prime}}(\bar{x}_{n})$ , which proves (5.16). Finally fix $x\in{B}_{R^{\prime}}(\bar{x}_{n})$ . Note that $\bar{x}_{n}\in\operatorname{int}\Omega$ (with Assumption 3.2) is a global maximum of $\bar{P}$ and therefore $\nabla\bar{P}(\bar{x}_{n})=0$ . By Taylor’s theorem with remainder there exists $\widetilde{x}\in\bar{B}_{R^{\prime}}(\bar{x}_{n})$ with

[TABLE]

where (5.16) is used in the last inequality. Since $n$ and $x$ were chosen arbitrarily, this finishes the proof. ∎

5.2.1. Intermediate estimates for the support points

First we argue that the support of $u^{k}$ and the new candidate point $\widehat{x}^{k}$ from step 1. in Algorithm 1 are located in the vicinity of the optimal support points $\bar{x}_{n}$ if $k$ is large enough. For this purpose we require the following estimate on the gap $\Phi(u^{k})$ of the iterates, which bounds the functional residual.

Lemma 5.8.

Assume that the sequence $u^{k}$ is generated by Algorithm 1 and recall the definitions of $\Phi$ from (5.9), of $p^{k}$ and $\lambda^{k}$ from (5.14) and of $\widehat{v}^{k}$ from (5.10). Then there holds

[TABLE]

as well as

[TABLE]

where $\widehat{v}^{k}$ is determined according to Proposition 5.3. In particular, we have

[TABLE]

Proof.

According to Propositions 5.3 and 4.1 there holds

[TABLE]

Recall that $M_{0}>0$ is a bound on the norm of the measures in $E_{j}(u^{0})$ . Since $\widehat{v}^{k}$ is a solution of the partially linearized problem and $\lVert u^{k}\rVert_{\mathcal{M}}\leq M_{0}$ due to $u^{k}\in E_{j}(u^{0})$ , we further obtain

[TABLE]

which gives the first inequality. Using $\lambda^{k}\in\partial G(\lVert u^{k}\rVert_{\mathcal{M}})$ , see Proposition 4.1, we estimate

[TABLE]

which provides the second inequality. The last inequality is a consequence of $\lVert\widehat{v}^{k}\rVert_{\mathcal{M}}\leq M_{0}$ and $r_{j}(u^{k})\leq\Phi(u^{k})$ form Proposition 5.2. ∎

The next result addresses the asymptotic behavior of $\Phi(u^{k})$ . Together with (5.18), this then yields convergence results for $\lambda^{k}$ and $\lVert p^{k}\rVert_{\mathcal{C}}$ .

Lemma 5.9.

There holds $\lim_{k\rightarrow\infty}\Phi(u^{k})=0$ .

Proof.

Since $\bar{u}$ is a solution to ( $\mathcal{P}$ ), the dual variable $\bar{p}=-\nabla f(\bar{u})$ satisfies

[TABLE]

By adding and subtracting $G(\lVert\bar{u}\rVert_{\mathcal{M}})$ and $\bar{\lambda}\lVert\bar{u}\rVert_{\mathcal{M}}=\langle\bar{p},\bar{u}\rangle$ to the definition of $\Phi(u^{k})$ , we estimate

[TABLE]

where we have used (5.19) and $\lVert\widehat{v}^{k}\rVert_{\mathcal{M}}\leq M_{0}$ . Due to the weak* convergence of $u^{k}$ due to Theorem 5.4 and $p^{k}\rightarrow\bar{p}$ in $\mathcal{C}(\Omega,H)$ with Lemma 5.6 we conclude

[TABLE]

Finally note that

[TABLE]

according to Theorem 5.4 and Lemma 5.6, respectively. Together with $\Phi(u^{k})\geq 0$ this concludes the proof. ∎

Corollary 5.10.

Let $\bar{\lambda},\lambda^{k}$ and $p^{k}$ be defined according to (5.13) and (5.14). There holds

[TABLE]

Proof.

Utilizing Lemma 5.6 we observe that

[TABLE]

for $k\to\infty$ . Since $u^{k}\rightharpoonup^{*}\bar{u}$ with Theorem 5.4 and $\lVert\bar{u}\rVert_{\mathcal{M}}>0$ , there holds $\lVert u^{k}\rVert_{\mathcal{M}}\geq\lVert\bar{u}\rVert_{\mathcal{M}}/2>0$ for all $k$ large enough. We consequently obtain

[TABLE]

with (5.18) and Lemma 5.9. Finally, $\lambda^{k}\rightarrow\bar{\lambda}$ follows from the triangle inequality. ∎

Combining Lemma 5.7 with the convergence results of Lemma 5.6 and Corollary 5.10 we conclude that the new candidate point $\widehat{x}^{k}$ and the support of $u^{k}$ are located in the vicinity of the set $\{\,\bar{x}_{n}\,\}^{N}_{n=1}$ . Moreover each optimal Dirac delta position $\bar{x}_{n}$ is approximated by at least one point in $\operatorname{\operatorname{supp}}u^{k}$ .

Corollary 5.11.

Let $0<R^{\prime}\leq R$ , $\sigma>0$ denote the constants from Lemma 5.7, recall $\Omega_{R^{\prime}}=\cup_{n=1}^{N}B_{R^{\prime}}(\bar{x}_{n})$ , and let $\widehat{x}^{k}$ denote the point determined in step 1. of Algorithm 1, $\mathcal{A}_{k}=\operatorname{\operatorname{supp}}u^{k}$ and $P^{k}(x)=\lVert p^{k}(x)\rVert_{H}$ as defined in (5.14). For $k$ large enough and $n=1,\dots,N$ there holds

[TABLE]

Proof.

Let an arbitrary point $x\in\Omega\setminus\Omega_{R^{\prime}}$ be given and recall the function $\bar{P}$ from (5.13). We estimate

[TABLE]

where we used (5.15) in the first inequality. Choosing $k$ large enough such that, with Lemma 5.6 and Corollary 5.10,

[TABLE]

yields (5.20). Next let $x\in\mathcal{A}_{k}=\operatorname{\operatorname{supp}}u^{k}$ be arbitrary. Then there holds $P^{k}(x)=\lambda^{k}$ . Consequently we have $x\in\Omega_{R^{\prime}}$ due to (5.20). In the same way we conclude $\widehat{x}^{k}\in\Omega_{R^{\prime}}$ since $P^{k}(\widehat{x}^{k})=\lVert p^{k}\rVert_{\mathcal{C}}\geq\lambda^{k}$ . Fix now an index $n$ and denote by $u^{k}_{n}$ the restriction of $u^{k}$ to $\bar{B}_{R^{\prime}}(\bar{x}_{n})$ . Invoking Urysohn’s lemma there exists a cut-off function $\chi_{n}\in\mathcal{C}(\Omega)$ with $\chi_{n}=1$ on $\bar{B}_{R^{\prime}}(\bar{x}_{n})$ and $\chi_{n}=0$ on $\bar{B}_{R^{\prime}}(\bar{x}_{i})$ for $i\neq n$ . The weak* convergence of the iterates due to Theorem 5.4 and the strong convergence of the dual variables due to Lemma 5.6 yield

[TABLE]

Since $\lambda^{k}\rightarrow\bar{\lambda}$ with Corollary 5.10 and $\bar{\lambda}>0$ with Assumption 3.2, we have $\lVert u^{k}_{n}\rVert_{\mathcal{M}}\neq 0$ for $k$ large enough. ∎

Next, we quantify the distance of the candidate point $\widehat{x}^{k}$ to the closest point in $\operatorname{\operatorname{supp}}\bar{u}$ in terms of the residual $r_{j}(u^{k})$ . For this purpose we rely on the observation that the behavior of the iterated dual certificate $P^{k}$ on the ball $B_{R^{\prime}}(\bar{x}_{n})$ is similar to that of $\bar{P}$ from (5.17), i.e. it assumes a unique local maximum on $B_{R^{\prime}}(\bar{x}_{n})$ which satisfies a quadratic growth condition.

Lemma 5.12.

Let $0<R^{\prime}\leq 0$ denote the constant from Lemma 5.7. For all $n=1,\ldots,N$ and $k$ large enough the function $P^{k}\in\mathcal{C}(\Omega)$ with $P^{k}(x)=\lVert p^{k}(x)\rVert_{H}$ for $x\in\Omega$ , as defined in (5.14), assumes a unique local maximum $\widehat{x}^{k}_{n}$ on each ball $B_{R^{\prime}}(\bar{x}_{n})$ . Furthermore there holds

[TABLE]

where $\theta_{0}>0$ is the coercivity constant from Assumption 3.3, as well as

[TABLE]

Moreover, for the global maximum $\widehat{x}^{k}$ from step 1. of Algorithm 1, there is a $\widehat{n}_{k}\in\{\,1,\dots,N\,\}$ with $\widehat{x}^{k}=\widehat{x}^{k}_{\,\widehat{n}_{k}}$ .

Proof.

Let $R>0$ denote the radius from Assumption 3.3 and let $\bar{p}$ and $\bar{P}$ be defined as in (5.13). Due to the strong convergence of $\nabla F(Ku^{k})$ in $Y$ from Lemma 5.6 and $K^{\star}\in\mathcal{L}\left(Y,\mathcal{C}^{2}(\bar{\Omega}_{R},H)\right)$ as a consequence of Assumption 3.3, we also have $p^{k}\rightarrow\bar{p}$ in $\mathcal{C}^{2}(\bar{\Omega}_{R},H)$ . In particular, due to (3.7), we conclude $\lVert p^{k}(x)\rVert_{H}\geq\bar{\lambda}/4$ , $x\in\bar{\Omega}_{R}$ , and thus $P^{k}\in\mathcal{C}^{2}(\bar{\Omega}_{R})$ for all $k$ large enough. The strong convergence $P^{k}\rightarrow\bar{P}$ in $\mathcal{C}^{2}(\bar{\Omega}_{R})$ follows immediately. Now fix an index $n$ and let $R^{\prime}<R$ denote the radius from Lemma 5.7. For all $x\in B_{R^{\prime}}(\bar{x}_{n})$ , $\zeta\in\mathbb{R}^{d}$ and $k$ large enough we estimate

[TABLE]

where $\|\cdot\|_{\mathbb{R}^{d\times d}}$ denotes the spectral norm. Here we used Lemma 5.7 in the first inequality and the uniform convergence of $\nabla^{2}P^{k}$ in the second one. Hence $P^{k}$ restricted to $\bar{B}_{R^{\prime}}(\bar{x}_{n})$ is uniformly concave and thus together with (5.20), which implies that no maximum can be assumed on the boundary of $B_{R^{\prime}}(\bar{x}_{n})$ , admits a unique maximum $\widehat{x}^{k}_{n}\in B_{R^{\prime}}(\bar{x}_{n})$ . It satisfies the necessary first order conditions $\nabla P^{k}(\widehat{x}^{k}_{n})=0$ . Let $x\in B_{R^{\prime}}(\bar{x}_{n})$ be arbitrary but fixed. By Taylor’s theorem and (5.24) we obtain

[TABLE]

for some $\widetilde{x}\in B_{R^{\prime}}(\bar{x}_{n})$ . Since $x$ and $n$ where chosen arbitrary we conclude (5.22). Next we prove the estimate in (5.23). For this purpose we invoke Lemma 5.7 and $P^{k}(\widehat{x}^{k}_{n})\geq P^{k}(\bar{x}_{n})$ to estimate

[TABLE]

using a Taylor expansion for $\bar{P}-P^{k}$ in the final step. We readily verify that the entries of the gradient $\nabla\bar{P}(x)-\nabla P^{k}(x)\in\mathbb{R}^{d}$ satisfy for all $x\in\Omega_{R}$ and $i=1,\ldots,d$ :

[TABLE]

As in Lemma 5.6 we estimate

[TABLE]

Note that

[TABLE]

and thus

[TABLE]

Dividing by $(\theta_{0}/4)>0$ we conclude (5.23). Finally we point out that $\widehat{x}^{k}$ is a global maximum of $P^{k}$ and $\widehat{x}^{k}\in\bigcup^{N}_{n=1}\bar{B}_{R^{\prime}}(\bar{x}_{n})$ for all $k$ large enough with Corollary 5.11. Hence, we conclude $\widehat{x}^{k}\in\{\,\widehat{x}^{k}_{n}\,\}^{N}_{n=1}$ . ∎

We finish this section with two a priori estimates for the support of $u^{k}$ as consequences of Lemma 5.7 and Lemma 5.12.

Lemma 5.13.

For all $n=1,\ldots,N$ and $k$ large enough there holds

[TABLE]

Moreover denote by $\{\,\widehat{x}^{k}_{n}\,\}^{N}_{n=1}$ the set of local maximizers of $P^{k}$ on $\Omega_{R^{\prime}}$ from Lemma 5.12. Then we have

[TABLE]

Proof.

First, let $0<R^{\prime}\leq R$ denote the constant from Lemma 5.7. Observe that $\mathcal{A}_{k}\cap B_{R^{\prime}}(\bar{x}_{n})\neq\emptyset$ with Corollary 5.11. Let $x\in\mathcal{A}_{k}\cap B_{R^{\prime}}(\bar{x}_{n})$ be arbitrary but fixed. Using (5.17) we obtain

[TABLE]

for some $c>0$ independent of $x$ . Here we used $P^{k}(x)=\lambda^{k}$ for all $x\in\mathcal{A}_{k}$ and $\bar{P}(\bar{x}_{n})=\bar{\lambda}$ as well as Lemma 5.6 in the final inequality. Taking the maximum over all $x\in\mathcal{A}_{k}\cap B_{R^{\prime}}(\bar{x}_{n})$ , and observing that $\mathcal{A}_{k}\cap B_{R^{\prime}}(\bar{x}_{n})=\mathcal{A}_{k}\cap B_{R}(\bar{x}_{n})$ for $k$ large enough due to (5.20) yields (5.25). Moreover, applying (5.22), we get

[TABLE]

for all $x\in\mathcal{A}_{k}\cap B_{R^{\prime}}(\bar{x}_{n})=\mathcal{A}_{k}\cap B_{R}(\bar{x}_{n})$ . Maximizing with respect to $x$ yields (5.26). ∎

5.2.2. Construction of a descent direction

With these auxiliary estimates at hand we now proceed to prove the linear convergence rate for the residual $r_{j}(u^{k})$ . For this, assume $k$ large enough such that all previous results hold and recall the definition of the trial point

[TABLE]

from (5.7), where $\widehat{n}_{k}$ is the index of the support point $\bar{x}_{\,\widehat{n}_{k}}$ closest to $\widehat{x}^{k}$ as defined in Lemma 5.12. The next statement establishes that the search direction $\Delta^{k}={\widehat{u}^{k}}-u^{k}$ provides a descent direction with descent proportional to the first order error quantity $\lVert p^{k}\rVert_{\mathcal{C}}-\lambda^{k}$ related to the gap $\Phi(u^{k})$ (cf. Lemma 5.8).

Proposition 5.14.

Let $p^{k}$ and $\lambda^{k}$ be defined according to (5.14). For all $k\geq 1$ the trial point $\widehat{u}^{k}$ satisfies

[TABLE]

Proof.

We note that $\lVert u^{k}\rVert_{\mathcal{M}}=\lVert u^{k}\rvert_{\Omega\setminus\widehat{B}_{k}}\rVert_{\mathcal{M}}+\lVert u^{k}\rvert_{\widehat{B}_{k}}\rVert_{\mathcal{M}}=\lVert{\widehat{u}^{k}}\rVert_{\mathcal{M}}$ , and consequently $G(\lVert{\widehat{u}^{k}}\rVert_{\mathcal{M}})=G(\lVert u^{k}\rVert_{\mathcal{M}})$ . Furthermore by construction of (5.7) and conditions (4.4) and (4.5) there holds

[TABLE]

Moreover, the error between ${\widehat{u}^{k}}$ and $u^{k}$ in terms of its observations $K{\widehat{u}^{k}}$ and $Ku^{k}$ can be bounded in terms of the aforementioned dual error quantity.

Lemma 5.15.

There exist a $c>0$ such that for all $k$ large enough there holds

[TABLE]

Proof.

Let an arbitrary $x\in\mathcal{A}_{k}\cap\widehat{B}_{k}$ be given and denote by $u^{k}(x)\in H$ the coefficient of the associated Dirac delta function. Given $\varphi\in Y$ there holds

[TABLE]

using (4.5). Now, with (5.26) and the continuity of $K^{\star}$ , the first term is estimated by

[TABLE]

for $k$ large enough with a constant $c>0$ independent of $x$ . For the second term we use $\lVert p^{k}(\widehat{x}^{k})\rVert_{H}=\lVert p^{k}\rVert_{\mathcal{C}}$ to estimate

[TABLE]

with $c$ as before. Here we used $\lVert p^{k}(\widehat{x}^{k})\rVert_{H}=\lVert p^{k}\rVert_{\mathcal{C}}$ as well as $\lambda^{k}\leq\lVert p^{k}\rVert_{\mathcal{C}}$ in the first equality Since $\lambda^{k}\rightarrow\bar{\lambda}>0$ and $\lVert p^{k}\rVert_{\mathcal{C}^{0,1}(\bar{\Omega}_{R},H)}\rightarrow\lVert\bar{p}\rVert_{\mathcal{C}^{0,1}(\bar{\Omega}_{R},H)}>0$ there holds for sufficiently large $k$ that

[TABLE]

for all $\varphi\in Y$ and consequently

[TABLE]

Now, we rewrite

[TABLE]

and applying the estimate for all $x$ from above and using $\widehat{\mu}^{k}=\lVert u^{k}\rvert_{\widehat{B}_{k}}\rVert_{\mathcal{M}}=\sum_{x\in\mathcal{A}_{k}\cap\widehat{B}_{k}}\lVert u^{k}(x)\rVert_{H}$ yields the desired result. ∎

The previous results establish the weak* convergence of ${\widehat{u}^{k}}$ towards $\bar{u}$ .

Corollary 5.16.

There holds ${\widehat{u}^{k}}\rightharpoonup^{*}\bar{u}$ and $j({\widehat{u}^{k}})\rightarrow j(\bar{u})$ for $k\to\infty$ .

Proof.

We readily obtain

[TABLE]

The first term tends to zero since $u^{k}$ is a minimizing sequence for $j$ and the second vanishes due to Lemma 5.15. Thus ${\widehat{u}^{k}}$ gives a minimizing sequence for $j$ . Since $\bar{u}$ is the unique minimizer of $j$ the claim on the weak* convergence follows. ∎

Finally, we show that $\Delta^{k}={\widehat{u}^{k}}-u^{k}$ yields a search direction that achieves a linear decrease in the objective functional.

Theorem 5.17.

Suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold and that $u^{k}$ is generated by Algorithm 1. There exists an index $\bar{k}\geq 1$ , and constants $c>0$ and $\zeta_{1}\in(0,1)$ with

[TABLE]

Proof.

For $s\in[0,1]$ and ${\widehat{u}^{k}}$ from (5.7) define

[TABLE]

Recall the definition of the sets $E_{j}(u^{0})$ and $KE_{j}(u^{0})$ from (5.1) and (5.2), respectively. Since $j({\widehat{u}^{k}})\rightarrow j(\bar{u})$ we conclude $u^{k+1/2}\in E_{j}(u^{0})$ for all $s$ and all $k$ large enough. Let in the following $k$ be big enough. Using the convexity of $F$ , the Lipschitz continuity of its gradient from (5.3) and the linearity of $K$ and the convexity of $G(\lVert\cdot\rVert_{\mathcal{M}})$ we obtain

[TABLE]

where $L_{u^{0}}$ denotes the Lipschitz constant of $\nabla F$ on $KE_{j}(u^{0})$ from Assumption 5.1. Now, by Proposition 5.14 and Lemma 5.15, we derive the estimate

[TABLE]

with $c_{1}=L_{u^{0}}c^{2}$ , where $c$ is the constant from Lemma 5.15. Minimizing for $s\in[0,1]$ , we obtain

[TABLE]

where $s^{k}=\min\{\,1,\;1/(c_{1}\widehat{\mu}^{k})\,\}$ . Note that $\widehat{\mu}^{k}=\lVert u^{k}\rvert_{\widehat{B}_{k}}\rVert_{\mathcal{M}}\geq\lVert\bar{u}(\bar{x}_{\,\widehat{n}_{k}})\rVert_{H}/2$ for all $k$ large enough, where $\widehat{n}_{k}$ is the index of the optimal support point closest to $\widehat{x}^{k}$ as in (5.6). Let $M_{0}$ be the bound on the norm of the elements of $E_{j}(u^{0})$ from Section 5.1 such that $r_{j}(u^{k})\leq\Phi(u^{k})\leq M_{0}\left(\lVert p^{k}\rVert_{\mathcal{C}}-\lambda^{k}\right)$ with Lemma 5.8. Defining the constant $\delta>0$ by

[TABLE]

and combining the previous estimates we have that

[TABLE]

Subtracting $j(\bar{u})$ from both sides, it follows that

[TABLE]

Denote by $\bar{k}$ an index such that all previous results hold for all $k\geq\bar{k}$ . By induction we obtain $r_{j}(u^{k})\leq(1-\delta)^{k-\bar{k}}r_{j}(u^{\bar{k}})$ . Setting $\zeta_{1}=(1-\delta)$ and $c=r_{j}(u^{\bar{k}})/\zeta_{1}^{\bar{k}}$ yields the result. ∎

5.3. Improved rates for the iterates

This section is devoted to quantitative convergence results for the sequence of iterates $u^{k}$ . While norm convergence towards the minimizer cannot be expected in general, the weak* convergence of the iterates implies convergence of the support points of $u^{k}$ towards those of $\bar{u}$ as well as convergence of the coefficient functions.

5.3.1. Rates for the support points

In this section we address the linear convergence of $\mathcal{A}_{k}=\operatorname{\operatorname{supp}}u^{k}$ towards the support points of $\bar{u}$ . More in detail we prove that

[TABLE]

for some $\zeta_{2}\in(0,1)$ and $k$ sufficiently large. For this purpose recall that

[TABLE]

for $k$ sufficiently large according to Lemma 5.13. In view of Theorem 5.17 it thus suffices to quantify the convergence of $\bar{\lambda}-\lambda^{k}$ from Lemma 5.10 in terms of the residual $r_{j}(u^{k})$ .

Lemma 5.18.

For all $k$ large enough there exists $c>0$ with

[TABLE]

Proof.

Recall the definition of the dual certificates $P^{k}$ and $\bar{P}$ from (5.13) and (5.14), respectively. First note that Lemma 5.18 holds if $u^{k}=\bar{u}$ for some $k$ . Therefore, without restriction assume that $u^{k}\neq\bar{u}$ for all $k$ . Let $\widehat{x}^{k-1}$ denote the new candidate point determined in the previous iteration of Algorithm 1. We now claim that $\widehat{x}^{k-1}\in\mathcal{A}_{k}=\operatorname{\operatorname{supp}}u^{k}$ for all $k$ large enough. Indeed, if this is not the case, then we have $\mathcal{A}_{k}\subset\mathcal{A}_{k-1}$ and thus

[TABLE]

This gives a contradiction to $r_{j}(u^{k})<r_{j}(u^{k-1})$ for all $k$ large enough due to (5.27). From (4.4) and Lemma 5.12 we thus conclude

[TABLE]

for $\bar{x}_{\,\widehat{n}_{k-1}}$ the support point closest to $\widehat{x}^{k-1}$ as in as in (5.6). Summarizing the previous observations we finally have

[TABLE]

due to the monotonicity of $r_{j}(u^{k})$ and Lemma 5.6. ∎

Combining Lemma 5.13 and 5.18, we obtain the following convergence results for the support points.

Theorem 5.19.

Suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold and that $u^{k}$ is generated by Algorithm 1. There exist $c>0$ and $0<\zeta_{2}<1$ such that for all $k$ large enough it holds

[TABLE]

Proof.

Due to the monotonicity of $r_{j}(u^{k})$ , Theorem 5.17 and Lemma 5.18 there exists $0<\zeta_{1}<1$ with

[TABLE]

By setting $\zeta_{2}=\zeta_{1}^{1/4}$ we deduce (5.28) by combining this with the estimate (5.25) in Lemma 5.13. ∎

5.3.2. Rates for the coefficients

Next we address the convergence of the lumped coefficient function $u^{k}(B_{R}(\bar{x}_{n}))$ introduced in (4.7) towards the optimal coefficient $\bar{u}(\bar{x}_{n})$ . We will establish the estimate

[TABLE]

with $\zeta_{2}\in(0,1)$ as in the previous section. We start with the following observation.

Lemma 5.20.

There exists a constant $c>0$ such that, for all $k$ large enough,

[TABLE]

Proof.

First recall that $\lVert u^{k}\rVert_{\mathcal{M}}=\sum^{N}_{n=1}\left\lVert u^{k}(B_{R}(\bar{x}_{n}))\right\rVert_{H}\leq M_{0}$ . The result readily follows from the triangle inequality and

[TABLE]

Therefore, in order to establish the desired result, it suffices to quantify the convergence of the norms and the normalized coefficient functions. We start with the latter one.

Lemma 5.21.

There exists a $c>0$ such that for all $n$ and $x\in\mathcal{A}_{k}\cap B_{R}(\bar{x}_{n})$ it holds

[TABLE]

Proof.

Let $x\in\mathcal{A}_{k}\cap B_{R}(\bar{x}_{n})$ be arbitrary but fixed. From Corollary 3.5 and Proposition 4.1 we recall that

[TABLE]

Now, the error is split into three parts

[TABLE]

For the first term we use Lemma 5.18 to obtain

[TABLE]

due to (5.29) and since $\lambda^{k}\bar{\lambda}$ is bounded away from zero. From the Lipschitz continuity of $\bar{p}$ and the uniform convergence of $p^{k}$ the remaining terms are estimated by

[TABLE]

Using (5.28) and $\lVert\bar{p}-p^{k}\rVert_{\mathcal{C}}\leq r_{j}(u^{k})^{1/4}$ from Lemma 5.6, for all $k$ large enough we obtain

[TABLE]

independent of $x$ with (5.29). Adding both estimates yields the result. ∎

Next we address the convergence of the norms. For this purpose we require the following auxiliary result.

Lemma 5.22.

*There exists a $c>0$ such that for all $n$ , $x\in\mathcal{A}_{k}\cap B_{R}(\bar{x}_{n})$ , and $k$ large enough it holds *

[TABLE]

Proof.

The proof follows similar steps as in Lemma 5.15. Fix an index $n$ and $x\in\mathcal{A}_{k}\cap B_{R}(\bar{x}_{n})$ . For $\varphi\in Y$ we obtain

[TABLE]

for some constant $c>0$ independent of $x$ and $n$ ; see Theorem 5.19 and Lemma 5.21. Since $\varphi\in Y$ was chosen arbitrarily, the desired statement follows. ∎

The next statement characterizes the convergence behavior of the norm of the lumped coefficient.

Proposition 5.23.

Suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold and that $u^{k}$ is generated by Algorithm 1. There exists a constant $c>0$ such that, for all $k$ large enough,

[TABLE]

Proof.

Define the vectors $\bar{\mu},\mu^{k}\in\mathbb{R}^{N}$ with $\bar{\mu}_{n}=\lVert\bar{u}(\bar{x}_{n})\rVert_{H}$ and $\mu^{k}_{n}=\lVert u^{k}(B_{R}(\bar{x}_{n}))\rVert_{H}$ and recall the definition of the operator $\bar{\bm{K}}\mu=\sum_{n}\bm{k}(\bar{x}_{n},\bar{\bm{u}}_{n}/\lVert\bar{\bm{u}}_{n}\rVert_{H})\mu_{n}$ from (3.6), which is injective due to Assumption 5.1. Thus, by standard arguments, there exists $c>0$ with $|\mu|_{\mathbb{R}^{N}}\leq c\lVert\bar{\bm{K}}\mu\rVert_{Y}$ for any $\mu\in\mathbb{R}^{N}$ . Using this, we estimate

[TABLE]

We further estimate

[TABLE]

where we use Lemma 5.22 and $\lVert u^{k}\rVert_{\mathcal{M}}=\sum_{n=1}^{N}\sum_{x\in\mathcal{A}_{k}\cap B_{R}(\bar{x}_{n})}\lVert u^{k}(x)\rVert_{H}\leq M_{0}$ . Using Lemma 5.6 we obtain

[TABLE]

for all $k$ large enough, finishing the proof. ∎

Summarizing all previous estimates we arrive at the following theorem.

Theorem 5.24.

There exists a constant $c>0$ such that, for all $k$ large enough it holds

[TABLE]

Proof.

This follows with Lemma 5.20 and the estimates in Lemma 5.21 and Proposition 5.23. ∎

5.3.3. Convergence rates in weaker norms

As already pointed out the norm convergence of $u^{k}$ towards the unique minimizer $\bar{u}$ in $\mathcal{M}(\Omega,H)$ cannot be expected in general. However norm convergence results can still be obtained by resorting to weaker spaces. In particular since the space of Lipschitz continuous functions embeds compactly into $\mathcal{C}(\Omega,H)$ weak* convergence on $\mathcal{M}(\Omega,H)$ implies strong convergence with respect to the canonical norm on the topological dual space of $\mathcal{C}^{0,1}(\Omega,H)$ . To this end we point out that

[TABLE]

for all $u\in\mathcal{M}(\Omega,H)$ . This closely relates the considered dual norm to the Wasserstein-1 distance for probability measures [31], and the Kantorovich-Rubinshtein norm for scalar-valued measures [5].

Theorem 5.25.

There exists a constant $c>0$ such that, for all $k$ large enough,

[TABLE]

Proof.

Let $\varphi\in\mathcal{C}^{0,1}(\Omega,H)$ with $\|\varphi\|_{\mathcal{C}^{0,1}(\Omega,H)}\leq 1$ be given. We estimate

[TABLE]

Fix an arbitrary index $n$ and split the error on the right hand side of the last inequality as

[TABLE]

The first term is bounded by

[TABLE]

for some constant $c>0$ independent of $n$ following Theorem 5.24. For the second term we use the Lipschitz continuity of $\varphi$ to obtain

[TABLE]

using $\lVert\varphi\rVert_{\mathcal{C}^{0,1}(\Omega,H)}\leq 1$ and the convergence results for the support points in Theorem 5.19. Again, the constant $c>0$ can be chosen independent of the index $n$ . Combining all previous observations we conclude

[TABLE]

Taking the supremum over all $\varphi\in\mathcal{C}^{0,1}(\Omega,H)$ with $\|\varphi\|_{\mathcal{C}^{0,1}(\Omega,H)}\leq 1$ yields the claim. ∎

6. Numerical experiments

In order to illustrate the theoretical results, we perform tests on a simple example with $\Omega=[-1,1]\subset\mathbb{R}$ . We consider the vector valued case with $H=\mathbb{C}^{2}\cong\mathbb{R}^{4}$ . Motivated by the task of inverse sound source location, we consider the convolution kernel

[TABLE]

corresponding to the fundamental solutions of the three dimensional free-space Helmholtz equation at wave number $\kappa$ evaluated at distance $D=1/2$ .

For testing purposes, we consider an exact source $u^{\star}\in\mathcal{M}(\Omega,H)$ consisting of three Dirac delta functions and observe the solutions of the Helmholtz equation with wave numbers $\kappa_{1}=4\pi$ and $\kappa_{2}=6\pi$ at several points $y_{m}\in[-1,1]$ , $m=1,\ldots,M$ . Choosing the kernel

[TABLE]

the corresponding integral operator $K$ from (1.3) describes these observations. Then we consider observations of this source perturbed by additive Gaussian noise $y_{d}=Ku^{\star}+w$ , with relative noise level of $10\%$ . The exact source and the observations are visualized in Figure 1(a).

To recover the source from these measurements, we solve the convex regularized source location problem ( $\mathcal{P}_{\textrm{source}}$ ) with $\beta=1$ . Since the analytical solution $\bar{u}$ is unknown, we compute a reference solution up to high tolerance by employing Algorithm 1 with $\mathrm{TOL}=10^{-13}$ . Additionally, to obtain the correct number of Dirac delta functions $N=3$ we post-process the final iterate along the lines of (4.8) by combining sources where the location points differs by less than $R=10^{-5}$ . We depict this approximation to the optimal solution in Figure 1(b), together with the dual variable $\bar{p}=-K^{\star}(K\bar{u}-y_{d})$ and the absolute value $\bar{P}(x)=\lVert\bar{p}(x)\rVert_{H}$ . It is evident that the strong sufficient conditions from Assumption 3.3 are fulfilled, validating the post-processing described above. Also, we numerically compute the condition number of the matrix (3.6) as $\operatorname{cond}(\bar{\bm{K}})\approx 1.44$ , providing numerical evidence for Assumption 3.2.

We compare Algorithm 1 to different versions of the accelerated GCG method Algorithm 2, in order to study the influence of the optional coefficient minimization step 4. We consider:

GCG

For plain GCG, we omit step 4. in Algorithm 2 and set $u^{k+1}=u^{k+1/2}$ .

SPINAT( $l$ )

Here, we adapt the procedure from [9], and perform $l\geq 1$ additional proximal gradient steps for ( $\mathcal{P}_{\mathcal{A}}$ ) on the current support to obtain $u^{k+1}$ started at $u^{k+1/2}$ in step 4. of Algorithm 2. We select the proximal gradient stepsize by an Armijo line-search rule.

PDAP

We solve the subproblem ( $\mathcal{P}_{\mathcal{A}}$ ) arising in step 4. to machine precision, resulting in Algorithm 1.

We briefly discuss the practical implementation aspects. Concerning the computation of the global maximum $\widehat{x}^{k}$ of $x\mapsto P^{k}(x)=\lVert p^{k}(x)\rVert_{H}$ , we solve a number of independent local nonlinear optimization problems using a Newton method initialized at 30 uniformly spaced points in $[-1,1]$ and at the existing support points $\operatorname{\operatorname{supp}}u^{k}$ . From those local maxima we select the point $\widehat{x}^{k}$ by a direct search. For the solution of the subproblems ( $\mathcal{P}_{\mathcal{A}}$ ) in PDAP we employ a semismooth Newton method (SSN) [57, 48] with a globalization strategy based on a line-search on the objective functional [51, Section 3.5]. The SSN algorithm is initialized with the coefficients of the intermediate iterate $u^{k+1/2}$ from step 3. of Algorithm 2. Due to the superlinear convergence properties, it terminates in a finite number of steps with the solution $\bm{u}^{k+1}$ up to machine precision, and also identifies the nonzero coefficients of $\bm{u}^{k+1}$ as part of the solution process, which define the support the new iterate $u^{k+1}$ .

Now, we run the aforementioned algorithms with a tolerance of $\mathrm{TOL}=10^{-12}$ for a maximum of $50$ steps. The corresponding functional residuals are given in Figure 2(a).

For GCG, we clearly observe the predicted sublinear convergence rate from Theorem 5.4, which leads to a slowly decreasing residual in later iterations, which appears effectively stagnant. SPINAT achieves a larger reduction in the residual, but is affected by the same effective stagnation in later iterations. PDAP initially performs very similar to either of these methods, but converges at a linear rate after the third step. In particular, it terminates within the tolerance after $41$ steps. This agrees with the convergence result from Theorem 5.1. Clearly, PDAP yields the best results compared to other methods in every iteration. Considering that the full resolution of the subproblem ( $\mathcal{P}_{\mathcal{A}}$ ) is more expensive than the simple update of the GCG and SPINAT method, we also plot the residual over the wall clock time in Figure 2(b). We observe that the added cost of PDAP does not outweigh the benefits, since, in fact, the computation time in each step is heavily dominated by the computation of the global nonconvex maximum $\widehat{x}^{k}$ . To validate if the improved convergence estimates for source points and coefficients can be observed in practice, we compute the maximum error of each support point of the iterates of GCG to the closest support point of the reference solution as in Theorem 5.19. Moreover, we compute the locally lumped coefficients as given in Theorem 5.24 and compute their maximum error to the corresponding reference values. As predicted by theory, both quantities converge at a linear rate, where we empirically estimate $\zeta_{2}\approx 0.72$ ; see Figure 2(c).

To further assess the properties of the solutions obtained by each method, we plot the evolution of the support sizes of the computed iterates in Figure 3(a).

Clearly, GCG inserts a new point in every iteration, which means that the support size is proportional to the iteration counter. PDAP behaves almost ideally, since the number of points is bounded by a number that is only twice the number of support points of the true source. The versions of SPINAT are also able to eliminate some support points, but only in later iterations and they do not achieve a meaningful reduction. For all of the methods, we observe a clustering of sources around the optimal location, but as the zoomed in plot from Figure 3(b) shows, PDAP produces a cluster of only two points at high accuracy, whereas SPINAT(100), albeit delivering the smallest residual out of the GCG and SPINAT experiments, has a large cluster of points at substantial distance from the optimal location. While for GCG this behavior is to be expected, it may appear surprising that $100$ proximal gradient iterations on the current support are not sufficient to move enough “mass” of the coefficients to the improved location points inserted in more recent iterations. This stems from the fact that the mapping $(\bm{u}_{i})_{i=1,\ldots,N^{k}}\mapsto\left(\bm{k}(x_{i}^{k},\bm{u}_{i})\right)_{i=1,\ldots,N^{k}}$ is increasingly ill-conditioned the more multiple $x_{i}^{k}\in\mathcal{A}_{k}$ cluster around the same point $\bar{x}_{n}$ , and this adversely affects the convergence of the proximal gradient method. This highlights the benefit of employing second order optimization methods for the subproblems of ( $\mathcal{P}_{\mathcal{A}}$ ), which are not affected as much by this ill-conditioning. In particular, for the given implementation using semismooth Newton methods, new support points are inserted at vastly improved locations due to the improved descent in the functional in the previous iteration and old support points at locations far from the optimum can be eliminated reliably.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. D. Ahipasaoglu, P. Sun, and M. J. Todd , Linear convergence of a modified Frank-Wolfe algorithm for computing minimum-volume enclosing ellipsoids , Optim. Methods Softw., 23 (2008), pp. 5–19.
2[2] C. D. Aliprantis and K. C. Border , Infinite dimensional analysis , Springer, Berlin, third ed., 2006. A hitchhiker’s guide.
3[3] J.-M. Azaïs, Y. de Castro, and F. Gamboa , Spike detection from inaccurate samplings , Appl. Comput. Harmon. Anal., 38 (2015), pp. 177–195.
4[4] A. Beck and M. Teboulle , A fast iterative shrinkage-thresholding algorithm for linear inverse problems , SIAM J. Imaging Sci., 2 (2009), pp. 183–202.
5[5] V. I. Bogachev , Measure theory. Vol. I, II , Springer-Verlag, Berlin, 2007.
6[6] N. Boyd, G. Schiebinger, and B. Recht , The alternating descent conditional gradient method for sparse inverse problems , SIAM J. Optim., 27 (2017), pp. 616–639.
7[7] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein , Distributed optimization and statistical learning via the alternating direction method of multipliers , Found. Trends Mach. Learn., 3 (2011), pp. 1–122.
8[8] K. Bredies and M. Carioni , Sparsity of solutions for variational inverse problems with finite-dimensional data , Calc. Var., 59 (2020).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Linear convergence of accelerated

Abstract.

Key words and phrases:

1991 Mathematics Subject Classification:

1. Introduction

Accelerated GCG methods

Contribution

Related work

Discretization-based methods

Regularization based methods

Existing convergence results for conditional gradient methods

Plan of the paper

2. Notation

3. Sparse minimization problems

Assumption 3.1**.**

3.1. Existence of minimizers and optimality conditions

Proposition 3.1**.**

Theorem 3.2**.**

Proposition 3.3**.**

Proof.

Theorem 3.4**.**

Corollary 3.5**.**

Proof.

3.2. Uniqueness of solutions and non-degeneracy conditions

Assumption 3.2**.**

Proposition 3.6**.**

Proof.

Assumption 3.3**.**

4. Algorithmic solution

4.1. Primal Dual Active Point Strategy

Proposition 4.1**.**

Proof.

5. Convergence analysis

Assumption 5.1**.**

Theorem 5.1**.**

Remark 5.1*.*

5.1. Worst-case convergence analysis

Proposition 5.2**.**

Proof.

Proposition 5.3**.**

Proof.

Theorem 5.4**.**

5.2. Improved rates for the residual

Corollary 5.5**.**

Proof.

Lemma 5.6**.**

Proof.

Lemma 5.7**.**

Proof.

5.2.1. Intermediate estimates for the support points

Lemma 5.8**.**

Proof.

Lemma 5.9**.**

Proof.

Corollary 5.10**.**

Proof.

Corollary 5.11**.**

Proof.

Lemma 5.12**.**

Proof.

Lemma 5.13**.**

Proof.

5.2.2. Construction of a descent direction

Proposition 5.14**.**

Proof.

Lemma 5.15**.**

Proof.

Corollary 5.16**.**

Proof.

Theorem 5.17**.**

Proof.

5.3. Improved rates for the iterates

5.3.1. Rates for the support points

Lemma 5.18**.**

Assumption 3.1.

Proposition 3.1.

Theorem 3.2.

Proposition 3.3.

Theorem 3.4.

Corollary 3.5.

Assumption 3.2.

Proposition 3.6.

Assumption 3.3.

Proposition 4.1.

Assumption 5.1.

Theorem 5.1.

*Remark 5.1**.*

Proposition 5.2.

Proposition 5.3.

Theorem 5.4.

Corollary 5.5.

Lemma 5.6.

Lemma 5.7.

Lemma 5.8.

Lemma 5.9.

Corollary 5.10.

Corollary 5.11.

Lemma 5.12.

Lemma 5.13.

Proposition 5.14.

Lemma 5.15.

Corollary 5.16.

Theorem 5.17.

Lemma 5.18.

Theorem 5.19.

Lemma 5.20.

Lemma 5.21.

Lemma 5.22.

Proposition 5.23.

Theorem 5.24.

Theorem 5.25.