Linear convergence of accelerated conditional gradient algorithms in spaces of measures
Konstantin Pieper, Daniel Walter

TL;DR
This paper introduces a class of generalized conditional gradient algorithms for optimization in spaces of Radon measures, demonstrating sub-linear convergence generally and linear convergence under certain structural assumptions.
Contribution
The paper develops a new class of algorithms for measure space optimization and proves their convergence rates, including local linear convergence under specific conditions.
Findings
Achieves a sub-linear $ ext{O}(1/k)$ convergence rate in general cases.
Under structural assumptions, attains local linear convergence with rate $ ext{O}( ho^k)$.
Provides analysis for finite-dimensional subproblem resolution within the algorithm.
Abstract
A class of generalized conditional gradient algorithms for the solution of optimization problem in spaces of Radon measures is presented. The method iteratively inserts additional Dirac-delta functions and optimizes the corresponding coefficients. Under general assumptions, a sub-linear rate in the objective functional is obtained, which is sharp in most cases. To improve efficiency, one can fully resolve the finite-dimensional subproblems occurring in each iteration of the method. We provide an analysis for the resulting procedure: under a structural assumption on the optimal solution, a linear convergence rate is obtained locally.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Linear convergence of accelerated
conditional gradient algorithms in spaces of measures
Konstantin Pieper
Computer Science and Mathematics Division, Oak Ridge National Laboratory, One Bethel Valley Road, P.O. Box 2008, MS-6211, Oak Ridge, TN 37831 ([email protected])
and
Daniel Walter
Johann Radon Institute for Computational and Applied Mathematics, ÖAW, Altenbergerstraße 69, 4040 Linz, Austria ([email protected])
Abstract.
A class of generalized conditional gradient algorithms for the solution of optimization problem in spaces of Radon measures is presented. The method iteratively inserts additional Dirac-delta functions and optimizes the corresponding coefficients. Under general assumptions, a sub-linear rate in the objective functional is obtained, which is sharp in most cases. To improve efficiency, one can fully resolve the finite-dimensional subproblems occurring in each iteration of the method. We provide an analysis for the resulting procedure: under a structural assumption on the optimal solution, a linear convergence rate is obtained locally.
Key words and phrases:
vector-valued finite Radon measures, generalized conditional gradient, sparsity, nonsmooth optimization
1991 Mathematics Subject Classification:
46E27, 65J22, 65K05, 90C25, 49M05
KP acknowledges funding by the US Air Force Office of Scientific Research grant FA9550-15-1-0001 and the Laboratory Directed Research and Development Program at Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). DW acknowledges support by the DFG through the International Research Training Group IGDK 1754 “Optimization and Numerical Analysis for Partial Differential Equations with Nonsmooth Structures”. Furthermore, support from the TopMath Graduate Center of TUM Graduate School at Technische Universität München, Germany and from the TopMath Program at the Elite Network of Bavaria is gratefully acknowledged.
1. Introduction
In this paper we consider generalized conditional gradient methods for sparse optimization problems, where the optimization variable lies in a space of measures. These problems arise in different contexts, and they are intrinsically related to certain optimization problems in terms of the spatial location parameters and associated coefficient variables: For the purposes of this paper, we want to find a “sparse” measure, which consists of a sum of Dirac delta functions,
[TABLE]
with a finite point set from a continuous candidate set (a possibly uncountably infinite compact subset of , ) and corresponding coefficients in a Hilbert space (for instance, , , , etc.), and the cardinality of the support set. It should be emphasized that neither the number of points, nor the coefficients are subject to any further restrictions. Usually, the measure has a physical interpretation as a number of point-wise sources or sensors in a physics-based model. There are many applications, where one is interested to choose , , and to minimize a functional of the form:
[TABLE]
Here, is a suitable design functional or quality criterion for the variable (which we will also refer to as observation variable), which is given in terms of the kernel function , and evaluates the response of a model to the optimization variables and . The second term, which is expressed in terms of the sum of the norms of the coefficients (the norm of ) models either the cost of the coefficient variable, or is added as a regularization term to ensure that the coefficients are sufficiently small.
Often, the functionals and are convex, but is linear only in the coefficients , but not in the location parameters . Thus, the corresponding optimization problem is not convex:
[TABLE]
Moreover, it has a combinatorial aspect, since is not fixed. However, by embedding this problem into a more general formulation, a convex formulation can be obtained. Concretely, the sparse measure (1.1) can be considered as an element of the space of regular vector-measures . Requiring to be continuous in the coefficients, we can introduce the (integral) operator and the total variation norm as
[TABLE]
We refer to section 2 for the rigorous definitions in the case of a general measure from the space of vector measures. Now, we can formulate the following generalized convex optimization problem:
[TABLE]
Note that the formulation () is more general than (1.2), since not all vector measure are of the form (1.1) (in particular, the Lebesgue space is contained in . However, in many cases, the solutions of () have the desired discrete sparsity structure. In particular, if is a finite-dimensional space, sparse solutions with can always be found; see, e.g., [8] or [58, Proposition 6.32]. This then renders both problem formulations essentially equivalent.
Our main motivation for studying these problems is given by applications in inverse source location [9, 52], optimal control [15, 39, 40, 30], or compressed sensing [10, 24, 3]: Here, () is of the simpler form:
[TABLE]
where encodes a collection of vector valued signals originating from a number of source locations , and models the signal that will be received by a measurement setup. The data vector contains (potentially noisy) observations obtained in practice, and the first term measures the misfit of the data to the response of the model. Often, such models involve trigonometric polynomials or other analytically given functions [24, 3, 10]. More complicated models involve partial differential equations [13, 52, 9]. Here, corresponds to (possibly pointwise) observations of the PDE solution corresponding to a source term .
A second motivation arises in the theory of optimal design [27, 53, 56], going back to the concept of approximate designs by Kiefer and Wolfowitz [38]. Here, corresponds to a spatial sensor location, to the error variance of the corresponding sensor, and to the Fisher information matrix of a linear (or linearized) Gaussian model associated to a measurement setup . In this case, there are various different “information criteria” to evaluate the quality of the overall measurement setup , which are usually convex, smooth, but extended real valued functionals (allowing for the value ). is often chosen to be a convex indicator function to enforce , but a cost term as in () can also be considered; see, e.g., [50].
With the intent of providing a unified analysis that covers all of the mentioned problem instances, we study the general formulation (), where we impose additional assumptions: We require certain regularity and coercivity properties of the convex functions and , which covers the examples mentioned above (see Assumptions 3.1 and 5.1); and we require second order differentiability of the kernel with respect to around the optimal locations (see Section 2), which can be verified in many of the mentioned cases.
Accelerated GCG methods
The objective of this paper is to analyze certain sequential point insertion and coefficient optimization methods as efficient solution algorithms for sparse optimization problems of the form (). We refer to [6, 9] for a description and analysis of the method applied to special instances of the general problem (). Starting from a sparse initial measure of the form (1.1), these type of algorithms generates a sequence of sparse iterates , , by the iterative procedure
[TABLE]
where maximizes a certain continuous function over the set , which is computed from the previous iterate ; see Algorithm 2 below. The new source location and the coefficient are chosen such that corresponds to a descent direction in a generalized conditional gradient method (GCG) – also known as Frank-Wolfe algorithm [29] – applied to an equivalent reformulation of (). We also point to different variations of the Fedorov-Wynn algorithm [63, 49, 61, 60, 26, 62], developed in the context of approximate design theory, which can be interpreted in this framework.
While the practical implementation of the GCG algorithm is fairly simple, it suffers from slow asymptotic convergence. Several works [50, 25, 9, 6] derive a sublinear convergence rate for the objective functional values of the iterates under mild assumptions on the problem and several choices of the step size . Numerical experiments (e.g., [50]) confirm that this convergence is also observed in practice. Therefore, it is unpractical to solve the problem to high precision, which motivates the introduction of additional acceleration steps. Moreover, the absence of point removal steps leads to undesirable clustering effects: The support size of the iterate grows monotonically with and, in later iterations, new support points are inserted very close to existing ones. As a remedy, one is also interested to incorporate additional sparsification steps which can iteratively remove support points without increasing the objective functional values. In the present work, we consider additional optimization steps based on the sparse representation of the iterates in terms of their support points and coefficients according to (1.1). Defining the updated support (active set) corresponding to (1.4) as , where , we improve the coefficients of the next iterate by approximately solving the coefficient optimization problem
[TABLE]
Note that this is a convex minimization problem on the Hilbert space due to the linearity of the kernel in the argument . In fact, (1.5) has the same structure as (); it is simply its restriction to the space . Since it is also a sparse optimization problem, some coefficients of the associated optimal solution may be zero. In the next iteration, we can thus exclude the corresponding support points from the representation of the measure (1.1), which also serves as a sparsification step. Thus, we obtain the next iterate, by setting for the solution of (1.5), and . In [9] the authors suggests to improve the GCG algorithm by performing several steps of a proximal gradient method for (1.5) starting from the current coefficients as initial guess. Acceleration of GCG by fully resolving the coefficient optimization problem (1.5) in each iteration of the method has been proposed in [6, 25, 61, 52].
Alternatively to coefficient optimization, point moving strategies have been suggested. Here, we additionally solve a finite-dimensional, generally non-convex optimization problem in subject to constraints imposed by the set . In [9] it is proposed to move the support points according to the gradient flow of the smooth part and in [6] it is advocated to employ general purpose optimization methods based on first order derivatives. Further, in [18] the authors propose to include steps which simultaneously optimize the positions and coefficients of the current iterate. Note that the nonconvex (and also nonsmooth, if both coefficients and positions are optimized) point moving problem to be solved here is more computationally intensive than (1.5). Moreover, for sparse minimization problems associated to PDEs, often the kernel is not given analytically and needs to be further approximated [12, 39, 40, 52, 50]. To solve () in practice, the operator is replaced by an approximation employing finite elements. Note that the most commonly employed Lagrangian finite elements are continuous, but not continuously differentiable and thus the objective function is no longer with respect to . This prevents a straightforward algorithmic solution of the point moving problem by derivative based methods, whereas coefficient optimization can be implemented in a straightforward fashion and the new point can be found by a direct search over the grid nodes (cf. [52, 50]). For these reasons, we do not consider point moving in this paper.
Contribution
The main contribution of this paper is to analyze the procedure resulting from combining point insertion steps (1.4) with subsequent full resolution of the coefficient optimization problem (1.5), which is summarized in Algorithm 1. Note that the method can be interpreted as an active set method, where new points are added to the active set at the global maxima of a dual variable, and points are removed if their primal coefficients are set to zero (by resolving (1.5)), we also refer to this method as Primal-Dual-Active-Point strategy (PDAP).
Since the coefficient optimization steps are carried out in addition to the point insertion steps, the convergence rate for GCG is also valid for the accelerated methods. We recall this convergence result in Theorem 5.4 for the general problem formulation (). Concerning the improved convergence behavior of methods combining point insertion and coefficient optimization over GCG – as reported in [52, 50] – we are not aware of any improved theoretical results. However, in this paper, we prove a linear convergence rate for ; see Theorem 5.17. Note that, since the improved result is local in character, we still have to rely on the general convergence result mentioned above to ensure that the iterate is sufficiently close to an optimal solution for large enough. In order to obtain the improved linear convergence result, we impose a non-degeneracy condition on the optimal solution; see Assumptions 3.2 and 3.3. This enables us to derive further convergence results for the location parameters and the coefficients . In particular, we show that the support points of the iterate asymptotically converge towards the support points of the optimal solution, again at a linear rate; see Theorem 5.19. This also gives theoretical evidence for the sparsifying effect of the coefficient optimization steps, since it shows that support points far away from the optimal locations eventually will be removed from the iterate measure. Moreover, we derive convergence estimates for the coefficients. Here, we need to account for the fact that multiple support points of can be close to the each optimal location. Lumping together the corresponding coefficients, we again obtain a linear convergence rate; see Theorem 5.24. Together, this results in a linear convergence rate of the iterate measure in the dual space ; see Theorem 5.25.
We note that the improved convergence rate proved here also requires additional regularity assumptions. In particular, we need second derivatives of the kernel function in , which may not be available if discrete approximations to are employed in practice. We point out that these assumptions are only of technical nature: The computation of the derivatives of the kernel function with respect to the position is not required in the algorithm. Consequently, the method can be readily adapted to discretizations of (), and the linear rate proved here is also observed in practice [52, 50].
Related work
The design of efficient algorithms for () is a challenging task since the space of vector-valued Borel measures is in general non-reflexive. Moreover, it lacks useful properties such as strict convexity and smoothness which are desirable for the convergence analysis of many optimization methods. Consequently, a direct extension of most well-known optimization routines to the present setting is not possible.
Discretization-based methods
A first approach to the solution of () for a continuous candidate set is to replace by a approximating sequence of finite sets with for a sequence of mesh parameters . For example, may be chosen as the nodal set of a triangulation of . Since consists of many points, every is of the form . Substituting the space of regular Borel measures in () with the discretized space yields a convex minimization problem for the coefficient functions similar to (1.5) with replaced by . While the resulting problem remains non-smooth due to the appearance of the total variation norm, it can be solved by a large number of well-studied algorithms. For examples we point to semi-smooth Newton methods [57, 48], the fast iterative shrinkage-thresholding algorithm (FISTA) [4], and the alternating direction of multipliers method [7]. However, this philosophy of discretize then optimize harbors the danger of yielding mesh dependent solution methods. While a particular algorithm may be efficient for the solution of the discrete problem associated to a fixed discretization parameter , its convergence behaviour can critically depend on the fineness of the discretization, which is usually the case for the aforementioned methods. For the methods analyzed in this paper, such problems only have to be solved on a very small candidate set.
Regularization based methods
A different approach to circumvent the lack of reflexivity of the space can be based on path-following strategies. Here the original problem is replaced by a sequence of -regularized ones:
[TABLE]
with the Hilbert space . Note that the appearance of the norm in the objective functional (as the restriction of the total variation norm to ) still promotes optimal solutions which are nonzero only on small subsets of . Furthermore in the limiting case for the -regularized solutions approximate solutions to (); see, e.g., [51]. For fixed those problems are amenable to efficient function space based solution methods such as semi-smooth Newton (SSN) [57, 55, 33]. For linear-quadratic problems such as (), these methods can be further interpreted as active set methods [35] (specifically Primal-Dual-Active-Set method, PDAS). While these methods behave mesh independent in practice and their performance scales linearly with the degrees of freedom underlying mesh, the convergence behavior deteriorates for small values of . In the practical realization it is therefore necessary to start at a large value of and to alternate between decreasing the regularization parameter and a (possibly inexact) solution of the regularized problem initialized at the previous iterate. Thus, a complete analysis of path-following methods requires a quantitative convergence analysis of the method used for the solution of the regularized problem in dependence of , a quantification of the additional regularization error and update strategies for the parameter; cf., e.g., [36]. We refer to [58] for further discussion and a numerical comparison of the path-following approach to the PDAP method analyzed here, which shows a substantial advantage of PDAP for the case that the optimal is small compared to the number of degrees of freedom of the mesh.
Existing convergence results for conditional gradient methods
Conditional gradient methods (see, e.g., [44]) have been originally proposed by Frank and Wolfe [29]. They constitute a simple iterative scheme for computing a minimizer of a smooth convex function over compact subsets of a Banach space. Since norm balls in are weak* compact, the general problem formulation fits into this setting for the choice of the convex indicator function . Feasibility of the iterates is ensured by taking the new iterate as a convex combination between the previous iterate and a trial point , which is obtained by minimizing a linearization of the objective functional around over the admissible set. A sublinear rate for the convergence of the objective functional values towards its minimum can be proven for various choices of the step size . For an overview we refer to [21, 22, 20]. The sublinear rate is tight even for strongly convex objective functionals [11]. An improved rate of convergence can only be derived in more restrictive settings: For problems on infinite dimensional spaces, a linear rate of convergence is provided in [44, 17] if the gradient of the objective functional is uniformly bounded away from zero on a strongly convex admissible set. The papers [20, 21] yield the same rate if the linearized objective functional fulfills a certain growth condition on the admissible set. We emphasize that, apart from trivial cases, none of the mentioned results is directly applicable to the problem at hand. Moreover, we point out that, on finite dimensional spaces, accelerated conditional gradient methods, such as Wolfe’s away-step conditional gradient [59], eventually yield a linear rate of convergence [1, 41]. In infinite dimensions, where the candidate set is not finite, we are not aware of similar results. Last we point out that if we replace with the cone and set , Algorithm 1 corresponds to the fully-corrective conditional gradient method [37]. For finite-dimensional observation space , this particular algorithm can be related to an exchange method [34] on the semi-infinite convex dual problem of (). We are also not aware of convergence results comparable to those provided in this work for these type of methods.
After this manuscript was finalized we were made aware of [28], where the authors prove linear convergence of a similar accelerated conditional gradient method for the particular case of and . We note that [28] and the present manuscript were derived independently of each other and differ in certain important aspects. In particular, in contrast to our work, the authors require , i.e. the dimension of the coefficient optimization problem (1.5) increases monotonically. Moreover, the active set is updated by adding all sufficiently large local maximizers of a certain dual certificate, while we only require the addition of one global maximum (as in the original GCG method).
Plan of the paper
The paper is organized as follows. In Section 2, we fix some basic notation and provide the functional analytic background used for the rest of the work. Section 3 introduces the optimization problem and some basic results on the existence and structure of optimal solutions are derived. We also discuss how different practically relevant problems fit into the general framework. In Section 4 we formulate the optimization algorithms and prove the subsequential convergence of the generated iterates as well as a sublinear worst-case convergence rate for the objective functional values. Under additional structural assumptions on the problem, an improved local linear rate of convergence is established in Section 5. Moreover, quantitative convergence results for the support points and the coefficients of the iterates are presented. Finally, in Section 6, we illustrate the theoretical findings by numerical experiments.
2. Notation
Let , , be compact and denote by a separable Hilbert space with respect to the norm induced by the inner product . In the following, is identified with its dual space using the Riesz representation theorem. A countably additive mapping is called a vector measure, where denote the Borel sets of . Associated to we define its total variation measure in the usual way. The space of vector measures with finite total variation is now denoted by , which is a Banach space with respect to the norm
[TABLE]
For reference, see the discussion in [43, Chapter 12.3]. The support of is defined as the support of the corresponding total variation measure . We point out that for a measure of the form (1.1) consisting of a finite sum of Dirac delta functions, we have and . Additionally, those measures are precisely the measures of finite support, which are characterized by their support and their coefficients for (which we will also abbreviate by , by a slight abuse of notation). We denote the cardinality of the support by .
Moreover, any is absolutely continuous with respect to , and there exists a unique function
[TABLE]
such that can be decomposed as
[TABLE]
see, e.g., [42, Chapter 12.4]. The function is called the Radon-Nikodým derivative of with respect to ; see [19]. For abbreviation we write in the following. For finitely supported measures of the form (1.1) it clearly holds for .
By we further denote the space of bounded and continuous functions on which assume values in . It is a separable Banach space when endowed with the usual supremum norm
[TABLE]
for any ; see e.g. [2, Lemma 3.85]. By Singer’s representation theorem (see, e.g., [32]) its topological dual space is identified with where the associated duality paring is given by
[TABLE]
for arbitrary and . A sequence , , is called weak* convergent with limit if
[TABLE]
for . We denote this by .
Finally, let be another Hilbert space and be a weak-to-strong continuous function, which is linear in the second argument. Now, we define the operator for each argument by
[TABLE]
which clearly extends the definition for finite measures given in (1.3), using the linearity of in the second argument. Additionally, we define the pre-adjoint operator by
[TABLE]
It is easy to see that for all and , using the definitions. Moreover, is a linear and bounded operator with norm
[TABLE]
Thus, is the Banach space adjoint of and thus also linear and bounded with the same norm bound. Since is weak-to-strong continuous, is sequentially weak*-to-strong continuous. Note that is not the Banach space adjoint of , since . It can be understood as the adjoint in the sense of topological vector spaces, if is endowed with the weak* topology, but we will not need this property in the following.
Finally, to prove the convergence result of this manuscript, we require higher smoothness assumptions on the kernel with respect to . We denote the partial derivatives of with respect to by , , for any and (if they exist) and analogously the higher derivatives. By and we denote the Gradient and Hessian with respect to , respectively. We require smoothness of the kernel only on a neighborhood of the optimal support points, and the precise assumptions will be given in section 3.2. For smooth functions on any open subset we denote by the spaces of twice continuously differentiable functions with derivatives that can be continuously extended up to the boundary of , endowed with the usual supremum norm over all partial derivatives. The space denotes the Lipschitz continuous functions endowed with the usual Lipschitz norm. Finally, for smooth functions taking values in the Hilbert space (resp. ), by and we denote the vector valued variants of the above spaces, defined in the canonical way.
3. Sparse minimization problems
We now turn to sparse minimization problems. Our aim is to solve the nonsmooth convex problem
[TABLE]
Here, the loss functional is a convex (extended real valued) functional with open domain on the Hilbert space . The convex cost functional is assumed to be monotone on . We note that its domain is given by . In order to ensure well-posedness of this problem, the following assumptions are made.
Assumption 3.1**.**
Let the following assumptions hold:
- (i.)
The function is proper, convex, lower semi-continuous, and monotonically increasing on with for . Without loss of generality we set for .
- (ii.)
The domain of the functional is nonempty and is radially unbounded.
- (iii.)
The function is convex and lower semi-continuous. Moreover, is open in , and is strictly convex and continuously Fréchet differentiable on .
Note that (i.), (iii.) and the weak*-to-strong continuity of imply that is weak* lower semicontinuous on . The convex subdifferential of will be denoted by and the (Hilbert-space) Fréchet derivative of at will be denoted by . For later use, we also define the smooth part of the reduced cost functional as
[TABLE]
From Assumption 3.1(iii.), the linearity of as well as the chain rule we conclude that is Fréchet differentiable at . In order to identify the Fréchet derivative we compute the directional derivative of in direction as
[TABLE]
Thus, the Fréchet derivative of at can be identified with the continuous function . Moreover, due to the weak*-to-strong continuity of , the mapping
[TABLE]
is sequentially weak*-to-strong continuous.
3.1. Existence of minimizers and optimality conditions
Before we turn to the algorithmic solution of () we summarize some basic properties, such as existence and optimality conditions, which will be necessary in the following. The existence of at least one global minimizer to () thus follows immediately by the direct method of variational calculus (see, e.g., [16, Chapter 1]).
Proposition 3.1**.**
There exists at least one optimal solution to ().
Let us turn to a structural characterization of minimizers obtained from (). The following theorem is a direct consequence of the one-homogeneity of the norm and Assumption 3.1(i.); see, e.g., [58, Theorem 6.22].
Theorem 3.2**.**
Let be given. Set . Then is an optimal solution to () if and only if
[TABLE]
Throughout the rest of the paper we will consider a solution of () and define
[TABLE]
We refer to as the optimal observation and to as the dual variable associated to .
Proposition 3.3**.**
The optimal observation and dual variable are the same for every minimizer to ().
Proof.
The uniqueness of the optimal observation can be shown by a standard argument using the strict convexity of . Thus, the optimal dual variable is unique as well. ∎
The optimality conditions can be equivalently expressed through a sparsity condition on the total variation measure and a projection formula for the Radon-Nikodým derivative ; see, e.g., [58, Theorem 6.24].
Theorem 3.4**.**
Let with polar decomposition . Then the condition (3.1) is equivalent to and either or
[TABLE]
In many situations it can be ensured that a minimizer with finite support (1.1) exists: for instance, when the space is finite dimensional (see, e.g., [52, Theorem 3.7], [58, Proposition 6.32]) or when the dual variable assumes its maximum only in a finite set of points. We will impose the latter condition below, which will be necessary for the improved convergence analysis. In this case, Theorem 3.4 can be interpreted as follows:
Corollary 3.5**.**
Assume that has finite support, i.e. with , , and . Then the condition (3.1) is equivalent to and either or
[TABLE]
Proof.
This follows directly with and for all . ∎
Note that, for problems of the form () and , is simply equal to the cost or regularization parameter .
3.2. Uniqueness of solutions and non-degeneracy conditions
In order to ensure uniqueness of the solution itself, we introduce the corresponding unique dual certificate
[TABLE]
where we recall that and (which are uniquely defined according to Proposition 3.3).
In the following, we will restrict ourselves to optimal solutions with non-degenerate dual variable , i.e. and thus , since a trivial dual variable would imply that is also a global minimizer of the functional over the set , which can be easily ruled out in most situations. Furthermore, we impose the following conditions for the analysis of the paper.
Assumption 3.2**.**
There is a discrete set for some such that the dual certificate defined above and fulfill
[TABLE]
Moreover, the following set is linearly independent:
[TABLE]
This assumption ensures the existence of a unique, sparse minimizer ; cf. also [24].
Proposition 3.6**.**
Under Assumption 3.2 the problem () admits a unique discrete minimizer given by a finite sum of Dirac delta functions
[TABLE]
Proof.
We note that points in (3.3) constitute the potential support set for the optimal solution, i.e. due to Theorem 3.4 and thus is given as in (3.5); cf. Corollary 3.5. Together with the form of the integral operator (1.3) it holds for where is defined as
[TABLE]
Now, with (3.4) (using linearity of in the second argument) the mapping is injective and is unique, which directly implies that is unique. ∎
The previous assumption can be guaranteed in several settings, and is commonly imposed for the purpose of error analysis, see, e.g., [24]. For instance, this condition holds if the operator is injective as in, e.g., sparse initial value identification problems [45]. Furthermore, if (3.3) holds, then we note that the linear independence of (3.4) is in fact necessary for the existence of a unique sparse minimizer with nonzero coefficient functions. More in detail, if with is a minimizer to () and (3.4) is linearly dependent then there exists another minimizer to () with ; see, e.g., [52]. Finally, recalling the definition of from (3.6), the vector is the unique minimizer of over with and satisfies
[TABLE]
for some if and only if has full column rank (i.e. if (3.4) is linearly independent) and fulfills a strong convexity condition on the image of . We will use this later to estimate the error of the optimal coefficients (see Proposition 5.23 below for details). Similar assumptions are used in the convergence analysis of semismooth Newton methods for problems with -regularization; see [48, p. 18], [47, Example 4.3.9].
In order to ensure the stability of the location points of approximations to and to be able to quantify the error, we require a further strengthened form of the optimality conditions. First, we introduce an appropriate neighborhood of the optimal support points: with Assumption 3.2 there exists a radius such that
[TABLE]
Now, on this neighborhood of the support points, we impose the following additional smoothness requirements on the kernel:
[TABLE]
This has several consequences. First, we observe that this implies that for any and therefore . Next, we note that the -norm is two times continuously Fréchet differentiable at every . Thus if satisfies for some and all , then the composition is in . In particular this implies . Since are the maximizers of , it holds
[TABLE]
Finally, we impose the main requirement for the analysis from this section: we assume that the curvature of around its global maximizers does not degenerate.
Assumption 3.3**.**
There holds , i.e. for . Furthermore, the kernel fulfills (3.8) for a radius satisfying (3.7) and there is a such that for all it holds:
[TABLE]
Let us briefly motivate the last assumption and recall similar concepts from the literature. First we point out . Hence, (3.9) corresponds to a second order sufficient condition (SSC) for the global maximizers of . In particular, this is equivalent to the quadratic growth
[TABLE]
of for all in the vicinity of (see Lemma 5.7 below for details). This will allow us to derive estimates on the support points of approximations to by perturbation results for the dual certificate . In the context of super-resolution the conditions of Assumptions 3.3 (for the case of ) are referred to as a non-degeneracy source condition for the measure ; cf. [24, 23]. Furthermore, we recall the connection of sparse minimization problems to state constrained optimization; cf. [14]. From this point of view the equality condition on corresponds to a strict complementarity assumption on the Lagrange multiplier associated to the state constraint. Moreover, in this case the definiteness assumption on the Hessian of can be interpreted as a condition on the curvature of the optimal state around those points in which it touches the constraint. Both of these conditions are well-established in the field of semi-infinite optimization. We refer to, e.g., [46] where similar assumptions are used to derive finite element error estimates. In [54] comparable conditions are imposed to derive second order optimality conditions for semi-infinite optimization problems.
4. Algorithmic solution
In this section we discuss the numerical solution of () with a conceptually simple algorithm operating on sparse finitely supported measures (1.1). It consists of the repeated insertion of a single new point at the maximum of a dual variable, the full resolution of a convex optimization problem on the current support, and the subsequent removal of all zero coefficients associated to the current support.
To describe the algorithm, we consider an active set of distinct points and the associated parametrization defined by
[TABLE]
The convex optimization problem at the core of the algorithm arises from fixing the support of the measure to the set and considering only the convex problem for the coefficients on the Hilbert space :
[TABLE]
Clearly () corresponds to () with the support of the optimization variable restricted to .
4.1. Primal Dual Active Point Strategy
The algorithm updates the active point set corresponding to the current iterate in every iteration by adding a single point , found at the global maximum of the current dual
[TABLE]
Subsequently, the convex sparse optimization problem (), is solved on the updated support, and entries from the active set with zero coefficient are pruned. We point out that the solution of () is sparse due to the sparsity promoting cost term, i.e. several coefficients of may be zero. The full procedure is outlined in Algorithm 1.
In the following, we analyze the iterates of Algorithm 1 without any termination criterion, which generates an infinite sequence , and analyze their structure and convergence. For this, we require the first order necessary optimality conditions for solutions to the coefficient optimization problem () in step 3. of Algorithm 1, which are given as follows.
Proposition 4.1**.**
Let and be the active set in iteration of Algorithm 1. Accordingly, denote by the optimal solution to (). Define the next iterate , dual variable and . Then there holds
[TABLE]
If this implies
[TABLE]
Proof.
The optimality conditions are obtained analogously to Corollary 3.5. To this end note that for the given the mapping (4.1) can be understood as an isometric isomorphism
[TABLE]
where is the space of vector measures supported on and the norm of is given by . Moreover the operator can be restricted to a linear continuous operator
[TABLE]
Thus is a solution to where is restricted to vector measures supported on . The claimed conditions follow now from Theorem 3.2 and Corollary 3.5 by replacing with and realizing that the dual variable for the restricted problem is the restriction of the dual variable to .
∎
We note the difference between the quantity from (4.2), which is the maximum of over , and the quantity , which is the maximum of over . Now, it is clear that
[TABLE]
and that is optimal for the original problem () if and only if . In fact, in this case the conditions from Proposition 4.1 imply the (sufficient) optimality conditions from Theorem 3.4 (cf. Corollary 3.5).
By Proposition 4.1, holds for all . Thus, if or equivalently it also holds
[TABLE]
since contains only the support points of . Associated to those support points we denote the coefficients of by (by a slight abuse of notation), and there holds further
[TABLE]
again from Proposition 4.1. Moreover, it is apparent that Algorithm 1 is a descent method: In fact step 3. implies that
[TABLE]
decays monotonically along the iterates, .
We will show in section 5 that for , which implies the weak* convergence of the iterates . However, we note that this notion of convergence requires careful interpretation. For instance, we cannot expect the coefficients of the iterate measure to converge towards coefficients of . In fact, is allowed to have much more support points than , i.e. , where multiple support points approximate asymptotically a single support point . Instead, we assume for the moment that the balls around the optimal support points from (3.7) are known and define for each support point of the exact solution the “lumped” coefficients , Then, the lumped coefficients fulfill
[TABLE]
Moreover, we can choose as an arbitrary convex combination of and replace with the “lumped” measure
[TABLE]
which still converges to in the weak* sense, but has the correct number of support points.
We note that the construction of the “lumped” coefficients supposes knowledge of the balls . This is sufficient for the purposes of the error analysis of the next section but problematic in practical computations, where neither the number nor these balls are known a priori. In practice, these balls could be estimated from the knowledge of the approximate solution and a posteriori, but we do not pursue this here.
5. Convergence analysis
We now provide a convergence analysis for Algorithm 1. It requires a final set of conditions imposed on the functional , for which we introduce additional notation: For define the sublevel set
[TABLE]
as well as its image set
[TABLE]
The set is weak* compact since is radially unbounded and weak* lower semicontinuous (Assumption 3.1). Consequently, since is weak*-to-strong continuous, is compact. Observe that Algorithm 1 is a descent method due to (4.6) and thus for all . For the convergence analysis we impose the following additional assumptions on , which are weaker than global Lipschitz continuity of its gradient and strong convexity.
Assumption 5.1**.**
For every the gradient is Lipschitz continuous on the image set : There exists a constant only depending on with
[TABLE]
Moreover is strongly convex around the optimal observation , i.e. there exist a neighborhood of in and a constant with
[TABLE]
It is clear that a quadratic as in () fulfills these conditions. However, in cases where the domain of is a proper subset of (cf., e.g., [50]), fulfills neither (5.3) nor (5.4) uniformly for all , and requires the weaker form given above.
In order to make the following presentation more transparent we state the main result of this section beforehand: The following theorem provides linear convergence for the residual of the functional defined as
[TABLE]
along the iterates , the support set , and the lumped coefficients of the iterates introduced in (4.7).
Theorem 5.1**.**
Suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold and let the sequence be generated by Algorithm 1 started at . Recall the definition of the balls and their union from (3.7). Then, there exists a constant with
[TABLE]
Moreover, there exist and , such that for all :
[TABLE]
Finally, the we have in , and the error decays linearly in the dual space of the space of valued Lipschitz continuous functions on : .
Before giving the proof of Theorem 5.1, we sketch the main idea of the proof of the convergence result for the functional beforehand: Due to the localization result as given in the first assertion of Theorem 5.1, we know that for large enough, every newly inserted support point is contained in exactly one ball around the support points , and we will denote the associated index and the ball by
[TABLE]
By perturbation arguments we can estimate the difference of the existing support points and the new point to the corresponding optimal location in terms of the error quantity from (4.3), which bounds the functional residual up to constant; see subsection 5.2.1. Then, we define the search direction
[TABLE]
and the associated trial point
[TABLE]
which replaces all Dirac delta functions in with a single one supported on , where the magnitude of the coefficient is maintained, but the direction is taken from the dual variable at , motivated by the optimality conditions. Relying on the fact that the coefficients of solve the problem () and the aforementioned perturbation arguments, we can document the decrease in terms of the objective functional by defining an intermediate update
[TABLE]
Here, if is large, we move a large fraction of “mass” from the Dirac-delta functions supported on over to the newly inserted point. We show that for appropriately chosen not only , but in fact the error decreases linearly according to ; see Theorem 5.17. Clearly, since it holds and the same estimate follows for .
The rest of this section is dedicated to providing the proof of Theorem 5.1. Throughout the derivations, the generic constant will be chosen different from line to line, but always independent of the iteration index . The plan of the proof is as follows. First, in subsection 5.1 we establish the global convergence of Algorithm 1 by interpreting it as an accelerated version of a generalized conditional gradient method. The corresponding theory yields only a much slower sublinear rate of convergence, since it can not fully exploit the decrease achieved in step 3. of Algorithm 1. However, the global convergence result at a sublinear rate is necessary for our proof since it allows us to apply perturbation arguments based on Assumption 3.3 (which are valid only for ) and the improved convergence analysis for the residual is given in subsection 5.2. We first establish the localization result of the support points in Corollary 5.11 and then the linear convergence of the residual in Theorem 5.17. In subsection 5.3 the results for the iterates are derived as consequences; see Theorem 5.19 for the support points, Theorem 5.24 for the lumped coefficients, and Theorem 5.25 for the estimate in the dual norm.
Remark 5.1*.*
It may be tempting to attempt to use the update (5.8) directly to replace the resolution of the subproblem in step 3. of Algorithm 1. However, this is not immediately possible for several reasons: First, the construction of the search direction requires knowledge of the ball based on the exact solution, which is not available in practical computations. Second, the descent properties of (5.8) rely on the fact that is a previous iterate computed from () and fulfills (4.4) and (4.5). To start, the choice of leads and thus can, on its own, not lead to convergent method. Moreover, it is beneficial to resolve () fully and obtain in each step the sparsity condition from Proposition 4.1, since this removes unnecessary support points in step 4. and keeps the active set small and enables the estimate on the points of the active set given above.
5.1. Worst-case convergence analysis
Now, we derive a first convergence result for the sequence generated by Algorithm 1. For this, we rely on existing analysis for generalized conditional gradient methods. Here, the same point is inserted at each iteration, but, in contrast to Algorithm 1, a convex combination of the old iterate and a new trial iterate supported on is taken as the update, instead of solving the subproblem () to obtain the new iterate.
Similar to [9], our derivation relies on an equivalent surrogate of (). Let be an upper bound on the norms of the elements in the set , which exists due to Assumption 3.1(ii.). For instance, for the indicator function it can be simply chosen as the value of the constraint and for problems of the form () we can set using . Note that for all and thus . Consider the norm-constrained problem
[TABLE]
Clearly, by choice of , the problems () and () admit the same global minimizers. Associated to this auxiliary problem we define the gap functional as
[TABLE]
Due to the additional constraint we can easily see that is finite for and there holds with equality if and only if is a solution to ().
The gap functional corresponds to the gap of the value of the primal function at and a dual function value of () evaluated at (see [58, Remark 6.4]) and is an important quantity for the following analysis. In particular, it provides an upper bound on the functional residual; see, e.g., [58, Lemma 6.12].
Proposition 5.2**.**
For any here holds
[TABLE]
Proof.
By convexity of , we have and thus with it follows
[TABLE]
The trial point for the GCG method is defined for a given iterate as a maximizer in the maximization problem occurring in the evaluation of . It can be computed analytically: Let be, as before, a maximum of the current dual , , and define
[TABLE]
where is chosen with
[TABLE]
The connection of to is proved in the following result.
Proposition 5.3**.**
Let be generated by Algorithm 1. Set and as in (5.11) and (5.10). Then solves
[TABLE]
and thus .
Proof.
By standard arguments, condition (5.11) implies that is a minimizer to
[TABLE]
Next we observe that
[TABLE]
for all , . We distinguish two cases. First, if then satisfies
[TABLE]
and therefore is a solution to (5.12). Note that the second equality holds due to the monotonicity of from Assumption 3.1. Else, if , the measure defined in (5.10) satisfies
[TABLE]
where we use in the second equality. Hence we again conclude the optimality of for (5.12) which finishes the proof. ∎
We note that the addition of the constraint ensures that the minimum in (5.12) is finite, which may otherwise not hold for all cost functions (e.g., () with ).
Last, we define the GCG update of with stepsize by . This stepsize can be chosen in various ways; since we rely on the analysis of [50, 58], we employ the same Armijo-Goldstein rule discussed there. The resulting GCG algorithm is summarized in Algorithm 2.
The worst-case convergence analysis of Algorithm 1, relies on the inclusion of the optional step 4. in Algorithm 2. It allows to replace the GCG update with any that decreases the functional value. Clearly the update defined in step 5. of Algorithm 1 fulfills this condition, due to the fact that . Hence, Algorithm 1 achieves at least as much descent in the objective functional as the GCG update in each step and thus the convergence analysis of Algorithm 2 applies to Algorithm 1, which can be considered an accelerated version of the former. Using this observation, we conclude the global unconditional convergence of Algorithm 1 and a sublinear rate of convergence for the residuals. For similar results in specific settings; see also [25, 6, 9, 50]. We refer to [58, Theorem 6.29] for a detailed derivation of the specific form of the following result.
Theorem 5.4**.**
Suppose that Assumption 3.1 and condition (5.3) hold and let be generated by Algorithm 2 or Algorithm 1. Then is a minimizing sequence for and there exists a with
[TABLE]
Moreover admits at least one weak convergent subsequence and each weak* accumulation point of is a minimizer of over . If the solution to () is unique then we have for the whole sequence as well as .*
Let us comment briefly on the main difference of the algorithms, which lies in the update of the coefficients in steps 3.–4., respectively. The GCG method in Algorithm 2 attempts to move as much “mass” as possible, simultaneously from all coefficients of to the trial point . The problem in obtaining an improved linear convergence result for this method lies in the choice of the point : for it does not converge to the true solution, , since is supported only on a single point. However, it holds that and thus
[TABLE]
and therefore we must have for . This prevents improving the convergence rate of Theorem 5.4 without any acceleration in step 4. In contrast, the proof of the linear rate for Algorithm 1 relies on an improved intermediate iterate defined with the trial point , which we show converges to . Here, we are able to choose which yields the linear convergence; see the proof of Theorem 5.17.
5.2. Improved rates for the residual
In the following, we turn our attention back to the improved convergence analysis of Algorithm 1 from Theorem 5.1, where we first prove the improved rate for the residual. For this purpose we now and for the rest of this section suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold. We again recall the definition of the optimal state , the dual variable , the dual certificate and its maximal value :
[TABLE]
as introduced in section 3. Analogously we define the corresponding iterates of Algorithm 1:
[TABLE]
as introduced in section 4. We note that we have given the form of the multiplier from (4.4), which requires . By the global convergence result of the previous section, this is indeed the case:
Corollary 5.5**.**
For all large enough there holds and .
Proof.
According to Theorem 5.4 we have in and in . In particular, since , this implies for all large enough. Thus it remains to address the positivity of . From the weak* convergence of , the strong convergence of and the weak* lower semicontinuity of the norm we readily obtain
[TABLE]
and thus for all large enough. ∎
We first explore some immediate consequences of Assumption 5.1 that allow us to estimate the error of the important algorithmic quantities in terms of the functional residual. This guarantees their convergence at the already established rate and will be used for the proof of the improved rate below.
Lemma 5.6**.**
For all large enough there holds
[TABLE]
Proof.
Recall the neighborhood from Assumption 5.1. Due to the weak* convergence of towards , see Theorem 5.4, and the weak*-to-strong continuity of (see the discussion at the end of Section 2) there holds for all large enough. Thus, invoking the strong convexity (5.4) from Assumption 5.1 and recalling the definition of from (5.9), we conclude
[TABLE]
where we used in the last inequality. By optimality of there holds and thus
[TABLE]
Dividing both sides by and taking the square root yields the estimate on .
Next recall the definition of the compact set from (5.2). Note that Algorithm 1 is a descent method and thus . Moreover, since the gradient is Lipschitz continuous on due to Assumption 5.1, we have
[TABLE]
for some only depending on . The estimate for the dual variables follows immediately since
[TABLE]
Finally we note that
[TABLE]
The next lemma establishes some immediate properties of the dual certificate that will be useful for estimating the distance of the inserted point and the support points of to the optimal support points of .
Lemma 5.7**.**
There exist and such that with there holds
[TABLE]
where denotes the constant from Assumption 3.3. Moreover, for all , the following quadratic growth condition is satisfied:
[TABLE]
Proof.
According to Corollary 3.5 we have , and with Assumption 3.2 it holds for all . Thus the existence of and such that (5.15) holds follows from the continuity of . For a given index , without restriction can be chosen small enough such that
[TABLE]
and thus
[TABLE]
for all , which proves (5.16). Finally fix . Note that (with Assumption 3.2) is a global maximum of and therefore . By Taylor’s theorem with remainder there exists with
[TABLE]
where (5.16) is used in the last inequality. Since and were chosen arbitrarily, this finishes the proof. ∎
5.2.1. Intermediate estimates for the support points
First we argue that the support of and the new candidate point from step 1. in Algorithm 1 are located in the vicinity of the optimal support points if is large enough. For this purpose we require the following estimate on the gap of the iterates, which bounds the functional residual.
Lemma 5.8**.**
Assume that the sequence is generated by Algorithm 1 and recall the definitions of from (5.9), of and from (5.14) and of from (5.10). Then there holds
[TABLE]
as well as
[TABLE]
where is determined according to Proposition 5.3. In particular, we have
[TABLE]
Proof.
According to Propositions 5.3 and 4.1 there holds
[TABLE]
Recall that is a bound on the norm of the measures in . Since is a solution of the partially linearized problem and due to , we further obtain
[TABLE]
which gives the first inequality. Using , see Proposition 4.1, we estimate
[TABLE]
which provides the second inequality. The last inequality is a consequence of and form Proposition 5.2. ∎
The next result addresses the asymptotic behavior of . Together with (5.18), this then yields convergence results for and .
Lemma 5.9**.**
There holds .
Proof.
Since is a solution to (), the dual variable satisfies
[TABLE]
By adding and subtracting and to the definition of , we estimate
[TABLE]
where we have used (5.19) and . Due to the weak* convergence of due to Theorem 5.4 and in with Lemma 5.6 we conclude
[TABLE]
Finally note that
[TABLE]
according to Theorem 5.4 and Lemma 5.6, respectively. Together with this concludes the proof. ∎
Corollary 5.10**.**
Let and be defined according to (5.13) and (5.14). There holds
[TABLE]
Proof.
Utilizing Lemma 5.6 we observe that
[TABLE]
for . Since with Theorem 5.4 and , there holds for all large enough. We consequently obtain
[TABLE]
with (5.18) and Lemma 5.9. Finally, follows from the triangle inequality. ∎
Combining Lemma 5.7 with the convergence results of Lemma 5.6 and Corollary 5.10 we conclude that the new candidate point and the support of are located in the vicinity of the set . Moreover each optimal Dirac delta position is approximated by at least one point in .
Corollary 5.11**.**
Let , denote the constants from Lemma 5.7, recall , and let denote the point determined in step 1. of Algorithm 1, and as defined in (5.14). For large enough and there holds
[TABLE]
Proof.
Let an arbitrary point be given and recall the function from (5.13). We estimate
[TABLE]
where we used (5.15) in the first inequality. Choosing large enough such that, with Lemma 5.6 and Corollary 5.10,
[TABLE]
yields (5.20). Next let be arbitrary. Then there holds . Consequently we have due to (5.20). In the same way we conclude since . Fix now an index and denote by the restriction of to . Invoking Urysohn’s lemma there exists a cut-off function with on and on for . The weak* convergence of the iterates due to Theorem 5.4 and the strong convergence of the dual variables due to Lemma 5.6 yield
[TABLE]
Since with Corollary 5.10 and with Assumption 3.2, we have for large enough. ∎
Next, we quantify the distance of the candidate point to the closest point in in terms of the residual . For this purpose we rely on the observation that the behavior of the iterated dual certificate on the ball is similar to that of from (5.17), i.e. it assumes a unique local maximum on which satisfies a quadratic growth condition.
Lemma 5.12**.**
Let denote the constant from Lemma 5.7. For all and large enough the function with for , as defined in (5.14), assumes a unique local maximum on each ball . Furthermore there holds
[TABLE]
where is the coercivity constant from Assumption 3.3, as well as
[TABLE]
Moreover, for the global maximum from step 1. of Algorithm 1, there is a with .
Proof.
Let denote the radius from Assumption 3.3 and let and be defined as in (5.13). Due to the strong convergence of in from Lemma 5.6 and as a consequence of Assumption 3.3, we also have in . In particular, due to (3.7), we conclude , , and thus for all large enough. The strong convergence in follows immediately. Now fix an index and let denote the radius from Lemma 5.7. For all , and large enough we estimate
[TABLE]
where denotes the spectral norm. Here we used Lemma 5.7 in the first inequality and the uniform convergence of in the second one. Hence restricted to is uniformly concave and thus together with (5.20), which implies that no maximum can be assumed on the boundary of , admits a unique maximum . It satisfies the necessary first order conditions . Let be arbitrary but fixed. By Taylor’s theorem and (5.24) we obtain
[TABLE]
for some . Since and where chosen arbitrary we conclude (5.22). Next we prove the estimate in (5.23). For this purpose we invoke Lemma 5.7 and to estimate
[TABLE]
using a Taylor expansion for in the final step. We readily verify that the entries of the gradient satisfy for all and :
[TABLE]
As in Lemma 5.6 we estimate
[TABLE]
Note that
[TABLE]
and thus
[TABLE]
Dividing by we conclude (5.23). Finally we point out that is a global maximum of and for all large enough with Corollary 5.11. Hence, we conclude . ∎
We finish this section with two a priori estimates for the support of as consequences of Lemma 5.7 and Lemma 5.12.
Lemma 5.13**.**
For all and large enough there holds
[TABLE]
Moreover denote by the set of local maximizers of on from Lemma 5.12. Then we have
[TABLE]
Proof.
First, let denote the constant from Lemma 5.7. Observe that with Corollary 5.11. Let be arbitrary but fixed. Using (5.17) we obtain
[TABLE]
for some independent of . Here we used for all and as well as Lemma 5.6 in the final inequality. Taking the maximum over all , and observing that for large enough due to (5.20) yields (5.25). Moreover, applying (5.22), we get
[TABLE]
for all . Maximizing with respect to yields (5.26). ∎
5.2.2. Construction of a descent direction
With these auxiliary estimates at hand we now proceed to prove the linear convergence rate for the residual . For this, assume large enough such that all previous results hold and recall the definition of the trial point
[TABLE]
from (5.7), where is the index of the support point closest to as defined in Lemma 5.12. The next statement establishes that the search direction provides a descent direction with descent proportional to the first order error quantity related to the gap (cf. Lemma 5.8).
Proposition 5.14**.**
Let and be defined according to (5.14). For all the trial point satisfies
[TABLE]
Proof.
We note that , and consequently . Furthermore by construction of (5.7) and conditions (4.4) and (4.5) there holds
[TABLE]
Moreover, the error between and in terms of its observations and can be bounded in terms of the aforementioned dual error quantity.
Lemma 5.15**.**
There exist a such that for all large enough there holds
[TABLE]
Proof.
Let an arbitrary be given and denote by the coefficient of the associated Dirac delta function. Given there holds
[TABLE]
using (4.5). Now, with (5.26) and the continuity of , the first term is estimated by
[TABLE]
for large enough with a constant independent of . For the second term we use to estimate
[TABLE]
with as before. Here we used as well as in the first equality Since and there holds for sufficiently large that
[TABLE]
for all and consequently
[TABLE]
Now, we rewrite
[TABLE]
and applying the estimate for all from above and using yields the desired result. ∎
The previous results establish the weak* convergence of towards .
Corollary 5.16**.**
There holds and for .
Proof.
We readily obtain
[TABLE]
The first term tends to zero since is a minimizing sequence for and the second vanishes due to Lemma 5.15. Thus gives a minimizing sequence for . Since is the unique minimizer of the claim on the weak* convergence follows. ∎
Finally, we show that yields a search direction that achieves a linear decrease in the objective functional.
Theorem 5.17**.**
Suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold and that is generated by Algorithm 1. There exists an index , and constants and with
[TABLE]
Proof.
For and from (5.7) define
[TABLE]
Recall the definition of the sets and from (5.1) and (5.2), respectively. Since we conclude for all and all large enough. Let in the following be big enough. Using the convexity of , the Lipschitz continuity of its gradient from (5.3) and the linearity of and the convexity of we obtain
[TABLE]
where denotes the Lipschitz constant of on from Assumption 5.1. Now, by Proposition 5.14 and Lemma 5.15, we derive the estimate
[TABLE]
with , where is the constant from Lemma 5.15. Minimizing for , we obtain
[TABLE]
where . Note that for all large enough, where is the index of the optimal support point closest to as in (5.6). Let be the bound on the norm of the elements of from Section 5.1 such that with Lemma 5.8. Defining the constant by
[TABLE]
and combining the previous estimates we have that
[TABLE]
Subtracting from both sides, it follows that
[TABLE]
Denote by an index such that all previous results hold for all . By induction we obtain . Setting and yields the result. ∎
5.3. Improved rates for the iterates
This section is devoted to quantitative convergence results for the sequence of iterates . While norm convergence towards the minimizer cannot be expected in general, the weak* convergence of the iterates implies convergence of the support points of towards those of as well as convergence of the coefficient functions.
5.3.1. Rates for the support points
In this section we address the linear convergence of towards the support points of . More in detail we prove that
[TABLE]
for some and sufficiently large. For this purpose recall that
[TABLE]
for sufficiently large according to Lemma 5.13. In view of Theorem 5.17 it thus suffices to quantify the convergence of from Lemma 5.10 in terms of the residual .
Lemma 5.18**.**
For all large enough there exists with
[TABLE]
Proof.
Recall the definition of the dual certificates and from (5.13) and (5.14), respectively. First note that Lemma 5.18 holds if for some . Therefore, without restriction assume that for all . Let denote the new candidate point determined in the previous iteration of Algorithm 1. We now claim that for all large enough. Indeed, if this is not the case, then we have and thus
[TABLE]
This gives a contradiction to for all large enough due to (5.27). From (4.4) and Lemma 5.12 we thus conclude
[TABLE]
for the support point closest to as in as in (5.6). Summarizing the previous observations we finally have
[TABLE]
due to the monotonicity of and Lemma 5.6. ∎
Combining Lemma 5.13 and 5.18, we obtain the following convergence results for the support points.
Theorem 5.19**.**
Suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold and that is generated by Algorithm 1. There exist and such that for all large enough it holds
[TABLE]
Proof.
Due to the monotonicity of , Theorem 5.17 and Lemma 5.18 there exists with
[TABLE]
By setting we deduce (5.28) by combining this with the estimate (5.25) in Lemma 5.13. ∎
5.3.2. Rates for the coefficients
Next we address the convergence of the lumped coefficient function introduced in (4.7) towards the optimal coefficient . We will establish the estimate
[TABLE]
with as in the previous section. We start with the following observation.
Lemma 5.20**.**
There exists a constant such that, for all large enough,
[TABLE]
Proof.
First recall that . The result readily follows from the triangle inequality and
[TABLE]
Therefore, in order to establish the desired result, it suffices to quantify the convergence of the norms and the normalized coefficient functions. We start with the latter one.
Lemma 5.21**.**
There exists a such that for all and it holds
[TABLE]
Proof.
Let be arbitrary but fixed. From Corollary 3.5 and Proposition 4.1 we recall that
[TABLE]
Now, the error is split into three parts
[TABLE]
For the first term we use Lemma 5.18 to obtain
[TABLE]
due to (5.29) and since is bounded away from zero. From the Lipschitz continuity of and the uniform convergence of the remaining terms are estimated by
[TABLE]
Using (5.28) and from Lemma 5.6, for all large enough we obtain
[TABLE]
independent of with (5.29). Adding both estimates yields the result. ∎
Next we address the convergence of the norms. For this purpose we require the following auxiliary result.
Lemma 5.22**.**
*There exists a such that for all , , and large enough it holds *
[TABLE]
Proof.
The proof follows similar steps as in Lemma 5.15. Fix an index and . For we obtain
[TABLE]
for some constant independent of and ; see Theorem 5.19 and Lemma 5.21. Since was chosen arbitrarily, the desired statement follows. ∎
The next statement characterizes the convergence behavior of the norm of the lumped coefficient.
Proposition 5.23**.**
Suppose that Assumption 3.1, 3.2, 3.3, and 5.1 hold and that is generated by Algorithm 1. There exists a constant such that, for all large enough,
[TABLE]
Proof.
Define the vectors with and and recall the definition of the operator from (3.6), which is injective due to Assumption 5.1. Thus, by standard arguments, there exists with for any . Using this, we estimate
[TABLE]
We further estimate
[TABLE]
where we use Lemma 5.22 and . Using Lemma 5.6 we obtain
[TABLE]
for all large enough, finishing the proof. ∎
Summarizing all previous estimates we arrive at the following theorem.
Theorem 5.24**.**
There exists a constant such that, for all large enough it holds
[TABLE]
Proof.
This follows with Lemma 5.20 and the estimates in Lemma 5.21 and Proposition 5.23. ∎
5.3.3. Convergence rates in weaker norms
As already pointed out the norm convergence of towards the unique minimizer in cannot be expected in general. However norm convergence results can still be obtained by resorting to weaker spaces. In particular since the space of Lipschitz continuous functions embeds compactly into weak* convergence on implies strong convergence with respect to the canonical norm on the topological dual space of . To this end we point out that
[TABLE]
for all . This closely relates the considered dual norm to the Wasserstein-1 distance for probability measures [31], and the Kantorovich-Rubinshtein norm for scalar-valued measures [5].
Theorem 5.25**.**
There exists a constant such that, for all large enough,
[TABLE]
Proof.
Let with be given. We estimate
[TABLE]
Fix an arbitrary index and split the error on the right hand side of the last inequality as
[TABLE]
The first term is bounded by
[TABLE]
for some constant independent of following Theorem 5.24. For the second term we use the Lipschitz continuity of to obtain
[TABLE]
using and the convergence results for the support points in Theorem 5.19. Again, the constant can be chosen independent of the index . Combining all previous observations we conclude
[TABLE]
Taking the supremum over all with yields the claim. ∎
6. Numerical experiments
In order to illustrate the theoretical results, we perform tests on a simple example with . We consider the vector valued case with . Motivated by the task of inverse sound source location, we consider the convolution kernel
[TABLE]
corresponding to the fundamental solutions of the three dimensional free-space Helmholtz equation at wave number evaluated at distance .
For testing purposes, we consider an exact source consisting of three Dirac delta functions and observe the solutions of the Helmholtz equation with wave numbers and at several points , . Choosing the kernel
[TABLE]
the corresponding integral operator from (1.3) describes these observations. Then we consider observations of this source perturbed by additive Gaussian noise , with relative noise level of . The exact source and the observations are visualized in Figure 1(a).
To recover the source from these measurements, we solve the convex regularized source location problem () with . Since the analytical solution is unknown, we compute a reference solution up to high tolerance by employing Algorithm 1 with . Additionally, to obtain the correct number of Dirac delta functions we post-process the final iterate along the lines of (4.8) by combining sources where the location points differs by less than . We depict this approximation to the optimal solution in Figure 1(b), together with the dual variable and the absolute value . It is evident that the strong sufficient conditions from Assumption 3.3 are fulfilled, validating the post-processing described above. Also, we numerically compute the condition number of the matrix (3.6) as , providing numerical evidence for Assumption 3.2.
We compare Algorithm 1 to different versions of the accelerated GCG method Algorithm 2, in order to study the influence of the optional coefficient minimization step 4. We consider:
- GCG
For plain GCG, we omit step 4. in Algorithm 2 and set .
- SPINAT()
Here, we adapt the procedure from [9], and perform additional proximal gradient steps for () on the current support to obtain started at in step 4. of Algorithm 2. We select the proximal gradient stepsize by an Armijo line-search rule.
- PDAP
We solve the subproblem () arising in step 4. to machine precision, resulting in Algorithm 1.
We briefly discuss the practical implementation aspects. Concerning the computation of the global maximum of , we solve a number of independent local nonlinear optimization problems using a Newton method initialized at 30 uniformly spaced points in and at the existing support points . From those local maxima we select the point by a direct search. For the solution of the subproblems () in PDAP we employ a semismooth Newton method (SSN) [57, 48] with a globalization strategy based on a line-search on the objective functional [51, Section 3.5]. The SSN algorithm is initialized with the coefficients of the intermediate iterate from step 3. of Algorithm 2. Due to the superlinear convergence properties, it terminates in a finite number of steps with the solution up to machine precision, and also identifies the nonzero coefficients of as part of the solution process, which define the support the new iterate .
Now, we run the aforementioned algorithms with a tolerance of for a maximum of steps. The corresponding functional residuals are given in Figure 2(a).
For GCG, we clearly observe the predicted sublinear convergence rate from Theorem 5.4, which leads to a slowly decreasing residual in later iterations, which appears effectively stagnant. SPINAT achieves a larger reduction in the residual, but is affected by the same effective stagnation in later iterations. PDAP initially performs very similar to either of these methods, but converges at a linear rate after the third step. In particular, it terminates within the tolerance after steps. This agrees with the convergence result from Theorem 5.1. Clearly, PDAP yields the best results compared to other methods in every iteration. Considering that the full resolution of the subproblem () is more expensive than the simple update of the GCG and SPINAT method, we also plot the residual over the wall clock time in Figure 2(b). We observe that the added cost of PDAP does not outweigh the benefits, since, in fact, the computation time in each step is heavily dominated by the computation of the global nonconvex maximum . To validate if the improved convergence estimates for source points and coefficients can be observed in practice, we compute the maximum error of each support point of the iterates of GCG to the closest support point of the reference solution as in Theorem 5.19. Moreover, we compute the locally lumped coefficients as given in Theorem 5.24 and compute their maximum error to the corresponding reference values. As predicted by theory, both quantities converge at a linear rate, where we empirically estimate ; see Figure 2(c).
To further assess the properties of the solutions obtained by each method, we plot the evolution of the support sizes of the computed iterates in Figure 3(a).
Clearly, GCG inserts a new point in every iteration, which means that the support size is proportional to the iteration counter. PDAP behaves almost ideally, since the number of points is bounded by a number that is only twice the number of support points of the true source. The versions of SPINAT are also able to eliminate some support points, but only in later iterations and they do not achieve a meaningful reduction. For all of the methods, we observe a clustering of sources around the optimal location, but as the zoomed in plot from Figure 3(b) shows, PDAP produces a cluster of only two points at high accuracy, whereas SPINAT(100), albeit delivering the smallest residual out of the GCG and SPINAT experiments, has a large cluster of points at substantial distance from the optimal location. While for GCG this behavior is to be expected, it may appear surprising that proximal gradient iterations on the current support are not sufficient to move enough “mass” of the coefficients to the improved location points inserted in more recent iterations. This stems from the fact that the mapping is increasingly ill-conditioned the more multiple cluster around the same point , and this adversely affects the convergence of the proximal gradient method. This highlights the benefit of employing second order optimization methods for the subproblems of (), which are not affected as much by this ill-conditioning. In particular, for the given implementation using semismooth Newton methods, new support points are inserted at vastly improved locations due to the improved descent in the functional in the previous iteration and old support points at locations far from the optimum can be eliminated reliably.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. D. Ahipasaoglu, P. Sun, and M. J. Todd , Linear convergence of a modified Frank-Wolfe algorithm for computing minimum-volume enclosing ellipsoids , Optim. Methods Softw., 23 (2008), pp. 5–19.
- 2[2] C. D. Aliprantis and K. C. Border , Infinite dimensional analysis , Springer, Berlin, third ed., 2006. A hitchhiker’s guide.
- 3[3] J.-M. Azaïs, Y. de Castro, and F. Gamboa , Spike detection from inaccurate samplings , Appl. Comput. Harmon. Anal., 38 (2015), pp. 177–195.
- 4[4] A. Beck and M. Teboulle , A fast iterative shrinkage-thresholding algorithm for linear inverse problems , SIAM J. Imaging Sci., 2 (2009), pp. 183–202.
- 5[5] V. I. Bogachev , Measure theory. Vol. I, II , Springer-Verlag, Berlin, 2007.
- 6[6] N. Boyd, G. Schiebinger, and B. Recht , The alternating descent conditional gradient method for sparse inverse problems , SIAM J. Optim., 27 (2017), pp. 616–639.
- 7[7] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein , Distributed optimization and statistical learning via the alternating direction method of multipliers , Found. Trends Mach. Learn., 3 (2011), pp. 1–122.
- 8[8] K. Bredies and M. Carioni , Sparsity of solutions for variational inverse problems with finite-dimensional data , Calc. Var., 59 (2020).
