Super-resolution meets machine learning: approximation of measures

H. N. Mhaskar

arXiv:1907.04895·math.FA·July 12, 2019

Super-resolution meets machine learning: approximation of measures

H. N. Mhaskar

PDF

Open Access

TL;DR

This paper investigates the problem of approximately recovering measures from limited information, extending super-resolution concepts to measures supported on continua, with explicit recovery operators and optimal error estimates.

Contribution

It introduces a new framework for measure approximation without support separation assumptions, providing explicit recovery operators and optimal error bounds.

Findings

01

Explicit recovery operator for measures

02

Optimal bounds on approximation error

03

Recovery limitations for limited information

Abstract

The problem of super-resolution in general terms is to recuperate a finitely supported measure $μ$ given finitely many of its coefficients $\overset{μ}{^} (k)$ with respect to some orthonormal system. The interesting case concerns situations, where the number of coefficients required is substantially smaller than a power of the reciprocal of the minimal separation among the points in the support of $μ$ . In this paper, we consider the more severe problem of recuperating $μ$ approximately without any assumption on $μ$ beyond having a finite total variation. In particular, $μ$ may be supported on a continuum, so that the minimal separation among the points in the support of $μ$ is $0$ . A variant of this problem is also of interest in machine learning as well as the inverse problem of de-convolution. We define an appropriate notion of a distance between the target measure and its…

Equations142

k \in Z \sum a_{k} exp (- iω k Δ) + z (ω), ∣ ω ∣ \leq Ω,

k \in Z \sum a_{k} exp (- iω k Δ) + z (ω), ∣ ω ∣ \leq Ω,

μ (t) = k = 1 \sum K a_{k} δ (t - t_{k}) + Z (t),

μ (t) = k = 1 \sum K a_{k} δ (t - t_{k}) + Z (t),

k = 1 \sum K a_{k} exp (- ij t_{k}) + \int_{T} Z (t) exp (- ij t) d t,

k = 1 \sum K a_{k} exp (- ij t_{k}) + \int_{T} Z (t) exp (- ij t) d t,

F_{N} (t) = \frac{1}{N + 1} ∣ j ∣ \leq N \sum (1 - ∣ j ∣/ N) exp (ij t) .

F_{N} (t) = \frac{1}{N + 1} ∣ j ∣ \leq N \sum (1 - ∣ j ∣/ N) exp (ij t) .

η = j \neq = k min ∣ t_{j} - t_{k} ∣ \geq 2/ n,

η = j \neq = k min ∣ t_{j} - t_{k} ∣ \geq 2/ n,

\int_{T} \int_{T} F_{N} (x - t) (d μ (t) - d ν_{n} (t)) d x \leq c (N / n)^{2} \int_{T} ∣ Z (t) ∣ d t,

\int_{T} \int_{T} F_{N} (x - t) (d μ (t) - d ν_{n} (t)) d x \leq c (N / n)^{2} \int_{T} ∣ Z (t) ∣ d t,

k = 1 \sum K r = 0 \sum R a_{k, r} (- ij)^{r} exp (- ij t_{k}), ∣ j ∣ < N,

k = 1 \sum K r = 0 \sum R a_{k, r} (- ij)^{r} exp (- ij t_{k}), ∣ j ∣ < N,

f (x_{1}, \dots, x_{4}) = f_{1} (f_{12} \raise 0.5 pt (x_{1}, x_{2} \raise 0.5 pt), f_{34} \raise 0.5 pt (x_{3}, x_{4} \raise 0.5 pt)),

f (x_{1}, \dots, x_{4}) = f_{1} (f_{12} \raise 0.5 pt (x_{1}, x_{2} \raise 0.5 pt), f_{34} \raise 0.5 pt (x_{3}, x_{4} \raise 0.5 pt)),

k = 1 \sum N a_{k} σ (w_{k} \cdot x + b_{k}), a_{k}, b_{k} \in R, w_{k} \in R^{4}

k = 1 \sum N a_{k} σ (w_{k} \cdot x + b_{k}), a_{k}, b_{k} \in R, w_{k} \in R^{4}

P (x_{1}, \dots, x_{4}) = P_{1} (P_{12} \raise 0.5 pt (x_{1}, x_{2} \raise 0.5 pt), P_{34} \raise 0.5 pt (x_{3}, x_{4} \raise 0.5 pt))

P (x_{1}, \dots, x_{4}) = P_{1} (P_{12} \raise 0.5 pt (x_{1}, x_{2} \raise 0.5 pt), P_{34} \raise 0.5 pt (x_{3}, x_{4} \raise 0.5 pt))

f (x) = \int_{X} G (x, y) d μ (y), x \in X,

f (x) = \int_{X} G (x, y) d μ (y), x \in X,

D^{E T} (ν) = x \in [- π, π) sup ∣ ν ([- π, x)) ∣.

D^{E T} (ν) = x \in [- π, π) sup ∣ ν ([- π, x)) ∣.

D^{E T} (ν) = \int_{T} G (\circ - t) d ν (t)_{T, \infty},

D^{E T} (ν) = \int_{T} G (\circ - t) d ν (t)_{T, \infty},

\int_{X} G (\circ, t) d ν (t)_{X, 2}

\int_{X} G (\circ, t) d ν (t)_{X, 2}

\int_{X} G (\circ, t) d ν (t)_{X, 1} .

\int_{X} G (\circ, t) d ν (t)_{X, 1} .

\int_{T} F_{N} (\circ - t) d ν (t)_{T, 1}

\int_{T} F_{N} (\circ - t) d ν (t)_{T, 1}

\|f\|_{\nu;B,p}:=\left\{\begin{array}[]{ll}\displaystyle\left\{\int_{B}|f(x)|^{p}d|\nu|(x)\right\}^{1/p},&\mbox{ if $1\leq p<\infty$,}\\ \displaystyle|\nu|-\mathop{\hbox{{ess sup}}}_{x\in B}|f(x)|,&\mbox{ if $p=\infty$.}\end{array}\right.

\|f\|_{\nu;B,p}:=\left\{\begin{array}[]{ll}\displaystyle\left\{\int_{B}|f(x)|^{p}d|\nu|(x)\right\}^{1/p},&\mbox{ if $1\leq p<\infty$,}\\ \displaystyle|\nu|-\mathop{\hbox{{ess sup}}}_{x\in B}|f(x)|,&\mbox{ if $p=\infty$.}\end{array}\right.

μ^{*} (B (x, r)) = μ^{*} ({y \in X : d (x, y) < r}) \leq κ_{1} r^{q} .

μ^{*} (B (x, r)) = μ^{*} ({y \in X : d (x, y) < r}) \leq κ_{1} r^{q} .

k = 0 \sum \infty exp (- λ_{k}^{2} t) ϕ_{k} (x) ϕ_{k} (y) \leq κ_{2} t^{- q /2} exp (- κ_{3} \frac{d ( x , y ) ^{2}}{t})

k = 0 \sum \infty exp (- λ_{k}^{2} t) ϕ_{k} (x) ϕ_{k} (y) \leq κ_{2} t^{- q /2} exp (- κ_{3} \frac{d ( x , y ) ^{2}}{t})

∣ ∣ ∣ μ ∣ ∣ ∣_{G; p} = \int_{X} G (\circ, y) d μ (y)_{p} .

∣ ∣ ∣ μ ∣ ∣ ∣_{G; p} = \int_{X} G (\circ, y) d μ (y)_{p} .

k \in Z^{q} \sum (∥ k ∥^{2} + 1)^{- β /2} exp (i k \cdot \circ)

k \in Z^{q} \sum (∥ k ∥^{2} + 1)^{- β /2} exp (i k \cdot \circ)

Π_{λ} = span {ϕ_{k} : λ_{k} < λ}, λ > 0,

Π_{λ} = span {ϕ_{k} : λ_{k} < λ}, λ > 0,

E_{λ, p} (f) = P \in Π_{λ} in f ∥ f - P ∥_{p}, λ \in R .

E_{λ, p} (f) = P \in Π_{λ} in f ∥ f - P ∥_{p}, λ \in R .

∥ f ∥_{W_{r, p}} = ∥ f ∥_{p} + n \geq 0 sup 2^{n r} E_{2^{n}, p} (f) < \infty.

∥ f ∥_{W_{r, p}} = ∥ f ∥_{p} + n \geq 0 sup 2^{n r} E_{2^{n}, p} (f) < \infty.

Φ_{n} (H; x, y) := k = 0 \sum \infty H (\frac{λ _{k}}{n}) ϕ_{k} (x) ϕ_{k} (y), n > 0, x, y \in X .

Φ_{n} (H; x, y) := k = 0 \sum \infty H (\frac{λ _{k}}{n}) ϕ_{k} (x) ϕ_{k} (y), n > 0, x, y \in X .

σ_{n} (H; μ) (x) = \int_{X} Φ_{n} (H; x, y) d μ (y), x \in X, n > 0.

σ_{n} (H; μ) (x) = \int_{X} Φ_{n} (H; x, y) d μ (y), x \in X, n > 0.

\overset{μ}{^} (k) = \int_{X} ϕ_{k} (y) d μ (y), k = 0, 1, \dots .

\overset{μ}{^} (k) = \int_{X} ϕ_{k} (y) d μ (y), k = 0, 1, \dots .

σ_{n} (H; μ) (x) = k = 0 \sum \infty H (\frac{λ _{k}}{n}) \overset{μ}{^} (k) ϕ_{k} (x), n > 0, x \in X .

σ_{n} (H; μ) (x) = k = 0 \sum \infty H (\frac{λ _{k}}{n}) \overset{μ}{^} (k) ϕ_{k} (x), n > 0, x \in X .

ν_{n} (k) = h (λ_{k} / 2^{n}) {\overset{μ}{^} (k) + ϵ_{k}}, k = 0, 1, \dots .

ν_{n} (k) = h (λ_{k} / 2^{n}) {\overset{μ}{^} (k) + ϵ_{k}}, k = 0, 1, \dots .

μ_{n} (B) = \int_{B} \int_{X} Φ_{2^{n}} (h; x, y) d μ (y) d μ^{*} (x) = \int_{B} σ_{2^{n}} (h; μ) (x) d μ^{*} (x), n \geq 0,

μ_{n} (B) = \int_{B} \int_{X} Φ_{2^{n}} (h; x, y) d μ (y) d μ^{*} (x) = \int_{B} σ_{2^{n}} (h; μ) (x) d μ^{*} (x), n \geq 0,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Numerical methods in inverse problems · Advanced Image Processing Techniques

Full text

Super-resolution meets machine learning: approximation of measures

H. N. Mhaskar

Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA 91711. The research of this author is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via 2018-18032000002. email: [email protected]

Abstract

The problem of super-resolution in general terms is to recuperate a finitely supported measure $\mu$ given finitely many of its coefficients $\hat{\mu}(k)$ with respect to some orthonormal system. The interesting case concerns situations, where the number of coefficients required is substantially smaller than a power of the reciprocal of the minimal separation among the points in the support of $\mu$ .

In this paper, we consider the more severe problem of recuperating $\mu$ approximately without any assumption on $\mu$ beyond having a finite total variation. In particular, $\mu$ may be supported on a continuum, so that the minimal separation among the points in the support of $\mu$ is [math]. A variant of this problem is also of interest in machine learning as well as the inverse problem of de-convolution.

We define an appropriate notion of a distance between the target measure and its recuperated version, give an explicit expression for the recuperation operator, and estimate the distance between $\mu$ and its approximation. We show that these estimates are the best possible in many different ways.

We also explain why for a finitely supported measure the approximation quality of its recuperation is bounded from below if the amount of information is smaller than what is demanded in the super-resolution problem.

Keywords: Super-resolution, machine learning, de-convolution, data defined spaces, widths.

1 Introduction

This paper is motivated by two apparently disjoint areas; super-resolution and machine learning. A problem of interest in both of these areas is the approximation of a measure using a finite amount of information on the measure. Thus we wish to develop a theory of (weak-star) approximation of measures. We will describe our motivation and the connections of this work to the problem of super-resolution and the problem of machine learning in Sections 1.1 and 1.2 respectively. The aims and contributions of this paper, and its outline is given in Section 1.3. This section and the next being introductory in nature, the notation used in these two sections may not be the same as the one used in the remainder of the paper.

1.1 Super-resolution

The problem of super-resolution is stated by Donoho [12] as follows. Given observations of the form

[TABLE]

where $\{a_{k}\}$ is sequence of complex numbers, $\Delta,\Omega>0$ , and $z$ represents a perturbation subject to the condition that $\int_{-\Omega}^{\Omega}|z(\omega)|^{2}d\omega\leq\epsilon$ , recuperate the sequence $\{a_{k}\}$ up to an accuracy ${\cal O}(\epsilon)$ in the sense of optimal recovery, when $\Omega\Delta<\pi$ . It is shown in [12] that this is not possible in general, but it is possible under some sparsity assumptions.

Relevant to the current paper is a generalization stated by Candés and Fernandez-Granda in [6, Theorem 1.2]. We denote the quotient space ${\mathbb{R}}/(2\pi{\mathbb{Z}})$ by $\mathbb{T}$ . Let

[TABLE]

where $\delta$ denotes the Dirac delta measure at [math]. We assume that the moments

[TABLE]

are known for $|j|\leq n$ for some integer $n\geq 1$ . The goal is to estimate how much degradation to expect when this data is extended for the values of $j$ with $n<|j|\leq N$ for some larger integer $N$ . Since the degradation is expected to be greater as $|j|$ increases, the authors propose to measure this degradation using the Fejér kernel

[TABLE]

They prove that if

[TABLE]

then a measure $\nu_{n}$ obtained by the solution of an optimization problem satisfies

[TABLE]

where $c$ is a positive constant. We note that the number of observations considered known is $2n+1$ . Thus, the condition (1.3) is a lower bound on the amount of information required in terms of the minimal separation in order to guarantee a stable recovery as measured in (1.4).

There is vast amount of literature on this problem, where the problem is referred to with different names : problem of hidden periodicities (e.g., [24, Chapter IV, Section 22]), direction finding in phased array antennas (e.g., [21]), detection of singularities (e.g., [19, 13, 35]), parameter estimation in exponential sums (e.g. [17, 33]), etc. The oldest we are aware of is the paper [8] of Prony, where the problem is considered without noise. We note also that there is some effort [31, 3, 2] in the direction of overcoming this barrier in the case of univariate trigonometric setting, where the information is in the form

[TABLE]

so that each $t_{k}$ appears with multiplicity $R+1$ .

We will not even begin to list the many modern works in the context of the periodic problem. The problem has been studied also in other settings; for example, the sphere (e.g., [4, 5]) or the rotation group [18], where the exponential monomials are replaced by eigenfunctions of the Laplace-Beltrami operator. In all this work, it is required that the number of Fourier coefficients known about the measure is at least a constant multiple of $\eta^{-1/q}$ , where $q$ is the dimension of the space involved.

This sort of condition seems to be an inherent barrier, we will refer to it as the minimal separation barrier. We note that any finite set of points will have a positive minimal separation; the condition refers to the amount of information necessary to recuperate the measure to given accuracy.

In this paper, we are interested in approximating an arbitrary measure; not just a measure supported on a finite set. When the support of the target measure is a continuum, then the minimal separation is [math], and any recuperation with a finite amount of noisy data is arguably beyond super-resolution. Of course, even in the absence of noise, an exact recuperation cannot be expected in general using only a finite amount of information. On the other hand, an approximate recovery is very standard in the trigonometric case; the entire section [36, Chapter 2, Section 8] describes already many constructions, some of which are used in [19] in the case of finitely supported measures.

To summarize, the question of overcoming the minimal separation barrier in super-resolution problems for point masses can be viewed as the problem of efficient approximation of a measure given a finite amount of information about the measure.

1.2 Machine learning

A central problem in machine learning is to find a target function $f$ on a space ${\mathbb{X}}$ , equipped with a probability measure $\mu^{*}$ , given the information $f(x_{j})+\epsilon_{j}$ , $j=1,\cdots,M$ , for some points $x_{j}\in{\mathbb{X}}$ chosen randomly from $\mu^{*}$ , where $\epsilon_{j}$ is a random noise. Typically, the function $f$ is defined on a very high dimensional space, say a space of dimension $Q$ . In this case, there are well known results in approximation theory, known as width theorems, which give a lower bound of the form $M^{-\gamma/Q}$ on how accurately one can approximate a function, for which the only available a priori information is that it belongs to a smoothness class indexed by $\gamma$ (e.g., a Sobolev class) [9]. This is known as the curse of dimensionality.

In recent years, deep networks have caused a revolution in machine learning, with many spectacular achievements in industrial problems. It is therefore an important problem to examine why and when deep networks perform better than the so called shallow networks. We have argued in [30] that one reason that deep networks perform better than shallow networks is that many functions of practical interest have a compositional structure which deep networks can exploit and shallow networks cannot. For example, suppose that the target function $f$ is known to have the structure

[TABLE]

where the functions $f_{1}$ , $f_{12}$ , $f_{34}$ are continuously differentiable on a cube $[-1,1]^{2}$ . A shallow network of the form

[TABLE]

where $\sigma$ is a suitable activation function, yields an approximation $O(N^{-1/4})$ [26].

However, if we use the same construction as in [26] to obtain shallow networks $P_{1},P_{11},P_{12}$ , each with $N$ terms to approximate $f_{1},f_{12},f_{13}$ respectively, one gets an accuracy $O(N^{-1/2})$ in each approximation. By the triangle inequality the deep network given by

[TABLE]

yields an accuracy $O(N^{-1/2})$ , using only $O(N)$ parameters.

This theory suggests on the other hand that deep networks do not give any advantage when there is no curse of dimensionality. There is some research exploring other prior assumptions on the target function which ensure that there is no curse of dimensionality of the kind described above.

Let us illustrate the situation by an example. Assume that $f$ admits a representation of the form

[TABLE]

for some measure $\mu$ having a bounded total variation on ${\mathbb{X}}$ , then it is known that one can obtain approximations to $f$ by linear combinations of the form $\sum_{j=1}^{n}a_{j}G(x,y_{j})$ with the degree of approximation being dimension-independent in terms of $n$ , often tractable as well (e.g., [1, 20, 22, 23, 27]). The total variation of $\mu$ , sometimes known as the $G$ -variation of $f$ , plays the role of the norm of the derivatives of $f$ in classical approximation theory. The proofs of such theorems depend upon a probabilistic argument, and are not constructive. Therefore, in practice, the parameters $a_{j}$ , $y_{j}$ are determined using some learning algorithm.

Thus, in theory, the problem is to determine $\mu$ given finitely many samples of $f$ , even though the number of these samples may not be dimension independent. Let us assume that ${\mathbb{X}}$ is a compact Riemannian manifold on which $G$ has a Mercer expansion of the form $\sum_{k=0}^{\infty}\Lambda_{k}\phi_{k}(x)\phi_{k}(y)$ for some orthonormal system $\{\phi_{k}\}$ . Then it is shown under some conditions in [16] (see also [14]) that one can obtain quadrature formulas to integrate all linear combinations of the first few functions $\{\phi_{k}\}$ ; the number of these functions dependent on the number of points at which the values of $f$ are available. Thus, in theory, the inverse problem of recuperating $\mu$ from the samples of $f$ is reduced to recuperating $\mu$ (respectively, its discretized version) using finitely many Fourier coefficients of $\mu$ with respect to the system $\{\phi_{k}\}$ .

A question of theoretical interest here is to estimate the degree of approximation of $\mu$ in a suitable sense in terms of the number of Fourier coefficients that can be computed reliably from the data. This is the same problem that we were led to in our musings on super-resolution in Section 1.1, thereby establishing a close connection between these two problems.

1.3 Contributions of this paper

The problem of approximating a measure in the weak-star sense is inherently different from that of approximation of functions, and also from that of an exact or approximate recuperation of the support of the measure. In particular, in the context of machine learning and probability density estimation, it is customary to use a “convolution” with a positive kernel. As an approximation device, it is well known that this is doomed not to give a good approximation. However, if we use a non-positive kernel to guarantee good approximation as we propose to do in this paper, then there are some difficulties in recuperating the support of the measure exactly and directly without some non-linear operations such as thresholding and clustering. In this paper, we are focused on approximation of measures, and will postpone the discussion of these other issues to future work.

We will describe how we address the difference between approximation of functions and that of measures in the current paper.

To study an approximation problem, one needs the notion of a distance between two objects and some notion of smoothness of the target object to be approximated. In classical theory of function approximation, there are standard ways of defining both of these; e.g., in approximating a continuous $2\pi$ -periodic function by trigonometric polynomials, one uses the uniform norm and the smoothness is measured by an appropriate modulus of smoothness. Such questions have been studied in many different contexts, e.g., [10].

In contrast, there is no standard definition for measuring the distance between two measures so that the convergence in the topology corresponds to weak-star convergence. There are several ways of defining such a distance in different application domains, and we will list a few of these in Section 2 to motivate our own definition. However, we are not aware of any standard smoothness class for measures. In our Theorem 4.1 below, we will observe that with no assumption on the target measure, the degree of approximation depends entirely on the definition of the distance. Since an estimate on the degree of approximation is typically (and in our theorem, is) achieved using a specific construction (given in (4.1)), this surprising fact gives rise to the question whether one could obtain better estimates using a different construction, or even using a different kind of information about the measure. We will discuss these issues in Theorem 4.3 and Theorem 4.4. In particular, Theorem 4.4 provides one explanation for the minimal separation barrier in super-resolution of point masses as described in Section 1.1. Another natural question to ask is to understand what a better estimate on the degree of approximation allows us to conclude about the measure, analogous to the converse theorems of approximation theory. A trivial case is when the target measure is absolutely continuous with respect to a base measure, and the Radon-Nikodym derivative is then approximated as in the classical function approximation paradigm. We will prove in Theorem 4.2 that an improvement on the approximation bounds in Theorem 4.1 implies that the target measure is in fact absolutely continuous with respect to a base measure and the derivative is in the right smoothness class as expected in the theory of function approximation.

In Section 2, we review a few notions of distance between measures in order to motivate our Definition 3.2. In Section 3, we develop the set up for our theory, and establish some notation to be used in the subsequent sections. The main results are discussed in Section 4, and the proofs of all the new results are given in Section 5.

We thank Professor Dr. Hans Feichtinger for many useful comments on the presentation in this paper.

2 Distance between measures

In order to discuss the quality of approximate recuperation of measures, we need first to develop a notion of distance between measures. There are many ways of defining a distance. We mention a few of these to motivate Definition 3.2 which we will use in this paper.

In the univariate case, a very old way to define a distance is the Erdős-Turán discrepancy (known also as Kolmogorov-Smirnov statistic in statistics and star discrepancy in information based complexity). In the context of measures on $\mathbb{T}={\mathbb{R}}/(2\pi{\mathbb{Z}})$ (identified with $[-\pi,\pi)$ ), this is defined for a signed measure $\nu$ with $\nu(\mathbb{T})=0$ by

[TABLE]

A comparison of Fourier coefficients shows that

[TABLE]

where $G$ denotes the Bernoulli spline defined by $\displaystyle G(u)=\sum_{k=1}^{\infty}\frac{\sin ku}{k}$ , $u\in[-\pi,\pi)$ . In the form (2.2), this notion of discrepancy is generalized using many different kernels on different high dimensional domains (e.g., [32, 11]). A similar notion in statistics is the so called maximum mean discrepancy (MMD), defined by

[TABLE]

for some measure space ${\mathbb{X}}$ and a positive definite kernel $G$ defined on this space.

Another popular distance between measures is the $L^{1}$ -Wasserstein distance. Let ${\mathbb{X}}$ be a metric space and $\nu$ be a Borel measure on this space with $\nu({\mathbb{X}})=0$ . One of the equivalent definitions of this distance is given by $\displaystyle\sup\left|\int_{\mathbb{X}}fd\nu\right|,$ where the supremum is over all Lipschitz continuous functions $f$ on ${\mathbb{X}}$ with Lipschitz constant $\leq 1$ . If ${\mathbb{X}}$ is a manifold, $\Delta$ is the Laplace-Beltrami operator on ${\mathbb{X}}$ , an analogue of this distance, more responsive to the manifold structure, is obtained by taking the supremum over all functions with $\|\Delta(f)\|_{{\mathbb{X}},\infty}\leq 1$ . Denoting the Green function for $\Delta$ by $G$ , this in turn is equivalent to

[TABLE]

Finally, we note that the estimate (1.4) utilizes a semi-norm of the form

[TABLE]

where $F_{N}$ is the Fejér kernel with Fourier coefficients equal to [math] outside of $[-N,N]$ .

3 Notation and definitions

In this section, we describe the general set up for our discussion, and establish notation.

Let ${\mathbb{X}}$ be a locally compact metric measure space, with $d$ denoting the metric on ${\mathbb{X}}$ , and $\mu^{*}$ being a distinguished positive measure on ${\mathbb{X}}$ . In the sequel, only complete, sigma finite, Borel measures are considered, defined on a sigma algebra $\mathfrak{M}$ containing all Borel subsets of $\mathbb{X}$ . In the sequel, $\nu$ -measurability will be understood in the sense of membership in this fixed sigma algebra.

For $B\subseteq\mathbb{X}$ , $\nu$ -measurable, and a $\nu$ -measurable function $f:B\to{\mathbb{R}}$ we write

[TABLE]

$L^{p}(\nu;B)$ denotes the class of all $\nu$ –measurable functions $f$ for which $\|f\|_{\nu;B,p}<\infty$ , where two functions are considered equal if they are equal $|\nu|$ –almost everywhere. We will omit the mention of $\nu$ if $\nu=\mu^{*}$ and that of $B$ if $B=\mathbb{X}$ . Thus, $L^{p}=L^{p}(\mu^{*};\mathbb{X})$ . For $1\leq p\leq\infty$ , we define $p^{\prime}=p/(p-1)$ with the usual understanding that $1^{\prime}=\infty$ , $\infty^{\prime}=1$ . The symbol $C_{0}(B)$ denotes the space of all continuous real functions on $B$ vanishing at infinity; $C_{0}=C_{0}({\mathbb{X}})$ . The symbol $C_{0}^{*}$ will denote the dual space of $C_{0}$ ; i.e., the class of all regular, Borel, measures with bounded total variation.

We also need a non-decreasing sequence $\{\lambda_{k}\}_{k=0}^{\infty}$ of real numbers, and an ( $\mu^{*}$ -) orthonormal system of functions $\{\phi_{k}\}_{k=0}^{\infty}$ in $C_{0}({\mathbb{X}})\cap L^{1}({\mathbb{X}})$ . We assume that $\lambda_{0}=0$ , and $\lim_{k\to\infty}\lambda_{k}=\infty$ . In addition we assume that the system $\{\phi_{k}\}_{k=0}^{\infty}$ is fundamental in both $L^{1}$ and $C_{0}$ .

Definition 3.1

The system $\Xi=(\mathbb{X},d,\mu^{*},\{\lambda_{k}\}_{k=0}^{\infty},\{\phi_{k}\}_{k=0}^{\infty})$ is called an admissible system if

For each $x\in\mathbb{X}$ and $r>0$ , the ball $\mathbb{B}(x,r)$ is compact. 2. 2.

There exists $q>0$ and $\kappa_{1},\kappa_{2},\kappa_{3}>0$ such that for $x\in\mathbb{X}$ , $r>0$ ,

[TABLE] 3. 3.

For $x,y\in\mathbb{X}$ , $0<t\leq 1$ ,

[TABLE]

Remark 3.1

In some of our other papers we have referred to an admissible system in the sense of the above definition as a data defined space. This is motivated by an idea for semi-supervised learning, called diffusion geometry/manifold learning. One assumes that the data for this kind of machine learning problem lives on an unknown low dimensional sub-manifold of a high dimensional Euclidean space. The learning takes place based on the eigen-decomposition of a suitably constructed graph Laplacian. In theory, one may assume the eigen-decomposition of the heat kernel with respect to an elliptic differential operator on the manifold itself. The properties of this heat kernel play a central role in the theoretical development. In particular, it is shown in [29, Theorem 4.3] that the condition (3.2) implies the localization properties of the kernels $\Phi_{n}$ defined in (3.7) below; which in turn, plays a crucial role in this paper via Proposition 5.1. $\blacksquare$ **

**Constant convention:

**

In the sequel, the symbols $c,c_{1},\cdots$ will denote generic positive constants depending only on the system $\Xi$ and other constant parameters under discussion. Their value will be different at different occurrences, even within a single formula. The notation $A\sim B$ means $cA\leq B\leq c_{2}B$ . $\blacksquare$

We now define a candidate for a semi-norm on $C_{0}^{*}$ which will be used in this paper.

Definition 3.2

Let $G:{\mathbb{X}}\times{\mathbb{X}}\to{\mathbb{R}}$ be a kernel that admits a formal Mercer expansion $\displaystyle G(x,y)\!=\!\sum_{j=0}^{\infty}b(\lambda_{j})\phi_{j}(x)\phi_{j}(y)$ , where $b(\lambda_{j})\geq 0$ for every $j\geq 0$ . For $\mu\in C_{0}^{*}$ and $1\leq p\leq\infty$ , we define formally

[TABLE]

We will be particularly interested in the following class of kernels (cf. [28]):

Definition 3.3

Let $\beta\in{\mathbb{R}}$ . A function $b:{\mathbb{R}}\to{\mathbb{R}}$ will be called a mask of type $\beta$ if $b$ is an even, $S$ times continuously differentiable function such that for $t>0$ , $b(t)=(1+t)^{-\beta}F_{b}(\log t)$ for some $F_{b}:{\mathbb{R}}\to{\mathbb{R}}$ such that $|{F_{b}}^{(k)}(t)|\leq c(b)$ , $t\in{\mathbb{R}}$ , $k=0,1,\cdots,S$ , and $F_{b}(t)\geq c_{1}(b)$ , $t\in{\mathbb{R}}$ . A function $G:{\mathbb{X}}\times{\mathbb{X}}\to{\mathbb{R}}$ will be called a kernel of type $\beta$ if it admits a formal expansion $G(x,y)=\sum_{j=0}^{\infty}b(\lambda_{j})\phi_{j}(x)\phi_{j}(y)$ for some mask $b$ of type $\beta>0$ . If we wish to specify the connection between $G$ and $b$ , we will write $G(b;x,y)$ in place of $G$ .

Example 3.1

We consider ${\mathbb{X}}=\mathbb{T}^{q}$ . If $\beta>0$ , then the kernel defined formally by

[TABLE]

is a kernel of type $\beta$ . $\blacksquare$ **

Example 3.2

We consider the unit sphere $\SS^{q}=\{\mathbf{x}\in{\mathbb{R}}^{q+1}:|\mathbf{x}|_{2}=1\}$ . If $\beta>0$ , the kernel defined formally by $G(\mathbf{x},\mathbf{y})=(1-\mathbf{x}\cdot\mathbf{y})^{(\beta-q)/2}$ is a kernel of type $\beta$ ([34, Section 9.3(4)]). $\blacksquare$ **

When $G$ is a kernel as defined in Definition 3.3, $|\!|\!|\circ|\!|\!|_{G;p}$ is a norm consistent with the weak-star topology on $C_{0}^{*}$ . We will give a proof of the following simple proposition in Section 5.

Proposition 3.1

Let $1\leq p\leq\infty$ , $\beta>q/p^{\prime}$ , $G$ be a kernel of type $\beta$ . Then the functional $|\!|\!|\circ|\!|\!|_{G;p}$ defines a norm on $C_{0}^{*}$ . If $\{\nu_{n}\}_{n=0}^{\infty}$ is a sequence in $C_{0}^{*}$ , $\nu\in C_{0}^{*}$ then $\nu_{n}\stackrel{{\scriptstyle*}}{{\to}}\nu$ if and only if $|\!|\!|\nu_{n}-\nu|\!|\!|_{G;p}\to 0$ .

Next, we define some * smoothness classes of functions* in terms of their degree of approximation by linear combinations of $\{\phi_{k}\}$ . We define

[TABLE]

and $\displaystyle\Pi_{\infty}=\bigcup_{\lambda>0}\Pi_{\lambda}$ . If $\lambda\leq 0$ , we denote $\Pi_{\lambda}=\{0\}$ . Following [25], we refer to the elements of $\Pi_{\infty}$ as diffusion polynomials. The $L^{p}$ -closure of $\Pi_{\infty}$ is denoted by $X^{p}$ ; i.e., $X^{p}=L^{p}$ if $1\leq p<\infty$ , and $C_{0}$ if $p=\infty$ .

If $1\leq p\leq\infty$ and $f\in L^{p}$ , we define

[TABLE]

If $r>0$ then the smoothness class $W_{r,p}$ is the set of all $f\in X^{p}$ such that

[TABLE]

Our main tool in the recuperation of measures is a localized kernel. Given a compactly supported function $H:{\mathbb{R}}\to{\mathbb{R}}$ , we define:

[TABLE]

For $\mu\in C_{0}^{*}$ , we define formally

[TABLE]

We write

[TABLE]

Then

[TABLE]

We note that $\sigma_{n}(H;\mu)$ can be identified with the measure $\sigma_{n}(H;\mu)d\mu^{*}$ . In general, if $\mu$ is absolutely continuous, so that $d\mu=fd\mu^{*}$ for some $f\in L^{1}$ , then by an abuse of the notation we write $\hat{f}(k)$ for $\hat{\mu}(k)$ , and likewise, $\sigma_{n}(H;f)$ for $\sigma_{n}(H;\mu)$ .

4 Main results

Our first objective is to estimate the degree of approximation in recuperating a measure $\mu\in C_{0}^{*}$ from noisy measurements of the form $\hat{\mu}(k)+\epsilon_{k}$ , for $k$ with $\lambda_{k}<2^{n}$ . Toward this end, we fix in the rest of this paper, an infinitely differentiable, even function $h:{\mathbb{R}}\to{\mathbb{R}}$ such that $h$ is non-increasing on $[0,\infty)$ , $h(t)=1$ if $0\leq t\leq 1/2$ , $h(t)=0$ if $t\geq 1$ . The constants $c,c_{1},\cdots$ will depend upon $h$ as well.

The approximation to $\mu$ is the measure $\nu_{n}$ , defined spectrally by

[TABLE]

We find it convenient to denote the noiseless recuperation measure by $\mu_{n}$ ; i.e.,

[TABLE]

for all Borel subsets $B\subseteq{\mathbb{X}}$ .

The following theorem shows that the rate at which the degree of approximation of $\mu$ by $\{\nu_{n}\}$ (as a function of $n$ ), measured in the norm given in Definition 3.2, decreases to [math] depends only on the kernel $G$ . There is no natural way to define a smoothness of the measure $\mu$ .

Theorem 4.1

Let $1\leq p\leq\infty$ , $\beta>q/p^{\prime}$ , $G$ be a kernel of type $\beta$ , and $\mu\in C_{0}^{*}$ , $\nu_{n}$ be defined by (4.1). Let $P_{n}=\sum_{k:\lambda_{k}<n}\epsilon_{k}\phi_{k}$ . Then

[TABLE]

Moreover, for the high pass filter $G_{hi}(x,y)=G(x,y)-\Phi_{2^{n}}(hb_{2^{n}};x,y)$ , we have

[TABLE]

Remark 4.1

We compare this theorem with [6, Theorem 1.2] described in Section 1.1. The analogue of the high pass filter $F_{N}$ is given by $G_{hi}$ . Note that, unlike (1.4), the noise term $\|P_{n}\|_{1}$ has a decreasing influence in the high pass range. Analogous to the kernel $F_{N}$ , the kernel $G_{hi}$ gives a lower weight to the higher frequencies, but unlike the kernel $F_{N}$ , the kernel $G_{hi}$ includes all the high frequency components.

We note that there is no longer any assumption on the minimal separation among the points in the support of the target measure $\mu$ . An exact recovery is in general impossible, even in the noise-free case. Our construction in (4.1) being general, does not give an exact recuperation also in the case of finitely supported measures without some further processing, which is not within the scope of this paper. However, the result is applicable for measures defined on a very general space, and does not require the verification of a signature polynomial as in [7]. Therefore, we expect that the approximation $\nu_{n}$ is easier to construct so as to obtain a good approximation, even if no exact recovery is possible. $\blacksquare$ **

Remark 4.2

Let $f$ admit a representation of the form

[TABLE]

for some measure $\mu\in C_{0}^{*}$ . A comparison of Fourier coefficients shows that

[TABLE]

Therefore, Theorem 4.1 implies

[TABLE]

In particular, in the case $p=1$ , we get bounds nominally sharper than those in [25]. Rather than assuming a condition on $f$ in terms of pseudo-differential operators (informally, choosing $G$ to be a Green function of a pseudo-differential operator), we allow a more general kernel $G$ . Also, we no longer require the object $\mathcal{D}_{G}(f)$ defined spectrally by $\widehat{\mathcal{D}_{G}(f)}(k)=\hat{f}(k)/b(\lambda_{k})$ , $k=0,1,\cdots$ , to be a function in $X^{1}$ , but allow it to be a measure. It is explained in [28, 14] how to discretize the quantity $\sigma_{2^{n}}(f)$ based on values of $f$ at scattered data points. This leads to a constructive procedure to obtain an approximation to $f$ by sums of the form $\sum_{j=1}^{M}a_{j}G(\circ,y_{j})$ [28]. However, the error bounds are not dimension independent. Dimension independent bounds can be obtained using concentration inequalities in a probabilistic sense, but then the proof is not constructive. $\blacksquare$ **

Next, we address the question whether one can improve upon the bounds in (4.3). For simplicity, we consider the noiseless case; i.e., assume in the sequel that $P_{n}\equiv 0$ . The first theorem below states that one cannot improve the factor of $2^{-n(\beta-q/p^{\prime})}$ to $2^{-n(\beta+r)}$ except in “trivial” cases; i.e., when $\mu=fd\mu^{*}$ for some $f\in W_{\gamma,p}$ , so that results from function approximation are applicable directly. Thus, in the case when $p=1$ , the estimate (4.3) cannot be improved.

Theorem 4.2

*Let $1\leq p\leq\infty$ , $\beta>q/p^{\prime}$ , $r>0$ , $G$ be a kernel of type $\beta$ , $\mu\in C_{0}^{*}$ and for each $n\geq 1$ , $\mu_{n}$ be defined by (4.2). Then the following are equivalent:

(a) There exists $f\in W_{r,p}$ such that $d\mu=fd\mu^{*}$ .

(b) We have*

[TABLE]

Another way to examine a possible improvement in (4.3) is using the notion of non-linear widths. We note that the recuperation measure $\mu_{n}$ depends upon the parameters $\hat{\mu}(k)$ , for $k$ such that $\lambda_{k}<2^{n}$ ; i.e., as many parameters as the dimension of $\Pi_{2^{n}}$ . In most manifolds, the eigenfunctions $\phi_{k}$ of the Laplace-Beltrami operator satisfy an additional estimate given in (4.6) below (see [15, 16] for a fuller discussion). In the general set up which we are working with, it is therefore reasonable to assume that there exists $\gamma>0$ such that

[TABLE]

Under this assumption, it is not difficult to verify that the dimension of $\Pi_{2^{n}}$ is $\sim 2^{nq}$ . Thus, in terms of the number $M$ of parameters used in the recuperation, the bound (4.3) for the case $p=1$ is ${\cal O}(M^{-\beta/q})$ . We now proceed to show that this is the best possible.

Let $\mathcal{K}$ be a weak-star compact subset of $C_{0}^{*}$ . We denote by $\mathcal{S}$ the set of all weak-star continuous mappings from $\mathcal{K}\to{\mathbb{R}}^{M}$ (parameter selection maps). An algorithm is a mapping $A:{\mathbb{R}}^{M}\to C_{0}^{*}$ . Thus, for any algorithm $A$ and parameter selection $S$ , and $\mu\in\mathcal{K}$ , $A(S(\mu))\in C_{0}^{*}$ is an attempted reconstruction of $\mu$ from the data $S$ using the algorithm $A$ . We define

[TABLE]

and the nonlinear width of $\mathcal{K}$ in the sense of $|\!|\!|\circ|\!|\!|_{G;p}$ by

[TABLE]

Theorem 4.3

Let ${\mathbb{X}}$ be compact, $\mu^{*}({\mathbb{X}})=1$ , $\beta>0$ , $G$ be a kernel of type $\beta$ . We assume further that there exists $\gamma>0$ such that (4.6) holds. Let

[TABLE]

Then for integer $M\geq 1$ ,

[TABLE]

We end this section with a width result that demonstrates that the minimal separation is an essential barrier to the recuperation of finitely supported measures, not just from the Fourier information, but from any robust parameter selection. Toward this end, let $\eta>0$ and

[TABLE]

Although we do not prescribe the exact number of point masses in the definition above, when ${\mathbb{X}}$ is compact, then a volume argument shows that this number cannot exceed $c\eta^{-q}$ . It is not difficult to show in this case that $\mathcal{K}_{\eta}$ is a compact subset of $C_{0}^{*}$ .

Theorem 4.4

Let ${\mathbb{X}}$ be compact, $\mu^{*}({\mathbb{X}})=1$ , $\beta>0$ , $G$ be a kernel of type $\beta$ . We assume further that (4.6) holds. Then for integer $M\sim\eta^{-q/\beta}$ ,

[TABLE]

Remark 4.3

We remark that $d_{m}(G;\mathcal{K}_{\eta})$ is a decreasing function of $m$ . Therefore, the estimate (4.11) shows a lower limit on how accurately a finitely supported measure with the minimal separation of its support equal to $\eta$ can be approximated using $\leq c\eta^{-q/\beta}$ continuously selected parameters. $\blacksquare$ **

Remark 4.4

In the case when ${\mathbb{X}}=\mathbb{T}$ , and $\mu$ is measure supported on $N$ points, then the Prony method can recuperate the measure exactly using $2N+1$ parameters, regardless of minimal separation among the points. This is not a contradiction to Theorem 4.4, which refers to the worst case error for approximating measures in $\mathcal{K}_{\eta}$ . For any $\eta>0$ , the class $\mathcal{K}_{\eta}$ contains a measure supported on $N\sim\eta^{-1}$ points and for this measure, $2N+1\sim c\eta^{-1}$ . $\blacksquare$ **

5 Proofs

In the sequel, if $N>0$ , we will write $b_{N}(t)=b(Nt)$ . If ${\mathcal{C}}\subset{\mathbb{X}}$ is a finite set, we define

[TABLE]

For $P\in\Pi_{\infty}$ , we define

[TABLE]

so that

[TABLE]

In the sequel, we write $g(t)=h(t)-h(2t)$ , $t\in{\mathbb{R}}$ .

We recall the following results from [28]. Although the set up there is that of a compact smooth manifold without boundary, the proofs are verbatim the same for admissible spaces.

Proposition 5.1

*Let $1\leq p\leq\infty$ , $\beta\in{\mathbb{R}}$ , $b$ be a mask of type $\beta$ .

(a) We have*

[TABLE]

(b)* If $\beta>q/p^{\prime}$ then for every $y\in{\mathbb{X}}$ , there exists $\psi_{y}:=G(\circ,y)\in X^{p}$ such that $\widehat{\psi_{y}}(k)=b(\lambda_{k})\phi_{k}(y)$ , $k=0,1,\cdots$ . We have*

[TABLE]

(c)* If $\beta>0$ , $n\geq 1$ , $P\in\Pi_{n}$ then*

[TABLE]

(d)* If $\beta>0$ , ${\mathbb{X}}$ is compact, and (4.6) holds, then for any $M\geq 1$ , $a_{1},\cdots,a_{M+1}\in{\mathbb{R}}$ , ${\mathcal{C}}=\{y_{1},\cdots,y_{M+1}\}\subset{\mathbb{X}}$ ,*

[TABLE]

Proof. The second inequality in (5.4) is proved in [28, Eqn. (5.3)]. The first inequality in (5.4) follows easily from [28, Eqn. (5.11)]. Part (b) is proved in [28, Proposition 5.2]. Part (c) is proved in [28, Eqn. (5.33)], used with $\gamma=0$ . Part (d) is proved in [28, Theorem 3.4]. $\blacksquare$

Proof of Proposition 3.1.

We note that Proposition 5.1 shows that $G$ is defined for all $x,y\in{\mathbb{X}}$ . Hence, (5.5) shows that for any $\mu\in C_{0}^{*}$ , $\int_{\mathbb{X}}\|G(\circ,y)\|_{p}d|\mu|(y)$ is well defined, and hence, so is $|\!|\!|\mu|\!|\!|_{G;p}$ .

It is clear that $|\!|\!|\circ|\!|\!|_{G;p}$ is a semi-norm. If $\mu\in C_{0}^{*}$ and $|\!|\!|\mu|\!|\!|_{G;p}=0$ then $b(\lambda_{j})\hat{\mu}(j)=0$ for all $j$ ; i.e., $\hat{\mu}(j)=0$ for all $j$ . Since the system $\{\phi_{j}\}_{j=0}^{\infty}$ is fundamental in $C_{0}$ , this implies that $\mu=0$ . The fact that $|\!|\!|\nu_{n}-\nu|\!|\!|_{G;p}\to 0$ implies that $\widehat{(\nu_{n}-\nu)}(k)\to 0$ for all $k$ , which in turn implies that $\nu_{n}\stackrel{{\scriptstyle*}}{{\to}}\nu$ . Conversely, if $\nu_{n}\stackrel{{\scriptstyle*}}{{\to}}\nu$ then for each $m\geq 0$ ,

[TABLE]

The dominated convergence theorem now leads to the fact that $|\!|\!|\nu_{n}-\nu|\!|\!|_{G;p}\to 0$ . $\blacksquare$

Proof of Theorem 4.1.

Using Fubini’s theorem and then making a change of dummy variables, we see that for $x\in{\mathbb{X}}$ ,

[TABLE]

Hence, (5.5) leads to

[TABLE]

This proves (4.3). The proof of (4.4) is similar; the last term in the middle expression in (5) does not appear in this case. $\blacksquare$

It is convenient to organize some details of the proof of Theorem 4.2 in the following lemma.

Lemma 5.1

Let $1\leq p\leq\infty$ , $\mu\in C_{0}^{*}$ , $\beta\in{\mathbb{R}}$ , and $b$ be a mask of type $\beta$ . Then for any $r\in{\mathbb{R}}$ ,

[TABLE]

Proof. In view of (5.4) with $p=1$ , we have for any real $\beta$ and mask $b$ of type $\beta$ .

[TABLE]

Consequently, Young inequality shows that for $1\leq p\leq\infty$ and $f\in L^{p}$ ,

[TABLE]

In this proof, let $\tilde{g}(t)=h(t/2)-h(4t)$ . Then $\tilde{g}$ is supported on $[1/8,2]\cup[-2,-1/8]$ . Analogous to (5.11), we see that for $m\geq 1$ , (5.4) and Young’s inequality lead to

[TABLE]

In view of the fact that $g(t)\tilde{g}(t)=1$ for all $t$ in the support of $g$ , we have

[TABLE]

Therefore, using (5.11) with $\sigma_{2^{n}}(\tilde{g};\mu)$ in place of $f$ , the fact that $\tilde{g}(t)=g(t/2)+g(t)+g(2t)$ for all $t\in{\mathbb{R}}$ , and (5.12), we conclude that

[TABLE]

Hence,

[TABLE]

Since $1/b$ is a mask of type $-\beta$ , this leads to (5.10). $\blacksquare$

Proof of Theorem 4.2. Let $f\in W_{r,p}$ . Since $g$ is supported on $[1/4,1]$ , $\sigma_{2^{m}}(g;P)=0$ for all $P\in\Pi_{2^{m-2}}$ . Therefore, using (5.12), we obtain for any $P\in\Pi_{2^{m-2}}$ that

[TABLE]

i.e.,

[TABLE]

Since $f\in W_{r,p}$ , this yields, together with (5.10) that

[TABLE]

Therefore,

[TABLE]

Thus, part (a) implies part (b).

Conversely, let part (b) hold. Then

[TABLE]

In view of (5.10) this leads to

[TABLE]

This implies that the sequence

[TABLE]

converges in $L^{p}$ to some $f\in L^{p}$ . Moreover, $\hat{\mu}(k)=\hat{f}(k)$ for all $k=0,1,\cdots$ . Therefore, $d\mu=fd\mu^{*}$ . Further, (5.15) shows that

[TABLE]

Thus, $f\in W_{r,p}$ . $\blacksquare$

Our proof of Theorems 4.3 and 4.4 depends upon another notion of widths, the so-called Bernstein width. This is defined by for a weak-star compact subset $\mathcal{K}\subset C_{0}^{*}$ and integer $M\geq 1$ by

[TABLE]

where the supremum is over all subspaces $Y_{M+1}$ of $C_{0}^{*}$ with dimension $M+1$ . It is proved in [9, Theorem 3.1] that for any integer $M\geq 1$ ,

[TABLE]

Proof of Theorem 4.3.

Let ${\mathcal{C}}=\{y_{1},\cdots,y_{n}\}$ be a maximal $(\kappa_{1}(M+1))^{-1/q}$ separated subset of ${\mathbb{X}}$ , where $\kappa_{1}$ is the constant appearing in the upper bound (3.1) on the $\mu^{*}$ -measure of balls. Then $\displaystyle{\mathbb{X}}=\bigcup_{j=1}^{n}\mathbb{B}(y_{j},(\kappa_{1}(M+1))^{-1/q})$ . In view of (3.1),

[TABLE]

Therefore, ${\mathcal{C}}_{1}=\{y_{1},\cdots,y_{M+1}\}$ satisfies $\eta({\mathcal{C}}_{1})\geq\eta({\mathcal{C}})\geq(\kappa_{1}(M+1))^{-1/q}$ . We consider the $M+1$ dimensional space $Y=\mathsf{span}\{\delta_{y_{k}}:k=1,\cdots,M+1\}\subset C_{0}^{*}$ , where $\delta_{y}$ denotes the Dirac delta at $y$ . For any $\mu=\sum_{k=1}^{M+1}a_{k}\delta_{y_{k}}$ , Proposition 5.1(d) shows that

[TABLE]

In view of (5.17), this leads to (4.9). $\blacksquare$

Proof of Theorem 4.4.

Choosing $M$ so that $(\kappa_{1}(M+1))^{-1/q}\geq\eta$ , the elements of the space $Y$ constructed in the proof of Theorem 4.3 serves also for this theorem in order to apply (5.17). $\blacksquare$

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. R. Barron. Neural net approximation. In Proc. 7th Yale Workshop on Adaptive and Learning Systems , volume 1, pages 69–72, 1992.
2[2] D. Batenkov. Stability and super-resolution of generalized spike recovery. Applied and Computational Harmonic Analysis , 2016.
3[3] D. Batenkov and Y. Yomdin. On the accuracy of solving confluent Prony systems. SIAM Journal on Applied Mathematics , 73(1):134–154, 2013.
4[4] T. Bendory, S. Dekel, and A. Feuer. Exact recovery of dirac ensembles from the projection onto spaces of spherical harmonics. Constructive Approximation , 42(2):183–207, 2015.
5[5] T. Bendory, S. Dekel, and A. Feuer. Super-resolution on the sphere using convex optimization. IEEE transactions on signal processing , 63(9):2253–2262, 2015.
6[6] E. J. Candès and C. Fernandez-Granda. Super-resolution from noisy data. Journal of Fourier Analysis and Applications , 19(6):1229–1254, 2013.
7[7] E. J. Candès and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Communications on Pure and Applied Mathematics , 67(6):906–956, 2014.
8[8] B. G. R. De Prony. Essai éxperimental et analytique: sur les lois de la dilatabilité de fluides élastique et sur celles de la force expansive de la vapeur de l’alkool,a différentes températures. Journal de l’école polytechnique , 1(22):24–76, 1795.