Total variation multiscale estimators for linear inverse problems
Miguel del \'Alamo, Axel Munk

TL;DR
This paper introduces a new estimator for functions of bounded variation in linear inverse problems, achieving near-optimal convergence rates and extending previous results to higher dimensions with a novel analysis of rate regimes.
Contribution
It proposes a novel variational wavelet-vaguelette estimator for BV functions in inverse problems, providing the first convergence guarantees in dimensions dâ„2.
Findings
Estimator is minimax optimal up to logarithmic factors.
First convergence result for BV functions in inverse problems in dimension dâ„2.
Identification of a slower minimax rate for large q due to low smoothness.
Abstract
Even though the statistical theory of linear inverse problems is a well-studied topic, certain relevant cases remain open. Among these is the estimation of functions of bounded variation (), meaning functions on a -dimensional domain whose weak first derivatives are finite Radon measures. The estimation of functions is relevant in many applications, since it involves minimal smoothness assumptions and gives simplified, interpretable cartoonized reconstructions. In this paper we propose a novel technique for estimating functions in an inverse problem setting, and provide theoretical guaranties by showing that the proposed estimator is minimax optimal up to logarithms with respect to the -risk, for any . This is to the best of our knowledge the first convergence result for functions in inverse problems in dimension , and it extendsâŠ
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Total variation multiscale estimators for linear inverse problems
Miguel del Ălamo
Institute for Mathematical Stochastics, University of Göttingen
Goldschmidtstrasse 7, 37077 Göttingen, Germany
Axel Munk
Institute for Mathematical Stochastics, University of Göttingen
Goldschmidtstrasse 7, 37077 Göttingen, Germany
Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
Abstract
Even though the statistical theory of linear inverse problems is a well-studied topic, certain relevant cases remain open. Among these is the estimation of functions of bounded variation (), meaning functions on a -dimensional domain whose weak first derivatives are finite Radon measures. The estimation of functions is relevant in many applications, since it involves minimal smoothness assumptions and gives simplified, interpretable cartoonized reconstructions. In this paper we propose a novel technique for estimating functions in an inverse problem setting, and provide theoretical guaranties by showing that the proposed estimator is minimax optimal up to logarithms with respect to the -risk, for any . This is to the best of our knowledge the first convergence result for functions in inverse problems in dimension , and it extends the results of Donoho, (1995) in . Furthermore, our analysis unravels a novel regime for large in which the minimax rate is slower than , where is the degree of ill-posedness: our analysis shows that this slower rate arises from the low smoothness of functions. The proposed estimator combines variational regularization techniques with the wavelet-vaguelette decomposition of operators.
**Keywords ** Inverse problems   Minimax estimation   Total variation   Interpolation inequalities   Wavelet-vaguelette
**Mathematics Subject Classification (2010) ** 62G05 Â Â 65J22 Â Â 62G20
Contents
1 Introduction
We consider the problem of estimating a real-valued function from observations of in a white noise regression model (see e.g. Tsybakov, (2008))
[TABLE]
Here denotes an open subset of , is a linear, bounded operator, and denotes a Gaussian white noise process on (see Section 2.1.2 in Giné and Nickl, (2015)). The domain on which the data is defined is given by the inverse problem under consideration. In case of regression or deconvolution, we may have , while for certain types of tomography we have  (Natterer,, 1986), where denotes the -dimensional unit sphere. The parameter serves as a noise level, and we may assume it to be known, since otherwise it can be estimated efficiently (see e.g. Spokoiny, (2002) or Munk et al., (2005)). The parametrization is motivated by the fact that the white noise model (1.1) is an idealization of a nonparametric regression model with design points and independent normal noise with variance (see e.g. Brown and Low, (1996), Reiss, (2008) or Section 1.10 in Tsybakov, (2008)). Specifically, the white noise model does not take into account discretization effects, thus simplifying the theoretical analysis (see however Section 2.5 for a discussion of this). In the following we will often refer to as the sample size, keeping in mind that this is only an analogy.
In this setting, our goal is to reconstruct the function from observations and quantify the error made as grows. In order to do so, we assume that is supported inside the unit hypercube . This restriction is somewhat arbitrary: we merely need the support of to be contained in a compact set. Additionally, we make the structural assumption that is a function of bounded variation, written .
Definition 1** (Functions of bounded variation).**
The space of functions of bounded variation consists of functions whose weak distributional gradient is a -valued finite Radon measure on . The finiteness implies that the bounded variation seminorm of , defined by
[TABLE]
is finite, where denotes the divergence of the vector field . is a Banach space with the norm (see Evans and Gariepy, (2015)). Here denotes the set of continuously differentiable functions on taking values on .
Functions of bounded variation have been used manifold in imaging applications since their introduction in the seminal work by Rudin et al., (1992). The reason for their success is that they produce cartoonized reconstructions with sharp edges, which eases interpretability and makes them suitable for applications as diverse as medical imaging, microscopy, astronomy and geology, to mention just a few (see Scherzer et al., (2009) and references therein). However, in spite of their widespread use, a statistical theory for the estimation of functions in inverse problems is still lacking. To the best of our knowledge, the only available result for minimax optimal reconstructions of functions in inverse problems is Donoho, (1995). He introduced the wavelet-vaguelette decomposition (WVD) associated with an operator, and showed that thresholding the WVD yields minimax optimal reconstruction over a range of Besov spaces. His results cover the case of functions for and -smoothing operators with , meaning operators whose singular values behave like as . This includes convolution operators with smooth enough convolution kernels, among others. In contrast, there is no statistical guaranty for estimating functions in dimension , which covers the very relevant imaging applications.
In this paper we propose an estimator that combines variational regularization with the WVD and multiscale dictionaries. We show that the proposed estimators are minimax optimal up to logarithmic factors for estimating functions in any dimension for a variety of inverse problems, including Radon inversion and deconvolution.
1.1 Multiscale total variation estimation
We consider the variational estimator
[TABLE]
where is a threshold to be chosen, and we minimize over a set of functions to be specified later. is a finite set of indices, and is a vaguelette system associated to the operator , meaning that
[TABLE]
for a wavelet basis and generalized singular values (see Assumption 1 for the details). The set depends monotonically on the parameter in (1.1), which plays the role of the sample size: the larger , the larger the set . The reason is that, if the observations are very noisy ( small), we do not want to include too many terms in (1.2), since would then be dominated by the noise. Conversely, the smaller the noise level, the more observations we want to include in the data-fidelity in (1.2), which is then able to extract more information about .
Notice that, by the definition of the vaguelettes, the data-fidelity in (1.2) is actually a constraint on the wavelet coefficients of : they are forced to be close to the wavelet coefficients of the unknown function , up to noise terms. Hence, the data-fidelity in (1.2) amounts to denoising of wavelet coefficients, while the regularization term ensures that is well-behaved in the seminorm.
We deliberately pose the optimization problem (1.2) in constrained form, but emphasize its equivalence to the penalized form
[TABLE]
Indeed, both forms are equivalent for suitable parameters and , but these will depend on the data and cannot be transformed easily from one problem to the other. For the penalized formulation (1.3), the optimal could then be chosen in a data-driven way (e.g. by cross validation (Wahba,, 1977) or by a version of Lepskiiâs balancing principle (Lepskii,, 1991), see e.g. MathĂ© and Pereverzev, (2003) in the context of inverse problems). In the constrained formulation (1.2), the optimal in (1.2) can be chosen in a universal, non data-dependent manner, see equation (2.7).
To see that, notice that the role of is to decide which functions are allowed for the minimization problem (1.2): a smaller would yield very few admissible functions, and conversely for larger . Since the best reconstruction we can hope for is the true regression function , the optimal would be the one that is large enough to let be a feasible function, but not larger. In this sense, note that satisfies the constraint in (1.2) precisely when
[TABLE]
Assume for a moment that with for all . Then the left-hand side behaves like the maximum of the absolute value of standard normal random variables times . Consequently, we see that (1.4) holds asymptotically with probability one if we choose . This argument can be adapted to the case that the do not have norm one, as long as their norms remain bounded above and below by positive constants. We remark that this canonical choice of makes the estimator in constrained form (1.2) more convenient from a practical point of view than the one in penalized form (1.3).
At this point we can argue why the choice of the data-fidelity term in (1.2) is in a sense optimal: if we had chosen it to be the maximum of weighted coefficients, these weights would appear in (1.4), which would then be the maximum of normals with different variances. The maximum would hence be dominated by the terms with larger variances, which would lead to overfitting (if small scales dominate) or oversmoothing (if the large scales dominate).
Finally, we argue that the multiscale data-fidelity in (1.2) is in a sense preferable over the data-fidelity, which acts globally on the residuals. Indeed, consider an estimator like (1.2) with an constraint, which would take the form
[TABLE]
for some , where we used the fact that is a frame for  (Donoho,, 1995) to express the norm in terms of the vaguelette coefficients. Arguing as above, the optimal is the one for which the true function satisfies the constraint. Plugging in , the left-hand side is a -distributed random variable, so should be chosen as . The difference between the multiscale and constraints is now apparent:
[TABLE]
where both constraints are on the vaguelette domain. If we assume that the number of constraints grows polynomially in (see Assumption 1), then the radius in the multiscale constraint tends to zero as , while the radius in the constraint tends to a constant or diverges if . Hence, the multiscale constraint set is much smaller for large, and we expect the multiscale data-fidelity to produce more faithful reconstructions.
Before we turn to the discussion of the convergence properties of , let us discuss two potential limitations of our approach. First, not every operator has an associated vaguelette system , as we use in (1.2). In fact, only reasonably homogeneous operators admit such a system (see Donoho, (1995)). However, for our theory we do not need the whole generality of the WVD (see Assumption 1), and many relevant operators such as the Radon transform, convolution or integration satisfy our assumptions (see Examples 2 below).
The second limitation concerns the numerical solution of the optimization problem in (1.2), which in general is a non-smooth, high-dimensional optimization problem (since and might be large). While classical techniques such as interior point methods (Nesterov and Nemirovsky,, 1994) find their limitations here, the computation of (1.1) is meanwhile feasible due to recent progress in convex optimization, e.g. in primal-dual methods (Chambolle and Pock,, 2011) and accelerations thereof (Malitsky and Pock,, 2018), or semismooth Newton methods with the path-following technique (Clason et al.,, 2010). We will not elaborate on this issue further and postpone this to future work.
1.2 Main result
The main result of this paper states that the estimator (1.2) is minimax optimal (up to logarithmic factors) for estimating functions in any dimension for certain inverse problems. In order to formulate our result we need to introduce some notation. For define the intersection of a -ball of radius with an -ball as
[TABLE]
where denotes the domain of the operator . The reason for the support condition in (1.5) is the following: since we only have a finite amount of information, we cannot hope to recover a function with infinite support. The restriction to the unit cube is in a sense arbitrary: any regular enough compact set would do.
For given , and , define the number
[TABLE]
Our main result (Theorems 3 and 4) can be stated informally as follows.
** Main Theorem**** (Informal).**
For and , let have a WVD with singular values behaving as (see Assumption 1 in Section 2). Let the threshold be as in (2.7) for depending on and only. Then the estimator attains the minimax optimal rate of convergence over up to a logarithmic factor,
[TABLE]
for large enough, for any q\in\big{[}1,\infty\big{)}, any and a constant independent of , but dependent on , , and .
The convergence rate in (1.7) is indeed minimax optimal over the class up to the logarithmic factor, as it is the optimal rate over the smaller class of bounded Besov functions, see Theorem 4 and Section 2.1 for the definition of Besov spaces. The minimax rate is well-known for inverse problems when (see e.g. Cavalier, (2011)). In contrast, the âslowâ regime with rate for has been observed for the specific case in density estimation (Goldenshluger and Lepskii,, 2014) and nonparametric regression (Lepskii, (2015) and del Ălamo et al., (2018)) when estimating over anisotropic Nikolskii classes and Besov classes with . Moreover, the slow regime explains the recently observed phase transition in the minimax risk for estimating discretized functions in the particular case , see Sadhanala et al., (2016). Our result extends these findings to linear inverse problems.
The proof of the minimax optimality of that rate is based on the construction of a set of alternatives in the smaller space . Interestingly, the set of alternatives that attains the minimax rate is neither sparse nor dense: it presents blocks of signals at different locations. We conjecture that only estimators that incorporate a form of spatial adaptation can be minimax optimal in this regime, as the ones proposed in Lepskii, (2015), in del Ălamo et al., (2018) and in the present paper.
The proof of the Main Theorem is based on an upper bound on the -risk with an interpolation inequality in terms of the norm and a Besov norm of negative smoothness,
[TABLE]
for any q\in\big{[}1,\frac{d+2\beta+2}{d+2\beta}\big{]}, . See Section 2.1 for the definition of Besov spaces. This inequality follows from a result by Cohen et al., (2003), proved by an analysis of the wavelet coefficients of functions. Since we have by construction, the norm in (1.8) is easily bounded by a constant. On the other hand, the Besov norm can be related to the constraint in the right-hand side of (1.2), and some analysis yields the bound
[TABLE]
with high probability. Plugging this expression in (1.8), we get the desired bound for the -risk. The bound is extended to using Hölderâs inequality. For we proceed analogously with some modifications. See Section 2.3 for a complete proof.
1.3 Related work
Notwithstanding the success of functions in imaging applications (see Rudin et al., (1992) for the first reference), there are very few works that analyze the estimation of functions in a statistical setting. In nonparametric regression (), classical results (Mammen and van de Geer, (1997) and Donoho and Johnstone, (1998)) established minimax optimality results for estimation in dimension , and recently a class of multiscale variational estimators was shown to perform optimally in any dimension (del Ălamo et al.,, 2018), whose approach we generalize here to . In statistical inverse problems, the only work proving minimax optimal convergence rates for the estimation of is, to the best of our knowledge, Donoho, (1995). He shows that thresholding of the WVD is minimax optimal over a range of Besov spaces and for a class of -smoothing inverse problems. In the case relevant for (), the minimax optimality holds for the range , i.e. for smoothing operators in dimension and . The present work is hence an improvement, since we do not impose any limitation on nor on the dimension . On the other hand, we get a suboptimal logarithmic factor in (1.7), while Donoho, (1995) achieves the exact optimal rate.
At a technical level, our work is inspired by several sources. We have already mentioned Donoho, (1995), who introduced the WVD as a means for using wavelet methods in inverse problems (see also Abramovich and Silverman, (1998) for a variant of the WVD, and CandĂšs and Donoho, (2002) for a refined approach to Radon inversion). Besides these works, there have been several approaches that implicitly use the WVD idea. We refer to Schmidt-Hieber et al., (2013) and Proksch et al., (2018) for hypothesis testing in inverse problems, where multiscale dictionaries adapted to the operator are employed. Another source of inspiration for our work are nonparametric methods that combine variational regularization with multiscale dictionaries. We refer exemplarily to CandĂšs and Guo, (2002), Dong et al., (2011), Frick et al., (2012) and Frick et al., (2013) for an empirical analysis of such methods in simulations. Moreover, the proof of our main result is based on the above mentioned interpolation technique: an interpolation inequality of the form (1.8) is used to relate the risk functional, the regularization functional and the data-fidelity. This technique was used by Nemirovski, (1985) and Grasmair et al., (2018) for estimating Sobolev functions, using an extension of the Gagliardo-Nirenberg interpolation inequalities (Nirenberg,, 1959), and by del Ălamo et al., (2018) for the estimation of functions, employing a generalization thereof (Meyer, (2001), Cohen et al., (2003)). In that sense, the present work combines the tools developed in del Ălamo et al., (2018) with the WVD from Donoho, (1995), and it generalizes both results.
Organization of the paper
The rest of the paper is organized as follows. In Section 2 we state our assumptions and main theorems, and give their proofs. We also discuss the particular inverse problems of deconvolution and Radon inversion. The proofs of auxiliary results are given in Section 3.
2 Results
2.1 Notation
Basic notation. We denote the Euclidean norm of a vector by |v|:=\big{(}v_{1}^{2}+\cdots+v_{d}^{2}\big{)}^{1/2}. For a real number , define \lfloor x\rfloor:=\textup{max}\big{\{}m\in\mathbb{Z}\,\big{|}\,m\leq x\big{\}} and \lceil x\rceil:=\textup{min}\big{\{}m\in\mathbb{Z}\,\big{|}\,m>x\big{\}}. The cardinality of a finite set is denoted by . We say that two sequences and , , grow at the same rate, written , if there are (potentially zero) constants such that for all . Finally, we denote by a generic positive constant that may change from line to line.
Wavelet bases. Let denotes a wavelet basis of formed by tensorization of Daubechies wavelets (Daubechies,, 1992) with continuous partial derivatives and whose mother wavelet has vanishing moments. Here is a scale index, is a position index, and denotes whether is a mother or a father wavelet along each coordinate. We recall that one-dimensional Daubechies wavelets with vanishing moments have support of size (with respect to the Lebesgue measure) and are times continuously differentiable (see Theorem 4.2.10 in Giné and Nickl, (2015)). A -smooth wavelet basis formed by tensorization of one-dimensional Daubechies wavelets needs to satisfy in order to have continuous derivatives. Consequently, the mother and father wavelets have support of size .
In this work we will mainly deal with functions supported inside the unit cube, . We will use their wavelet expansion intensively, so let us introduce the set of wavelets with nonzero overlap with the unit cube
[TABLE]
For each , , let
[TABLE]
denote the set of indices of wavelets at scales rougher that . Since the wavelets at scale have support of size , it follows that there are indices at level , and hence the cardinality of is of the order .
Besov spaces. Let be a wavelet basis with continuous partial derivatives and whose mother wavelet has vanishing moments. For and with , the Besov space consists of all functions (or distributions) with finite Besov norm
[TABLE]
We refer to Section 4.3 in Giné and Nickl, (2015) for more details.
Finally, we define the Fourier transform of a function by
[TABLE]
The Fourier transform can be extended as an operator to and, by duality, to distributions (see e.g. Section 4.1.1 in Giné and Nickl, (2015)).
2.2 Main results
We make the following assumptions on the operator .
Assumption 1**.**
Let denote a bounded, linear operator. For , assume that the following hold:
- âą
there is a wavelet basis \{\psi_{j,k,e}\,\big{|}\,(j,k,e)\in\Lambda\} of (see Section 2.1) with continuous partial derivatives and whose mother wavelet has vanishing moments, such that ;
- âą
there is a set of functions \{u_{j,k,e}\,\big{|}\,(j,k,e)\in\Lambda\}\subset L^{2}(\mathbb{M}), which we call vaguelette system, s.t.
[TABLE]
with singular values . Furthermore, the vaguelettes satisfy
[TABLE]
for some real constants .
We remark that a vaguelette system as constructed in Donoho, (1995) is a frame. However, we will not need that property in the following.
Remark 1**.**
- a)
Assumption 1 is slightly weaker than assuming that the operator has a wavelet-vaguelette decomposition (WVD) (Donoho,, 1995). In the following we nevertheless call a vaguelette system for simplicity.
- b)
As remarked in Section 2.1, we will only need the wavelets with nonzero overlap with the unit cube, which we index by the set in (2.1). In the following we index the vaguelettes accordingly.
- c)
The condition is necessary for ensuring that the norms of the Besov spaces and , , can be expressed in terms of wavelet coefficients with respect to the basis (see Section 2.1, or Section 4.3 in Giné and Nickl, (2015)).
- d)
Let be a smooth enough wavelet basis. Then condition (2.3) implies that the inverse problem (1.1) is mildly ill-posed with degree of ill-posedness .
** Examples 2****.**
We list here some examples of operators satisfying Assumption 1.
- a)
The integration operator
[TABLE]
Its domain consists of functions such that , where denotes the Fourier transform. The vaguelettes are given by derivatives and integrals of the wavelet basis, and the critical values are . Fractional integration, iterated integration and higher dimensional integrals also define operators satisfying Assumption 1. We refer to Donoho, (1995) for more details.
- b)
The Radon transform, which maps a function to
[TABLE]
where the integral is taken over the hyperplane defined by vectors satisfying . See Section 2.4.1 for more details on how our estimator (2.5) works for the Radon transform.
- c)
The convolution operator
[TABLE]
for a regular enough kernel satisfies Assumption 1. See Section 2.4.2 for the details.
- d)
The identity operator, in which case we are in the white noise regression model. We can take to be a smooth enough wavelet basis, and the estimator (2.5) reduces (with minor modifications) to the multiscale total variation estimator analyzed in del Ălamo et al., (2018). Besides some differences in the setting (here we estimate compactly supported functions, there periodic ones), the convergence rate that we prove here coincides for with the result in del Ălamo et al., (2018).
More generally, operators satisfying a certain homogeneity condition with respect to dilations have a WVD (see Donoho, (1995) for a general result). Conversely, Assumption 1 is in general not satisfied for operators with a strong preference for a particular scale. An extreme example is convolution with a kernel whose Fourier transform has compact support. In that case, the equation does not admit solutions for compactly supported wavelets .
In this setting, we define our estimator as follows.
Definition 2**.**
Let the observations follow the model (1.1), and let the operator satisfy Assumption 1 with a vaguelette system . We denote
[TABLE]
as the multiscale total variation estimator for the operator . In (2.5) we minimize over the set
[TABLE]
We use the convention that, whenever the feasible set of the problem (2.5) is empty (which happens with vanishing probability as grows, see Remark 2), the estimator is set to zero.
The reason for requiring the support to be inside the closed unit cube in (2.6) is to make the set closed. This is important for ensuring existence of a minimizer in (2.5) as the limit of a minimizing sequence.
Concerning the choice of the threshold , let be as in (1.1), and let be the upper bound in Assumption 1. For , we choose
[TABLE]
Notice that the upper bound can be computed from the dictionary, as we do in the examples in Section 2.4.
Remark 2**.**
Let us discuss the feasible set of the problem (2.5), which consists of the constraints
[TABLE]
Here we assume that the observations arise from a function , as defined in (1.5). By Proposition 2 below and the choice (2.7) for , the probability that the true regression function satisfies the first constraint in (2.8) is not smaller than . As long as satisfies the first constraint in (2.8), it also satisfies the others for large enough (), since we assume that . As a consequence, the feasible set of (2.5) is nonempty with probability of the order . Hence, we will see that the caveat in Definition 2 about the feasible set does not play a decisive role for the convergence properties of .
Theorem 3**.**
For , let satisfy Assumption 1 with . Assume the model (1.1) with for some . For q\in\big{[}1,\infty\big{)}, let be as in (1.6).
- a)
Let be as in (2.7) with . Then for any with , the estimator in (2.5) with parameter satisfies
[TABLE]
for any with probability at least 1-\big{(}\#\Omega_{n}\big{)}^{1-\kappa^{2}}, for a constant independent of , but depending on , and .
- b)
Under the assumptions of part a), if , then
[TABLE]
holds for any , large enough and a constant independent of .
Theorem 3 gives an upper bound for the expected error of . We now prove a matching lower bound. For that, we assume that satisfies
[TABLE]
for a constant , where is a wavelet basis of compactly supported wavelets. We remark that (2.11) is satisfied by any operator with a WVD (see Donoho, (1995)).
Theorem 4**.**
Consider the setting of Theorem 3, and assume that the operator admits a WVD. Then the minimax -risk over given observations (1.1) is lower bounded by . In particular, the estimator (2.5) is asymptotically minimax optimal up to logarithmic factors for estimating functions , , with respect to the -risk, for any q\in\big{[}1,\infty\big{)}.
2.3 Proofs of the main theorems
2.3.1 Proof of Theorem 3
The proof of Theorem 3 relies on a variant of an interpolation inequality prove by Cohen et al., (2003).
Proposition 1**.**
For and , let .
- a)
If , there is a constant such that
[TABLE]
holds for any and any with supp .
- b)
If , then there is a constant such that for any we have
[TABLE]
for any and any with supp .
The proof of Proposition 1 is given in Section 3 below. Define the event
[TABLE]
where is the vaguelette system from Assumption 1.
Proposition 2**.**
Let be a vaguelette system as described in Assumption 1. For any we have
[TABLE]
for any , where is the upper bound in Assumption 1.
Proof.
The random variables are normal with variance smaller than one, since by the inequality in Assumption 1. By the union bound we have
[TABLE]
for any , and the probability in the right-hand side can be bounded as
[TABLE]
In the first inequality, we bounded the probability that by the probability that a standard normal random variable is larger than in absolute value. This is justified by the fact that has variance smaller than one for all indices. â
We begin with an auxiliary result for the proof of Theorem 3, which is essentially a regularity result for conditionally on the event in (2.12). In the following proofs, denotes a generic constant that may change from line to line.
Proposition 3**.**
Let and denote the wavelet and vaguelette systems from Assumption 1. For , let denote the estimator (2.5) with parameter given by (2.7). Then conditionally on the event in (2.12) we have
[TABLE]
for any with supp , and a constant independent of , and .
Proof.
For part , the definition of the Besov norm in terms of wavelet coefficients (see Section 2.1) yields
[TABLE]
where we used that and that for Daubechies wavelets (which are supported on a compact set). The numerator in the second term can be bounded by by construction of , while the first term can be bounded as
[TABLE]
conditionally on , where in the second inequality we used the definition of . This completes the proof of . The proof of is analogous to the proof of Proposition 4 in del Ălamo et al., (2018), so we do not reproduce it here. â
Proof of part a) of Theorem 3.
We prove the claim of part a) of Theorem 3 conditionally on the event in (2.12), which by Proposition 2 happens with probability .
Consider first the case , which gives . In this case, Proposition 1 gives the interpolation inequality
[TABLE]
for . Conditionally on and for , Proposition 3 gives bounds for the terms in the right-hand side of (2.13), and putting the last three equations together then yields
[TABLE]
using that . Since grows linearly in (recall Section 2.1), the claim follows.
For the case when and , we have and the argument goes through as above.
Finally, the case and requires a special treatment, since then . We use part b) of Proposition 1, which gives
[TABLE]
for a constant and any . Conditionally on , we bound the terms in the right-hand side by Proposition 3, which for yields
[TABLE]
which gives the claim.
We have proved the claim for the -risk with . For larger , we use Hölderâs inequality between the and the -risk, which gives the bound
[TABLE]
for . This completes the proof. â
Proof of part b) of Theorem 3.
It follows from the convergence conditionally on proved in part a) of the theorem. We omit the proof, as it is analogous to the proof of part b) of Theorem 1 in del Ălamo et al., (2018). â
2.3.2 Proof of Theorem 4
Here we prove Theorem 4 by showing that the minimax rate over the smaller set
[TABLE]
with respect to the -risk, , is not faster than . The proof of this is well-known in the dense case , where : it can be found e.g. in Chapter 10 of HÀrdle et al., (2012) for and , so we do not reproduce it here. Indeed, the generalization from to is trivial. Concerning the difference between and general , we show below how to adapt the construction of the alternatives in the case , which indicates how to proceed in the dense regime (see e.g. Theorem 3 in Cavalier, (2011) for a different strategy for computing the minimax risk in inverse problems for the -risk).
On the other hand, we have not found a lower bound in the literature for the regime : only the construction in del Ălamo et al., (2018) for the particular case deals with that regime. Here we modify that proof and give a lower bound for general .
Proof of Theorem 4.
The proof follows the proof of Theorem 2 in del Ălamo et al., (2018) closely.
**Construction of alternatives: ** In the proof of Theorem 2 in del Ălamo et al., (2018), a set of alternatives is constructed such that
[TABLE]
where is the signal strength, are orthonormal Daubechies wavelets, and , , are indices such that . These functions are chosen to satisfy , and
[TABLE]
**Lower bound: ** We use now Assouadâs lemma for lower bounding the -risk over . We reproduce the claim (Lemma 10.2 in HĂ€rdle et al., (2012)) for completeness.
Lemma 1**.**
For and , define , where
[TABLE]
Assume there exist constants such that
[TABLE]
where denotes the probability with respect to observations drawn from in the white noise model (1.1), and denotes the likelihood ratio between the observations associated to and . Then any estimator based on observations (1.1) satisfies
[TABLE]
where is defined in (2.15).
**Verification of (2.16): ** With the same argument as the proof of Theorem 2 in del Ălamo et al., (2018), condition (2.16) holds provided that the Kullback-Leibler divergence between observations from two alternatives satisfies for a small enough constant . A standard computation gives
[TABLE]
using (2.11), so choosing gives (2.16).
**Application of Lemma 1: ** The conclusion of the lemma applies, and we can lower bound the -risk over the class by the risk over , i.e.,
[TABLE]
for any estimator . Choosing as above , the definition (2.15) for gives the bound
[TABLE]
which completes the proof. â
2.4 Examples
2.4.1 Radon transform
Due to its application in nondestructive imaging, in particular in medical applications, tomography is a very relevant inverse problem. While there are plenty of mathematical models for tomography, which mainly depend on the type of tomography and the geometry of the detector (see e.g. Chapter 1 in Scherzer et al., (2009)), in this section we will exemplarily consider tomography modeled by the Radon transform. For simplicity we consider here the two dimensional case, in which the Radon transform of a function is given by its line integrals along different directions, see (2.4).
Functions in the range of are supported on cylindrical sets of the form . Moreover, the domain of consists of functions whose Fourier transform satisfies , see Donoho, (1995). This is a condition on the low frequencies which essentially ensures that local averages remain reasonably small.
In this section we will show how to apply the estimation framework developed above to this type of inverse problems. For that, let denote a basis of Daubechies wavelets as described in Section 2.1. For , define the vaguelettes by
[TABLE]
It is easy to verify directly (see e.g. Chapter 2 in Natterer, (1986)) that the vaguelettes satisfy the equation
[TABLE]
for generalized critical values . Moreover,
[TABLE]
for constants depending on , see Section 3.3 in Donoho, (1995) for a proof of this claim. Let us remark that the system is part of a WVD for (see Donoho, (1995) for the details).
Altogether, the observations above imply that the Radon transform satisfies Assumption 1 with in dimension . By Theorem 4, the multiscale total variation estimator (2.5) is nearly minimax optimal for recovering a function from noisy Radon observations. We remark that the same analysis can be performed for the Radon transform in higher dimensions, in which case , for the X-ray transform, with for any dimension (Natterer,, 1986), as well as for other tomography operators, such as photoacoustic and thermoacoustic tomography (see e.g. Haltmeier, (2013)).
2.4.2 Convolution
Let denote the convolution operator with a kernel , i.e.,
[TABLE]
We let , and by Youngâs inequality is a bounded operator from to itself whose operator norm equals . The inverse problem (1.1) with a convolution operator is a model for a myriad of applications in image and signal processing, including microscopy and astronomy models (see e.g. Bertero et al., (2009)). The problem of recovering a signal from noisy measurement of its convolution is hence of extreme practical relevance. In this section we show that the multiscale TV-estimator (2.5) solves this problem in a minimax optimal sense.
For that, we need to impose regularity conditions on , which naturally have the form of a decay condition on the Fourier transform of . In particular, we assume that the kernel satisfies
[TABLE]
for constants and some . Given a basis of Daubechies wavelets like that in Section 2.1 with , define the system of functions
[TABLE]
indexed by the set in (2.1). These functions satisfy the following relations
[TABLE]
where we can choose and (see Proposition 5 for the proof). These results show that the convolution operator under the assumptions above satisfies Assumption 1. By Theorem 4 we conclude that the multiscale TV-estimator is minimax optimal for estimating functions , up to logarithmic factors.
2.5 Nonparametric inverse regression model
So far we have discussed the estimator based on observations from the white noise model (1.1). In practice, however, one naturally has access to discretely sampled data, which makes it more realistic to model the observations with the nonparametric regression model
[TABLE]
Here we assume that for some , and that the design points belong to an equidistant grid
[TABLE]
Of course, different grids may be used, depending on the operator and the domain under consideration. For simplicity of the analysis, we assume in this section that . This is the case when is the identity operator, a suitable convolution operator, or integration, to mention just a few examples. In (2.21), are independent standard normal random variables, and plays the role of the standard deviation of the noise.
Given observations (2.21), our goal is to estimate the function . We do so by discretizing our construction of the multiscale TV-estimator from Definition 2. Let be a dictionary of discretized vaguelettes, i.e., each is a vector of values
[TABLE]
which are the evaluations of the vaguelette at the grid points. The scaling factor is chosen so that
[TABLE]
for any , i.e., so that the vectors have asymptotically unit norm in an sense.
In this setting, the multiscale TV-estimator takes the form
[TABLE]
where is the upper frame constant for the continuous vaguelettes in Assumption 1.
We can now analyze the estimator (2.22) following the same strategy as we did in the white noise model. The only difference will be that, above, we related the constraint on the vaguelette coefficients to the Besov norm. Since here we only have access to the discretized vaguelette coefficients, there is an additional discretization error caused by the approximation of the vaguelette coefficients by their discretized counterparts. That error is given by
[TABLE]
Proceeding as in the proof of Proposition 3, we see that satisfies the error bounds
[TABLE]
conditionally on the event in (2.12). Following the proof of Theorem 3, we get the result
[TABLE]
for and large enough. Here we have the following trade-off: if is of smaller order that , then attains the same rate as the multiscale TV-estimator based on observations from the white noise model. On the other hand, if is of bigger order than , the discretization error dominates and performs worse than . The different performance of the multiscale TV-estimator in the white noise and in the nonparametric regression models hence boils down to a purely approximation theoretic question.
It remains now to bound the discretization error . For that, notice that it is entirely determined by the smoothness of . Recall that and that is a smoothing operator. Consider the following examples.
Let be the identity operator. Then is a smooth wavelet basis, and . Consequently, the product is at most a function of bounded variation, for which we have (see e.g. Chapter 5 in Evans and Gariepy, (2015)). In this case, the discretization error is of lower order for , while it dominates for .
- 2)
In particular cases, for , the error in can be improved. For instance, if is a piecewise constant function and if is smooth enough and has vanishing moments. In that case, the discretization error can be of smaller order due to the vanishing moments of . We do not pursue this idea further.
- 3)
If is a convolution operator as in Section 2.4.2, then by Fourier inversion we can show that is continuous. Moreover, if the kernel decays fast enough, will be a continuous function as well, and so will be . Hence we have the same discretization error as above. There is nevertheless an important caveat here: as opposed to wavelets, vaguelettes do not have in general vanishing moments. Consequently, this error cannot be improved by assuming that is e.g. piecewise constant.
We have argued that the difference between the multiscale TV-estimator in the white noise and the nonparametric inverse regression models arises from a discretization error. In particular, the error appears in the convergence rate in the nonparametric regression model, eventually making it slower. Importantly, for the error behaves as , and so the multiscale estimator attains the optimal convergence rate for imaging problems in the discretized model (2.21).
More generally, the difference between the white noise and the nonparametric inverse problem models can be measured with the theory of asymptotic equivalence. While that theory is well understood when (Brown and Low, (1996), Reiss, (2008)), there are considerably fewer results for general operators (see Grama and Nussbaum, (1998) and Meister, (2011)). In particular, Meister, (2011) proves asymptotic equivalence in a functional linear regression model provided that the unknown function is suitably smooth, which is reminiscent of our analysis above to control based on the smoothness of .
3 Auxiliary analytical results
For simplicity, we prove the two parts of Proposition 1 separately. They rely on an interpolation inequality proved by Cohen et al., (2003), which we reproduce here.
Theorem 5** (Theorem 1.5 in Cohen et al., (2003)).**
Let and , and assume that satisfies either or , where denotes the Hölder conjugate of . Then for any such that
[TABLE]
we have the inequality
[TABLE]
for any function and a constant depending on and only.
Proof of part a) of Proposition 1.
First, Theorem 5 with and gives
[TABLE]
for any smooth enough . It remains to show that the -norm, , can be upper bounded by the -norm. But that is indeed the case, due to the continuous embedding
[TABLE]
for . Indeed, continuity of the embedding follows from Proposition 2 in Section 2.3.2 in Triebel, (1983). It states that, for , and , the embedding
[TABLE]
is continuous. Moreover, equation (2) in Section 2.3.5 in Triebel, (1983) states that
[TABLE]
for . These two facts imply that
[TABLE]
which completes the proof of (3.2). The extension to the -risk follows by compact support. â
The proof of part b) of Proposition 1 relies on the following result.
Proposition 4**.**
Let satisfy supp , and let . Then for any we have
[TABLE]
for a constant independent of .
The proof of Proposition 4 uses the following lemma.
Lemma 2**.**
Let denote a basis of compactly supported wavelets in . For any there is a constant such that
[TABLE]
for any and any coefficients , where
[TABLE]
Proof of Lemma 2.
We prove the lemma by showing the extreme cases and , and then applying the Riesz-Thorin interpolation theorem (see e.g. Stein and Weiss, (1971)) to the bounded operator
[TABLE]
which gives the claim for all . The claim for follows by the orthonormality of the wavelet basis. For , the claim follows with the same argument as Lemma 2 in del Ălamo et al., (2018): the only difference is that the functions there are defined on the torus , and here on the cube . This completes the proof. â
Proof of Proposition 4.
Let be a basis of compactly supported wavelets. Writing formally as its wavelet series we have for any
[TABLE]
for any . Since supp , the sums are over . Using Lemma 2, the first term can be bounded as
[TABLE]
which gives the first term of the claim. For the second term, we use that and , which means that the wavelet coefficients of satisfy the bounds
[TABLE]
for any , where the first inequality follows from the compact support of the wavelets and Hölderâs inequality, and the second follows from the embedding . Using Lemma 2 and these bounds, the second term in (3.3) can be bounded as
[TABLE]
which gives the claim. â
Proof of part b) of Proposition 1.
Let and assume that . Notice that for and . The claim follows from Theorem 5 with and , which gives a bound on the norm. The -norm, , can be upper bounded by the -norm, which itself can be upper bounded by the norm using Proposition 4 below. Choosing yields the claim. â
Proposition 5**.**
In the setting of Section 2.4.2 we have
[TABLE]
where we can choose and
Proof.
Notice that the Fourier transform of the elements is given by
[TABLE]
The first claim of the proposition follows trivially by construction of the : we essentially use that acts by convolution with , which in Fourier domain is the product with . For the bounds in the norm, we use Plancherelâs theorem, i.e.
[TABLE]
where in the second line we used the bounds (2.19) on the Fourier transform of the kernel . The expression in the right-hand side can now be easily bounded from below as
[TABLE]
again by Plancherelâs theorem. On the other hand, the right-hand side of (3.5) can be upper-bounded as
[TABLE]
This yields the claim. â
Funding
This work was supported by the Deutsche Forschungsgemeinschaft [RTG 2088-B2 to M.A., CRC 755-A4 to A.M.].
Acknowledgment
The authors thank Dr. Housen Li and Dr. Frank Werner for helpful discussions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abramovich and Silverman, (1998) Abramovich, F. U. and Silverman, B. W. (1998). Wavelet decomposition approaches to statistical inverse problems. Biometrika , 85(1):115â129.
- 2Assouad, (1983) Assouad, P. (1983). Deux remarques sur lâestimation. C. R. Math. Acad. Sci. Paris , 296(23):1021â1024.
- 3Bertero et al., (2009) Bertero, M., Boccacci, P., DesiderĂ , G., and Vicidomini, G. (2009). Image deblurring with Poisson data: from cells to galaxies. Inverse Problems , 25(12):123006.
- 4Brown and Low, (1996) Brown, L. D. and Low, M. G. (1996). Asymptotic equivalence of nonparametric regression and white noise. Ann. Statist. , 24(6):2384â2398.
- 5CandĂšs and Donoho, (2002) CandĂšs, E. J. and Donoho, D. L. (2002). Recovering edges in ill-posed inverse problems: Optimality of curvelet frames. Ann. Statist. , 30(3):784â842.
- 6CandĂšs and Guo, (2002) CandĂšs, E. J. and Guo, F. (2002). New multiscale transforms, minimum total variation synthesis: Applications to edge-preserving image reconstruction. Signal Processing , 82(11):1519â1543.
- 7Cavalier, (2011) Cavalier, L. (2011). Inverse problems in statistics. In Inverse problems and high-dimensional estimation , pages 3â96. Springer Berlin Heidelberg.
- 8Chambolle and Pock, (2011) Chambolle, A. and Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vision , 40(1):120â145.
