Effective New Methods for Automated Parameter Selection in Regularized   Inverse Problems

Toby Sanders; Rodrigo B. Platte; Robert D. Skeel

arXiv:1812.11449·math.NA·February 11, 2020

Effective New Methods for Automated Parameter Selection in Regularized Inverse Problems

Toby Sanders, Rodrigo B. Platte, Robert D. Skeel

PDF

TL;DR

This paper introduces a new Bayesian-based criterion for selecting regularization parameters in inverse problems, which does not require prior noise knowledge, and demonstrates its effectiveness through theoretical analysis and numerical simulations.

Contribution

It proposes an iterative scheme for parameter selection in regularized inverse problems based on maximizing data likelihood without prior noise information, with proven convergence and practical algorithms.

Findings

01

The method accurately selects parameters for MRI, SAR, denoising, and deconvolution.

02

The iterative scheme converges reliably and efficiently.

03

Numerical simulations confirm the approach's effectiveness.

Abstract

The choice of the parameter value for regularized inverse problems is critical to the results and remains a topic of interest. This article explores a criterion for selecting a good parameter value by maximizing the probability of the data, {{with no prior knowledge of the noise variance}}. These concepts are developed for $ℓ_{2}$ and consequently $ℓ_{1}$ regularization models by way of their Bayesian interpretations. Based on these concepts, an iterative scheme is proposed and demonstrated to converge accurately, and analytical convergence results are provided that substantiate these empirical observations. For some of the most common inverse problems, including MRI, SAR, denoising, and deconvolution, an extremely efficient algorithm is derived, making the iterative scheme very attractive for real case use. The computational concerns associated with the general case for any inverse…

Tables1

Table 1. Table 1: Average computational run time (in seconds) to determine the optimal parameters for deconvolution and to then compute the corresponding ℓ 1 subscript ℓ 1 \ell_{1} reconstruction from the determined parameters.

image name	pixel count	$ℓ_{2}$ parameter time	$ℓ_{1}$ reconstruction time
cameraman	256x256	0.0391	0.4997
Shepp-Logan	256x256	0.0338	0.7862
peppers	384x512	0.0822	2.3706

Equations121

u_{λ} = ar g u \in R^{n} min ∥ A u - b ∥_{2}^{2} + λ ∥ T u ∥_{p}^{p},

u_{λ} = ar g u \in R^{n} min ∥ A u - b ∥_{2}^{2} + λ ∥ T u ∥_{p}^{p},

E ∥ A (u_{λ} - v) ∥_{2}^{2} = - m σ^{2} + ∥ A u_{λ} - b ∥_{2}^{2} + 2 σ^{2} trace (A B_{λ}),

E ∥ A (u_{λ} - v) ∥_{2}^{2} = - m σ^{2} + ∥ A u_{λ} - b ∥_{2}^{2} + 2 σ^{2} trace (A B_{λ}),

u_{λ} = ar g u max p (u ∣ b) = ar g u max p (b ∣ u) p (u) = ar g u max exp (- \frac{∥ A u - b ∥ _{2}^{2}}{2 σ ^{2}}) exp (- \frac{λ ∥ T u ∥ _{p}^{p}}{2 σ ^{2}}) .

u_{λ} = ar g u max p (u ∣ b) = ar g u max p (b ∣ u) p (u) = ar g u max exp (- \frac{∥ A u - b ∥ _{2}^{2}}{2 σ ^{2}}) exp (- \frac{λ ∥ T u ∥ _{p}^{p}}{2 σ ^{2}}) .

p (b ∣ u, σ) = (2 π σ)^{- m /2} exp (- \frac{∥ A u - b ∥ _{2}^{2}}{2 σ ^{2}}) .

p (b ∣ u, σ) = (2 π σ)^{- m /2} exp (- \frac{∥ A u - b ∥ _{2}^{2}}{2 σ ^{2}}) .

p (u ∣ η) = \frac{det T}{( 2 π η ^{2} ) ^{n /2}} exp (- \frac{∥ T u ∥ _{2}^{2}}{2 η ^{2}}) .

p (u ∣ η) = \frac{det T}{( 2 π η ^{2} ) ^{n /2}} exp (- \frac{∥ T u ∥ _{2}^{2}}{2 η ^{2}}) .

u_{σ, η} = ar g u max exp (- \frac{∥ A u - b ∥ _{2}^{2}}{2 σ ^{2}}) exp (- \frac{∥ T u ∥ _{2}^{2}}{2 η ^{2}}),

u_{σ, η} = ar g u max exp (- \frac{∥ A u - b ∥ _{2}^{2}}{2 σ ^{2}}) exp (- \frac{∥ T u ∥ _{2}^{2}}{2 η ^{2}}),

T=U\left[\begin{array}[]{cc}\Sigma_{1}&0\end{array}\right]\left[\begin{array}[]{c}V_{1}^{\mathsf{T}}\\ V_{2}^{\mathsf{T}}\end{array}\right]

T=U\left[\begin{array}[]{cc}\Sigma_{1}&0\end{array}\right]\left[\begin{array}[]{c}V_{1}^{\mathsf{T}}\\ V_{2}^{\mathsf{T}}\end{array}\right]

p (u ∣ η) = C_{2} exp (- \frac{∥ T u ∥ _{2}^{2}}{2 η ^{2}} - \frac{∥ V _{2}^{T} u ∥ _{2}^{2}}{2 α ^{2}}),

p (u ∣ η) = C_{2} exp (- \frac{∥ T u ∥ _{2}^{2}}{2 η ^{2}} - \frac{∥ V _{2}^{T} u ∥ _{2}^{2}}{2 α ^{2}}),

C_{2} = \frac{det Σ _{1}}{( 2 π η ^{2} ) ^{(n - r) /2} ( 2 π α ^{2} ) ^{r /2}}

C_{2} = \frac{det Σ _{1}}{( 2 π η ^{2} ) ^{(n - r) /2} ( 2 π α ^{2} ) ^{r /2}}

u_{σ, η} = u_{λ} = H^{- 1} A^{T} b = ar g u min ∥ A u - b ∥_{2}^{2} + λ ∥ T u ∥_{2}^{2},

u_{σ, η} = u_{λ} = H^{- 1} A^{T} b = ar g u min ∥ A u - b ∥_{2}^{2} + λ ∥ T u ∥_{2}^{2},

E [f (U) ∣ b] = E [f (u_{λ} + σ H^{- 1/2} X)],

E [f (U) ∣ b] = E [f (u_{λ} + σ H^{- 1/2} X)],

E [f (U) ∣ b]

E [f (U) ∣ b]

= C_{1} \int f (u) exp (- \frac{1}{2 σ ^{2}} (u^{T} H_{α} u - 2 u^{T} A^{T} b)) d u,

E [f (U) ∣ b]

E [f (U) ∣ b]

= C_{3} \int_{R^{n}} f (u_{λ} + σ H_{α}^{- 1/2} x) exp (- x^{T} x /2) d x .

σ^{2}

σ^{2}

η^{2}

p (b)

p (b)

= (2 π σ^{2})^{- m /2} (2 π η^{2})^{- n /2} det T \int_{R^{n}} exp (- \frac{∥ A u - b ∥ _{2}^{2}}{2 σ ^{2}} - \frac{∥ T u ∥ _{2}^{2}}{2 η ^{2}}) d u

\frac{d}{d σ} p (b) = - \frac{m}{σ} p (b) + σ^{- 3} p (b) E [∥ A U - b ∥_{2}^{2} ∣ b] .

\frac{d}{d σ} p (b) = - \frac{m}{σ} p (b) + σ^{- 3} p (b) E [∥ A U - b ∥_{2}^{2} ∣ b] .

σ^{2}

σ^{2}

η^{2}

E [∥ A U - b ∥_{2}^{2} ∣ b] = E [∥ A (u_{λ} + σ H^{- 1/2} X) - b ∥_{2}^{2}] .

E [∥ A U - b ∥_{2}^{2} ∣ b] = E [∥ A (u_{λ} + σ H^{- 1/2} X) - b ∥_{2}^{2}] .

E [∥ A U - b ∥_{2}^{2} ∣ b] = ∥ A u_{λ} - b ∥_{2}^{2} + σ^{2} trace (H^{- 1} A^{T} A) .

E [∥ A U - b ∥_{2}^{2} ∣ b] = ∥ A u_{λ} - b ∥_{2}^{2} + σ^{2} trace (H^{- 1} A^{T} A) .

E [∥ T U ∥_{2}^{2} ∣ b] = ∥ T u_{λ} ∥_{2}^{2} + λ η^{2} trace (H^{- 1} T^{T} T) .

E [∥ T U ∥_{2}^{2} ∣ b] = ∥ T u_{λ} ∥_{2}^{2} + λ η^{2} trace (H^{- 1} T^{T} T) .

σ_{k + 1}^{2}

σ_{k + 1}^{2}

η_{k + 1}^{2}

\frac{∥ u _{k + 1}^{*} - u _{k}^{*} ∥ _{2}}{∥ u _{k}^{*} ∥ _{2}} < t o l,

\frac{∥ u _{k + 1}^{*} - u _{k}^{*} ∥ _{2}}{∥ u _{k}^{*} ∥ _{2}} < t o l,

A u (t_{i}, θ_{j}) = \int_{R^{2}} u (x, y) δ (t_{i} - (x, y) \cdot (cos θ_{j}, sin θ_{j})) d x d y

A u (t_{i}, θ_{j}) = \int_{R^{2}} u (x, y) δ (t_{i} - (x, y) \cdot (cos θ_{j}, sin θ_{j})) d x d y

p (u ∣ η) = \frac{det T}{( 2 η ) ^{n}} exp (- \frac{∥ T u ∥ _{1}}{η / 2}) .

p (u ∣ η) = \frac{det T}{( 2 η ) ^{n}} exp (- \frac{∥ T u ∥ _{1}}{η / 2}) .

u_{σ, η} = ar g u max C exp (- \frac{∥ A u - b ∥ _{2}^{2}}{2 σ ^{2}}) exp (- \frac{∥ T u ∥ _{1}}{η / 2}),

u_{σ, η} = ar g u max C exp (- \frac{∥ A u - b ∥ _{2}^{2}}{2 σ ^{2}}) exp (- \frac{∥ T u ∥ _{1}}{η / 2}),

u min ∥ C u - \tilde{u} ∥_{2}^{2} + λ ∥ T u ∥_{2}^{2},

u min ∥ C u - \tilde{u} ∥_{2}^{2} + λ ∥ T u ∥_{2}^{2},

T_{1}=\left(\begin{array}[]{ccccc}-1&1&0&\dots&0\\ 0&-1&1&\dots&0\\ \vdots&&\ddots&&\vdots\\ 1&0&0&\dots&-1\end{array}\right)~{}\text{and}\quad T_{2}=\left(\begin{array}[]{ccccc}1&-2&1&\dots&0\\ 0&1&-2&\dots&0\\ \vdots&&\ddots&&\vdots\\ -2&1&0&\dots&1\end{array}\right).

T_{1}=\left(\begin{array}[]{ccccc}-1&1&0&\dots&0\\ 0&-1&1&\dots&0\\ \vdots&&\ddots&&\vdots\\ 1&0&0&\dots&-1\end{array}\right)~{}\text{and}\quad T_{2}=\left(\begin{array}[]{ccccc}1&-2&1&\dots&0\\ 0&1&-2&\dots&0\\ \vdots&&\ddots&&\vdots\\ -2&1&0&\dots&1\end{array}\right).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\pdfcolInitStack

tcb@breakable

Effective New Methods for Automated Parameter Selection in Regularized Inverse Problems

Toby Sanders, Rodrigo B. Platte, Robert D. Skeel

School of Mathematical and Statistical Sciences, Arizona State University, Tempe, AZ, USA.

Abstract

The choice of the parameter value for regularized inverse problems is critical to the results and remains a topic of interest. This article explores a criterion for selecting a good parameter value by maximizing the probability of the data, with no prior knowledge of the noise variance. These concepts are developed for $\ell_{2}$ and consequently $\ell_{1}$ regularization models by way of their Bayesian interpretations. Based on these concepts, an iterative scheme is proposed and demonstrated to converge accurately, and analytical convergence results are provided that substantiate these empirical observations. For some of the most common inverse problems, including MRI, SAR, denoising, and deconvolution, an extremely efficient algorithm is derived, making the iterative scheme very attractive for real case use. The computational concerns associated with the general case for any inverse problem are also carefully addressed. A robust set of 1D and 2D numerical simulations confirm the effectiveness of the proposed approach.

1 Introduction

Image and signal denoising and reconstruction problems are important research topics due to their wide range of applications, including medical diagnosis, defense, and basic scientific research [20, 21, 22, 2, 17, 19]. These problems arise when an object or image, which we denote by $u$ , cannot easily be observed in a straightforward manner, e.g. brain imaging. In a linear model, when $u$ must be measured in a indirect fashion, measurements of $u$ are encoded into a data vector of the form $b=Au+\epsilon$ , where $A$ is a linear operator and $\epsilon$ is a noise term inherent to the sensing mechanism of the application. Then the image reconstruction problem is to recover or decode the most accurate representation of $u$ given $A\in\mathbb{R}^{m\times n}$ and $b$ , and any additional prior information. The topic of this article is the proper choice of the regularization parameter in the reconstruction model which balances the data fitting with the priors.

Inverse problems are typically characterized as ill-posed, resulting in solution maps that are sensitive to the noise term. To alleviate this issue, it is common to implement regularization techniques that promote favorable solutions based on prior knowledge of the behavior of the target signal. For example, in the continuous formulation, an order 1 Tikhonov regularization scheme sets $R(u)=\int_{\Omega}|\nabla u(x)|^{2}\,dx$ , and the model minimizes a weighted sum of $R(u)$ with $\|Au-b\|_{2}^{2}$ . This formulation intuitively recovers smooth solutions with small variation [34]. The regularized solutions considered in this article takes the form

[TABLE]

for $p=1,2$ . The main focus of this article is the choice of the important parameter $\lambda>0$ that balances the regularization with the data fitting term.

In (1), the case $p=2$ yields the discretized model for Tikhonov regularization, and finite difference matrices are often used for $T$ to approximate derivatives [27, 5, 29]. The case $p=1$ is typically referred to as the compressed sensing formulation when $m<n$ [10, 9]. While the Tikhonov case is less computationally expensive using conjugate gradient (CG) methods, it is well documented that in many applications the $\ell_{1}$ regularized solutions can be superior, hence they are also of interest in the current work.

Generally, the main benefit of working with Tikhonov regularization is computational. For the most challenging problems where the best possible results are necessary, more computationally intensive techniques can yield better results by using for example learning-based priors [26, 40, 33] or $\ell_{1}$ regularization models as indicated in some of our results. Nonetheless, Tikhonov regularization remains the regularizer of choice for a number of problems with sufficient data but where a fast and reliable reconstruction is needed. For example, it was recently studied in the case of rapid blind deconvolution of low noise images [38, 39] and for real time processing of very large data sets [31, 6, 7]. Moreover, even with the most recent wave of learning-based prior models, the alternating direction method of multipliers (ADMM) approach most often used as the computational tool to optimize these models involves solving intermediate Tikhonov regularization problems [4, 26, 35]. Hence, the work developed here could potentially be leveraged to improve these algorithms, e.g. in the way we leverage our approach for novel $\ell_{1}$ regularized parameter selection.

In general, the parameter $\lambda$ should be larger for smaller signal-to-noise ratio (SNR) in $b$ , and vice versa. The SNR is unknown for most applications, and even when it is this does not immediately inform us what value $\lambda$ should take. Perhaps most commonly researchers choose $\lambda$ based on “experience.” While this approach often leads to suitably pleasing results, having a robust automated approach eliminates any user bias, provides more broadly applicable results, and saves time for the users.

This article proposes a new criterion and procedure for the parameter selection, which uses the equivalent maximum a posteriori (MAP) interpretation of regularized inverse problems for a Bayesian approach [18]. Using the MAP interpretation of (1), we consider the parameter $\lambda$ to be optimal when providing maximum evidence (ME), i.e. it maximizes the likelihood of the data, $b$ , which is related to a marginalized maximum likelihood (ML) estimation [3]. The main contributions of this article, which are expanded upon further below, may be summarized by the following:

Proof of concept of the ME criterion for noise estimation and effective parameter selection, showing comparable results to the leading methods where the noise level is assumed known. 2. 2.

Algorithms and computational methods for very efficient implementation, particularly for important applications such as MRI and deconvolution, and a novel nonlinear iteration scheme for finding the parameter that is robust and efficient 3. 3.

Extension of the recovered $\ell_{2}$ parameter value to obtain an accurate $\ell_{1}$ parameter, which is optimal as $m\rightarrow\infty$ .

In light of the proposed ME criterion, several theoretical results are first developed that are necessary for the algorithm. Then, based on these results, we propose a fixed point iterative scheme that updates two parameters, $\sigma$ and $\eta$ , that together determine $\lambda$ . The value $\sigma^{2}$ is the variance on the random noise vector $\epsilon\sim N(0,\sigma^{2})$ , which, contrary to most parameter selection methods, we assume to have no prior knowledge of. The parameter $\eta$ is related to the regularity or scale of the solution, and we call $\eta^{2}$ the variance of the signal. This iterative scheme is developed for the Tikhonov regularized problem and is shown to converge accurately in relatively few iterations (e.g. 5 or 10).

Revisiting the Bayesian formulation, we show how the recovered parameters $\sigma$ and $\eta$ for $\ell_{2}$ regularization give very good estimates for the $\ell_{1}$ regularization parameter, which is more appealing for many applications. This contribution, while simple in derivation, is shown to be very effective, and we believe to be one of the critical pieces of this work. Finally some analysis of the convergence of the proposed fixed point scheme is given, which indicates there are typically two and possibly more nontrivial parameters $\lambda$ satisfying the equations derived for the ME parameter. However, this analysis also indicates that when the appropriate regularization operator $T$ is used, the only stable parameter (and hence the one found with the proposed algorithm) is indeed the ME parameter, which converges for a very large range of starting values.

There are several computational considerations to address for our approach. In particular, each iteration requires the trace of a matrix which is infeasible to compute directly for large imaging problems. However, for many important inverse problems including image denoising, deconvolution, and image reconstruction from Fourier data (e.g. SAR, MRI, and deconvolution), we analytically develop the necessary ingredients that allow us to compute the value of the necessary traces and solution exactly at the cost of essentially just one fast Fourier transform (FFT) resulting in an extremely fast algorithm. This makes our scheme very appealing and practical for real case use. The general case of any inverse problem is also addressed, in which the authors implemented trace estimation procedures using random vectors, which is closely related to the trace estimation procedures already used for UPRE.

The remainder of this article is organized as follows. In the subsection below we discuss previous work. In section 2 and 2.1 the necessary background and theoretical developments are provided, which leads to the fixed point iteration for finding the $\ell_{2}$ regularization parameter. The general outline of the full computational algorithm is also given, and a few simple examples demonstrate the approach and its potential. This is followed in 2.2 and 2.4 with several additional simulated test problems and comparison with other methods. In section 3 we demonstrate how the $\ell_{2}$ parameters recovered using our algorithm may be used to obtain an accurate estimate for the $\ell_{1}$ regularization parameter at no computational cost, and simple examples show that this is very effective. In section 4 the fast versions of the algorithm are developed for problems such as MRI, SAR, and deconvolution, and they are shown to provide several orders of magnitude in speed up. To avoid complicating the work, some of the details necessary for this algorithm are provided in the appendix. The convergence analysis of this algorithm is provided in section 5 and some of these details are also provided in the appendix.

1.1 Previous work

There have been a number of automated and semi-automated approaches proposed for choosing parameter selection in inverse problems [36], which rely on various different criteria that characterize a good parameter. These include the L-curve method [16, 8] and generalized cross validation (GCV) [15]. The L-curve method is mainly an empirical method, while GCV provides a reasonable criterion based on having a model where solutions from subsets of data fit complimentary data sets. There is also the discrepancy principle, which enforces the condition that the regularized solution matches the assumed known noise variance so that $\|Au-b\|_{2}^{2}\approx m\sigma^{2}$ . This method is not generally preferred due to unfavorable empirical evidence [36, 14], while also requiring accurate knowledge of $\sigma^{2}$ . Moreover, these methods typically require one to compute many solutions over a potentially large range of possible parameters in a brute force approach to find the optimal value.

The leading criteria for parameter selection are the unbiased predictive risk estimator (UPRE) [23, 31] and Stein’s unbiased risk estimator (SURE)[32], which provide unbiased estimates of a squared error. The goal then, is to minimize these estimators as functions of $\lambda$ . Denoting the true solution by $v$ , UPRE gives an estimate of the predictive error by

[TABLE]

where $B_{\lambda}$ is the linear solution map, i.e. $u_{\lambda}=B_{\lambda}b$ . The UPRE method has usually been applied to Tikhonov regularization [25], hence $B_{\lambda}=(A^{\mathsf{T}}A+\lambda T^{\mathsf{T}}T)^{-1}A^{\mathsf{T}}$ , although it is applicable more generally to problems where the solution depends linearly on the data. The method also assumes i.i.d. Gaussian noise and prior knowledge of the variance $\sigma^{2}$ , and the minimization of UPRE typically requires an exhaustive search over $\lambda$ [36], similar to procedures for the L-curve method.

The SURE estimator is similar but more general than UPRE due to its application to nonlinear inverse models. In fact, the UPRE method just described can be easily derived from SURE, although the pure form of SURE for inverse problems was originally considered for denoising and threshold selection [12, 41]. The fact that SURE can be applied to nonlinear solutions has generated interest for its application to $\ell_{1}$ regularization models [42, 12, 24]. More recent developments [13, 14] have generalized SURE (GSURE) to minimize the expectation of $\|P(u_{\lambda}-v)\|_{2}^{2}$ , where $P=A^{\mathsf{T}}(A^{\mathsf{T}}A)^{-1}A$ is the projection operator onto the range of $A^{\mathsf{T}}$ . This same work further extends the method to families of exponentials (e.g. non-i.i.d. noise models), which is outside the scope of this work. As observed in [24], UPRE and the projected GSURE are both estimators of particular instances of weighted error norms. Likewise to UPRE, the literature on SURE always assumes the noise variance (or covariance matrix) is known a priori, contrary to our approach. For additional detailed discussion and review of these topics, see [24] and the references therein.

More recent work has successfully proposed learning adaptive parameters from a training set [11]. Certainly, developing a training set to learn from is far from ideal for the parameter selection problem if there is a suitable alternative. The statistical criteria for which we develop an algorithm are classical methods, whose conceptual basis is easy to explain. Such methods will be used for a long time by many researchers, and for comparison purposes by researchers developing new methods (see e.g. recent examples in [11, 31]), and for these reasons we do not consider this approach in this work.

2 Bayesian Formulation and Algorithm for Finding $\lambda$

We begin by writing an equivalent expression for $u_{\lambda}$ in (1) as

[TABLE]

We observe this expression as a maximum a posteriori (MAP) formulation from the Bayesian perspective [18], and we write $u_{\lambda}$ maximizes $p(u|b)\propto p(b|u)p(u)$ . For a noise vector $\epsilon\sim N(0,\sigma^{2}I)$ , it is clear that

[TABLE]

For the prior we consider for now $\ell_{2}$ regularizations with parameter $\eta$ , which is related to the regularity of the signal, and we call $\eta^{2}$ the variance of the signal (under the map $T$ ). Then our $\ell_{2}$ Gaussian prior in the case $T$ is nonsingular takes the form

[TABLE]

From (4) and (5) we see that for a given $\sigma$ and $\eta$ , which are generally unknown, we have the MAP formulation as

[TABLE]

which one may observe is equivalent to the minimization in (1) with $p=2$ and $\lambda=\sigma^{2}/\eta^{2}$ . Numerically, we use the formulation (1) to find $u_{\lambda}$ , however we make use of the equivalent Bayesian formulation (6) for the analysis in finding good parameters.

Normally the $\ell_{2}$ prior would take the form $p(u)\propto\exp(-\|Tu\|_{2}^{2}/(2\eta^{2}))$ . However, this is improper in the typical case where $T$ is singular, since it cannot be normalized. To fix this (see section 3.4 of [18]), one may use a (nearly) flat prior on the null space of $T$ , in particular, a Gaussian with variance $\alpha$ approaching $+\infty$ . Proceed by writing the singular value decomposition of $T$ as

[TABLE]

and using

[TABLE]

for which

[TABLE]

where $n-r$ is the rank of $T$ .

2.1 Iterative Algorithm for Finding the Noise and Signal Variances

Our goal is to find a good estimate for $\sigma$ and $\eta$ in an efficient manner. We will make use of the Bayesian interpretation of Tikhonov regularization with probability distributions in (4) and (5). The parameters will be considered good in a marginalized maximum likelihood sense for maximum evidence (ME), i.e. they maximize $p(b|\sigma,\eta)$ . We first provide the theoretical developments necessary for an ME algorithm.

Definition 1.

For a given $\sigma$ and $\eta$ , we define $u_{\sigma,\eta}$ to be the MAP solution in (6), which may be equivalently expressed by the Tikhonov regularized minimizer as

[TABLE]

where $H=A^{\mathsf{T}}A+\lambda T^{\mathsf{T}}T$ and $\lambda=\sigma^{2}/\eta^{2}$ .

Lemma 1.

Consider the expectation of an arbitrary function $f(U)$ , where $U$ is a random variable with the Gaussian density function in (5). Then for a given $b$ with conditional expectation given by (4), the conditional expectation of $f(U)$ is

[TABLE]

where $X\sim N(0,I)$ .

Proof.

We will use capital $C$ ’s to denote constants independent of $u$ that can be absorbed on the outside. We begin by writing the conditional expectation using Bayes’ theorem as

[TABLE]

where $H_{\alpha}=H+(\sigma^{2}/\alpha^{2})V_{2}V_{2}^{\mathsf{T}}$ . Completing the square of the matrix equation in the exponential leads to

[TABLE]

Because $H$ is symmetric positive definite, $H_{\alpha}^{-1/2}=H^{-1/2}+\mathcal{O}(\alpha^{-2})$ , and the limit $\alpha\rightarrow+\infty$ is well defined. Setting $f(u)=1$ , we observe that $C_{3}=(2\pi)^{-n/2}$ , which completes the proof.∎

Lemma 2.

Let $b$ be given with conditional density defined by (4), and let $U$ be a random variable having a density defined by (5), with $\sigma$ and $\eta$ considered unknown parameters. Then the values of $\sigma$ and $\eta$ that maximize $p(b)$ satisfy the following conditional expectations:

[TABLE]

Proof.

Using the law of total probability leads to

[TABLE]

Differentiating this expression with respect to $\sigma$ leads to

[TABLE]

Setting this expression to zero completes the proof for (9), and the details for (10) are similar. ∎

Theorem 1.

Let $U$ be a random variable with density given by (5), and let $b$ be given with conditional density given by (4). Then $\sigma$ and $\eta$ which maximize $p(b)$ in terms of expectations satisfy the following equalities:

[TABLE]

where $H=A^{\mathsf{T}}A+\lambda T^{\mathsf{T}}T$ and $\lambda=\sigma^{2}/\eta^{2}$

Proof.

Letting $f(u)=\|Au-b\|_{2}^{2}$ and applying Lemma 1 leads to

[TABLE]

Expanding this expression out and using the properties of $X\sim N(0,I)$ (see for example, Lemma 7.2 in [36]) leads us to

[TABLE]

In a similar fashion

[TABLE]

Combining equations (12)-(13) with Lemma 2 completes the proof. ∎

Theorem 1 is the basis of an iteration to find $\sigma$ and $\eta$ for the ME algorithm, which is given by

[TABLE]

where it is implied in this case that $H_{k}=A^{\mathsf{T}}A+\lambda_{k}T^{\mathsf{T}}T$ , $\lambda_{k}=\sigma_{k}^{2}/\eta_{k}^{2}$ , and $u_{k}^{\ast}$ is the Tikhonov regularized solution for parameter $\lambda_{k}$ . This iteration is set to converge whenever

[TABLE]

or until we reach some maximum number of iterations $K$ , which we demonstrate only mildly depends on $\lambda_{0}$ .

One must consider the cost of such an iteration. Later in section 4 we give the most useful results to deal with the computational cost. The results in that section lead to an extremely efficient scheme for this iteration for some of the most common inverse problems, including denoising, deconvolution, and MRI. For completeness however, we present the general algorithm first. Obviously each iteration requires the minimization (7) to find $u_{k}^{\ast}$ , which for general sampling matrices $A$ is most suitably solved with a Krylov subspace method, in which case we implement a conjugate gradient method. The biggest computational burden for the general problem is in approximating the traces in (14) and (15), for which a Monte Carlo method [1] is suggested. Specifically, the trace of any square matrix $C$ is given by $\mathbb{E}[X^{\mathsf{T}}CX]=\mathrm{trace}(C)$ , where $X$ is a random vector of independent Gaussians. In its pure form this method uses a set of independent pseudo-random vectors $\{x_{j}\}_{j=1}^{J}$ . The error of the estimate is noticeably reduced if the set is first orthogonalized, notwithstanding the bias that is introduced. One can precompute $A^{\mathsf{T}}Ax_{j}$ and $T^{\mathsf{T}}Tx_{j}$ . However, since $H$ depends on $\sigma$ and $\eta$ we must compute approximations to $H^{-1}x_{j}$ at each iteration. Hence at each iteration over $\sigma$ and $\eta$ we have $J+1$ solves of (7). An outline of the general algorithm is provided in Algorithm 1.

We proceed with an example demonstrating the effectiveness of this approach on a 1D piecewise quadratic signal of dimension $n=500$ . The sampling matrix $A\in\mathbb{R}^{n\times n}$ was generated randomly with independent normally distributed entries. For the additive i.i.d. Gaussian noise vector $\epsilon$ we chose $\sigma$ so that the signal to noise ratio (SNR) is 2. For the regularization operator $T$ , in order to avoid the inverse crime of prior knowledge of the quadratic signal, we used a first order finite difference to penalize the discrete first derivative. The only parameter left to choose is $\lambda_{0}$ . We chose a series of initial $\lambda_{0}$ ’s that are equally spaced logarithmically and plot, in Figure 1, the evolution of various parameters and errors as the iteration progresses. At the very least, it should be evident that the convergence in this example is only mildly dependent of the initial $\lambda_{0}$ even at several orders of magnitude difference, and in all cases convergence is practically achieved in 10 iterations or less. Moreover, the relative error (top left), defined by the relative difference of the reconstruction and the true solution analogous to (16), is observed to nearly monotonically decrease after each update, indicating a near optimal recovered solution for $\lambda$ . All simulations converge the same final $\lambda$ with a maximum relative difference between the parameters to be less than $4\times 10^{-5}$ . This in turn generates very similar solutions $u_{\lambda}$ , as evidenced by the similarity of the error plots. The convergence of $\sigma$ and $\eta$ are also presented, and $\sigma$ is shown to accurately converge close to the true value in all cases. Finally, for completeness, the evolution of the solutions are presented in the bottom right two panels for initial selections of $\lambda_{0}=10^{-2}$ and $\lambda_{0}=10^{5}$ . Many other numerical results are presented later that further confirms the findings of this example.

2.2 Comparison with UPRE

In this section we compare our approach with UPRE. For 4 distinct test signals, 500 such simulations were performed. In each case a random square sampling matrix $A\in\mathbb{R}^{n\times n}$ was generated randomly with independent normally distributed entries, and a random $\sigma$ value was generated on the interval $[.8,10]$ . Then our iterative scheme was implemented with a maximum of only 15 iterations to find $\sigma$ and $\eta$ . The 4 distinct signals are as follows: test signal 1 is a piecewise constant boxcar signal, test signal 2 is a piecewise linear hat function, test signal 3 is one period of a sine wave, and test signal 4 is the piecewise quadratic from [30] and the previous example. The corresponding regularization operators $T$ were chosen as various finite difference operators that appropriately match the signal properties, except again to avoid the inverse crime of overfitting each problem we used a first order finite difference regularizer for test signal 4. For test signal one, we used a first order finite difference regularization, and for test signals 2 and 3 we used a multiscale second order finite difference regularization operator [30]. Recall our method does not require knowledge of $\sigma$ whereas UPRE does require $\sigma$ .

For UPRE, we followed the general approach suggested by Vogel [36], by selecting a series of $\lambda$ test values (e.g. 20) that are logarithmically equally spaced and choosing the parameter which minimizes the UPRE objective function. From here, we even performed a second refinement by testing an additional series of parameter values near this parameter (also logarithmically equally spaced) and then settling on the optimal parameter from this set.

The scatter plots comparing the recovered parameters from our method and UPRE are shown in Figure 2. These show that the two methods generally yield similar results (points near the dashed line mean similar recovered parameter values), with varying success and trends between the test signals. These plots also show a very positive relationship between the SNR and $\lambda$ (lower SNR values almost always yield larger $\lambda$ ).

There are some notable discrepancies in the recovered $\lambda$ from our method and UPRE, particularly with test signals 2 and 3. Therefore to discern the two methods we provide in Figure 3 a second set of scatter plots from the simulations, where the resulting errors from our method and UPRE are plotted against one another. For test signal 1, we see that the two methods generate similar approximations to the true solutions, which should be expected since the two generate very similar $\lambda$ values seen in Figure 2. On the other hand, for the remaining 3 test signals some discrepancies were observed in the recovered $\lambda$ value, and the error plots in Figure 3 show that our method tended to yield improved solutions (and hence $\lambda$ values), particularly for signals 2 and 3. Therefore we conclude from these examples that our method, while assuming far less information that UPRE, may still yield improved results in the reconstruction.

2.3 Accuracy of $\sigma$

In this section we repeat a similar set of simulations from section 2.2 to search for the parameter $\lambda=\sigma^{2}/\eta^{2}$ . The general set up was the same as before, where this time we specified a random value of the SNR selected from the interval $[2,20]$ to determine the additive noise.

The scatter plots comparing the true $\sigma$ and recovered $\sigma$ are given in Figure 4, and the corresponding SNR is given by the coloring. These show that our scheme generally yields very accurate estimates of the variance under these settings. There is greater spread for larger variances, but the relative error does not necessarily increase. Moreover, out of these 2,000 simulations there are only two clear poor solutions for $\sigma$ , which come from test signal 3.

2.4 2D Tomographic Example

In this section we apply our automatic parameter selection to a 2D tomographic imaging example with parallel beam geometry. For this problem the data vector $b$ takes values of the form

[TABLE]

for $i=1,\dots,N$ and $j=1,\dots,M$ . These types of inverse problems occur in a large number of applications, e.g. X-ray medical imaging, and electron and neutron tomography. Typically the mesh formed by the $t_{i}$ is very fine, whereas the angular spacing $\theta_{j}$ can be quite large depending on the application.

For our test problem we set a fixed $\Delta\theta=\theta_{j+1}-\theta_{j}$ to be $5\degree$ , and acquired data of the form (17) at each such angle increment over the full $180\degree$ extent for a total of 36 angles. The spacing $t_{i}$ is set so that $N=512$ resulting in $b\in\mathbb{R}^{512\cdot 36}$ . However, the image is also $u\in\mathbb{R}^{512\times 512}$ , making the problem severely underdetermined. Based on the parameters specified above, the corresponding forward and adjoint operator is computed efficiently in MATLAB with a sparse matrix, which can be constructed from our openly available software [28].

We tested 4 different SNRs, 5, 10, 20, and 30, and used a first order finite difference regularizer. The initial choice of $\lambda$ was $\lambda_{0}=1$ . The resulting convergence of $\lambda$ for each case is plotted in logarithmic scale in Figure 5, along with the resulting image reconstruction at SNRs of 5 and 10. In the left two columns are the unregularized reconstructions from ordinary least squares (OLS) and filtered backprojection (FBP). In the right two columns are the $\ell_{2}$ Tikhonov regularized solutions from our recovered optimal parameter, where the right most solution was recomputed under the constraint that $u$ be a nonnegative mass-density function. The unconstrained Tikhonov and OLS solutions were computed with an iterative conjugate gradient method, which is likely the most efficient approach due to the sparse nature of the sampling operator. The nonnegative Tikhonov solution is computed with a projected gradient decent approach [28].

Observe that the quality of the regularized solutions are quite good compared with the unregularized, and the density constraint provides further notable improvement. Also observe in plots of $\lambda$ that the recovered value accurately reflects changes in the SNR levels.

3 Mapping Tikhonov Parameters onto L1 Parameters

Consider again the general inverse problem where the noise vector $\epsilon$ is i.i.d. mean zero Gaussian with variance $\sigma^{2}$ and the signal variance is $\eta^{2}$ . For the Gaussian prior on the signal, this lead us to the prior given in (5), and equivalently the regularization in (1) with $p=2$ and $\lambda=\sigma^{2}/\eta^{2}$ . If we instead assume a Laplacian prior, i.e. an $\ell_{1}$ regularization with $p=1$ in (1), then this prior with variance $\eta^{2}$ is given by

[TABLE]

This leads to the MAP solution given by

[TABLE]

and we see this formulation is equivalent to (1) with $p=1$ and $\lambda=2^{3/2}\sigma^{2}/\eta$ . Due to the favorable analytical properties $\ell_{2}$ norm, in section (2.1) we were able to derive an iteration for the variances $\sigma^{2}$ and $\eta^{2}$ with a Gaussian prior. Moreover, the iterations for the $\ell_{2}$ prior can be evaluated very quickly, hence we use the iterative scheme for the Gaussian prior to find $\sigma^{2}$ and $\eta^{2}$ . These variances can then be put into the $\ell_{1}$ MAP estimation (19) and yield the corresponding $\lambda$ for the $\ell_{1}$ regularized problem as written above. The $\ell_{1}$ optimization problem is solved using the alternating direction method of multipliers approach [37, 4], and our code is openly available [28].

The use of the Gaussian prior to select $\lambda$ for the Laplace prior is justified as follows. The prior $p(u|\eta)$ given by Eq. (18), as a function of $Tu$ , is the density function of a sum of $n$ independent Laplace random variables each with variance $\eta^{2}$ . This can be approximated by a sum of Gaussian random variables having the same variance, and by the central limit theorem the difference between the two distributions gets smaller as $n$ becomes larger. Moreover, the marginal probability $p(b)=p(b|\sigma,\eta)$ , defined in the proof of Lemma 2, depends not pointwise on the prior but on an integral of the prior, further diminishing the difference. And it is the maximization of $p(b|\sigma,\eta)$ that determines $\sigma$ and $\eta$ . Therefore, the optimal choice for $\ell_{2}$ is expected to be very nearly optimal for $\ell_{1}$ .

Numerical tests of these concepts are presented in Figure (6). Here we have solved for $\sigma$ and $\eta$ for the Tikhonov regularized problem as before, and then used these parameters to yield $\lambda$ for the $\ell_{1}$ regularized problem as written above. From these parameters, the measured errors between the true solution and the recovered solution are plotted against one another for the $\ell_{1}$ and $\ell_{2}$ regularized solutions. The set up for these simulations was the same as those in section 2.2. It is observed from these plots that projecting the parameters $\sigma$ and $\eta$ onto the $\ell_{1}$ problem generally provides solutions that are comparable to the $\ell_{2}$ , and in many cases the $\ell_{1}$ solution provides better results. Hence it appears to be a quite reasonable strategy to use the ML estimation for the $\ell_{2}$ parameter to select the $\ell_{1}$ parameter.

Finally, we present numerical results for deconvolution of 2D images with different noise levels and convolutional kernels. The convolution kernels are symmetric Gaussian point spread functions with different standard deviations, $\omega$ , varying between $\omega=0.5$ and $\omega=3$ , where we are using the convention that a unit length in an image is one pixel. The noise added is i.i.d. white Gaussian noise, where the SNR is defined by the mean image value divided by the standard deviation of the noise. The simulations are performed on three classical test images, the cameraman, Shepp-Logan phantom, and the peppers image. The first two images are grayscale and the third was converted to grayscale for these simulations. An extremely fast version of the algorithm is implemented for deconvolution, as outlined in the next section. In each case, we simply used a first order finite difference regularization matrix, $T$ . The $\ell_{2}$ parameters found from our algorithm are again projected onto the $\ell_{1}$ formulation. Some of the resulting images for the cameraman are shown Figure 8.

From these simulations, the resulting errors (compared with the known true solution) are shown in Figure 7, where the relative error presented is analogous to (16). These results indicate that our approach and UPRE yield very similar results for these simulations, with marginal improvements in our approach for a few cases. Moreover, our approach, as opposed to UPRE, was implemented without the knowledge of the noise level in the data and uses a quickly convergent fixed point iteration to find the optimal parameters. The $\ell_{1}$ solutions found by projecting the recovered $\ell_{2}$ parameters are notably improved in each case, and significantly so for the Shepp-Logan phantom (results in middle row), likely due to the fact that the total variation $\ell_{1}$ regularization is ideal for this image.

The plots in the far right column provide a closer comparison by showing the errors curves resulting from the fourth row of each error matrix, which is the case $\omega=1.33$ . Included in these plots is the error obtained from the $\ell_{2}$ and $\ell_{1}$ solutions where the parameter was tuned to minimize the error between the reconstruction and the true solution, which is obviously not possible in practice. Nonetheless, this result indicates that the solution obtained using our approach for both the $\ell_{2}$ and $\ell_{1}$ parameter selection is nearly optimal, particularly for the last two test images. One could not hope for a better result in this case.

4 Accelerated Iterations for Denoising and Deconvolution

In this section we show how to significantly reduce the computational load for our iterative scheme for the popular denoising and deconvolution applications, and in the appendix these derivations are easily extended to Fourier reconstruction problems (e.g. SAR and MRI). It is achieved by obtaining exact formulas for the traces appearing in (12) and (13) and the solutions to (7) that require only one inverse FFT and multiplication by a diagonal matrix in each iteration. Moreover, the FFT may be avoided if one is satisfied with the leaving the iterated solutions in the Fourier domain, leaving us with only $O(n)$ flop count for each iteration. The derivations needed for 1D are provided here, and extensions to higher dimensions and Fourier sampling are provided in the appendix.

The case of denoising occurs when the sampling matrix $A$ is the identity and the data vector is a noisy version of the image given by say $\tilde{u}$ . More generally, we consider the deconvolution problem, where the sampling matrix is circulant. These denoising and deconvolution problems are written as

[TABLE]

where $C$ is a circulant matrix and $\tilde{u}$ is a noisy and/or blurred version of $u$ .

In what follows, formulas for the traces needed for (14) and (15) are derived analytically by determining and summing the eigenvalues. First observe that circulant matrices are diagonalizable by the unitary discrete Fourier transform111This can be seen as a direct result of the Fourier convolution theorem. denoted by $\mathcal{F}$ , and hence for this calculation we consider circulant regularization matrices $T=T_{r}$ of the form

[TABLE]

Notice in general we have $T_{r}=T_{1}^{r}$ . We write the diagonalization of $T_{r}$ and $C$ as

[TABLE]

where $\Lambda_{r}$ and $\Lambda_{C}$ contain the eigenvalues of $T_{r}$ and $C$ respectively. The eigenvalues in the diagonal matrices $\Lambda_{r}$ and $\Lambda_{C}$ may be evaluated by simply taking the discrete Fourier transforms (DFT) of the first column of $T_{r}^{\mathsf{T}}$ and $C^{\mathsf{T}}$ , respectively.

These diagonalized forms allow us to analytically derive simple formulas for the traces and evaluate them very cheaply. Using (22), observe the expression for the matrix products within the traces in (14) and (15) as

[TABLE]

Therefore, $H^{-1}C^{\mathsf{T}}C$ and $H^{-1}T_{r}^{\mathsf{T}}T_{r}$ are also diagonalized by the Fourier transform and also circulant, and their eigenvalues are given by the elements of the diagonal matrices $(|\Lambda_{C}|^{2}+\lambda|\Lambda_{r}|^{2})^{-1}|\Lambda_{C}|^{2}$ and $(|\Lambda_{C}|^{2}+\lambda|\Lambda_{r}|^{2})^{-1}|\Lambda_{r}|^{2}$ respectively. The eigenvalues for $C$ will depend on the specific convolution operator, and in the case of denoising these eigenvalues are all obviously 1. In general, we denote these eigenvalues by $\gamma_{j}(C)=\sum_{k=1}^{n}c_{1,k}\mathrm{e}^{-\mathrm{i}2\pi j(k-1)/n}$ , for $j=0,1,\dots,n-1$ . For $T_{r}$ , we have the exact expression for these eigenvalues as

[TABLE]

for $j=0,1,\dots,n-1$ , and therefore $|\gamma_{j}(T_{r})|^{2}=4^{r}\sin^{2r}(\pi j/n)$ . Then the traces needed for our algorithm are given by

[TABLE]

which are reevaluated very cheaply for each update on $\lambda$ .

This handles the traces needed for (14) and (15), and next we need to efficiently evaluate the numerators, $\|Cu_{\lambda}-\tilde{u}\|_{2}^{2}$ and $\|T_{r}u_{\lambda}\|_{2}^{2}$ . Once again using the diagonalized form of $T_{r}$ , observe the latter norm may be written as

[TABLE]

Leveraging Parceval’s theorem with the diagonalized form of $C$ into the first norm leads to

[TABLE]

From (27) and (28), observe that with the availability of the Fourier transform of $u_{\lambda}$ these norms may be computed by multiplication with a diagonal matrix followed by a dot product. Using the diagonalized representations once more the solution to (20) is written as

[TABLE]

where $D_{C,r}=(|\Lambda_{C}|^{2}+\lambda|\Lambda_{r}|^{2})^{-1}\overline{\Lambda_{C}}$ is a diagonal matrix 222Observe that (29) is just a particular form of a Wiener filter. Therefore the Fourier transform of $u_{\lambda}$ is obtained by only multiplying $\mathcal{F}\tilde{u}$ by a diagonal matrix, and $\mathcal{F}\tilde{u}$ only needs to be precomputed. Moreover, the inverse Fourier transform in (29) needed to obtain $u_{\lambda}$ only needs to be evaluated following the iterations, so that only one initial FFT is needed to evaluate $\mathcal{F}\tilde{u}$ prior to the iterations and one final inverse FFT to evaluate the solution following the iterations. A pseudo algorithm summarizing these ideas is provided in Algorithm 2.

To formally point out the computational load of this algorithm, recall the forward operator $A\in\mathbb{R}^{m\times n}$ , and for deconvolution $m=n$ . Then it can be observed that the main computational load for each main outer loop of the general algorithm presented in Algorithm 1 requires $O(n^{2}N(J+1))$ flops, where $J$ is the number of random vectors used and $N$ is the number of iterations used in the CG solver. Alternatively, the main computational load for each loop in Algorithm 2 is given by dot products and multiplication by diagonal matrices, which are only $O(n)$ operations, providing massive reduction in computation compared with the general formulation.

Average run times (in MATLAB) for determining the optimal parameters for deconvolution using this fast approach are provided in table 1, along with the time needed to then compute the optimal $\ell_{1}$ solution from the determined parameters using the state-of-the-art ADMM approach [4]. These timed simulations were taken from the simulated results in Figures 7. The convergence for each simulation was based on the relative change in the iterated solution reducing to less than a fixed tolerance, $10^{-4}$ . Observe that the time required to determine the parameters is roughly 12 to 30 times faster than needed to compute the $\ell_{1}$ solution. Considering this along with the plots in Figure 7, this approach appears to very practical for either extremely fast optimized deconvolution with Tikhonov regularization, or for simply choosing the parameters for the superior $\ell_{1}$ model with practically no additional computation time.

The analytical extensions of these concepts to 2D problems and Fourier sampling are provided in the appendix. Moreover, convergence analysis of algorithm 2 in the denoising case is given in section 5, with the detailed proofs provided in the appendix.

5 Convergence Analysis of Algorithm 2

In this section some asymptotic and convergence analysis of algorithm 2 is provided in the denoising case, $C=I$ . The detailed proofs of our claims are given in the appendix.

Proposition 1.

For a general circulant matrix $T$ with eigenvalues $\{\gamma_{j}\}_{j=0}^{n-1}$ used for regularization, the fixed point iteration in Algorithm 2 can be represented by

[TABLE]

where

[TABLE]

where $B(\lambda)=(I+\lambda|\Lambda|^{2})^{-1}$ and $\hat{u}=\mathcal{F}\tilde{u}$ .

The following proposition gives further insight on the behavior of $f$ .

Proposition 2.

Suppose $\tilde{u}\neq 0$ . The fixed point iteration function $f$ defined in (30) satisfies:

For $T=I$ , $f(\lambda,\tilde{u},I)=\lambda$ and every real number is a fixed point. 2. 2.

Zero is a fixed point of $f$ . Moreover, zero is a stable fixed point if

[TABLE] 3. 3.

For $T=T_{r}$ with $r>0$ as in (21), the asymptotic behavior of $f$ as $\lambda\to\infty$ is given by

[TABLE]

and $4^{r}\sin^{2r}(\pi/n)/(n-1)\leq\kappa_{\infty}(\tilde{u},T_{r})\leq 4^{r}/(n-1)$ .

Figure 9 illustrates features of $f$ in light of Proposition 2. Here the input signal was generated by adding Gaussian noise to the piecewise constant function shown in the right frame. In this example, SNR $=5$ and second order ( $r=2$ ) regularization was used. Notice that $f(\cdot,\tilde{u},T_{r})$ has 3 fixed points: [math], $\lambda_{opt}\approx 16.5$ , and $\lambda_{\infty}\approx 1/\kappa_{\infty}(\tilde{u},T_{r})$ . As typically the case, the trivial fixed point $\lambda=0$ is unstable – a fact that can easily be verified using the second part of Proposition 2. A second unstable fixed point can be observed in the asymptotic regime $f(\lambda;\tilde{u},T_{r})\sim\kappa_{\infty}(\tilde{u},T_{r})y^{2}$ , since $1/\kappa_{\infty}(\tilde{u},T_{r})$ solves the equation $\kappa_{\infty}(\tilde{u},T_{r})\lambda^{2}=\lambda$ . Finally $\lambda_{opt}\approx 16.5$ is the only stable fixed point. The middle frame in Figure 9 shows that $f$ is a contraction in a neighborhood of this value. The reconstruction using $\lambda_{opt}$ is given in right frame of this figure.

We point out that there are instances in which $\lambda=0$ is a stable fixed point. According to Proposition 2, this happens when

[TABLE]

This scenario is illustrated in Figure 10 (left) with $r=1$ . In this example, SNR $=10$ and the target function is piecewise quadratic, making the ratio $\frac{\|T_{2}\tilde{u}\|^{2}}{\|T_{1}\tilde{u}\|^{2}}$ smaller than

[TABLE]

Notice that here we used the identity

[TABLE]

which holds for $n>2r$ by exactness of the trapezoidal quadrature rule. Given the large SNR and the inadequate prior in this case, our algorithm favors forfeiting regularization. This scenario is avoided when a better prior is specified. Figure 10 shows that, for the same data and $r=2$ , $\lambda_{opt}\approx 2.47$ is returned by the fixed point iteration. The corresponding denoised signal is presented in the right frame of Figure 10.

6 Conclusions

This article considers the question of choosing the weighting parameter for a regularization term in a least-squares problem. One way to proceed is to take a Bayesian perspective and express the problem as that of finding the maximum a posteriori (MAP) solution. Proceeding in this way introduces, as a second parameter, the variance of the Gaussian noise associated with the data of the least-squares term. Then, one way of choosing these two parameters is to maximize the probability of the data given the two parameters, which here it is called the maximum evidence (ME) value.

We demonstrate the effectiveness of ME for image reconstruction problems in several instances by comparison with UPRE. Whereas, the UPRE value assumes that the variance of the noise in the data is given, the ME value is obtained by determining the variance of the noise in the data automatically. Simple test examples demonstrate the accuracy of the ME estimate for the variance of the noise. The tests also show decidedly better reconstructions for ME than for UPRE for certain types of data and regularizations, though we do not make any general claims concerning the relative merits of the two approaches. The automatic choice of parameters is also tested for tomographic imaging and deconvolution with good results.

Also presented is an apparently novel iterative scheme, and its efficient implementation, for determining the ME values in the case of an $\ell_{2}$ regularization term. Empirical evidence is presented demonstrating both rapid convergence and insensitivity to an initial guess for the parameters. The general algorithm requires repeated computationally expensive estimates of matrix traces. At the same time, for many important inverse problems, including denoising, MRI, and deconvolution, the trace and corresponding iterative estimates of the parameters can be calculated analytically with very little computation. In this case, the algorithm is extremely efficient and practical.

Finally, it is shown how the Bayesian perspective facilitates a very reasonable choice for the weighing parameter in the case of $\ell_{1}$ regularization. Experimental results for both simple test examples and image deconvolution give reconstructions of high quality (and confirm the general superiority of $\ell_{1}$ regularization). In the case of deconvolution, it is demonstrated that these $\ell_{1}$ parameters are nearly optimal, and are determined at almost no notable computational expense to the $\ell_{1}$ solution. This observation significantly broadens the practical application of the $\ell_{2}$ parameter selection work developed here. Moreover, this approach may be trivially generalized to select parameters for the $\ell_{1}$ problem based on UPRE, whenever the noise level is known a priori.

Appendix A Extension of Algorithm 2 to 2D Image Denoising

In the case of 2D images $u\in\mathbb{R}^{n^{2}}$ , we consider regularizers that compute finite differences analogous to (21) along all rows and columns of the 2D image. We write the order $k$ operator as

[TABLE]

and we obtain the matrix $H=I+\lambda(T_{r,x}^{\mathsf{T}}T_{r,x}+T_{r,y}^{\mathsf{T}}T_{r,y})$ . The operators $T_{r,x}$ and $T_{r,y}$ are visualized in Figure 11 for the case $r=1$ and $n=4$ .

To determine the exact trace of $H^{-1}$ and the exact solution for (7) using only FFT’s, we will analytically determine the eigenvalues of $T_{r,x}^{\mathsf{T}}T_{r,x}+T_{r,y}^{\mathsf{T}}T_{r,y}$ . We first observe that these matrices have the following Kronecker product representation:

[TABLE]

where $T_{r}$ is the 1D version of the difference operators as written in (21) and $I$ is the $n\times n$ identity. This leads to the product representations as $T_{r,x}^{\mathsf{T}}T_{r,x}=T_{r}^{\mathsf{T}}T_{r}\otimes I$ and $T_{r,y}^{\mathsf{T}}T_{r,y}=I\otimes T_{r}^{\mathsf{T}}T_{r}$ . To find the solution to (7) in this case, combine the Kronecker representation with the unitary Fourier diagonalization of $T_{r}$ to obtain

[TABLE]

This leads to our expression for $u_{\lambda}$ in (7) in the case of 2D denoising as

[TABLE]

Observe that $\mathcal{F}\otimes\mathcal{F}$ and $\mathcal{F}^{-1}\otimes\mathcal{F}^{-1}$ are the 2D unitary discrete Fourier and inverse Fourier transforms, therefore analogous to the 1D case this solution requires two 2D FFTs and product with a diagonal matrix, whose values we determine in what follows.

Using properties of Kronecker products, it can be easily shown that for any two eigenvalues $\gamma_{1},\gamma_{2}$ of $T_{r}^{\mathsf{T}}T_{r}$ , then $\gamma_{1}+\gamma_{2}$ is an eigenvalue of $T_{r,x}^{\mathsf{T}}T_{r,x}+T_{r,y}^{\mathsf{T}}T_{r,y}$ , and therefore the complete set of eigenvalues of this matrix are obtained by considering all such combinations. Combining this observation with the eigenvalues in (24) leads us to the exact trace as

[TABLE]

and similarly the trace of $H^{-1}T^{\mathsf{T}}T$ is given by

[TABLE]

Finally, the entries of the diagonal matrix needed in evaluation of (34) coincide with the terms in the sum (35). The arguments used here are very easily extended to higher dimensions, say 3D video denoising. In addition, they also extend to other circulant regularization matrices, such as the more effective multiscale operators in [30].

Appendix B Extension of Algorithm 2 to Fourier Sampling

Consider our reconstruction problem in the case that the sampling matrix is $A=P\mathcal{F}$ , where $P$ is a row selector matrix, i.e. the identity with some rows deleted. In other words, we have some Fourier coefficients of $u$ , and let $S\subset\{1,2,\dots,n\}$ denote the indices of those rows of the identity that are in $P$ . Then using some similar arguments as before, the solution to (7) is given by

[TABLE]

Hence, again in this case one only needs two FFTs and a product with a diagonal matrix to obtain the exact solution. These entries are easily seen once again using (24). Moreover, the traces needed are given by

[TABLE]

where $\delta_{S}(j)=1$ for $j\in S$ and 0 otherwise. These concepts are extended to 2D and higher dimensions repeating similar arguments from section A.

Appendix C Proof of Convergence Results

proof of Proposition 1.

The equations of Algorithm 2, together with $T=\mathcal{F}^{-1}\Lambda\mathcal{F}$ , gives

[TABLE]

where $B(\lambda)=(I+\lambda|\Lambda|^{2})^{-1}$ . ∎

proof of Proposition 2.

The first part of the proposition follows directly by substituting $|\Lambda|^{2}=I$ into (30). We now focus on the two other parts. Again from (30) it is immediately deduced that 0 is a fixed point of $f$ . The stability of this fixed point depends on the derivative of $f$ , which is given by

[TABLE]

Finally, proving part 3 makes repeated use of the facts $\gamma_{0}=0$ for $r>0$ and $\lim_{\lambda\rightarrow\infty}\lambda B(\lambda)=|\Lambda_{r}|^{-2}$ , where here we use the minimium norm pseudo inverse. To this end, we evaluate the limit:

[TABLE]

It follows that

[TABLE]

or

[TABLE]

∎

Acknowledgements

This work is supported in part by the grants NSF-DMS 1502640 and AFOSR FA9550-15-1-0152.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Z. Bai, G. Fahey, and G. Golub. Some large-scale matrix computation problems. Journal of Computational and Applied Mathematics , 74(1-2):71–89, 1996.
2[2] S. Bhattacharya, T. Blumensath, B. Mulgrew, and M. Davies. Fast encoding of synthetic aperture radar raw data using compressed sensing. In 2007 IEEE/SP 14th Workshop on Statistical Signal Processing , pages 448–452. IEEE, 2007.
3[3] C. M. Bishop. Pattern recognition and machine learning. Springer-Verlag New York, 2006.
4[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning , 3(1):1–122, 2011.
5[5] K. Bredies, K. Kunisch, and T. Pock. Total generalized variation. SIAM J. Imaging Sci. , 3(3):492–526, 2010.
6[6] A. Buccini. Regularizing preconditioners by non-stationary iterated Tikhonov with general penalty term. Applied Numerical Mathematics , 116:64–81, 2017.
7[7] A. Buccini, M. Donatelli, and L. Reichel. Iterated Tikhonov regularization with a general penalty term. Numerical Linear Algebra with Applications , 24(4):e 2089, 2017.
8[8] D. Calvetti, S. Morigi, L. Reichel, and F. Sgallari. Tikhonov regularization and the L-curve for large discrete ill-posed problems. Journal of Computational and Applied Mathematics , 123(1):423–446, 2000.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Effective New Methods for Automated Parameter Selection in Regularized Inverse Problems

Abstract

1 Introduction

1.1 Previous work

2 Bayesian Formulation and Algorithm for Finding λ\lambdaλ

2.1 Iterative Algorithm for Finding the Noise and Signal Variances

Definition 1**.**

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Theorem 1**.**

Proof.

2.2 Comparison with UPRE

2.3 Accuracy of σ\sigmaσ

2.4 2D Tomographic Example

3 Mapping Tikhonov Parameters onto L1 Parameters

4 Accelerated Iterations for Denoising and Deconvolution

5 Convergence Analysis of Algorithm 2

Proposition 1**.**

Proposition 2**.**

6 Conclusions

Appendix A Extension of Algorithm 2 to 2D Image Denoising

Appendix B Extension of Algorithm 2 to Fourier Sampling

Appendix C Proof of Convergence Results

proof of Proposition 1.

proof of Proposition 2.

Acknowledgements

2 Bayesian Formulation and Algorithm for Finding $\lambda$

Definition 1.

Lemma 1.

Lemma 2.

Theorem 1.

2.3 Accuracy of $\sigma$

Proposition 1.

Proposition 2.