Risk Estimators for Choosing Regularization Parameters in Ill-Posed   Problems - Properties and Limitations

Felix Lucka; Katharina Proksch; Christoph Brune; Nicolai Bissantz,; Martin Burger; Holger Dette; Frank W\"ubbeling

arXiv:1701.04970·math.ST·October 12, 2017

Risk Estimators for Choosing Regularization Parameters in Ill-Posed Problems - Properties and Limitations

Felix Lucka, Katharina Proksch, Christoph Brune, Nicolai Bissantz,, Martin Burger, Holger Dette, Frank W\"ubbeling

PDF

TL;DR

This paper analyzes the effectiveness of risk estimators like SURE and GSURE for selecting regularization parameters in ill-posed problems, revealing limitations especially as ill-posedness increases, and shows they can lead to unreliable regularization choices.

Contribution

The paper provides a theoretical and numerical study of risk estimators' properties in ill-posed problems, highlighting their limitations and potential failures in high ill-posedness scenarios.

Findings

01

GSURE risk estimator deteriorates asymptotically for ill-posed problems

02

Risk estimators often suggest overly small regularization parameters

03

Unbiased risk estimation may not be reliable for ill-posed inverse problems

Abstract

This paper discusses the properties of certain risk estimators recently proposed to choose regularization parameters in ill-posed problems. A simple approach is Stein's unbiased risk estimator (SURE), which estimates the risk in the data space, while a recent modification (GSURE) estimates the risk in the space of the unknown variable. It seems intuitive that the latter is more appropriate for ill-posed problems, since the properties in the data space do not tell much about the quality of the reconstruction. We provide theoretical studies of both estimators for linear Tikhonov regularization in a finite dimensional setting and estimate the quality of the risk estimators, which also leads to asymptotic convergence results as the dimension of the problem tends to infinity. Unlike previous papers, who studied image processing problems with a very low degree of ill-posedness, we are…

Figures40

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: Condition of A l subscript 𝐴 𝑙 A_{l} computed different values of m = n 𝑚 𝑛 m=n and l 𝑙 l .

	$l = 0.02$	$l = 0.04$	$l = 0.06$	$l = 0.08$	$l = 0.1$
$m = 16$	1.27e+0	1.75e+0	2.79e+0	6.77e+0	2.31e+2
$m = 32$	1.75e+0	6.77e+0	6.94e+1	6.88e+2	2.30e+2
$m = 64$	6.77e+0	6.88e+2	6.42e+2	1.51e+3	4.22e+3
$m = 128$	6.88e+2	1.51e+3	1.51e+4	4.29e+3	4.29e+4
$m = 256$	1.70e+3	4.70e+4	1.87e+6	4.07e+6	1.79e+6
$m = 512$	4.70e+4	1.11e+7	1.22e+7	2.12e+7	3.70e+7

Table 2. Table 2: Statistics of the ℓ 2 subscript ℓ 2 \ell_{2} -error ‖ x ∗ − x α ^ ‖ 2 subscript norm superscript 𝑥 subscript 𝑥 ^ 𝛼 2 \|x^{*}-x_{\hat{\alpha}}\|_{2} for different parameter choice rules using m = n = 64 𝑚 𝑛 64 m=n=64 , l = 0.06 𝑙 0.06 l=0.06 , σ = 0.1 𝜎 0.1 \sigma=0.1 and N ε = 10 6 subscript 𝑁 𝜀 superscript 10 6 N_{\varepsilon}=10^{6} samples of ε 𝜀 \varepsilon .

	min	max	mean	median	std
optimal	4.78	9.63	8.04	8.05	0.43
DP	6.57	10.81	8.82	8.87	0.34
PSURE	6.10	277.24	8.38	8.23	1.53
SURE	6.08	339.80	27.71	8.95	37.26

Equations241

y = A x^{*} + ε,

y = A x^{*} + ε,

\overset{x}{^}_{α} (y) = x \in R^{n} argmin \frac{1}{2} ∥ A x - y ∥_{2}^{2} + α R (x),

\overset{x}{^}_{α} (y) = x \in R^{n} argmin \frac{1}{2} ∥ A x - y ∥_{2}^{2} + α R (x),

R (x) = \frac{1}{2} ∥ x ∥_{2}^{2},

R (x) = \frac{1}{2} ∥ x ∥_{2}^{2},

\overset{x}{^}_{α} (y) = T_{α} y := (A^{*} A + α I)^{- 1} A^{*} y .

\overset{x}{^}_{α} (y) = T_{α} y := (A^{*} A + α I)^{- 1} A^{*} y .

α^{*} := α ⩾ 0 argmin ∥ \overset{x}{^}_{α} (y) - x^{*} ∥_{2}^{2}

α^{*} := α ⩾ 0 argmin ∥ \overset{x}{^}_{α} (y) - x^{*} ∥_{2}^{2}

∥ A \overset{x}{^}_{α} (y) - y ∥_{2}^{2} = m σ^{2} .

∥ A \overset{x}{^}_{α} (y) - y ∥_{2}^{2} = m σ^{2} .

MSPE (α) := E [∥ A (x^{*} - \overset{x}{^}_{α} (y)) ∥_{2}^{2}]

MSPE (α) := E [∥ A (x^{*} - \overset{x}{^}_{α} (y)) ∥_{2}^{2}]

\overset{α}{^}_{PSURE} \in α ⩾ 0 argmin PSURE (α, y) := α ⩾ 0 argmin ∥ y - A \overset{x}{^}_{α} (y) ∥_{2}^{2} - m σ^{2} + 2 σ^{2} df_{α} (y)

\overset{α}{^}_{PSURE} \in α ⩾ 0 argmin PSURE (α, y) := α ⩾ 0 argmin ∥ y - A \overset{x}{^}_{α} (y) ∥_{2}^{2} - m σ^{2} + 2 σ^{2} df_{α} (y)

df_{α} (y) = tr (\nabla_{y} \cdot A \overset{x}{^}_{α} (y)) .

df_{α} (y) = tr (\nabla_{y} \cdot A \overset{x}{^}_{α} (y)) .

MSEE (α) := E [∥Π (x^{*} - \overset{x}{^}_{α} (y)) ∥_{2}^{2}],

MSEE (α) := E [∥Π (x^{*} - \overset{x}{^}_{α} (y)) ∥_{2}^{2}],

\overset{α}{^}_{SURE} \in α ⩾ 0 argmin SURE (α, y) := α ⩾ 0 argmin ∥ x_{ML} (y) - \overset{x}{^}_{α} (y) ∥_{2}^{2} - σ^{2} tr ((A A^{*})^{+}) + 2 σ^{2} gdf_{α} (y)

\overset{α}{^}_{SURE} \in α ⩾ 0 argmin SURE (α, y) := α ⩾ 0 argmin ∥ x_{ML} (y) - \overset{x}{^}_{α} (y) ∥_{2}^{2} - σ^{2} tr ((A A^{*})^{+}) + 2 σ^{2} gdf_{α} (y)

gdf_{α} (y) = tr ((A A^{*})^{+} \nabla_{y} A \overset{x}{^}_{α} (y)), x_{ML} = A^{+} y = A^{*} (A A^{*})^{+} y .

gdf_{α} (y) = tr ((A A^{*})^{+} \nabla_{y} A \overset{x}{^}_{α} (y)), x_{ML} = A^{+} y = A^{*} (A A^{*})^{+} y .

A = U Σ V^{*}, Σ = diag (γ_{1}, \dots, γ_{q}) \in R^{m \times n}, γ_{1} \geq \dots \geq γ_{r} > 0, γ_{r + 1} \dots γ_{m} := 0

A = U Σ V^{*}, Σ = diag (γ_{1}, \dots, γ_{q}) \in R^{m \times n}, γ_{1} \geq \dots \geq γ_{r} > 0, γ_{r + 1} \dots γ_{m} := 0

U = (u_{1}, \dots, u_{m}) \in R^{m \times m}, V = (v_{1}, \dots, v_{n}) \in R^{n \times n} unitary .

U = (u_{1}, \dots, u_{m}) \in R^{m \times m}, V = (v_{1}, \dots, v_{n}) \in R^{n \times n} unitary .

y_{i} = ⟨ u_{i}, y ⟩, x_{i}^{*} = ⟨ v_{i}, x^{*} ⟩, \tilde{ε}_{i} = ⟨ u_{i}, ε ⟩

y_{i} = ⟨ u_{i}, y ⟩, x_{i}^{*} = ⟨ v_{i}, x^{*} ⟩, \tilde{ε}_{i} = ⟨ u_{i}, ε ⟩

y_{i} = γ_{i} x_{i}^{*} + \tilde{ε}_{i}, i = 1 \dots q; y_{i} = \tilde{ε}_{i}, i = q + 1 \dots m,

y_{i} = γ_{i} x_{i}^{*} + \tilde{ε}_{i}, i = 1 \dots q; y_{i} = \tilde{ε}_{i}, i = q + 1 \dots m,

x_{M L}

x_{M L}

\overset{x}{^}_{α}

∥ \overset{x}{^}_{α} ∥_{2}^{2}

∥ A \overset{x}{^}_{α} - y ∥_{2}^{2}

∥ A \overset{x}{^}_{α} - y ∥_{2}^{2}

∥ x_{ML} - \overset{x}{^}_{α} ∥_{2}^{2}

= i = 1 \sum r (\frac{1}{γ _{i}} - \frac{γ _{i}}{( γ _{i}^{2} + α )})^{2} y_{i}^{2} .

(A A^{*})^{+}

(A A^{*})^{+}

A^{*} (A A^{*})^{+} A

df_{α}

df_{α}

gdf_{α}

= tr ((Σ Σ^{*})^{+} Σ Σ_{α}^{- 1}) = i = 1 \sum r \frac{1}{γ _{i}^{2}} γ_{i} \frac{γ _{i}}{γ _{i}^{2} + α} = i = 1 \sum r \frac{1}{γ _{i}^{2} + α} .

DP (α, y) := i = 1 \sum m \frac{α ^{2}}{( γ _{i}^{2} + α ) ^{2}} y_{i}^{2} - m σ^{2},

DP (α, y) := i = 1 \sum m \frac{α ^{2}}{( γ _{i}^{2} + α ) ^{2}} y_{i}^{2} - m σ^{2},

PSURE (α, y)

PSURE (α, y)

SURE (α, y)

y_{\infty} (s) = A_{\infty, l} x_{\infty}^{*} := \int_{- \frac{1}{2}}^{\frac{1}{2}} k_{l} (s - t) x_{\infty}^{*} (t) d t, s \in [- 1/2, 1/2],

y_{\infty} (s) = A_{\infty, l} x_{\infty}^{*} := \int_{- \frac{1}{2}}^{\frac{1}{2}} k_{l} (s - t) x_{\infty}^{*} (t) d t, s \in [- 1/2, 1/2],

k_{l} (t) := \frac{1}{N _{l}} {exp (- \frac{1}{1 - t ^{2} / l ^{2}}) 0 if ∣ t ∣ < l l \leq ∣ t ∣ \leq 1/2, N_{l} = \int_{- l}^{l} exp (- \frac{1}{1 - t ^{2} / l ^{2}}) d t,

k_{l} (t) := \frac{1}{N _{l}} {exp (- \frac{1}{1 - t ^{2} / l ^{2}}) 0 if ∣ t ∣ < l l \leq ∣ t ∣ \leq 1/2, N_{l} = \int_{- l}^{l} exp (- \frac{1}{1 - t ^{2} / l ^{2}}) d t,

x_{\infty}^{*} (t) := i = 1 \sum 4 a_{i} δ (b_{i} - \frac{1}{2}), with a = [0.5, 1, 0.8, 0.5], b = [\frac{1}{26}, \frac{1}{11}, \frac{1}{3}, \frac{1}{3/2}] .

x_{\infty}^{*} (t) := i = 1 \sum 4 a_{i} δ (b_{i} - \frac{1}{2}), with a = [0.5, 1, 0.8, 0.5], b = [\frac{1}{26}, \frac{1}{11}, \frac{1}{3}, \frac{1}{3/2}] .

E_{i}^{n} := [\frac{i - 1}{n} - \frac{1}{2}, \frac{i}{n} - \frac{1}{2}], i = 1, \dots, n

E_{i}^{n} := [\frac{i - 1}{n} - \frac{1}{2}, \frac{i}{n} - \frac{1}{2}], i = 1, \dots, n

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Risk Estimators for Choosing Regularization Parameters in Ill-Posed Problems - Properties and Limitations

Felix Lucka Centre for Medical Image Computing, University College London, WC1E 6BT London, UK email: [email protected]

Katharina Proksch Institut für Mathematische Stochastik, Georg-August-Universität Göttingen, Goldschmidtstrasse 7, 37077 Göttingen, Germany, e-mail: [email protected]

Christoph Brune Department of Applied Mathematics, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands, e-mail:[email protected]

Nicolai Bissantz Fakultät für Mathematik, Ruhr-Universität Bochum, 44780 Bochum, Germany, e-mail: [email protected]

Martin Burger Institut für Numerische und Angewandte Mathematik, Westfälische Wilhelms-Universität (WWU) Münster. Einsteinstr. 62, D 48149 Münster, Germany. e-mail: [email protected]

Holger Dette Fakultät für Mathematik, Ruhr-Universität Bochum, 44780 Bochum, Germany, e-mail: [email protected]

Frank Wübbeling Institut für Numerische und Angewandte Mathematik, Westfälische Wilhelms-Universität (WWU) Münster. Einsteinstr. 62, D 48149 Münster, Germany. e-mail: [email protected]

Abstract

This paper discusses the properties of certain risk estimators that recently regained popularity for choosing regularization parameters in ill-posed problems, in particular for sparsity regularization. They apply Stein’s unbiased risk estimator (SURE) to estimate the risk in either the space of the unknown variables or in the data space, which we call PSURE in order to distinguish the two different risk functions. It seems intuitive that SURE is more appropriate for ill-posed problems, since the properties in the data space do not tell much about the quality of the reconstruction. We provide theoretical studies of both approaches for linear Tikhonov regularization in a finite dimensional setting and estimate the quality of the risk estimators, which also leads to asymptotic convergence results as the dimension of the problem tends to infinity. Unlike previous works which studied single realizations of image processing problems with a very low degree of ill-posedness, we are interested in the statistical behaviour of the risk estimators for increasing ill-posedness. Interestingly, our theoretical results indicate that the quality of the SURE risk can deteriorate asymptotically for ill-posed problems, which is confirmed by an extensive numerical study. The latter shows that in many cases the SURE estimator leads to extremely small regularization parameters, which obviously cannot stabilize the reconstruction. Similar but less severe issues with respect to robustness also appear for the PSURE estimator, which in comparison to the rather conservative discrepancy principle leads to the conclusion that regularization parameter choice based on unbiased risk estimation is not a reliable procedure for ill-posed problems. A similar numerical study for sparsity regularization demonstrates that the same issue appears in non-linear variational regularization approaches.

**Keywords: ** Ill-posed problems, regularization parameter choice, risk estimators, Stein’s method, discrepancy principle.

1 Introduction

Choosing suitable regularization parameters is a problem as old as regularization theory, which has seen a variety of approaches both from deterministic (e.g. L-curve criteria, [23, 22]) or statistical perspectives (e.g. Lepskij principles, [3, 26]), respectively in between (e.g. discrepancy principles motivated by deterministic bounds or noise variance, cf. [38, 4]). While the particular class of statistical parameter choice rules based on unbiased risk estimation (URE) was used for linear inverse reconstruction techniques early on [35, 37, 17], there is a renewed interest in these approaches for iterative, non-linear inverse reconstruction techniques, in particular in the context of sparsity constraints, see e.g., [33, 15, 42, 19, 30, 43, 12, 13, 14, 40, 44, 11, 39]). These works are based on extending Stein’s general construction of an unbiased risk estimator [36] to the inverse problems setting. Compared to approaches that measure the risk in the data space, the classical SURE and a generalized version (GSURE, [15, 19, 13, 40]) measure the risk in the space of the unknown which seems more appropriate for ill-posed problems. Previous investigations show that the performance of such parameter choice rules is reasonable in many different settings (cf. [21, 45, 9, 2, 34, 31, 15]). However, most of the problems considered in these works are very mildly ill-posed (which we will define more precisely below), the interplay between ill-posedness and the performance of the risk estimators is not studied explicitly and the inherent statistical nature of the selected regularization parameter is ignored as only single realizations of noise are typically considered.

Therefore, a first motivation of this paper is to further study the properties of SURE in Tikhonov-type regularization methods from a statistical perspective and systematically in dependence of the ill-posedness of the problem. For this purpose we provide a theoretical analysis of the quality of unbiased risk estimators in the case of linear Tikhonov regularization. In addition, we carry out extensive numerical investigations on appropriate model problems. While in very mildly ill-posed settings the performances of the parameter choice rules under consideration are reasonable and comparable, our investigations yield various interesting results and insights in ill-posed settings. For instance, we demonstrate that SURE shows a rather erratic behaviour as the degree of ill-posedness increases. The observed effects are so strong that the meaning of a parameter chosen according to this particular criterion is unclear.

A second motivation of this paper is to study the discrepancy principle as a reference method and as we shall see it can indeed be put in a very similar context and analysed by the same techniques. Although the popularity of the discrepancy principle is decreasing recently in favour of choices using more statistical details, our findings show that it is still more robust for ill-posed problems than risk-based parameter choices. The conservative choice by the discrepancy principle is well-known to rather overestimate the optimal parameter, but on the other hand it avoids to choose too small regularization as risk-based methods often do. In the latter case the reconstruction results are completely deteriorated, while the discrepancy principle yields a reliable, though not optimal, reconstruction.

Formal Introduction

We consider a discrete inverse problem of the form

[TABLE]

where $y\in\mathbb{R}^{m}$ is a vector of observations, $A\in\mathbb{R}^{m\times n}$ is a known matrix, and $\varepsilon\in\mathbb{R}^{m}$ is a noise vector. We assume that $\varepsilon$ consists of independent and identically distributed (i.i.d.) Gaussian errors, i.e., $\varepsilon\sim\mathcal{N}(0,\sigma^{2}I_{m})$ . The vector $x^{*}\in\mathbb{R}^{n}$ denotes the (unknown) exact solution to be reconstructed from the observations. There are two potential difficulties for this: If $A$ has a non-trivial kernel, e.g. for $n>m$ , we simply cannot observe certain aspects of $x^{*}$ and regularization has to interpolate them from the observed features in some way. This, however, typicallyis not the ill-posedness we are interested, in practice we know what we miss and we consider these problems only ”mildly ill-posed”. The second difficulty is more subtle: The singular values of $A$ might decay very fast, which means that certain aspects of $x^{*}$ are barely measurable and even small additional noise $\varepsilon$ can render their recovery unstable. This is the main difficulty we are interested in here, so we will measure the degree of ill-posedness of (1) by the condition of $A$ restricted to its co-kernel, i.e. the ratio between largest and smallest non-zero singular value. Note that this definition deviates from the classical definition of ill-posedness for continuous problems by Hadamard [20], which leads to a binary classification of problems as either well- or ill-posed but is not very useful for practical applications. In order to find an estimate $\hat{x}(y)$ of $x^{*}$ from (1), we apply a variational regularization method:

[TABLE]

where $R$ is assumed convex and such that the minimizer is unique for positive regularization parameter $\alpha>0$ . In what follows the dependence of $\hat{x}_{\alpha}(y)$ on $\alpha$ and the data $y$ may be dropped where it is clear without ambiguity that $\hat{x}=\hat{x}_{\alpha}(y).$

In practice there are two choices to be made: First, a regularization functional $R$ needs to be specified in order to appropriately represent a-priori knowledge about solutions and second, a regularization parameter $\alpha$ needs to be chosen in dependence of the data $y$ . The ideal parameter choice would minimize a difference between $\hat{x}_{\alpha}(y)$ and $x^{*}$ over all $\alpha$ , which obviously cannot be computed and is hence replaced by a parameter choice rule that tries to minimize a worst-case or average error to the unknown solution, which can be referred to as a risk minimization. In the practical case of having a single observation only, the risk based on average error needs to be replaced by an estimate as well, and unbiased risk estimators that will be detailed in the following are a natural choice.

For the sake of a clearer presentation of methods and results we first focus on linear Tikhonov regularization, i.e.,

[TABLE]

leading to the explicit Tikhonov estimator

[TABLE]

In this setting, a natural distance for measuring the error of $\hat{x}_{\alpha}(y)$ is given by its $\ell_{2}$ -distance to $x^{*}$ . Thus, we define

[TABLE]

as the optimal, but inaccessible, regularization parameter. Many different rules for the choice of the regularization parameter $\alpha$ are discussed in the literature. Here, we focus on strategies that rely on an accurate estimate of the noise variance $\sigma^{2}$ . A classical example of such a rule is given by the discrepancy principle: The regularization parameter $\hat{\alpha}_{\operatorname*{DP}}$ is given as the solution of the equation

[TABLE]

The discrepancy principle is robust and easy-to-implement for many applications (cf. [5, 24, 32]) and is based on the heuristic argument, that $\hat{x}_{\alpha}(y)$ should only explain the data $y$ up to the noise level.

The broader class of unbiased risk estimators accounts for the stochastic nature of $\varepsilon$ by aiming to choose $\alpha$ such that it minimizes certain $\ell_{2}$ -errors between $\hat{x}_{\alpha}(y)$ and $x^{*}$ only in expectation: We first define the mean squared prediction error*(MSPE)* as

[TABLE]

and refer to its minimizer as $\hat{\alpha}_{\operatorname*{{MSPE}}}$ . Since $\operatorname*{{MSPE}}$ depends on the unknown vector $x^{*}$ , we have to replace it by an unbiased estimate we will call $\operatorname*{{PSURE}}$ here and define:

[TABLE]

with

[TABLE]

While the classical SURE estimator would try to estimate the expectation of the simple $\ell_{2}$ -error between $\hat{x}_{\alpha}(y)$ and $x^{*}$ like in (4), a generalization [15, 19] is often considered in inverse problems where $A$ may have a non-trivial kernel: We define the mean squared estimation error*(MSEE)* here as

[TABLE]

where $\Pi:=A^{+}A$ denotes the orthogonal projector onto the range of $A^{*}$ ( $M^{+}$ denotes the Pseudoinverse of $M$ ), and refer to the minimizer of $\operatorname*{{MSEE}}(\alpha)$ as $\hat{\alpha}_{\operatorname*{{SURE}}}^{*}$ . Again, we replace $\operatorname*{{MSEE}}$ by an unbiased estimator to obtain

[TABLE]

with

[TABLE]

If $A$ is non-singular, as it will be in the theoretical analysis and numerical experiments in this work, the above definition coincides with the classical one considered by Stein [36].

Note that the main difference between the two risk functions MSPE and MSEE and their corresponding estimators PSURE and SURE is that they measure in image and domain of the ill-conditioned operator $A$ , respectively. The second important observation here is that all parameter choice rules depend on the data $y$ and hence on the random errors $\varepsilon_{1},\ldots,\varepsilon_{m}$ . Therefore, $\hat{\alpha}_{\operatorname*{DP}}$ , $\hat{\alpha}_{\operatorname*{{PSURE}}}$ and $\hat{\alpha}_{\operatorname*{{SURE}}}$ are random variables, described in terms of their probability distributions. In the next section, we first investigate these distributions by a numerical simulation study in a simple inverse problem scenario using quadratic Tikhonov regularization. The results point to several problems of the presented parameter choice rules, in particular of SURE, and motivate our further theoretical investigation in Section 3. The theoretical results will be illustrated and supplemented by an exhaustive numerical study in Section 4. Finally we extend the numerical investigation in Section 5 to a sparsity-promoting LASSO-type regularization, for which we find a similar behaviour. Conclusions are given in Section 6.

2 Risk Estimators for Quadratic Regularization

In the following we discuss the setup in the case of the simple quadratic regularization functional $R(x)=\frac{1}{2}\|x\|^{2}$ , i.e. we recover the well-known linear Tikhonov regularization scheme. The linearity can be used to simplify arguments and gain analytical insight in the next section. While the arguments presented can easily be extended to more general quadratic regularizations, this model already contains all important properties.

2.1 Singular System and Risk Representations

Considering a quadratic regularization allows to analyze $\hat{x}_{\alpha}$ in a singular system of $A$ in a convenient way. Let $r=\operatorname*{rank}(A)$ , $q=\min(n,m)$ . Let

[TABLE]

denote a singular value decomposition of $A$ with

[TABLE]

Defining

[TABLE]

we can rewrite model (1) in its spectral form

[TABLE]

where $\tilde{\varepsilon}_{1},\ldots,\tilde{\varepsilon}_{m}$ are still i.i.d. $\sim\mathcal{N}(0,\sigma^{2})$ . All quantities considered in the following depend on $n$ or $m$ . In particular, we have $A=A_{n,m}$ , $y=y_{m}$ , $x^{*}=x^{*}_{n}$ , $\gamma_{i}=\gamma_{n,m}$ , $x_{i}^{*}=x_{i,n,m}^{*}$ and $\tilde{\varepsilon}_{i}=\tilde{\varepsilon}_{i,n,m}$ . This dependence is made explicit in the statements of the results and technical assumptions for clarity but is dropped in the main text for ease of notation. Increasing $m$ corresponds to sampling from an equation such as (1) more finely, whereas an increase in $n$ increases the level of discretization of an operator $A_{\infty}$ (see section 2.2). In our asymptotic considerations both $n$ and $m$ tend to infinity.

We will express some more terms in the singular system that are frequently used throughout this paper. In particular, we have for $x_{ML}$ , the regularized solution $\hat{x}_{\alpha}$ (dropping the dependence on $y$ below for notational simplicity) and its norm

[TABLE]

as well as the residual and distance to the maximum likelihood estimate

[TABLE]

Based on the generalized inverse we compute

[TABLE]

which yields the degrees of freedom and the generalized degrees of freedom

[TABLE]

Next, we derive the spectral representations of the parameter choice rules. For the discrepancy principle, we use (13) to define

[TABLE]

and now, (5) can be restated as $\operatorname*{DP}(\hat{\alpha}_{\operatorname*{DP}},y)=0$ . For (7) and (9), we find

[TABLE]

2.2 An Illustrative Example

We consider a simple imaging scenario which exhibits typical properties of inverse problems. The unknown function $x_{\infty}^{*}:[-1/2,1/2]\rightarrow\mathbb{R}$ is mapped to a function $y_{\infty}:[-1/2,1/2]\rightarrow\mathbb{R}$ by a periodic convolution with a compactly supported kernel of width $l\leq 1/2$ :

[TABLE]

where the 1-periodic $C_{0}^{\infty}(\mathbb{R})$ function $k_{l}(t)$ is defined for $|t|\leq 1/2$ by

[TABLE]

and continued periodically for $|t|>1/2$ . Examples of $k_{l}(t)$ are plotted in Figure 1. The normalization ensures that $A_{\infty,l}$ and suitable discretizations thereof have the spectral radius $\gamma_{1}=1$ which simplifies our derivations and the corresponding illustrations. The $x_{\infty}^{*}$ used in the numerical examples is the sum of four delta distributions:

[TABLE]

The locations of the delta distributions approximate $[-0.3,-0.2,0.1,0.3]$ by irrational numbers which will simplify the discretization of this continuous problem.

Discretization

For a given number $n\in\mathbb{N}$ , let

[TABLE]

denote the equidistant partition of $[-1/2,1/2]$ and $\psi_{i}^{n}(t)=\sqrt{n}\,\mathds{1}_{E^{n}_{i}}(t)$ an orthonormal basis (ONB) of piecewise constant functions over that partition. If we use $m$ and $n$ degrees of freedom to discretize range and domain of $A_{\infty,l}$ , respectively, we arrive at the discrete inverse problem (1) with

[TABLE]

The two dimensional integration in (17) is computed by the trapezoidal rule with equidistant spacing, employing $100\times 100$ points to partition $E^{m}_{i}\times E^{n}_{i}$ . Note that we drop the subscript $l$ from $A_{l}$ whenever the dependence on this parameter is not of importance for the argument being carried out.

As the convolution kernel $k_{l}$ has mass $1$ and the discretization was designed to be mass-preserving, we have $\gamma_{1}=1$ and the condition number of $A$ is given by $\operatorname*{cond}(A)=1/\gamma_{r}$ , where $r=\text{rank}(A)$ . Figure 2 shows the decay of the singular values for various parameter settings and Table 1 lists the corresponding condition numbers: From this, we can see that the degree of ill-posedness of solving (1) measured in terms of the rate of decay of the singular values and the condition number grows very fast with increasing $m$ and $l$ . It is easy to show that in the infinite dimensional setting, the rate of decay would be exponentially fast.

Empirical Distributions

Using the above formulas and $m=n=64$ , $l=0.06$ , $\sigma=0.1$ , we computed the empirical distributions of the $\alpha$ values selected by the different parameter choice rules by evaluating (14), (15) and (16) on a fine logarithmical $\alpha$ -grid, i.e., $\log_{10}(\alpha_{i})$ was increased linearly in between $-40$ and $40$ with a step size of $0.01$ . We draw $N_{\varepsilon}=10^{6}$ samples of $\varepsilon$ . The results are displayed in Figures 3 and 4: In both figures, we use a logarithmic scaling of the empirical probabilities wherein empirical probabilities of [math] have been set to $1/(2N_{\varepsilon})$ . While this presentation complicates the comparison of the distributions as the probability mass is deformed, it facilitates the examination of small values and tails.

First, we observe in Figure 3 that $\hat{\alpha}_{\operatorname*{DP}}$ typically overestimates the optimal $\alpha^{*}$ . However, it performs robustly and does not cause large $\ell_{2}$ -errors as can be seen in Figure 3. For $\hat{\alpha}_{\operatorname*{{PSURE}}}$ and $\hat{\alpha}_{\operatorname*{{SURE}}}$ , the latter is not true: While being closer to $\alpha^{*}$ than $\hat{\alpha}_{\operatorname*{DP}}$ most often, and, as can be seen from the joint error histograms in Figure 4, producing smaller $\ell_{2}$ -errors more often (87%/56% of the time for PSURE/SURE), both distributions show outliers, i.e., occasionally, very small values of $\hat{\alpha}$ are estimated that cause large $\ell_{2}$ -errors. In the case of $\hat{\alpha}_{\operatorname*{{SURE}}}$ , we even observe two clearly separated modes in the distributions. Table 2 shows different statistics that summarize the described phenomena. These findings motivate the theoretical examinations carried out in the following section.

3 Properties of the Parameter Choice Rules for Quadratic Regularization

In this section we consider the theoretical (risk) properties of PSURE, SURE and the discrepancy principle. To allow for a concise and accessible presentation of the main results, all proofs are shifted to Appendix A. As we are investigating random quantities, convergence rates are given in terms of the stochastic order symbols $O_{\mathbb{P}}$ and $o_{\mathbb{P}}$ , which correspond to Landau’s big $O$ and small $o$ notation, respectively, when convergence in probability is considered. Let us recall the definition of $O_{\mathbb{P}}$ and $o_{\mathbb{P}}$ using the formulation in [41], Chapter 2.1.

Definition 1 (Stochastic Order Symbols).

Let $(\Omega,\mathcal{F},\mathbb{P})$ a probability space. $Z_{n}:\Omega\to\mathbb{R}$ , $n\in\mathbb{N},$ be a sequence of random variables, and $(r_{n})_{n\in\mathbb{N}}$ be a sequence of positive numbers. We say that

[TABLE]

We say that

[TABLE]

Instead of $Z_{n}=O_{\mathbb{P}}(r_{n})$ or $Z_{n}=o_{\mathbb{P}}(r_{n})$ we may also write $Z_{n}/r_{n}=O_{\mathbb{P}}(1)$ or $Z_{n}/r_{n}=o_{\mathbb{P}}(1)$ , respectively.

Assumption 1.

For the sake of simplicity we only consider $m=n$ in this first analysis. Furthermore, we assume

[TABLE]

Note that all assumptions are fulfilled in the numerical example we described in the previous section.

We mention that we consider here a rather moderate size of the noise, which remains bounded in variances as $m\rightarrow\infty$ . A scaling corresponding to white noise in the infinite dimensional limit is rather $\sigma^{2}\sim m$ and an inspection of the estimates below shows that the risk estimate is potentially far from the expected values in such cases additionally.

3.1 PSURE-Risk

We start with an investigation of the PSURE risk estimate. Based on (15) and Stein’s result, the representation for the risk is given as

[TABLE]

Figure 5 illustrates the typical shape of $\operatorname*{{MSPE}}(\alpha)$ and PSURE estimates thereof. Following [29, 25] who considered the case $A=I_{m}$ and [46, 18], who investigated the performance of Stein’s unbiased risk estimate in the different context of hierarchical modeling, we show that, with the definition of the loss $\mathcal{L}$ by

[TABLE]

$1/m\,\operatorname*{{PSURE}}(\alpha,y)$ is close to $\mathcal{L}$ for large $m.$ Note that PSURE is an unbiased estimate of the expectation of $\mathcal{L}.$

Theorem 1.

If Assumption 1 holds, then we have for any sequence of vectors $(x^{*}_{m})_{m\in\mathbb{N}}$ , $x^{*}_{m}\in\mathbb{R}^{m},$ such that $\|x^{*}_{m}\|_{2}^{2}=O(m)$ as $m\rightarrow\infty$

[TABLE]

Remark 1.

The result of Theorem 1 guarantees stochastic boundedness of the sequence

[TABLE]

It does not entail the existence of a proper weak limit of $\operatorname*{{PSURE}}$ , which would require stronger assumptions on the sequences $(x^{*}_{m})_{m\in\mathbb{N}}$ and $\big{(}(\gamma_{i,m})_{i=1}^{m}\big{)}_{m\in\mathbb{N}}.$

The latter result can be used to show that, in an asymptotic sense, if the loss $\mathcal{L}$ is considered, the estimator $\hat{\alpha}_{\operatorname*{{PSURE}}}$ does not have a larger risk than any other choice of regularization parameter. This statement is made precise in the following corollary.

Corollary 1.

Let $(\delta_{m})_{m\in\mathbb{N}}$ be a sequence of positive real numbers such that $1/\delta_{m}=o(\sqrt{m})$ . Under the assumptions of Theorem 1 the following holds true for any sequence of positive real numbers $(\alpha_{m})_{m\in\mathbb{N}}$ :

[TABLE]

We finally mention that our estimates are rather conservative, in particular with respect to the quantity $Sl_{3}(\alpha)$ in the proof of Theorem 1, since we do not assume particular smoothness of $x^{*}$ . With an additional source condition, i.e., certain decay speed of the $x_{i}^{*}$ , it is possible to derive improved rates, which are however beyond the scope of our paper. We refer to [10] and [27] for recent results in that direction, where optimality of $x_{\hat{\alpha}_{\operatorname*{{PSURE}}}}$ with respect to the risk $\operatorname*{{MSEE}}$ under source conditions for spectral cut-off and more general, filter based methods are shown, respectively. We turn our attention to the convergence of the risk estimate as $m\rightarrow\infty$ as well as the convergence of the estimated regularization parameters.

Theorem 2.

If Assumption 1 holds, then we have for any sequence of vectors $(x^{*}_{m})_{m\in\mathbb{N}}$ , $x^{*}_{m}\in\mathbb{R}^{m},$ such that $\|x^{*}_{m}\|_{2}^{2}=O(m)$ as $m\rightarrow\infty$

[TABLE]

and

[TABLE]

Remark 2.

It follows from Theorem 2 and Definition 1 that

[TABLE]

whereas ${\mathrm{MSPE}_{m}}(\alpha,y)$ is bounded away from zero by the assumptions of Theorem 3. Therefore, asymptotically, minimizing ${\mathrm{MSPE}_{m}}$ is the same as minimizing $\operatorname*{{PSURE}}.$

In order to understand the behaviour of the estimated regularization parameters we start with some bounds on $\hat{\alpha}_{\operatorname*{{MSPE}}}$ , which recover a standard property of deterministic Tikhonov-type regularization methods, namely that $\frac{\sigma^{2}}{\alpha}$ does not diverge for suitable parameter choices (cf. [16]).

Lemma 1.

A regularization parameter $\hat{\alpha}^{*}_{\operatorname*{{PSURE}},m}$ obtained from ${\mathrm{MSPE}_{m}}$ satisfies

[TABLE]

From a straight-forward estimate of the derivative of ${\mathrm{MSPE}_{m}}$ on sets where $\alpha$ is bounded away from zero, together with the Arzela-Ascoli theorem we obtain the following result:

Proposition 1.

The sequence of functions $f_{m}:\alpha\mapsto\frac{1}{m}{\mathrm{MSPE}_{m}}(\alpha)$ is equicontinuous on sets $[C_{1},C_{2}]$ with $0<C_{1}<C_{2}$ and hence has a uniformly convergent subsequence $f_{m_{k}}$ with continuous limit function $f$ .

In order to obtain convergence of minimizers it suffices to be able to choose uniform constants $C_{1}$ and $C_{2}$ , which is possible if the bounds in Lemma 1 are uniform:

Theorem 3.

Let $\max_{i=1}^{m}|x_{i,m}^{*}|$ be uniformly bounded in $m$ and $\frac{1}{m}\sum_{i=1}^{m}\gamma_{i,m}^{4}(x_{i,m}^{*})^{2}$ be uniformly bounded away from zero. Then there exists a subsequence $\hat{\alpha}_{\operatorname*{{MSPE}},m_{k}}$ that converges to a minimizer of the asymptotic risk $f$ . Moreover $\hat{\alpha}_{\operatorname*{{PSURE}},m_{k}}$ converges to to a minimizer of the asymptotic risk $f$ in probability.

3.2 Discrepancy Principle

We now turn our attention to the discrepancy principle, which we can formulate in a similar setting as the PSURE approach above. With a slight abuse of notation, in analogy to the other methods, we denote the expectation of $\operatorname*{DP}(\alpha,y)$ by $\operatorname*{{EDP}}(\alpha)$ and define $\hat{\alpha}_{\operatorname*{{EDP}}}$ as the solution of the equation

[TABLE]

Figure 5 illustrates the typical shape of $\operatorname*{{EDP}}(\alpha)$ and its DP estimates. Observing that

[TABLE]

we immediately obtain the following result:

Theorem 4.

If Assumption 1 holds, we have for any sequence of vectors $(x^{*}_{m})_{m\in\mathbb{N}}$ , $x^{*}_{m}\in\mathbb{R}^{m},$ such that $\|x^{*}_{m}\|_{2}^{2}=O(m)$

[TABLE]

and

[TABLE]

3.3 SURE-Risk

Now we consider the SURE-risk estimation procedure. Figure 5 illustrates the typical shape of $\operatorname*{{MSEE}}(\alpha)$ and SURE estimates thereof. Based on (16), if $\gamma_{m}>0$ for all $m$ , the risk can be written as

[TABLE]

For the PSURE criterion we showed in Theorem 1 that $\operatorname*{{PSURE}}(\alpha,y)$ is close to the loss $\mathcal{L}$ in an asymptotic sense with the standard $\sqrt{m}$ -rate of convergence. An analogous result can be shown for SURE and the associated loss $\tilde{\mathcal{L}}(\alpha):=c_{m}\|\Pi(x^{*}-\hat{x}_{\alpha})\|_{2}^{2}$ but with different associated rates of convergence $c_{m}$ , dependent on the singular values.

Theorem 5.

Let Assumption 1 be satisfied and in addition to (20), let $\gamma_{m,m}>0$ for all $m$ and $m=n=r$ . Then we have for any sequence of vectors $(x^{*}_{m})_{m\in\mathbb{N}}$ , $x^{*}_{m}\in\mathbb{R}^{m},$ such that $\max_{i=1}^{m}|x_{i,m}^{*}|$ is uniformly bounded as $m\rightarrow\infty,$

[TABLE]

where

[TABLE]

In the same manner as for PSURE, we may use the latter convergence result to show that, in an asymptotic sense, if the loss $\tilde{\mathcal{L}}$ is considered, the estimator $\hat{\alpha}_{\operatorname*{{SURE}}}$ does not have a larger risk than any other choice of regularization parameter. We stress again that this optimality property depends on the loss considered, as it is the case in Corollary 1.

Corollary 2.

Let $(\delta_{m})_{m\in\mathbb{N}}$ be a sequence of positive reals such that $d_{m}=o(\delta_{m})$ . If the assumptions of Theorem 5 hold, we have for any sequence of positive real numbers $(\alpha_{m})_{m\in\mathbb{N}}$ :

[TABLE]

Note that $1/\sqrt{m}\leq d_{m}\leq 1$ , depending on the behaviour of the singular values. If $\inf_{m}d_{m}>0$ , $O_{\mathbb{P}}(d_{m})=O_{\mathbb{P}}(1)$ in Theorem 5 and only sequences $\delta_{m}$ such that $\inf_{m}\delta_{m}>0$ are permissible in Corollary 2.

We can now proceed to an estimate between $\operatorname*{{SURE}}$ and $\operatorname*{{MSEE}}$ similar to the ones for the PSURE risk, however we observe a main difference due to the appearance of the condition number of the forward matrix $A$ :

Theorem 6.

Let $A_{m}\in\mathbb{R}^{m\times m}$ be a full rank matrix. In addition to Assumption 1, let $\gamma_{m,m}>0$ for all $m$ and $\gamma_{m,m}\rightarrow 0$ . Then, we have for any sequence of vectors $(x^{*}_{m})_{m\in\mathbb{N}}$ , $x^{*}_{m}\in\mathbb{R}^{m},$ such that $\|x^{*}_{m}\|_{2}^{2}=O(m)$ as $m\rightarrow\infty,$

[TABLE]

and

[TABLE]

We finally note that in the best case the convergence of SURE is slower than that of PSURE. However, since for ill-posed problems the condition number of $A$ will grow with $m$ the typical case is rather divergence of $\frac{\operatorname*{cond}(A)^{2}}{\sqrt{m}}$ , hence the empirical estimates of the regularization parameters might have a large variation, which will be confirmed by the numerical results below.

4 Numerical Studies for Quadratic Regularization

4.1 Setup

As in the illustrative example in Section 2.1, we computed the empirical distributions of the different parameter choice rules for the same scenario (cf. Section 2.2) for each combination of $m=n=16,32,64,128,256,512,1024,2048$ , $l=0.01,0.02,0.03,0.04,0.06,0.08,0,1$ and $\sigma=0.1$ . For $m=16,\ldots,512$ , $N_{\varepsilon}=10^{6}$ and for $m=1024,2048$ , $N_{\varepsilon}=10^{5}$ noise realizations were sampled. The computation was, again, based on a logarithmical $\alpha$ -grid, i.e., $\log_{10}\alpha$ is increased linearly in between -40 and 40 with a step size of $0.01$ . In addition to the distributions of $\alpha$ , the expressions

[TABLE]

were computed over the $\alpha$ -grid. As in some cases, the supremum is obtained in the limit $\alpha\rightarrow\infty$ , and hence, on the boundary of our computational grid, we also evaluated (24) for $\alpha=\infty$ in these cases.

4.2 Illustration of Theorems

We first illustrate Theorems 2 and 6 by computing (22) and (23) based on our samples. The results are plotted in Figure 6 and show that the asymptotic rates hold. For SURE, the comparison between Figures 6(b) and 6(c) also shows that the dependence on $\operatorname*{cond}(A)$ is crucial.

4.3 Dependence on the Ill-Posedness

We then demonstrate how the empirical distributions of $\hat{\alpha}$ and the corresponding $\ell_{2}$ -error, $\|x^{*}-x_{\hat{\alpha}}\|_{2}^{2}$ , such as those plotted in Figure 3, depend on the ill-posedness of the inverse problem.

Dependence on $m$

In Figures 7 and 8, $m$ is increased while the width of the convolution kernel is kept fix. The impact of this on the singular value spectrum is illustrated in Figure 2. Most notably, smaller singular values are added and the condition of $A$ increases (cf. Table 1). Figures 7(a) and 8(a) suggest that the distribution of the optimal $\alpha^{*}$ is Gaussian and converges to a limit for increasing $m$ . The distribution of the corresponding $\ell_{2}$ -error looks Gaussian as well and seems to concentrate while shifting to larger mean values. For the discrepancy principle, Figures 7(b) and 8(b) show that the distribution of $\hat{\alpha}_{\operatorname*{DP}}$ widens for increasing $m$ , and the distribution of the corresponding $\ell_{2}$ -error develops a tail while shifting to larger mean values. Figures 7(c) and 8(c) show that the distribution of $\hat{\alpha}_{\operatorname*{{PSURE}}}$ seems to converge to a limit for increasing $m$ . The distribution of the corresponding $\ell_{2}$ -error also develops a tail while shifting to larger mean values. For SURE, Figures 7(d) and 8(d) reveal that increasing $m$ leads to erratic, multimodal distributions: Compared to the other $\alpha$ -distributions, the distribution of $\hat{\alpha}_{\operatorname*{{SURE}}}$ includes a significant amount of very small values, and the corresponding $\ell_{2}$ -error distributions range over very large values.

Dependence on $l$

In Figures 9 and 10, the width of the convolution kernel, $l$ , is increased while $m=64$ is kept fix (cf. Figure 2 and Table 1). It is worth noticing that as $l=0.02$ corresponds to a very well-posed problem, the optimal $\alpha^{*}$ is often extremely small or even [math], as can be seen from Figure 9(a). The general tendencies are similar to those observed when increasing $m$ . For SURE, Figures 9(d) and 10(d) illustrate how the multiple modes of the distributions slowly evolve and shift to smaller vales of $\alpha$ (and larger corresponding $\ell_{2}$ -errors).

4.4 Linear vs Logarithmical Grids

One reason why the properties of SURE exposed in this work have not been noticed so far is that they only become apparent in very ill-conditioned problems (cf. Section 1). Another reason is the way the risk estimators are typically computed: Firstly, for high dimensional problems, (3) often needs to be solved by an iterative method. For very small $\alpha$ , the condition of $(A^{*}A+\alpha I)$ is very large and the solver will need a lot of iterations to reach a given tolerance. If, instead, a fixed number of iterations is used, an additional regularization of the solution to (1) is introduced which alters the risk function. Secondly, again due to the computational effort, a coarse, linear $\alpha$ -grid excluding $\alpha=0$ instead of a fine, logarithmic one is often used for evaluating the risk estimators. For two of the risk estimations plotted in Figure 5, Figure 11 demonstrates that this insufficient coverage of small $\alpha$ values by the grid can lead to missing the global minimum and other misinterpretations.

5 Numerical Studies for Non-Quadratic Regularization

In this section, we consider the popular sparsity-inducing $R(x)=\|x\|_{1}$ as a regularization functional (LASSO penalty) to examine whether our results also apply to non-quadratic regularization functionals. For this, let $I$ be the support of $\hat{x}_{\alpha}(y)$ and $J$ its complement. Let further $|I|=k$ and $P_{I}\in\mathbb{R}^{k\times n}$ be a projector onto $I$ and $A_{I}$ the restriction of $A$ to $I$ . We have that

[TABLE]

as shown, e.g., in [39, 14, 12], which allows us to compute PSURE (7) and SURE (9). Notice that while $\hat{x}_{\alpha}(y)$ is a continuous function of $\alpha$ [7], PSURE and SURE are discontinuous at all $\alpha$ where the support $I$ changes.

To carry out similar numerical studies as those presented the last section, we have to overcome several non-trivial difficulties: While there exist various iterative optimization techniques to solve (2) nowadays (see, e.g., [8]), each method typically only works well for certain ranges of $\alpha$ , $\operatorname*{cond}(A)$ and tolerance levels to which the problem should be solved. In addition, each method comes with internal parameters that have to be tuned for each problem separately to obtain fast convergence. As a result, it is difficult to compute a consistent series of $\hat{x}_{\alpha}(y)$ for a given logarithmical $\alpha$ -grid, i.e., that accurately reproduces all the change-points in the support and has a uniform accuracy over the grid. Our solution to this problem is to use an all-at-once implementation of ADMM [6] that solves (2) for the whole $\alpha$ -grid simultaneously, i.e., using exactly the same initialization, number of iterations and step sizes. See Appendix B for details. In addition, an extremely small tolerance level ( $tol=10^{-14}$ ) and $10^{4}$ maximal iterations were used to ensure a high accuracy of the solutions.

Another problem for computing quantities like (24) is that we cannot compute the expectations defining the real risks $\operatorname*{{MSPE}}$ (7) and $\operatorname*{{MSEE}}$ (9) anymore: We have to estimate them as the sample mean over PSURE and SURE in a first run of the studies, before we can compute (24) in a second run (wherein $\operatorname*{{MSPE}}$ and $\operatorname*{{MSEE}}$ are replaced by the estimates from the first run).

We considered scenarios with each combination of $m=n=16,32,64,128,256,512$ , $l=0.02,0.04,0.06$ and $\sigma=0.1$ . Depending on $m$ , $N_{\varepsilon}=10^{5},10^{4},10^{4},10^{4},10^{3},10^{3}$ noise realizations were examined. The computation was based on a logarithmical $\alpha$ -grid where $\log_{10}\alpha$ is increased linearly in between -10 and 10 with a step size of $0.01$ .

Risk Plots:

Figure 12 shows the different risk functions and estimates thereof. The jagged form of the PSURE and SURE plots evaluated on this fine $\alpha$ -grid indicates that the underlying functions are discontinuous. Also note that while PSURE and SURE for each individual noise realization are discontinuous, $\operatorname*{{MSPE}}$ and $\operatorname*{{MSEE}}$ are smooth and continuous, as can be seen already from the empirical means over $N_{\varepsilon}=10^{4}$ .

Empirical Distributions:

Figure 13 shows the empirical distributions of the different parameter choice rules for $\alpha$ . Here, the optimal ${\alpha}^{*}$ is chosen as the one minimizing the $\ell_{1}$ -error $\|x^{*}-x_{\hat{\alpha}}\|_{1}$ to the true solution $x^{*}$ . We can observe similar phenomena as for $\ell_{2}$ -regularization. In particular, the distributions for SURE, also have multiple modes at small values of $\alpha$ and at large values of $\ell_{1}$ -error.

Sup-Theorems:

Due to the lack of explicit formulas for the $\ell_{1}$ -regularized solution $x_{\alpha}(y)$ , carrying out similar analysis as in Section 3 to derive theorems such as Theorems 2 and 6 is very challenging. In this work, we only illustrate that similar results may hold for the case of $\ell_{1}$ -regularization by computing the left hand side of (22) and (23) based on our samples. The results are shown in Figure 14 and are remarkably similar to those shown in Figure 6.

Linear Grids and Accurate Optimization

All the issues raised in Section 4.4 about why the properties of SURE revealed in this work are likely to be overlooked when working on high dimensional problems are even more crucial for the case of $\ell_{1}$ -regularization: For computational reasons, the risk estimators are often evaluated on a coarse, linear $\alpha$ -grid using a small, fixed number of iterations of an iterative method such as ADMM. Figure 15 illustrates that this may obscure important features of the real SURE function, such as the strong discontinuities for small $\alpha$ , or even change it significantly.

6 Conclusion

We examined variational regularization methods for ill-posed inverse problems and conducted extensive numerical studies that assessed the statistical properties different parameter choice rules. In particular, we were interested in the influence of the degree of ill-posedness of the problem (measured in terms of the condition of the forward operator) on the probability distributions of the selected regularization parameters and of the corresponding induced errors. This perspective revealed important features that were not discussed or noticed before but are essential to know for practical applications, namely that unbiased risk estimators encounter enormous difficulties: While the discrepancy principle yields a rather unimodal distribution of regularization parameters resembling the optimal one with slightly increased mean value, the PSURE estimates start to develop multimodality, and the additional modes consist of underestimated regularization parameters, which may lead to significant errors in the reconstruction. For the case of SURE, which is based on a presumably more reliable risk, the estimates produce quite wide distributions (at least in logarithmic scaling) for increasing ill-posedness, in particular there are many highly underestimated parameters, which clearly yield bad reconstructions. We expect that this behaviour is rather due to the bad quality of the risk estimators than the quality of the risk. These findings may be explained by Theorem 6, which indicates that the estimated SURE risk might deviate strongly from the true risk function $\operatorname*{{MSEE}}$ when the condition number of $A$ is large, i.e. the problem is asymptotically ill-posed as $m\rightarrow 0$ . Consequently one might expect a strong variation in the minimizers of $\operatorname*{{SURE}}$ with varying $y$ compared to the ones of $\operatorname*{{MSEE}}$ . A potential way to cure those issues is to develop novel risk estimates for $\operatorname*{{MSEE}}$ that are not based on Stein’s method, possibly it might even be useful not to insist on the unbiasedness of the estimators.

We finally mention that for problems like sparsity-promoting regularization, the SURE risk leads to additional issues, since it is based on a Euclidean norm. While the discrepancy principle and the PSURE risk only use the norms appearing naturally in the output space of the inverse problem (or in a more general setting the log-likelihood of the noise), the Euclidean norm in the space of the unknown is rather arbitrary. In particular, it may deviate strongly from the Banach space geometry in $\ell^{1}$ or similar spaces in high dimensions. Thus, different constructions of SURE risks are to be considered in such a setting, e.g. based on Bregman distances.

Appendix A Proofs

Proof of Theorem 1.

We find

[TABLE]

where $\tilde{\varepsilon}=U^{*}\varepsilon$ , that is, $\tilde{\varepsilon}_{i}=\langle u_{i},\varepsilon\rangle$ . Note that

[TABLE]

and recall from (10) that $x_{i}^{*}=\langle v_{i},x^{*}\rangle.$ Since $U^{*}U=UU^{*}=I$ , Var $[\tilde{\varepsilon}_{i}]=\sigma^{2}$ . This yields

[TABLE]

We obtain the representation

[TABLE]

where the terms $Sl_{j}(\alpha),\,j\in\{1,2,3\}$ are defined in an obvious manner. Since $\tilde{\varepsilon}_{1}^{2},\ldots,\tilde{\varepsilon}_{n}^{2}$ are independent and identically distributed with expectation $\sigma^{2}$ we immediately obtain that

[TABLE]

Note that $Sl_{1}(\alpha)$ is independent of $\alpha.$ Next, we consider the term $Sl_{2}(\alpha).$

Due to (20) the values $\gamma_{i}^{2}/(\gamma_{i}^{2}+\alpha)\in(0,1]$ for $\alpha\in[0,\infty)$ , are monotonically decreasing (with respect to $i$ ). Thus, we find

[TABLE]

It follows from [28], Lemma 7.2:

[TABLE]

and an application of Kolmogorov’s maximal inequality yields

[TABLE]

Hence

[TABLE]

and therefore, by Definition (18),

[TABLE]

where we also used that Var $(\tilde{\varepsilon}_{i}^{2}-\sigma^{2})=2\sigma^{4},$ which follows from $\tilde{\varepsilon}_{i}\sim\mathcal{N}(0,\sigma^{2})$ .

Finally, we estimate $Sl_{3}(\alpha).$

Now, if $\alpha\geq 1$ , then it follows from condition (20) that $0\leq\alpha\gamma_{i}/(\gamma_{i}^{2}+\alpha)\leq\alpha\gamma_{i}/\alpha=\gamma_{i}\leq 1$ and

[TABLE]

and a further application of Kolmogorov’s maximal inequality as in (25) yields

[TABLE]

To determine its asymptotic order, we consider the term $\bigl{(}Sl_{3}(\alpha),\alpha\in[0,1]\bigr{)}$ as a (Gaussian) stochastic process in $\alpha\in[0,1]$ for fixed $m$ . Clearly, by the Cauchy-Schwarz inequality

[TABLE]

The first factor is bounded since, by Assumption, $\|x^{*}\|_{2}^{2}=O(m)$ and for any $m\in\mathbb{N}$ , $\frac{1}{m}\sum_{i=1}^{m}\tilde{\varepsilon}_{i}^{2}$ is a random variable (independent of $\alpha$ ) and therefore almost surely bounded (w.r.t. $\alpha$ ). Hence, the process $\bigl{(}Sl_{3}(\alpha),\alpha\in[0,1]\bigr{)}$ is almost surely bounded (w.r.t. $\alpha\in[0,1]$ ). Recall that we need to show that $\sup_{\alpha\in[0,1]}|Sl_{3}(\alpha)|=O_{\mathbb{P}}(1/\sqrt{m})$ , where the stochastic order symbol $O_{\mathbb{P}}(1/\sqrt{m})$ is defined in (18). Let $T>0.$ An application of the Markov inequality yields

[TABLE]

Since $\tilde{\varepsilon}$ and $-\tilde{\varepsilon}$ have the same distribution due to symmetry of the standard normal distribution,

[TABLE]

Hence, the desired result follows if we show that

[TABLE]

To do so, we apply the following Gaussian comparison inequality.

Theorem 7 (Sudakov-Fernique inequality (Theorem 2.2.3 in [1])).

Let $f$ and $g$ be a.s. bounded Gaussian processes on $T$ . If

[TABLE]

for all $s,t\in T,$ then

[TABLE]

Let $\alpha_{1},\alpha_{2}\in[0,1].$

[TABLE]

Consider the process

[TABLE]

Obviously, $\widetilde{Sl}_{3}$ is almost surely bounded and

[TABLE]

which yields

[TABLE]

Since $\mathbb{E}[Sl_{3}(\alpha)]=\mathbb{E}[\widetilde{S}l_{3}(\alpha)]=0$ for all $\alpha\in[0,1],$ the assumptions of Theorem 7 are satisfied, which allows us to conclude

[TABLE]

Furthermore,

[TABLE]

where we used that $\frac{4}{m}\sum_{i=1}^{m}x_{i}^{*}\tilde{\varepsilon}_{j}\sim\mathcal{N}\big{(}0,16\sigma^{2}/m^{2}\sum_{i=1}^{m}(x_{i}^{*})^{2}\big{)}$ and that for a random variable $Z\sim\mathcal{N}(0,s^{2})$ the first absolute moment is given by $\mathbb{E}|Z|=s\sqrt{2/\pi}.$ This yields

[TABLE]

since, by Assumption, $\|x^{*}\|_{2}^{2}=O(m)$ . By Definition (18) we conclude $\sup_{\alpha\in[0,1]}|Sl_{3}(\alpha)|=O(\sigma/\sqrt{m}).$

∎

Proof of Corollary 1.

By definition $\operatorname*{{PSURE}}(\hat{\alpha}_{\operatorname*{{PSURE}}},y)\leq\operatorname*{{PSURE}}(\alpha_{m},y)$ . This yields

[TABLE]

It follows from Theorem 1 that $\sup_{\alpha\in[0,\infty)}|\mathcal{L}(\alpha)-\frac{1}{m}\operatorname*{{PSURE}}(\alpha,y)|=o_{\mathbb{P}}(\delta_{m})$ for any sequence $(\delta_{m})_{m\in\mathbb{N}}$ such that $1/\delta_{m}=o(\sqrt{m}).$ By definition (see (19)),

[TABLE]

Setting $\nu=1/2$ above, the claim now follows.

∎

Proof of Theorem 2.

Observing (15) and (3.1) we find

[TABLE]

where $\check{\varepsilon}_{i}:=y_{i}^{2}-\mathbb{E}[y_{i}^{2}]$ . The random variables $\check{\varepsilon}_{1},\ldots,\check{\varepsilon}_{n}$ are independent and centered. Notice that

[TABLE]

since $y_{i}\sim\mathcal{N}(\gamma_{i}{x_{i}^{*}},\sigma^{2})$ .

Consider the monotonically increasing function $\alpha\mapsto\frac{\alpha^{2}}{(\gamma_{i}^{2}+\alpha)^{2}}\in[0,1]$ (where $\alpha\in[0,\infty)$ ) and note that the sequence $\big{(}\frac{1}{(\gamma_{i}^{2}+\alpha)^{2}}\big{)}_{i=1}^{m}$ is increasing. With the same arguments as in the proof of Theorem 1 (see (25)), using Kolmogorov’s maximal inequality, we estimate

[TABLE]

It remains to show the $L^{2}$ -convergence (22). To this end define the $j$ -th partial sum

[TABLE]

and observe that $\{S_{j}\,|\,j\in\mathbb{N}\}$ forms a martingale. The $L^{p}$ -maximal inequality for martingales yields

[TABLE]

as above.

∎

Proof of Lemma 1.

It is straightforward to see the differentiability of $\operatorname*{{MSPE}}$ and to compute

[TABLE]

Hence, for $\alpha<\frac{\sigma^{2}}{\max_{i}|x_{i}^{*}|^{2}}$ , the risk $\operatorname*{{MSPE}}$ is strictly decreasing, which implies the first inequality. Moreover, for $\alpha\geq 1$ we obtain

[TABLE]

and we finally see that $\operatorname*{{MSPE}}^{\prime}$ is nonnegative if in addition $\alpha\geq 8\sigma^{2}\frac{\sum\gamma_{i}^{4}}{\sum\gamma_{i}^{4}(x_{i}^{*})^{2}}$ .

∎

Proof of Theorem 3.

From the uniform convergence of the sequence $f_{m_{k}}$ in Proposition 1 we obtain the convergence of the minimizers $\hat{\alpha}_{\operatorname*{{MSPE}},m_{k}}$ . Combined with Theorem 2 we obtain an analogous argument for $\hat{\alpha}_{\operatorname*{{PSURE}},m_{k}}$ . ∎

Proof of Theorem 5.

For $m=n$ and invertible matrices $A$ the projection $\Pi$ satisfies $\Pi={\rm id}$ and

[TABLE]

where we used (2.1). Recall from (11) that $y_{i}=\gamma_{i}x_{i}^{*}+\tilde{\varepsilon}_{i}.$ This yields

[TABLE]

Hence,

[TABLE]

Recall from (16) that

[TABLE]

We obtain

[TABLE]

where $GSl_{1}(m,\alpha)$ and $GSl_{2}(m,\alpha)$ are defined in an obvious manner. Obviously,

[TABLE]

by assumption (20). Therefore

[TABLE]

where the last estimate follows from Kolmogorov’s maximal inequality as in (25) . Now, since

[TABLE]

Next we derive a corresponding estimate for the term $GSl_{2}(\alpha)$ . Observe that $0\leq\alpha/(\gamma_{i}^{2}+\alpha)\leq\alpha/(\gamma_{i+1}^{2}+\alpha)\leq 1$ and $1\geq\gamma_{i}^{4}/(\gamma_{i}^{2}+\alpha)^{2}\geq\gamma_{i+1}^{4}/(\gamma_{i+1}^{2}+\alpha)^{2}\geq 0$ for any $\alpha\geq 0$ and any $1\leq i\leq m$ by ordering of the singular values. This implies

[TABLE]

by a further application of Kolmogorov’s maximal inequality as in (25). Notice that $\sqrt{c_{m}}\leq 1/\sqrt{m}$ and $1/\sqrt{m}\leq d_{m}\leq 1$ Therefore, since

[TABLE]

the claim of the theorem follows. ∎

Proof of Theorem 6.

For full rank matrices $A\in\mathbb{R}^{m\times m}$ we have from (26)

[TABLE]

As in the proof of Theorem 2 we set $\check{\varepsilon}_{i}:=y_{i}^{2}-\mathbb{E}[y_{i}^{2}].$ Recall that the random variables $\check{\varepsilon}_{i}$ are centered, independent with Var $[\check{\varepsilon}_{i}]=4\gamma_{i}^{2}{x_{i}^{*}}^{2}\sigma^{2}+2\sigma^{4}$ . We find

[TABLE]

With the same arguments as in the proofs of Theorems 1 and 2 we obtain

[TABLE]

Again, an application of Kolmogorov’s maximal inequality yields

[TABLE]

and the first claim of the theorem follows with $\operatorname*{cond}(A)=\gamma_{1}/\gamma_{m}=1/\gamma_{m}$ . Moreover, in a similar manner as in the proofs of the previous theorems, we find

[TABLE]

and by the $L^{p}$ maximal inequality the second claim now follows as

[TABLE]

∎

Appendix B Consistent LASSO Solver

We want to solve (2) with $R(x)=\|x\|_{1}$ for a large number of different values of $\alpha$ but need to ensure that the results are comparable and consistent. For this, we rely on an implementation of the scaled version of ADMM [6] that carries out the iterations for all $\alpha$ simultaneously, with the same penalty parameter $\rho$ for all $\alpha$ and a stop criterion based on the maximal primal and dual residuum over all $\alpha$ . Online adaptation of $\rho$ is also performed based on primal and dual residua for all $\alpha$ . While ensuring the consistency of the results, this leads to sub-optimal performance for individual $\alpha$ ’s which has to be countered by using a large number of iterations to obtain high accuracies.

Algorithm 1 (All-At-Once ADMM).

*Given $\alpha_{1},\ldots,\alpha_{N_{\alpha}}$ , $\rho>0$ (penalty parameter), $\tau>1$ , $\mu>1$ (adaptation parameters), $K\in\mathbb{N}$ (max. iterations) and $\varepsilon\geqslant 0$ (stopping tolerance), initialize $X^{0},Z^{0},U^{0}\in\mathbb{R}^{n\times N_{\alpha}}$ by [math], and $Y=y\otimes\mathds{1}_{N_{\alpha}}^{T}$ , $\Lambda=[\alpha_{1},\ldots,\alpha_{N_{\alpha}}]\otimes\mathds{1}_{n}$ , where $\mathds{1}_{q}$ denotes an all-one column vector in $\mathbb{R}^{q}$ . Further, let $\odot$ denote the component-wise multiplication between matrices (Hadamard product).

For $k=1,\ldots,K$ do:

[TABLE]

The algorithm returns both $X_{(\cdot,i)}^{k+1}$ and $Z_{(\cdot,i)}^{k+1}$ as approximations of the solution to (2) with $R(x)=\|x\|_{1}$ and $\alpha=\alpha_{i}$ of which we use $Z_{(\cdot,i)}^{k+1}$ for our purposes as it is exactly sparse due to the soft-thresholding step (z-update). In the computations, we furthermore initialized $\rho=1$ and used $\tau=2$ , $\mu=1.1$ , $\varepsilon=10^{-14}$ and $K=10^{4}$ .

Acknowledgements. The work of N. Bissantz, H. Dette and K. Proksch has been supported by the Collaborative Research Center “Statistical modeling of nonlinear dynamic processes” (SFB 823, Projects A1, C1, C4) of the German Research Foundation (DFG).

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Robert J. Adler and Jonathan E. Taylor, Random fields and geometry , Springer Monographs in Mathematics, Springer, New York, 2007. MR 2319516
2[2] M. S. C. Almeida and M. a T. Figueiredo, Parameter estimation for blind and non-blind deblurring using residual whiteness measures. , IEEE Transactions on Image Processing 22 (2013), no. 7, 2751–63.
3[3] F. Bauer and T. Hohage, A Lepskij-type stopping rule for regularized Newton methods , Inverse Problems 21 (2005), no. 6, 1975–1991.
4[4] Gilles Blanchard and Peter Mathé, Discrepancy principle for statistical inverse problems with application to conjugate gradient iteration , Inverse problems 28 (2012), no. 11, 115011.
5[5] Peter Blomgren and Tony F Chan, Modular solvers for image restoration problems using the discrepancy principle , Numerical linear algebra with applications 9 (2002), no. 5, 347–358.
6[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers , Foundations and Trends in Machine Learning 3 (2011), no. 1, 1–122.
7[7] Björn Bringmann, Daniel Cremers, Felix Krahmer, and Michael Möller, The homotopy method revisited: Computing solution paths of ℓ _ 1 ℓ _ 1 \ell\_1 -regularized problems , ar Xiv preprint ar Xiv:1605.00071 (2016).
8[8] M. Burger, A. Sawatzky, and G. Steidl, First Order Algorithms in Variational Image Processing , ar Xiv (2014), no. 1412.4237, 60.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Risk Estimators for Choosing Regularization Parameters in Ill-Posed Problems - Properties and Limitations

Abstract

1 Introduction

Formal Introduction

2 Risk Estimators for Quadratic Regularization

2.1 Singular System and Risk Representations

2.2 An Illustrative Example

Discretization

Empirical Distributions

3 Properties of the Parameter Choice Rules for Quadratic Regularization

Definition 1** (Stochastic Order Symbols).**

Assumption 1**.**

3.1 PSURE-Risk

Theorem 1**.**

Remark 1**.**

Corollary 1**.**

Theorem 2**.**

Remark 2**.**

Lemma 1**.**

Proposition 1**.**

Theorem 3**.**

3.2 Discrepancy Principle

Theorem 4**.**

3.3 SURE-Risk

Theorem 5**.**

Corollary 2**.**

Theorem 6**.**

4 Numerical Studies for Quadratic Regularization

4.1 Setup

4.2 Illustration of Theorems

4.3 Dependence on the Ill-Posedness

Dependence on mmm

Dependence on lll

4.4 Linear vs Logarithmical Grids

5 Numerical Studies for Non-Quadratic Regularization

Risk Plots:

Empirical Distributions:

Sup-Theorems:

Linear Grids and Accurate Optimization

6 Conclusion

Appendix A Proofs

Proof of Theorem 1.

Theorem 7** (Sudakov-Fernique inequality (Theorem 2.2.3 in [1])).**

Proof of Corollary 1.

Proof of Theorem 2.

Proof of Lemma 1.

Proof of Theorem 3.

Proof of Theorem 5.

Proof of Theorem 6.

Appendix B Consistent LASSO Solver

Algorithm 1** (All-At-Once ADMM).**

Definition 1 (Stochastic Order Symbols).

Assumption 1.

Theorem 1.

Remark 1.

Corollary 1.

Theorem 2.

Remark 2.

Lemma 1.

Proposition 1.

Theorem 3.

Theorem 4.

Theorem 5.

Corollary 2.

Theorem 6.

Dependence on $m$

Dependence on $l$

Theorem 7 (Sudakov-Fernique inequality (Theorem 2.2.3 in [1])).

Algorithm 1 (All-At-Once ADMM).