Back-Projection based Fidelity Term for Ill-Posed Linear Inverse   Problems

Tom Tirer; Raja Giryes

arXiv:1906.06794·cs.CV·May 4, 2020

Back-Projection based Fidelity Term for Ill-Posed Linear Inverse Problems

Tom Tirer, Raja Giryes

PDF

1 Repo

TL;DR

This paper introduces a novel fidelity term based on back-projection for ill-posed linear inverse problems, demonstrating its advantages over traditional least squares in certain conditions and validating its effectiveness with various priors.

Contribution

The paper proposes a new back-projection based fidelity term for inverse problems, providing theoretical analysis and empirical validation against standard least squares methods.

Findings

01

The back-projection fidelity term performs better with badly conditioned operators.

02

Theoretical analysis shows advantages of the new term in specific scenarios.

03

Empirical results confirm improved performance with complex priors.

Abstract

Ill-posed linear inverse problems appear in many image processing applications, such as deblurring, super-resolution and compressed sensing. Many restoration strategies involve minimizing a cost function, which is composed of fidelity and prior terms, balanced by a regularization parameter. While a vast amount of research has been focused on different prior models, the fidelity term is almost always chosen to be the least squares (LS) objective, that encourages fitting the linearly transformed optimization variable to the observations. In this paper, we examine a different fidelity term, which has been implicitly used by the recently proposed iterative denoising and backward projections (IDBP) framework. This term encourages agreement between the projection of the optimization variable onto the row space of the linear operator and the pseudo-inverse of the linear operator…

Figures40

Click any figure to enlarge with its caption.

Tables2

Table 1. TABLE I: Reconstruction PSNR [dB] (averaged over 50 images from CelebA) for super-resolution with Gaussian filter and scale factor of 3, using DCGAN prior and ADAM optimizer.

	Bicubic	LS est.	BP est.
SR x3	23.04	23.02	23.77

Table 2. TABLE II: Reconstruction PSNR [dB] (averaged over 50 images from CelebA) for compressed sensing with Gaussian measurement matrix, using DCGAN prior and ADAM optimizer.

	Naive $𝑨^{†} 𝒚$	LS est.	BP est.
CS $m / n = 0.1$	12.07	22.78	22.80
CS $m / n = 0.3$	13.22	23.55	23.62
CS $m / n = 0.5$	14.71	23.67	23.82

Equations109

y = A x + e,

y = A x + e,

f (\tilde{x}) = ℓ (\tilde{x}) + β s (\tilde{x}),

f (\tilde{x}) = ℓ (\tilde{x}) + β s (\tilde{x}),

ℓ_{L S} (\tilde{x}) ≜ \frac{1}{2} ∥ y - A \tilde{x} ∥_{2}^{2},

ℓ_{L S} (\tilde{x}) ≜ \frac{1}{2} ∥ y - A \tilde{x} ∥_{2}^{2},

ℓ_{B P} (\tilde{x}) ≜ \frac{1}{2} ∥ A^{†} y - A^{†} A \tilde{x} ∥_{2}^{2},

ℓ_{B P} (\tilde{x}) ≜ \frac{1}{2} ∥ A^{†} y - A^{†} A \tilde{x} ∥_{2}^{2},

f_{L S} (\tilde{x})

f_{L S} (\tilde{x})

f_{B P} (\tilde{x})

\nabla f_{L S} (\tilde{x})

\nabla f_{L S} (\tilde{x})

\Rightarrow \hat{x}_{L S} = (A^{T} A + β D^{T} D)^{- 1} A^{T} y,

\nabla f_{B P} (\tilde{x})

\nabla f_{B P} (\tilde{x})

\Rightarrow \hat{x}_{B P} = (P_{A} + β D^{T} D)^{- 1} A^{†} y .

M S E_{L S} = E ∥ \hat{x}_{L S} - x ∥_{2}^{2}

M S E_{L S} = E ∥ \hat{x}_{L S} - x ∥_{2}^{2}

= E (A^{T} A + β D^{T} D)^{- 1} A^{T} (A x + e) - x_{2}^{2}

\displaystyle={\color[rgb]{0,0,0}\left\|\left((\bm{A}^{T}\bm{A}+\beta\bm{D}^{T}\bm{D})^{-1}\bm{A}^{T}\bm{A}-\bm{I}_{n}\right)\bm{x}\right\|_{2}^{2}}

\displaystyle\hskip 10.0pt{\color[rgb]{0,0,0}+2\mathbb{E}\left[\bm{e}\right]^{T}\bm{A}(\bm{A}^{T}\bm{A}+\beta\bm{D}^{T}\bm{D})^{-2}\bm{A}^{T}\bm{A}\bm{x}}

\displaystyle\hskip 10.0pt{\color[rgb]{0,0,0}-2\mathbb{E}\left[\bm{e}\right]^{T}\bm{A}(\bm{A}^{T}\bm{A}+\beta\bm{D}^{T}\bm{D})^{-1}\bm{x}}

\displaystyle\hskip 10.0pt{\color[rgb]{0,0,0}+\mathbb{E}\left[\bm{e}^{T}\bm{A}(\bm{A}^{T}\bm{A}+\beta\bm{D}^{T}\bm{D})^{-2}\bm{A}^{T}\bm{e}\right]}

\displaystyle={\color[rgb]{0,0,0}\left\|\left((\bm{A}^{T}\bm{A}+\beta\bm{D}^{T}\bm{D})^{-1}\bm{A}^{T}\bm{A}-\bm{I}_{n}\right)\bm{x}\right\|_{2}^{2}}

\displaystyle\hskip 10.0pt{\color[rgb]{0,0,0}+\mathrm{Tr}\left((\bm{A}^{T}\bm{A}+\beta\bm{D}^{T}\bm{D})^{-2}\bm{A}^{T}\mathbb{E}\left[\bm{e}\bm{e}^{T}\right]\bm{A}\right)}

= ((A^{T} A + β D^{T} D)^{- 1} A^{T} A - I_{n}) x_{2}^{2}

+ σ_{e}^{2} Tr ((A^{T} A + β D^{T} D)^{- 2} A^{T} A)

= V ((Λ^{T} Λ + β Γ^{2})^{- 1} Λ^{T} Λ - I_{n}) V^{T} x_{2}^{2}

+ σ_{e}^{2} Tr (V (Λ^{T} Λ + β Γ^{2})^{- 2} Λ^{T} Λ V^{T})

\displaystyle=\sum\limits_{i=1}^{n}\Big{(}\frac{\lambda_{i}^{2}}{\lambda_{i}^{2}+\beta\gamma_{i}^{2}}-1\Big{)}^{2}[\bm{V}^{T}\bm{x}]_{i}^{2}+\sigma_{e}^{2}\sum\limits_{i=1}^{n}\frac{\lambda_{i}^{2}}{(\lambda_{i}^{2}+\beta\gamma_{i}^{2})^{2}}.

bia s_{L S}^{2}

bia s_{L S}^{2}

v a r_{L S}

M S E_{L S} = bia s_{L S}^{2} + v a r_{L S} .

M S E_{L S} = bia s_{L S}^{2} + v a r_{L S} .

P_{A}

P_{A}

A^{†}

A^{†} A^{† T}

M S E_{B P} = E ∥ \hat{x}_{B P} - x ∥_{2}^{2}

M S E_{B P} = E ∥ \hat{x}_{B P} - x ∥_{2}^{2}

= E (P_{A} + β D^{T} D)^{- 1} A^{†} (A x + e) - x_{2}^{2}

\displaystyle={\color[rgb]{0,0,0}\left\|\left((\bm{P}_{A}+\beta\bm{D}^{T}\bm{D})^{-1}\bm{P}_{A}-\bm{I}_{n}\right)\bm{x}\right\|_{2}^{2}}

\displaystyle\hskip 10.0pt{\color[rgb]{0,0,0}+2\mathbb{E}\left[\bm{e}\right]^{T}\bm{A}^{\dagger T}(\bm{P}_{A}+\beta\bm{D}^{T}\bm{D})^{-2}\bm{P}_{A}\bm{x}}

\displaystyle\hskip 10.0pt{\color[rgb]{0,0,0}-2\mathbb{E}\left[\bm{e}\right]^{T}\bm{A}^{\dagger T}(\bm{P}_{A}+\beta\bm{D}^{T}\bm{D})^{-1}\bm{x}}

\displaystyle\hskip 10.0pt{\color[rgb]{0,0,0}+\mathbb{E}\left[\bm{e}^{T}\bm{A}^{\dagger T}(\bm{P}_{A}+\beta\bm{D}^{T}\bm{D})^{-2}\bm{A}^{\dagger}\bm{e}\right]}

\displaystyle={\color[rgb]{0,0,0}\left\|\left((\bm{P}_{A}+\beta\bm{D}^{T}\bm{D})^{-1}\bm{P}_{A}-\bm{I}_{n}\right)\bm{x}\right\|_{2}^{2}}

\displaystyle\hskip 10.0pt{\color[rgb]{0,0,0}+\mathrm{Tr}\left((\bm{P}_{A}+\beta\bm{D}^{T}\bm{D})^{-2}\bm{A}^{\dagger}\mathbb{E}\left[\bm{e}\bm{e}^{T}\right]\bm{A}^{\dagger T}\right)}

= ((P_{A} + β D^{T} D)^{- 1} P_{A} - I_{n}) x_{2}^{2}

+ σ_{e}^{2} Tr ((P_{A} + β D^{T} D)^{- 2} A^{†} A^{† T})

= V ((I_{i \leq m} + β Γ^{2})^{- 1} I_{i \leq m} - I_{n}) V^{T} x_{2}^{2}

+ σ_{e}^{2} Tr (V (I_{i \leq m} + β Γ^{2})^{- 2} Λ^{T} (Λ Λ^{T})^{- 2} Λ V^{T})

\displaystyle=\sum\limits_{i=1}^{n}\Big{(}\frac{1_{i\leq m}}{1_{i\leq m}+\beta\gamma_{i}^{2}}-1\Big{)}^{2}[\bm{V}^{T}\bm{x}]_{i}^{2}+\sigma_{e}^{2}\sum\limits_{i=1}^{n}\frac{\lambda_{i}^{-2}1_{i\leq m}}{(1_{i\leq m}+\beta\gamma_{i}^{2})^{2}}.

bia s_{B P}^{2}

bia s_{B P}^{2}

v a r_{B P}

M S E_{B P} = bia s_{B P}^{2} + v a r_{B P} .

M S E_{B P} = bia s_{B P}^{2} + v a r_{B P} .

i = 1 \sum m bia s_{B P}^{2 (i)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tomtirer/BP-term
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Back-Projection based Fidelity Term for Ill-Posed Linear Inverse Problems

Tom Tirer and Raja Giryes This work was supported by the European research council (ERC starting grant 757497 PI Giryes). The authors are with the School of Electrical Engineering, Tel Aviv University, Tel Aviv 69978, Israel. (email: [email protected], [email protected])

Abstract

Ill-posed linear inverse problems appear in many image processing applications, such as deblurring, super-resolution and compressed sensing. Many restoration strategies involve minimizing a cost function, which is composed of fidelity and prior terms, balanced by a regularization parameter. While a vast amount of research has been focused on different prior models, the fidelity term is almost always chosen to be the least squares (LS) objective, that encourages fitting the linearly transformed optimization variable to the observations. In this paper, we examine a different fidelity term, which has been implicitly used by the recently proposed iterative denoising and backward projections (IDBP) framework. This term encourages agreement between the projection of the optimization variable onto the row space of the linear operator and the pseudo-inverse of the linear operator (”back-projection”) applied on the observations. We analytically examine the difference between the two fidelity terms for Tikhonov regularization and identify cases (such as a badly conditioned linear operator) where the new term has an advantage over the standard LS one. Moreover, we demonstrate empirically that the behavior of the two induced cost functions for sophisticated convex and non-convex priors, such as total-variation, BM3D, and deep generative models, correlates with the obtained theoretical analysis.

Index Terms:

Inverse problems, image restoration, image deblurring, image super-resolution, compressed sensing, total variation, non-convex priors, BM3D, deep generative models.

I Introduction

Inverse problems appear in many fields of science and engineering, where the goal is to recover a signal from its observations that are obtained by some acquisition process. In image processing, the observations are usually a degraded version of the latent image, which may be noisy, blurred, downsampled, or all together. Such observation models, and others, can be formulated by a linear model

[TABLE]

where $\bm{x}\in\mathbb{R}^{n}$ represents the unknown original image, $\bm{y}\in\mathbb{R}^{m}$ represents the observations, $\bm{A}$ is an $m\times n$ degradation matrix (sometimes also referred to as the measurement matrix) and $\bm{e}\in\mathbb{R}^{m}$ is a noise vector. For example, this model corresponds to the problem of denoising [1, 2, 3, 4] when $\bm{A}$ is the $n\times n$ identity matrix $\bm{I}_{n}$ ; inpainting [5, 6, 7] when $\bm{A}$ is an $m\times n$ sampling matrix (i.e. a selection of m rows of $\bm{I}_{n}$ ); deblurring [8, 9] when $\bm{A}$ is a blur operator; super-resolution [10, 11] if $\bm{A}$ is a composite operator of blurring (e.g. anti-aliasing filtering) and down-sampling; and compressed sensing when $\bm{A}$ is a (random) measurement matrix ( $m\ll n$ ) and the signal is sparse under some basis representation [12, 13, 14] or resides in a general union of low-dimensional subspaces [15, 16].

The inverse problems represented by (1) are usually ill-posed, i.e. the measurements do not suffice for obtaining a successful reconstruction. Therefore, a vast amount of research has focused on designing good prior models for natural images. In fact, many of the methods for the problems mentioned above differ only in their prior assumptions and not in the way that they enforce fidelity to the observations.

To be more formal, a common strategy for recovering $\bm{x}$ aims at minimizing a cost function of the form

[TABLE]

where $\ell(\tilde{\bm{x}})$ is a fidelity term, $s(\tilde{\bm{x}})$ is a prior term (can be also referred to as the regularizer), $\beta$ is a positive scalar that controls the level of regularization, and $\tilde{\bm{x}}$ is the optimization variable. Many different prior functions are used in the literature, whether explicitly, e.g. total-variation (TV) [1], or implicitly, e.g. BM3D [4] and deep generative models [17]. Yet, most of the works use a typical least squares (LS) fidelity term

[TABLE]

where $\|\cdot\|_{2}$ stands for the Euclidean norm. The frequent usage of this term is probably also motivated by the fact that it can be derived from the negative log-likelihood function, under the assumption that the noise $\bm{e}$ is a vector of i.i.d. Gaussian random variables $e_{i}\sim\mathcal{N}(0,\sigma_{e}^{2})$ . However, note that, in general, maximum likelihood estimation has optimality properties only when the number of measurements is much larger than the number of unknown variables, which is obviously not the case in ill-posed problems.

In this paper, we examine a different fidelity term, which has been implicitly used by the recently proposed iterative denoising and backward projections (IDBP) framework [18] (we elaborate on this method in the appendix). Under the practical assumptions that $m\leq n$ and $\mathrm{rank}(\bm{A})=m$ , we examine the fidelity term

[TABLE]

where $\bm{A}^{\dagger}\triangleq\bm{A}^{T}(\bm{A}\bm{A}^{T})^{-1}$ is the pseudoinverse of the full row-rank matrix $\bm{A}$ . Note that $\bm{P}_{A}\triangleq\bm{A}^{\dagger}\bm{A}$ is an orthogonal projection onto the row space of $\bm{A}$ 111In row space of $\bm{A}$ , we mean the subspace spanned by the rows of $\bm{A}$ ., and that $\bm{A}^{\dagger}$ can be interpreted as a ”back-projection” (BP) from $\bm{A}\mathbb{R}^{n}$ back to $\mathbb{R}^{n}$ . Therefore, the fidelity (4) encourages agreement between $\bm{P}_{A}\tilde{\bm{x}}$ —the projection of $\tilde{\bm{x}}$ onto the row space of $\bm{A}$ , and $\bm{A}^{\dagger}\bm{y}$ —the back-projection of the measurements. In general, this is different than $\ell_{LS}(\tilde{\bm{x}})$ that encourages agreement between $\bm{A}\tilde{\bm{x}}$ and $\bm{y}$ . Note that in the noiseless case, i.e. when $\bm{y}=\bm{A}\bm{x}$ , the terms in (3) and (4) are translated to fitting $\bm{A}\tilde{\bm{x}}$ to $\bm{A}\bm{x}$ and $\bm{P}_{A}\tilde{\bm{x}}$ to $\bm{P}_{A}\bm{x}$ , respectively.

Note that for some inverse problems $\ell_{LS}(\tilde{\bm{x}})$ and $\ell_{BP}(\tilde{\bm{x}})$ may coincide. For example, in image inpainting, where $\bm{A}$ is a selection of $m$ rows of $\bm{I}_{n}$ , we have that $\bm{A}^{\dagger}=\bm{A}^{T}$ is an $n\times m$ matrix that merely pads with $n-m$ zeros the vector on which it is applied, and so $\|\bm{A}^{\dagger}(\bm{y}-\bm{A}\tilde{\bm{x}})\|_{2}^{2}=\|\bm{y}-\bm{A}\tilde{\bm{x}}\|_{2}^{2}$ . Therefore, we specifically focus on three popular inverse problems: super-resolution, deblurring and certain compressed sensing scenarios, where the two fidelity terms, $\ell_{LS}(\tilde{\bm{x}})$ and $\ell_{BP}(\tilde{\bm{x}})$ , are indeed very different.

Contribution. This work makes a first attempt towards characterizing for which observation model $\bm{A}$ and prior $s(\tilde{\bm{x}})$ it is better to use each of the following objectives:

[TABLE]

Particularly, for $s(\tilde{\bm{x}})$ being the Tikhonov regularization (the $\ell_{2}$ prior), where closed-form solutions exist, we derive analytical expressions for the estimations’ mean square error (MSE) that allow to examine which fidelity term is preferable. For example, we show that in the noiseless case $f_{BP}(\tilde{\bm{x}})$ yields provably better restoration than $f_{LS}(\tilde{\bm{x}})$

if the condition number of $\bm{A}\bm{A}^{T}$ (i.e. the ratio between the largest and smallest squared singular values of $\bm{A}$ ) is large, e.g. in typical super-resolution problems.

For sophisticated convex and non-convex priors, such as TV [1], BM3D [4], and DCGAN [19], analytical analysis is intractable. Therefore, we perform an intensive empirical study, where we use the same optimization method (FISTA [20] or ADAM [21]) to minimize each of the two different cost functions. Interestingly, we demonstrate that the behavior for the sophisticated priors strongly correlates with properties for which we establish concrete mathematical reasoning in the case of $\ell_{2}$ priors.

Another contribution of the paper that is deferred to the appendix is showing that IDBP framework [18], which has achieved excellent results for deblurring [18, 22] and super-resolution [23] is in fact the proximal gradient method [20, 24] (popularized under the name ISTA) applied on $f_{BP}(\tilde{\bm{x}})$ . This derivation of IDBP is completely different, and arguably simpler, than the way it is developed in [18].

The paper is organized as follows. Section II includes mathematical analysis of the two cost functions for the case of $\ell_{2}$ -type priors. The analytical results are verified in Section III. In Section IV the two cost functions are empirically examined for different sophisticated priors. Section V concludes the paper.

II Mathematical Analysis for $\ell_{2}$ Priors

In this section, we analyze the performance of the new cost function (6) and compare it to (5) for a type of $\ell_{2}$ prior functions, for which the closed-form solutions of (5) and (6) lead to a tractable performance analysis. We start with specifying the required assumptions, then we derive the estimators and expressions for their expected mean square error. Finally, the error expressions are compared and several observations are stated.

II-A Assumptions

In order to allow a concrete mathematical comparison between $f_{BP}(\tilde{\bm{x}})$ and $f_{LS}(\tilde{\bm{x}})$ , in the theoretical analysis we restrict our discussion to $\ell_{2}$ prior functions of the form $s(\tilde{\bm{x}})={\color[rgb]{0,0,0}\frac{1}{2}}\|\bm{D}\tilde{\bm{x}}\|_{2}^{2}={\color[rgb]{0,0,0}\frac{1}{2}}\tilde{\bm{x}}^{T}\bm{D}^{T}\bm{D}\tilde{\bm{x}}$ , where $\bm{D}^{T}\bm{D}$ is a positive-definite matrix. This prior is often referred to as Tikhonov regularization and is one of the most widely used methods to solve ill-posed inverse problems. Yet, for obtaining analytical results, we further focus on a more specific type of this prior—we require that both $\bm{A}$ and $\bm{D}$ have the same right singular vectors. Let us define the singular value decomposition (SVD) of the $m\times n$ matrix $\bm{A}=\bm{U}\bm{\Lambda}\bm{V}^{T}$ , where $\bm{U}$ is an $m\times m$ orthogonal matrix whose columns are the left singular vectors, $\bm{\Lambda}$ is an $m\times n$ rectangular diagonal matrix with nonzero singular values $\{\lambda_{i}\}_{i=1}^{m}$ on the diagonal, and $\bm{V}$ is an $n\times n$ orthogonal matrix whose columns are the right singular vectors. The property that $\{\lambda_{i}\}_{i=1}^{m}$ are strictly positive follows from our assumptions in Section I, that $m\leq n$ and $\mathrm{rank}(\bm{A})=m$ . For $\bm{D}$ , essentially, we assume that $\bm{D}^{T}\bm{D}=\bm{V}\bm{\Gamma}^{2}\bm{V}^{T}\succ 0$ , where $\bm{\Gamma}^{2}$ is an $n\times n$ diagonal matrix of nonzero eigenvalues $\{\gamma_{i}^{2}\}_{i=1}^{n}$ .

The assumption above is required because, as far as we know, currently there is no known analytical expression for the eigen-decomposition of arbitrary matrices $\bm{A}^{T}\bm{A}+\bm{D}^{T}\bm{D}$ which is required for our analysis [25]. Yet, this assumption holds in some practical cases, e.g. if $\bm{A}$ and $\bm{D}$ are circulant matrices (and thus diagonalized by the discrete Fourier transform), or if $\bm{D}=\bm{I}_{n}$ (i.e. least-norm regularization).

II-B Performance analysis

Let us start with obtaining closed-form expressions for the estimators $\hat{\bm{x}}_{LS}$ and $\hat{\bm{x}}_{BP}$ , which minimize $f_{LS}(\tilde{\bm{x}})$ and $f_{BP}(\tilde{\bm{x}})$ , respectively. Due to the convexity of the cost functions, this is done simply by equating their gradients to zero

[TABLE]

In (II-B) we use the properties $\bm{P}_{A}\triangleq\bm{A}^{\dagger}\bm{A}=\bm{P}_{A}^{T}=\bm{P}_{A}^{2}$ and $\bm{P}_{A}\bm{A}^{\dagger}=\bm{A}^{\dagger}$ .

We turn to compute the expected mean square errors (MSEs) of the estimators, conditioned on $\bm{x}$ , under the assumptions that $\mathbb{E}[\bm{e}]=\mathbf{0}$ and $\mathbb{E}[\bm{e}\bm{e}^{T}]=\sigma_{e}^{2}\bm{I}_{m}$ . To ease formulations, we define the $n-m$ zero eigenvalues of $\bm{A}^{T}\bm{A}$ (i.e. zeros in the diagonal of $\bm{\Lambda}^{T}\bm{\Lambda}$ ) by $\{\lambda_{i}^{2}\}_{i=m+1}^{n}$ .

The computation of the MSE of $\hat{\bm{x}}_{LS}$ is given by

[TABLE]

The second equality follows from substituting (1) in (II-B), the fourth equality uses $\mathbb{E}\left[\bm{e}\right]=\mathbf{0}$ and the cyclic property of trace, the fifth equality uses $\mathbb{E}\left[\bm{e}\bm{e}^{T}\right]=\sigma_{e}^{2}\bm{I}_{m}$ , the sixth equality is obtained by substituting the eigen-decompositions of $\bm{A}^{T}\bm{A}$ and $\bm{D}^{T}\bm{D}$ , and the last equality follows from the fact that $\bm{V}$ is an orthogonal matrix. Therefore, by defining the (squared) bias and variance terms as

[TABLE]

we may write the error as

[TABLE]

Note that the bias depends on the original image $\bm{x}$ and not on the noise, and the opposite holds for the variance. Yet, both terms are affected by the structure of $\bm{A}$ . The regularization parameters $\beta,\{\gamma_{i}\}$ introduce a tradeoff: increasing them reduces the variance but increases the bias.

To ease the computation of the MSE of $\hat{\bm{x}}_{BP}$ , let us also define an indicator function $1_{i\leq m}$ that is equal to 1 if $i\leq m$ and 0 otherwise, and an $n\times n$ diagonal matrix $\bm{I}_{i\leq m}$ with $\{1_{i\leq m}\}_{i=1}^{n}$ on its diagonal. The following identities are used

[TABLE]

Now, we get

[TABLE]

The second equality follows from substituting (1) in (II-B), the fourth equality uses $\mathbb{E}\left[\bm{e}\right]=\mathbf{0}$ and the cyclic property of trace, the fifth equality uses $\mathbb{E}\left[\bm{e}\bm{e}^{T}\right]=\sigma_{e}^{2}\bm{I}_{m}$ , the sixth equality is obtained by substituting the eigen-decompositions of $\bm{P}_{A}$ , $\bm{D}^{T}\bm{D}$ and $\bm{A}^{\dagger}\bm{A}^{\dagger T}$ , and the last equality uses the orthogonality of $\bm{V}$ . Therefore, by defining

[TABLE]

we have that

[TABLE]

Comparing (II-B) and (II-B) we may notice the following. First, the term $bias_{BP}^{2}$ handles small $\{\lambda_{i}\}_{i=1}^{m}$ (i.e. singular values of $\bm{A}$ that are smaller than 1) better than $bias_{LS}^{2}$ . However, $var_{BP}$ handles such small singular values worse than $var_{LS}$ . The opposite holds for singular values that are greater than 1. This behavior can be formulated as the following observation.

Observation 1.

For $\lambda_{i}<1$ we have that $bias_{BP}^{2(i)}<bias_{LS}^{2(i)}$ but $var_{BP}^{(i)}>var_{LS}^{(i)}$ . And, for $\lambda_{i}>1$ we have that $bias_{BP}^{2(i)}>bias_{LS}^{2(i)}$ but $var_{BP}^{(i)}<var_{LS}^{(i)}$ .

Notice that in the noiseless case $\sigma_{e}=0$ , implying that $MSE_{LS}=bias_{LS}^{2}$ and $MSE_{BP}=bias_{BP}^{2}$ . This leads us to the following observation for the noiseless case.

Observation 2.

In a noiseless scenario, the relation between $\sum\limits_{i=1}^{m}bias_{BP}^{2(i)}$ and $\sum\limits_{i=1}^{m}bias_{LS}^{2(i)}$ , dictates the relation between $MSE_{BP}$ and $MSE_{LS}$ . In particular, if all the singular values of $\bm{A}$ are smaller than 1, then $MSE_{BP}<MSE_{LS}$ , and if all the singular values of $\bm{A}$ are greater than 1, then $MSE_{BP}>MSE_{LS}$ .

Note that Observation 2 holds for any given setting of $\beta$ that is used by the two estimators. Therefore, these relations between $MSE_{BP}$ and $MSE_{LS}$ hold also when $\beta$ is tuned for best performance of each estimator.

In practice, a different value of $\beta$ can be preferred for the different cost functions. Let us denote by $\beta_{LS}$ and $\beta_{BP}$ the regularization parameter in $\ell_{LS}(\tilde{\bm{x}})$ and $\ell_{BP}(\tilde{\bm{x}})$ , respectively, and let the singular values of $\bm{A}$ be indexed in a descending order, i.e. $\lambda_{1}\geq\ldots\geq\lambda_{m}$ . Comparing $MSE_{BP}$ and $MSE_{LS}$ with $\beta_{BP}\neq\beta_{LS}$ leads to an additional observation for the noiseless case, which is in favor of the BP cost.

Observation 3.

*In a noiseless scenario, for any $\beta_{LS}$ and $\beta_{BP}=\beta_{LS}/\lambda_{1}^{2}$ , we have that $MSE_{BP}\leq MSE_{LS}$ . If in addition $[\bm{V}^{T}\bm{x}]_{{i}}\neq 0$ for some indices $2\leq{{i}}\leq m$ , then $MSE_{BP}<MSE_{LS}$ unless $\lambda_{{i}}=\lambda_{1}$ for all these indices. *

Proof.

Since $\beta_{BP}=\beta_{LS}/\lambda_{1}^{2}$ , we have that $\frac{\beta_{BP}\gamma_{i}^{2}}{1+\beta_{BP}\gamma_{i}^{2}}=\frac{\beta_{LS}\gamma_{i}^{2}}{\lambda_{1}^{2}+\beta_{LS}\gamma_{i}^{2}}$ . Therefore,

[TABLE]

If $[\bm{V}^{T}\bm{x}]_{i}\neq 0$ for some indices $2\leq i\leq m$ , it is easy to see that the inequality is strict unless $\lambda_{{i}}=\lambda_{1}$ for these indices. Finally, recall that in the noiseless case the relation between $\sum\limits_{i=1}^{m}bias_{BP}^{2(i)}$ and $\sum\limits_{i=1}^{m}bias_{LS}^{2(i)}$ , dictates the relation between $MSE_{BP}$ and $MSE_{LS}$ .

∎

Even though Observation 2 and Observation 3 consider the noiseless case, note that they cover events where the gap between $bias_{LS}^{2}$ and $bias_{BP}^{2}$ may be substantial enough to dictate the relationship between the MSEs also when the noise level is moderate. For example, if $\beta_{BP}=\beta_{LS}$ and all the singular values are much smaller than 1 then the ’in particular’-part in Observation 2 implies that $\sum\limits_{i=1}^{m}bias_{BP}^{2(i)}$ is much smaller than $\sum\limits_{i=1}^{m}bias_{LS}^{2(i)}$ . Another example, if $\beta_{BP}=\beta_{LS}/\lambda_{1}^{2}$ and the condition number of $\bm{A}\bm{A}^{T}$ , i.e. the ratio $\lambda_{1}^{2}/\lambda_{m}^{2}$ , is very large, then Observation 3 implies that $\sum\limits_{i=1}^{m}bias_{BP}^{2(i)}$ is much smaller than $\sum\limits_{i=1}^{m}bias_{LS}^{2(i)}$ .

II-C Discussion and implications for priors beyond $\ell_{2}$

As can be seen in (II-B) and (II-B), for the discussed Tikhonov regularization the bias term of each estimator is minimized if $\beta\to 0$ , and in this case $bias_{LS}^{2}$ tends to $bias_{BP}^{2}$ . This means that the performance gap in the noiseless case, which is stated in Observation 2 and Observation 3, tends to zero for $\beta\to 0$ . However, note that we consider here $\ell_{2}$ priors mainly as a surrogate to complex priors which are hard to analyze. As we demonstrate in Section IV, the results that are obtained for sophisticated priors, such as TV, BM3D and DCGAN, indeed strongly correlate with the observations above

(especially with Observation 3 that implies an advantage of BP for badly conditioned $\bm{A}\bm{A}^{T}$ ).

For such priors, the optimal value of $\beta$ for each fidelity term is significantly above 0 even in the noiseless case (contrary to $\ell_{2}$ priors), and the gap between the best recoveries is significant as well.

Another motivation for connecting the above analysis to other priors comes from recognizing attributes that distinguish between the LS and BP fidelity terms regardless of the prior used with them.

Let us focus on the noiseless case, where $\bm{y}=\bm{A}\bm{x}$ . In this case, (5) and (6) can be written as

[TABLE]

Under our SVD notations, we have $\bm{A}^{T}\bm{A}=\sum\limits_{i=1}^{m}\lambda_{i}^{2}\bm{v}_{i}\bm{v}_{i}^{T}$ and $\bm{P}_{A}=\sum\limits_{i=1}^{m}\bm{v}_{i}\bm{v}_{i}^{T}$ , where $\bm{v}_{i}$ is the right singular vector of $\bm{A}$ associated with the singular value $\lambda_{i}$ . Therefore, we get

[TABLE]

Note that $f_{BP}(\tilde{\bm{x}})$ equally weighs all $\{|\bm{v}_{i}^{T}(\bm{x}-\tilde{\bm{x}})|^{2}\}_{i=1}^{m}$ , contrary to $f_{LS}(\tilde{\bm{x}})$ that weighs them according to $\{\lambda_{i}^{2}\}$ . As in inverse problems one (typically) cares about minimizing the MSE, an intuition that minimizing (20) may have an advantage over minimizing (19) for general priors, comes from the similarity between the BP fidelity term and formulating the MSE as $\|\tilde{\bm{x}}-\bm{x}\|_{2}^{2}=\sum\limits_{i=1}^{n}|\bm{v}_{i}^{T}(\bm{x}-\tilde{\bm{x}})|^{2}$ (note that the sum here goes over all the $n$ basis vectors in $\bm{V}$ ). For $\ell_{2}$ priors, we indeed have shown in Section II-B that this “equal weighting” strategy translates to the fact that $\{bias_{BP}^{2(i)}\}$ do not depend on $\{\lambda_{i}^{2}\}$ , contrary to $\{bias_{LS}^{2(i)}\}$ , which later yields the MSE advantage of BP over LS in Observation 3. For $\ell_{2}$ priors, we have obtained analytical results and tradeoffs also for the noisy case. For other priors, we empirically show in Section IV correlation to the above analytical findings.

An important factor that is not taken into account in the above analysis is optimization, since for $\ell_{2}$ priors there is a closed-form solution. Yet, for sophisticated priors iterative optimization schemes are inevitable, and the regularization parameter has an effect which is similar to the step size in these schemes. In such cases, extremely low value of $\beta$ inherently results in a massive slowdown in the convergence for convex priors [26, 27] and/or bad local minima for non-convex priors. Taking a numerical optimization point of view, in the sequel we empirically show that $\hat{\bm{x}}_{BP}$ is superior to $\hat{\bm{x}}_{LS}$ even for $\ell_{2}$ priors with $\beta\to 0$ , if few iterations of conjugate gradients are used instead of the closed-form expressions (II-B) and (II-B). This implementation choice may be preferable in high-dimensional problems when it is not possible to invert the matrices. The advantage of BP in this case follows from the fact that the eigenvalues of $\bm{P}_{A}$ are only 1 (in the row space of $\bm{A}$ ) and 0 (in the null space of $\bm{A}$ ), while $\bm{A}^{T}\bm{A}$ may have very different eigenvalues in general, and conjugate gradients (among other methods) performs better when the eigenvalues are clustered [28]. In Section IV we provide empirical evidence that BP requires less iterations than LS also for other optimization schemes and priors.

III Experiments with $\ell_{2}$ Priors

In this section, we discuss the implications of the analytical results from Section II and verify them for specific observation models: super-resolution and compressed sensing. In the first, all the singular values of $\bm{A}$ are smaller than 1 and the condition number of $\bm{A}\bm{A}^{T}$ is large, while in the latter it is possible that all singular values are greater than 1 and that the condition number is very moderate. We also discuss the typical deblurring problem, which is highly ill-conditioned. In this case, $\bm{A}^{\dagger}$ in $\hat{\bm{x}}_{BP}$ has to be regularized due to the large number of near zero singular values, and (II-B) needs to be modified accordingly.

Throughout this section, we use the closed-form estimators in (II-B) and (II-B) to restore the images. The empirical performance of these two estimators is presented by markers, while the analytical expressions from (11) and (15) are plotted in solid curves. Different colors are used to distinguish between the two fidelity terms that are used for the estimation.

III-A Super-resolution

Let us consider the super-resolution (SR) task, where $\bm{A}$ is a composite operator of blurring (e.g. anti-aliasing filtering) followed by down-sampling. Note that the largest singular value of a typical low-pass filtering operation is 1, and it is associated with the DC (i.e. the magnitude of the Fourier coefficient that is associated with zero frequency). The rest of the singular values are smaller than 1. The subsequent operator is subsampling, which inevitability reduces the energy of the signal (as $m<n$ ). Therefore, essentially, all the singular values of $\bm{A}$ are smaller than 1. Accordingly, the condition number of $\bm{A}\bm{A}^{T}$ is large. These properties are demonstrated in Fig. 1(a) for SR with scale factor 3 and Gaussian filter of size $7\times 7$ and standard deviation 1.6 (used in many works, e.g. [11, 23, 29]), which is performed on a $64\times 64$ image (thus $n=4096$ and $m=484$ ). We consider such a small image to allow computing the SVD of $\bm{A}$ (our analytic expressions require both $\{\lambda_{i}^{2}\}_{i=1}^{m}$ and $\bm{V}$ ).

We verify our analytical results for the SRx3 scenario mentioned above, and two cases: $\sigma_{e}=0$ and Gaussian noise with $\sigma_{e}=\sqrt{2}$ . The experiments are performed on the cameraman image, resized to $64\times 64$ pixels. In the noisy case, we average the results over 5 noise realizations. We have observed similar results for other images as well. We use the $\ell_{2}$ prior $s(\tilde{\bm{x}})=\frac{1}{2}\|\tilde{\bm{x}}\|_{2}^{2}$ , which satisfies the assumptions ( $\bm{D}=\bm{I}_{n}$ and $\gamma_{i}=1$ ).

The PSNR222The PSNR for a recovery $\hat{\bm{x}}$ of a uint8 image $\bm{x}\in\mathbb{R}^{n}$ is computed as $10\mathrm{log}_{10}\Big{(}\frac{255^{2}}{\frac{1}{n}\|\hat{\bm{x}}-\bm{x}\|_{2}^{2}}\Big{)}$ . results are presented in Fig. 3 and validate the analytical expressions. For $\sigma_{e}=0$ , $\hat{\bm{x}}_{BP}$ is better than $\hat{\bm{x}}_{LS}$ for any value of the parameter $\beta$ , as implied by Observation 2 since all the singular values of $\bm{A}$ are smaller than 1 (Fig. 1(a)).

The rather large gap in favor of BP also agrees with Observation 3 that predicts it when the ratio $\lambda_{1}^{2}/\lambda_{m}^{2}$ is large. The fact that BP at $\beta/\lambda_{1}^{2}=5.97\beta$ outperforms LS at $\beta$ , further verifies Observation 3. For $\sigma_{e}=\sqrt{2}$ , the gap between the estimators is reduced because $var_{BP}$ is worse than $var_{LS}$ at handling the small singular values, as mentioned in Observation 1.

To demonstrate the numerical optimization advantage of the BP cost over the LS cost for $\beta\to 0$ (where the gap between the bias terms in (II-B) and (II-B) tends to 0), we repeat the experiments above for very small values of $\beta$ . However, this time instead of inverting the matrices in (II-B) and (II-B) we obtain the estimators using the conjugate gradient method. The results are presented in Fig. 3. Remarkably, a single iteration is enough for obtaining the exact BP estimator (for $\ell_{2}$ prior).

III-B Compressed sensing

Contrary to SR scenarios, in compressed sensing (CS) the condition number of $\bm{A}\bm{A}^{T}$ is moderate and the singular values of $\bm{A}$ may be larger than 1. Consider the commonly examined scenario where $\bm{A}$ is the multiplication of an $m\times n$ Gaussian measurement matrix (whose i.i.d. entries are drawn from $\mathcal{N}(0,1/m)$ ) with an $n\times n$ Haar wavelet basis. We have observed that for high compression, e.g. $m/n=0.1$ , all the singular values are larger than 1 and the condition number is very small, as demonstrated in Fig. 1(c). However, for lower compression, e.g. $m/n=0.5$ , there are also singular values smaller than 1 and the condition number increases, as demonstrated in Fig. 1(d).

We verify our analytical results for these two compression ratios (both with $\sigma_{e}=0$ ). The experiments are performed on the same $64\times 64$ version of cameraman image, and we use again the $\ell_{2}$ prior $s(\tilde{\bm{x}})=\frac{1}{2}\|\tilde{\bm{x}}\|_{2}^{2}$ . The results are presented in Fig. 5 and validate the analytical expressions. For $m/n=0.1$ , $\hat{\bm{x}}_{LS}$ is better than $\hat{\bm{x}}_{BP}$ for any value of $\beta$ , as implied by Observation 2 since all the singular values of $\bm{A}$ are greater than 1 (Fig. 1(c)).

To verify Observation 3 in this case, see that BP at $\beta/\lambda_{1}^{2}=0.058\beta$ has (slightly) higher PSNR than LS at $\beta$ , e.g. for $\beta=1$ . We have verified this also for very large values of $\beta$ (not presented here)—both curves decrease and reach a similar plateau at high $\beta$ , yet BP at $\beta/\lambda_{1}^{2}$ indeed has higher PSNR than LS at $\beta$ , but the difference is extremely small. Interestingly, for $m/n=0.5$ , where some singular values of $\bm{A}$ are smaller than 1 (Fig. 1(d)), $\hat{\bm{x}}_{BP}$ gets better results than $\hat{\bm{x}}_{LS}$ .

Also in this scenario, it can be verified that BP at $\beta/\lambda_{1}^{2}=0.171\beta$ has higher PSNR than LS at $\beta$ , as implied by Observation 3. The fact that the gap between BP and LS for $m/n=0.1$ and $m/n=0.5$ has been changed in favor of BP in the latter agrees with the derivation of Observation 3 in (II-B) that links the advantage of BP to an increased $\lambda_{1}^{2}/\lambda_{m}^{2}$ ratio.

We demonstrate again the numerical optimization advantage of the BP cost over the LS cost by repeating the experiments above for very small values of $\beta$ , while using the conjugate gradient method instead of matrix inversion. The results are presented in Fig. 5. It can be seen again that for the $\ell_{2}$ prior a single iteration is enough for obtaining the exact BP estimator.

We find it necessary to emphasize that compressed sensing scenarios require a sparsity-inducing prior, e.g. $s(\tilde{\bm{x}})=\|\tilde{\bm{x}}\|_{1}$ or TV prior, rather than an $\ell_{2}$ prior, for which both estimators exhibit poor results (i.e. very low PSNR). However, our purpose here is merely to validate our analysis, which applies only to $\ell_{2}$ priors, for a case in which all the singular values are greater than 1 and/or the condition number is small.

Finally, note that for Gaussian $\bm{A}$ there is no efficient way to implement the operators $\bm{A}$ and $\bm{A}^{T}$ for large dimensions. Therefore, in practice, taking $\bm{A}$ to be the subsampled Fourier transform is more common, e.g. in sparse MRI [30]. However, note that for this acquisition model $\bm{A}^{\dagger}$ is simply the Hermitian transpose of $\bm{A}$ (this property follows from the fact that the subsampled Fourier transform is a tight frame [31]), which together with the unitarity of the Fourier transform leads to $\|\bm{A}^{\dagger}(\bm{y}-\bm{A}\tilde{\bm{x}})\|_{2}^{2}=\|\bm{y}-\bm{A}\tilde{\bm{x}}\|_{2}^{2}$ . This means that the two cost functions coincide, which is also implied by the fact that in this case all the singular values of $\bm{A}$ are 1 and thus (II-B) is identical to (II-B). Therefore, we do not make a comparison for this case.

III-C Deblurring

In the deblurring problem, $\bm{A}$ is a square ( $m=n$ ) ill-conditioned matrix that performs blurring (i.e. filtering by a blur kernel). Typically, the blur kernel coefficients are normalized such that their sum is 1. Thus, the largest singular value of $\bm{A}$ is 1 (associated with the DC), and many other singular values are near 0. Accordingly, the condition number of $\bm{A}\bm{A}^{T}$ is extremely large. These properties are demonstrated in Fig. 1(b) for uniform kernel of size $9\times 9$ (used in many works, e.g. [8, 9, 18]).

Note that if one uses $\hat{\bm{x}}_{BP}$ , exactly as defined in (II-B), both Observation 2 and Observation 3 imply an advantage of $\hat{\bm{x}}_{BP}$ over $\hat{\bm{x}}_{LS}$ in the noiseless case due to small singular values and a large condition number, respectively. However, since in the deblurring problem $\bm{A}$ is not rank-deficient but rather (very) ill-conditioned, deblurring scenarios always assume that the measurements are noisy (typically with low noise levels). Therefore, it is required to regularize the inversion of $\bm{A}\bm{A}^{T}$ in $\bm{A}^{\dagger}$ in order to mitigate the effect of near zero $\{\lambda_{i}\}$ on the variance of $\hat{\bm{x}}_{BP}$ . A common regularized inversion is diagonal loading: inverting $\bm{A}\bm{A}^{T}+\epsilon\bm{I}_{n}$ instead of $\bm{A}\bm{A}^{T}$ , where $\epsilon$ is a parameter. This is equivalent to replacing $\lambda_{i}^{2}$ with $\lambda_{i}^{2}+\epsilon$ in the eigen-decomposition of $\bm{A}\bm{A}^{T}$ .

For $\hat{\bm{x}}_{BP}$ with such a regularized inversion, it is not hard to repeat the computations in (II-B) and obtain a very similar result, where $1_{i\leq m}$ is replaced with $\lambda_{i}^{2}/(\lambda_{i}^{2}+\epsilon)$ and $\lambda_{i}^{-2}1_{i\leq m}$ is replaced with $\lambda_{i}^{2}/(\lambda_{i}^{2}+\epsilon)^{2}$ . Formally, we get

[TABLE]

Therefore, as could be expected, increasing the amount of regularization $\epsilon$ reduces the variance of $\hat{\bm{x}}_{BP}$ but increases its bias. As a sanity check, observe that for $\epsilon\to 0$ we get that (III-C) coincides with (II-B) (recall $m=n$ ). Since in this case the performance of $\hat{\bm{x}}_{BP}$ depends on the couple $(\beta,\epsilon)$ , we cannot obtain clear properties like the observations in Section II-B that hold uniformly for any parameter setting. Yet, as demonstrated below and in the sequel, we have empirically observed that it is possible to find settings of $(\beta,\epsilon)$ that balance the bias and variance of $\hat{\bm{x}}_{BP}$ and therefore lead to very good results despite the observed noise.

We verify (III-C) for the uniform blur kernel mentioned above, and two levels of Gaussian noise: $\sigma_{e}=\sqrt{0.3}$ and $\sigma_{e}=\sqrt{2}$ . The experiments are performed on the $64\times 64$ version of cameraman image, and we use again the $\ell_{2}$ prior. The results are presented in Fig. 6. They show that $\hat{\bm{x}}_{BP}$ with good tuning of $(\beta,\epsilon)$ can outperform $\hat{\bm{x}}_{LS}$ , especially when the noise level is low. This implies that ”well-tuned” $\hat{\bm{x}}_{BP}$ handles the badly conditioned $\bm{A}$ (Fig. 1(b)) better than $\hat{\bm{x}}_{LS}$ .

III-D The effect of the joint right singular vectors assumption

In this section, we compare the empirical MSE and the analytical formulas in (11), (15) and (III-C) in cases where the condition $\bm{D}^{T}\bm{D}=\bm{V}\bm{\Gamma}^{2}\bm{V}^{T}$ is violated (recall that the columns of $\bm{V}$ are the right singular vectors of $\bm{A}$ and $\bm{\Gamma}^{2}$ is a diagonal matrix, as defined in Section II-A). Since our formulas require the diagonal of $\bm{\Gamma}^{2}$ (i.e. $\{\gamma_{i}^{2}\}$ ), we compute it as the diagonal of $\bm{V}^{T}\bm{D}^{T}\bm{D}\bm{V}$ , which is exact under the analysis assumption and can be regarded as an approximation when $\bm{D}^{T}\bm{D}\neq\bm{V}\bm{\Gamma}^{2}\bm{V}^{T}$ .

We start with examining the case of $\bm{D}^{T}\bm{D}=\bm{\Omega}_{DIF}^{T}\bm{\Omega}_{DIF}+0.01\bm{I}_{n}$ , where $\bm{\Omega}_{DIF}$ is the 2D finite difference operator and the diagonal loading is required to make $\bm{D}^{T}\bm{D}\succ 0$ . Note that for the deblurring task we have that both $\bm{A}$ and $\bm{D}^{T}\bm{D}$ are circulant matrices that can be diagonalized by the DFT matrix. Therefore, the condition $\bm{D}^{T}\bm{D}=\bm{V}\bm{\Gamma}^{2}\bm{V}^{T}$ holds for $\bm{V}$ that equals the (inverse) DFT matrix. However, for the SR task $\bm{A}$ cannot be singularly decomposed by a Fourier basis. Therefore, the condition cannot be satisfied.

We repeat previous deblurring and SR experiments with the examined $\bm{D}^{T}\bm{D}$ . The results are presented in Figs. 7(a) and 7(c). For deblurring we see perfect agreement between the empirical results and the analytical formulas (as expected). For SR we see that violating the condition has led to a small gap between the empirical results and the formulas.

Now, we further increase the violation of the condition by breaking the circularity property of $\bm{D}^{T}\bm{D}$ . We do it by replacing $\bm{\Omega}_{DIF}$ with a non-circulant operator $\tilde{\bm{\Omega}}$ that performs finite difference only on every 8th pixel (and identity on the rest). We repeat the previous deblurring and SR experiments and present the results in Figs. 7(b) and 7(d). It is easy to see that the deviation of the empirical results from the formulas further grows for both tasks.

The experiments demonstrate that the deviation between the empirical MSE and the analytical expressions is proportional to how much the condition on $\bm{D}^{T}\bm{D}$ is violated. Yet, the overall trend in the curves still shares similarity with the analytical results, which motivates considering the observations obtained by the analytical analysis for practical sophisticated priors.

III-E Incorporating prior knowledge on $\bm{x}$ with the results

The analytical MSE formulas in (11) and (15) are conditioned on the latent image $\bm{x}$ , as the expectations are taken only with respect to the noise $\bm{e}$ . These expressions have led to observations in Section II

that depend only the singular values of $\bm{A}$ (i.e. $\{\lambda_{i}\}$ ) and do not require prior knowledge on $\bm{x}$ (recall that accurately modeling natural images is difficult). The usefulness of these observations for preferring one fidelity term over the other for sophisticated priors is demonstrated in Section IV.

However, a natural question arises: How can one leverage prior knowledge on $\bm{x}$ to improve the criterion for choosing the fidelity term?

In this section we briefly demonstrate, using a controlled experiment, how the observations in Section II can be polished given a constraint that $\bm{x}$ resides in $\mathcal{W}^{\perp}$ the orthogonal complement of a known subspace $\mathcal{W}$ . Note that for low-dimensional $\mathcal{W}$ such that $m<n-\mathrm{dim}(\mathcal{W})$ , we still have an ill-posed linear inverse problem.

We consider the compressed sensing scenario from Section III-B where $n=64^{2}$ , $m/n=0.5$ and $\sigma_{e}=0$ . In this case $\bm{A}$ has (more than 1000) singular values that are larger than 1 and (slightly more than 500) singular values that are smaller than 1 (see Fig. 1(d)). Therefore, the events in the ’in particular’-part in Observation 2 do not occur. Indeed, observe that in Fig. 4(b) none of the estimators is consistently (i.e. for any $\beta$ ) better than the other when $\bm{x}$ is the cameraman image.

Now, let us use the notation from Section II, where the columns of $\bm{V}$ , that is, the right singular vectors of $\bm{A}$ , are ordered according to a descending order of the singular values (from 1 to $m$ ), and the last $n-m$ columns span the null space of $\bm{A}$ . Suppose that $\mathcal{W}$ is the subspace spanned by the columns of $\bm{V}(:,1:m-500)$ , where we use Matlab notation, and that $\bm{x}\in\mathcal{W}^{\perp}$ . Due to the orthogonality of $\bm{V}$ , we have that $[\bm{V}^{T}\bm{x}]_{i}=0$ for any $1\leq i\leq m-500$ . Substituting this property in (II-B) and (II-B), we get

[TABLE]

Therefore, for the considered CS scenario, we have that $bias_{BP}^{2}<bias_{LS}^{2}$ for any $\beta$ (because $\lambda_{i}<1$ for all $m-499\leq i\leq m$ ). Since $\sigma_{e}=0$ , this implies that $MSE_{BP}<MSE_{LS}$ for any $\beta$ .

Note that for $\mathcal{W}$ that is the subspace spanned by the columns of $\bm{V}(:,1000:m)$ and $\bm{x}\in\mathcal{W}^{\perp}$ (i.e. $\bm{x}$ in a subspace spanned by columns of $\bm{V}$ that are either associated with singular values that are greater than 1 or with the null space of $\bm{A}$ ), similar arguments lead to $MSE_{BP}>MSE_{LS}$ for any $\beta$ . Fig. 9 verifies both results for a test image $\bm{x}$ that is the projection of the cameraman image onto $\mathcal{W}^{\perp}$ (i.e. $\bm{x}=\bm{P}_{\mathcal{W}^{\perp}}\bm{x}_{0}$ , where $\bm{x}_{0}$ is the cameraman image).

Note that the behavior in Fig. 9 cannot be predicted by the ’in particular’-part in Observation 2 that considers all the singular values of $\bm{A}$ , regardless of $\bm{x}$ . We believe that a detailed study with constraints on $\bm{x}$ that better fit images is an interesting direction for future research.

IV Experiments with Sophisticated Priors

In this section we empirically demonstrate that the behavior of $\hat{\bm{x}}_{BP}$ and $\hat{\bm{x}}_{LS}$ (the minimizers of $f_{BP}(\tilde{\bm{x}})$ and $f_{LS}(\tilde{\bm{x}})$ ) for sophisticated convex and non-convex priors (for whom mathematical analysis is hard or even intractable) strongly correlates with properties for which we have established concrete mathematical reasoning in the case of $\ell_{2}$ priors. Specifically, for super-resolution and deblurring tasks (where the condition number of $\bm{A}\bm{A}^{T}$ is very large) BP cost function can lead to significantly improved results compared to the LS cost function, yet, there is inverse proportion between the performance gap and the noise level (since the singular values of $\bm{A}$ are small in these tasks).

For Gaussian compressed sensing with low $m/n$ ratio (where the condition number is small and the singular values are greater than 1) $\hat{\bm{x}}_{BP}$ is not significantly better than $\hat{\bm{x}}_{LS}$ , but it is quite robust to noise. However, when the $m/n$ ratio increases (then the condition number increases and some singular values are smaller than 1) the advantage of BP is more significant, but inversely proportional to the noise level.

IV-A TV prior

We start with the widely-used (isotropic) total-variation (TV) prior [1], which is given by

[TABLE]

for a two-dimensional signal $\tilde{\bm{x}}$ . The factor 0.1 is used to achieve good performance for $\beta=\sigma_{e}^{2}$ in case of denoising ( $\bm{A}=\bm{I}_{n}$ ). Obviously, it does not affect the comparison between the methods, since $s(\tilde{\bm{x}})$ is multiplied by $\beta$ that can be set arbitrarily. Note that $s(\tilde{\bm{x}})$ is convex, and thus $f_{LS}(\tilde{\bm{x}})$ and $f_{BP}(\tilde{\bm{x}})$ are also convex functions. We choose to minimize them by the same method: 100 iterations of FISTA [20], which is basically a variant of ISTA (see (32) in the appendix) that is incorporated with Nesterov’s accelerated gradient [32]. The step size $\mu$ is the typical 1 over the Lipschitz constant of $\nabla\ell(\tilde{\bm{x}})$ , which in our case can be computed as 1 over the spectral norm of the constant Hessian matrix $\nabla^{2}\ell$ , i.e. $\mu=1/\|\bm{P}_{A}\|=1$ for BP recovery and $\mu=1/\|\bm{A}^{T}\bm{A}\|$ (computed by the power method) for LS recovery. This common choice of step size is known to ensure convergence in the convex setting [20]. Several methods for performing proximal mapping of $s(\tilde{\bm{x}})$ (i.e. Gaussian denoising associated with the TV prior) exist [33, 34]. Here, we choose to apply split Bregman method [33]. The experiments are performed on the following eight classical test images: cameraman, house, peppers, Lena, Barbara, boat, hill and couple.

IV-A1 Super-resolution

We compare the performance of $\hat{\bm{x}}_{LS}$ and $\hat{\bm{x}}_{BP}$ for SR with Gaussian anti-aliasing kernel (defined in Section III-A) and scale factor of 3. We consider the noiseless case $\sigma_{e}=0$ , as well as the case of Gaussian noise with $\sigma_{e}=\sqrt{2}$ . For both estimators we initialize FISTA with the bicubic upsampling of $\bm{y}$ . For BP, the operator $\bm{A}^{\dagger}$ has fast implementation using the conjugate gradient method [35]. Fig. 11 shows the PSNR of the reconstructions, averaged over all images, for different values of the regularization parameter $\beta$ . Fig. 11 shows the average PSNR as a function of the iteration number, where for each estimator we use the value of $\beta$ which has led to its best results in Fig. 11 (0.25 for LS and 16 for BP in Fig. 10(c); 0.5 for LS and 46 for BP in Fig. 10(d)). It can be seen that $\hat{\bm{x}}_{BP}$ converges somewhat faster than $\hat{\bm{x}}_{LS}$ . In Figs. 21(c) and 21(d) we also display the results for cameraman image in the noiseless case.

Note the agreement of the obtained results with the observations from Section II, even though they have been established for a much simpler convex prior. In the noiseless case, $\hat{\bm{x}}_{BP}$ outperforms $\hat{\bm{x}}_{LS}$ for any value of $\beta$ , while in the noisy scenario, this does not hold. However, even in the latter case, $\hat{\bm{x}}_{BP}$ (with good tuning of $\beta$ ) outperforms $\hat{\bm{x}}_{LS}$ (with good tuning of $\beta$ ). Yet, the gap between them (for optimal tuning) is smaller than in the noiseless case.

IV-A2 Deblurring

We compare the two estimators for the widely examined $9\times 9$ uniform blur kernel mentioned in Section III-C. We make the common assumption of circular shift-invariant blur operator, which allows very fast implementation of the gradient steps in the optimization of both cost functions using Fast Fourier Transform (FFT). We consider two levels of Gaussian noise: $\sigma_{e}=\sqrt{0.3}$ and $\sigma_{e}=\sqrt{2}$ . For both estimators we initialize FISTA with $\bm{y}$ , and for $\hat{\bm{x}}_{BP}$ we use $\epsilon=0.01\sigma_{e}^{2}$ . Fig. 13 shows the average PSNR for different values of $\beta$ , and Fig. 13 shows the average PSNR as a function of the iteration number, where each estimator uses the best $\beta$ from Fig. 13 (0.3 for LS and 20.5 for BP in Fig. 12(c); 0.98 for LS and 19.5 for BP in Fig. 12(d)). Note that $\hat{\bm{x}}_{BP}$ converges much faster than $\hat{\bm{x}}_{LS}$ . The difference here for deblurring is more significant than for SR. Visual results for couple image in the case of $\sigma_{e}=\sqrt{2}$ are presented in Figs. 22(c) and 22(d).

The obtained results agree with the observations in Section III-C, in the sense that there exist settings of $(\beta,\epsilon)$ for which $\hat{\bm{x}}_{BP}$ outperforms $\hat{\bm{x}}_{LS}$ . Presumably, even for the more complex TV prior, this is due to a better handling of $\bm{A}$ whose condition number is very large. As expected, the performance gap between the estimators (for optimal tuning) decreases when the noise level is higher. However, it is still highly in favor of $\hat{\bm{x}}_{BP}$ .

IV-A3 Compressed sensing

We compare the performance of $\hat{\bm{x}}_{LS}$ and $\hat{\bm{x}}_{BP}$ for CS with Gaussian measurement matrix (i.e. $A_{ij}\sim\mathcal{N}(0,1/m)$ ), for which the two cost functions differ (see the discussion in Section III-B). In these CS experiments (only) we decrease the size of the test images to 128 $\times$ 128 pixels, as there is no efficient way to implement the operators $\bm{A}$ and $\bm{A}^{T}$ for large dimensions. We consider compression ratios of $m/n=0.1$ , $m/n=0.3$ , and $m/n=0.5$ . For each of them we examine the noiseless case and the case of Gaussian noise with signal-to-noise ratio (SNR) of 20 dB. For both estimators we initialize FISTA with zero and use 500 iterations. As we compute $\bm{A}^{\dagger}$ in advance, both estimators have similar computational cost per iteration. Fig. 15 shows the average PSNR for different values of $\beta$ . For $m/n=0.5$ we show in Fig. 15 the average PSNR vs. the iteration number, where each estimator uses the best $\beta$ from Fig. 15. Again, note that $\hat{\bm{x}}_{BP}$ requires less iterations than $\hat{\bm{x}}_{LS}$ . Visual results for house image in the noiseless case are presented in Fig. 16.

The results show correlation with the observations in Section II. In the noiseless case, when the $m/n$ ratio increases (and thus the condition number of $\bm{A}\bm{A}^{T}$ increases, e.g. see Figs. 1(c) and 1(d)) the performance gap between BP and LS increases in favor of BP. In the noisy case, when the $m/n$ ratio increases the BP estimator becomes more sensitive to noise (due to the increase in the number of singular values that are smaller than 1, again, see Figs. 1(c) and 1(d)).

IV-B BM3D prior

We turn to compare the performance of the two cost functions for the BM3D prior [4], which is based on sparsifying a three-dimensional transformation applied to groups of nearest-neighbor (i.e. similar) patches. This prior is non-convex. In fact, it is also not clear how to precisely formulate its associated $s(\tilde{\bm{x}})$ . Yet, when implementing proximal algorithms the proximal mapping of $s(\tilde{\bm{x}})$ can be replaced with applying the BM3D denoiser as a ”black-box”. We use 200 iterations of FISTA to minimize the cost functions with typical step sizes as explained above, and the same eight classical test images.

IV-B1 Super-resolution

We repeat the two SR experiments of Section IV-A1. Fig. 18 shows the average PSNR for different values of $\beta$ , and Fig. 18 shows the average PSNR as a function of the iteration number, where each estimator uses the best $\beta$ from Fig. 18 (0.09 for LS and 16 for BP in Fig. 17(c); 0.5 for LS and 140 for BP in Fig. 17(d)). Again, note that $\hat{\bm{x}}_{BP}$ converges much faster than $\hat{\bm{x}}_{LS}$ . In Figs. 21(e) and 21(f) we display the results for cameraman image in the noiseless case.

Note the strong correlation between the obtained results and the observations from Section II, even though the prior is highly non-convex. In the noiseless case, $\hat{\bm{x}}_{BP}$ outperforms $\hat{\bm{x}}_{LS}$ for a large range of $\beta$ . For very small values of $\beta$ it is inferior to $\hat{\bm{x}}_{LS}$ , but with only a small gap. From a practitioner point of view, the advantages of using the BP cost here are still clear, since when $\beta$ is well-tuned (for each of the cost functions) $\hat{\bm{x}}_{BP}$ is significantly better. Note that in the examined noisy scenario, well-tuned $\hat{\bm{x}}_{BP}$ is still better than well-tuned $\hat{\bm{x}}_{LS}$ , but the gap decreases.

IV-B2 Deblurring

We repeat the two deblurring experiments of Section IV-A2. Fig. 20 shows the average PSNR for different values of $\beta$ , and Fig. 20 shows the average PSNR as a function of the iteration number, where each estimator uses the best $\beta$ from Fig. 20 (0.027 for LS and 25.5 for BP in Fig. 19(c); 0.5 for LS and 29.5 for BP in Fig. 19(d)). Figs. 22(e) and 22(f) present visual results for couple image in the case of $\sigma_{e}=\sqrt{2}$ . The observations that have been made for TV prior stay the same here for the BM3D prior: There exist settings of $(\beta,\epsilon)$ for which $\hat{\bm{x}}_{BP}$ significantly outperforms $\hat{\bm{x}}_{LS}$ and converges faster.

IV-C DCGAN prior

The developments in deep learning [36] in the recent years have led to significant improvement in learning generative models. Methods like variational auto-encoders (VAEs) [37] and generative adversarial networks (GANs) [38] have found success at modeling data distributions. This has naturally led to using pre-trained generative models as priors in imaging inverse problems [17]. Since in popular generative models [37, 38] a generator $\mathcal{G}(\cdot)$ learns a mapping from a low dimensional space $\bm{z}\in\mathbb{R}^{d}$ to the signal space $\mathcal{G}(\bm{z})\subset\mathbb{R}^{n}$ , one can search for a reconstruction of $\bm{x}$ only in the range of the generator. This can be formulated by the following non-convex prior

[TABLE]

Plugging (24) into the typical cost function (5), we get the objective

[TABLE]

Note that for this prior, a regularization parameter $\beta$ is not required. The recovery of the latent image $\bm{x}$ is given by $\hat{\bm{x}}_{LS}=\mathcal{G}(\hat{\bm{z}}_{LS})$ , where $\hat{\bm{z}}_{LS}$ is a minimizer of (25), which can be obtained by backpropagation and standard gradient based optimizers.

The technique above has been examined recently in [17]. Here, we compare it with the one obtained by a similar approach that uses the BP cost function (6), i.e. we plug (24) into (6), to get the objective

[TABLE]

and recover $\bm{x}$ by $\hat{\bm{x}}_{BP}=\mathcal{G}(\hat{\bm{z}}_{BP})$ , where $\hat{\bm{z}}_{BP}$ is a minimizer of (26).

We use the CelebA dataset [39] and Tensorflow package [40] to train a generator using DCGAN architecture [19] on the cropped version of the images (64 $\times$ 64 pixels), as done in [17]. We use the first 200,000 images (out of 202,599) for training, and the training procedure follows the one in [19, 17]. At test time, all the optimizations with respect to $\bm{z}$ are performed using: ADAM [21] with learning rate of 0.1 (as done in [17]), same 10 random initializations of $\tilde{\bm{z}}$ , and 2000 iterations, which suffice for ensuring that the objectives (25) and (26) stop decreasing. The value of $\tilde{\bm{z}}$ that gives the lowest objective is chosen.

IV-C1 Super-resolution

We compare the performance of $\hat{\bm{x}}_{LS}$ and $\hat{\bm{x}}_{BP}$ for SR with Gaussian anti-aliasing kernel (defined in Section III-A) and scale factor of 3. Table I shows the PSNR results for the different cost functions, averaged over the last 50 images in CelebA (these images are not included in the training data). Several visual results are shown in Fig. 23.

It can be seen that the BP fidelity yields higher average PSNR and perceptually better recoveries. In fact, in each of the 50 examined images $\hat{\bm{x}}_{BP}$ has obtained higher PSNR than $\hat{\bm{x}}_{LS}$ . This behavior agrees with the previous experiments that demonstrate the advantages of the BP cost for the noiseless SR problem. We also note that even though the results of the simple bicubic upsampling are always perceptually worse than the recoveries that use DCGAN, its PSNR is sometimes higher. This drawback of GAN-based priors is due to the limited representation capabilities of the generators (sometimes referred to as ”mode collapse”). A very recent work has suggested to mitigate this deficiency by image-adaptation and back-projections [41].

IV-C2 Compressed sensing

Due to the small image dimensions, we are able to compare the performance of $\hat{\bm{x}}_{BP}$ and $\hat{\bm{x}}_{LS}$ for CS with Gaussian measurement matrix (i.e. $A_{ij}\sim\mathcal{N}(0,1/m)$ ), for which the two cost functions differ (see the discussion in Section III-B). We use compression ratios of $m/n=0.1$ , $m/n=0.3$ , and $m/n=0.5$ . Table II shows the PSNR results for the different cost functions, averaged over the last 50 images in CelebA. Several visual results are shown in Figs. 24 and 25.

The performance gap between $\hat{\bm{x}}_{BP}$ and $\hat{\bm{x}}_{LS}$ is negligible for $m/n=0.1$ , and increases in favor of BP when the $m/n$ ratio increases. This behavior correlates with the analysis in Section II (specifically with Observation 3), which explains such behavior for $\ell_{2}$ priors by the fact that when the $m/n$ ratio increases the condition number of $\bm{A}\bm{A}^{T}$ increases as well.

V Conclusion

In this work we examined the BP fidelity term for ill-posed linear inverse problems. This term has only been used implicitly by the recently proposed iterative denoising and backward projections (IDBP) framework, and is an alternative to the least squares (LS) term, which is the common choice in most works. We showed that IDBP is essentially a specific optimization scheme, namely the proximal gradient method (known also as ISTA), for minimizing the cost function induced by the BP fidelity term. We analytically compared the two fidelity terms—BP and LS—for the case of $\ell_{2}$ -type prior functions, and obtained mathematically-backed observations in favor of the BP term when the condition number of $\bm{A}\bm{A}^{T}$ is large (which is the case in many applications, such as super-resolution and deblurring). Furthermore, we showed that it is possible to leverage prior knowledge on $\bm{x}$ to increase the coverage of the observations. Finally, we empirically demonstrated that the behavior for sophisticated priors, such as TV, BM3D and DCGAN, strongly correlates with the theoretically backed properties that we established for $\ell_{2}$ priors. While the mathematical performance analysis in this work is done only for $\ell_{2}$ priors, it provides a good characterization for the advantages of BP and LS compared to each other. Yet, we believe that there are other factors that should be explored with respect to the new fidelity term, such as its behavior with non-convex priors or its effect on the convergence speed of iterative optimization algorithms.

Appendix A The Connection Between IDBP [18] and $f_{BP}(\tilde{\bm{x}})$

A-A Background

The iterative denoising and backward projections (IDBP) framework [18] is inspired by the plug-and-play priors concept [42], which encourages the usage of existing Gaussian denoisers as ”black boxes” to implicitly dictate the prior $s(\tilde{\bm{x}})$ when solving inverse problems. Such an approach allows one to use sophisticated denoising methods even when it is not clear how to formulate their associated priors, e.g. convolutional neural network (CNN) denoisers.

Several plug-and-play works have been published [42, 43, 44, 45, 46, 47, 29, 48, 49]. Most of them consider the typical cost function (5) and directly minimize it using existing iterative optimization schemes, such as FISTA [20], ADMM [50] or quadratic penalty method [51], that include steps in which the proximal mapping of $s(\tilde{\bm{x}})$ is used (as explained below, this mapping is equivalent to Gaussian denoising under the prior $s(\tilde{\bm{x}})$ ).

Recently, [18] has suggested, after several manipulations, to solve a different optimization problem

[TABLE]

where $\sigma_{e}$ is the noise level and $\delta$ is a design parameter. This work has also proposed an adaptive strategy to set $\delta$ , which does not depend on the prior and, contrary to cross-validation, does not require a set of ground truth examples. It has been suggested in [18] to solve (27) using a simple alternating minimization scheme that possesses the plug-and-play property, where the prior term $s(\tilde{\bm{x}})$ is handled solely by a Gaussian denoising operation $\mathcal{D}(\cdot;\sigma)$ with noise level $\sigma=\sigma_{e}+\delta$ . In this iterative method, $\tilde{\bm{z}}_{k}$ is obtained by projecting $\tilde{\bm{x}}_{k-1}$ onto $\{\bm{A}\mathbb{R}^{n}=\bm{y}\}$

[TABLE]

and $\tilde{\bm{x}}_{k}$ is obtained by

[TABLE]

The two repeating operations lends the method its name: Iterative Denoising and Backward Projections (IDBP). After a stopping criterion is met, the last $\tilde{\bm{x}}_{k}$ is taken as the estimate of the latent $\bm{x}$ . Note that in many cases the operation $\bm{A}^{\dagger}$ can be performed efficiently (e.g. the matrix inversion can be avoided using the conjugate gradient method [35]), and thus IDBP is dominated by the complexity of the denoising operation, similarly to other plug-and-play techniques. Using sophisticated denoisers, such as BM3D and CNNs, this algorithm has achieved excellent results for deblurring [18, 22] and super-resolution [23].

A-B Obtaining IDBP by applying ISTA on $f_{BP}(\tilde{\bm{x}})$

Interestingly, there is another way to develop the exact algorithm, which is different from the way it is developed in [18]. First, note that (27) can be solved directly for $\tilde{\bm{z}}$ . Similar to (A-A), we get

[TABLE]

Substituting (30) into (27), we reach $\underset{\tilde{\bm{x}}}{\textrm{min}}f_{BP}(\tilde{\bm{x}})$ with a specific value of the regularization parameter, i.e. $\beta=(\sigma_{e}+\delta)^{2}$ . Therefore, IDBP is essentially a specific method to minimize the $f_{BP}(\tilde{\bm{x}})$ cost function. Let us show that this method coincides with applying the proximal gradient method [20, 24], popularized under the name ISTA333ISTA is the abbreviation of Iterative Shrinkage-Thresholding Algorithm, initially designed for $s(\tilde{\bm{x}})=\|\tilde{\bm{x}}\|_{1}$ [52]., on $f_{BP}(\tilde{\bm{x}})$ . Let us define the proximal mapping, which was introduced by Moreau [53] for convex functions. Here we do not limit this definition to convex functions, though, we emphasize that previous results for proximal mapping of convex functions do not apply to non-convex functions.

Definition 1.

The proximal mapping of a function $s(\cdot)$ at the point $\tilde{\bm{z}}$ is defined by

[TABLE]

Clearly, given the same $s(\cdot)$ , Gaussian denoising and proximal mapping are tightly connected $\mathcal{D}(\tilde{\bm{z}};\sigma)=\mathrm{prox}_{\sigma^{2}s(\cdot)}(\tilde{\bm{z}})$ .

Assuming a differentiable fidelity term $\ell(\tilde{\bm{x}})$ with a Lipschitz continuous gradient $\nabla\ell(\tilde{\bm{x}})$ , applying ISTA on (2) involves iterations of

[TABLE]

where $\mu$ is a step-size, which ensures convergence for convex $s(\cdot)$ if it is equal to (or smaller than) 1 over the Lipschitz constant of $\nabla\ell(\tilde{\bm{x}})$ [20].

Proposition 1.

The IDBP algorithm, given in (A-A) and (A-A), coincides with applying ISTA (32) on the cost function $f_{BP}(\tilde{\bm{x}})$ .

Proof.

Let us compute $\nabla\ell_{BP}(\tilde{\bm{x}})$ . Using the properties $\bm{P}_{A}\triangleq\bm{A}^{\dagger}\bm{A}=\bm{P}_{A}^{T}=\bm{P}_{A}^{2}$ and $\bm{P}_{A}\bm{A}^{\dagger}=\bm{A}^{\dagger}$ , we get

[TABLE]

The Lipschitz constant of $\nabla\ell_{BP}(\tilde{\bm{x}})$ can be computed here as the spectral norm of the constant Hessian matrix $\nabla^{2}\ell_{BP}$ . Therefore, $\mu$ can be chosen as

[TABLE]

where we use the fact that the spectral norm of a non-trivial orthogonal projection is 1. Now, due to the connection $\mathcal{D}(\tilde{\bm{z}};\sigma)=\mathrm{prox}_{\sigma^{2}s(\cdot)}(\tilde{\bm{z}})$ , (32) can be written as

[TABLE]

Finally, by plugging (A-B) and (34) into (35) and setting $\beta=(\sigma_{e}+\delta)^{2}$ , we get the IDBP scheme, which is presented in (A-A) and (A-A).

∎

The connection between IDBP and ISTA, allows IDBP to adopt the theoretical results of the latter. Yet, note that the powerful global convergence (obtaining the optimal value of the objective) of ISTA holds only for denoisers that are associated with convex prior functions [20]. This limitation is shared also with ADMM-based plug-and-play schemes [43].

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: nonlinear phenomena , vol. 60, no. 1-4, pp. 259–268, 1992.
2[2] A. Buades, B. Coll, and J.-M. Morel, “A review of image denoising algorithms, with a new one,” Multiscale Modeling & Simulation , vol. 4, no. 2, pp. 490–530, 2005.
3[3] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image processing , vol. 15, no. 12, pp. 3736–3745, 2006.
4[4] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” IEEE Transactions on image processing , vol. 16, no. 8, pp. 2080–2095, 2007.
5[5] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pp. 417–424, ACM Press/Addison-Wesley Publishing Co., 2000.
6[6] A. Criminisi, P. Pérez, and K. Toyama, “Region filling and object removal by exemplar-based image inpainting,” IEEE Transactions on image processing , vol. 13, no. 9, pp. 1200–1212, 2004.
7[7] M. Elad, J.-L. Starck, P. Querre, and D. L. Donoho, “Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA),” Applied and Computational Harmonic Analysis , vol. 19, no. 3, pp. 340–358, 2005.
8[8] J. A. Guerrero-Colón, L. Mancera, and J. Portilla, “Image restoration using space-variant Gaussian scale mixtures in overcomplete pyramids,” IEEE Transactions on Image Processing , vol. 17, no. 1, pp. 27–41, 2008.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Back-Projection based Fidelity Term for Ill-Posed Linear Inverse Problems

Abstract

Index Terms:

I Introduction

II Mathematical Analysis for ℓ2\ell_{2}ℓ2​ Priors

II-A Assumptions

II-B Performance analysis

Observation 1**.**

Observation 2**.**

Observation 3**.**

Proof.

II-C Discussion and implications for priors beyond ℓ2\ell_{2}ℓ2​

III Experiments with ℓ2\ell_{2}ℓ2​ Priors

III-A Super-resolution

III-B Compressed sensing

III-C Deblurring

III-D The effect of the joint right singular vectors assumption

III-E Incorporating prior knowledge on x\bm{x}x with the results

IV Experiments with Sophisticated Priors

IV-A TV prior

IV-A1 Super-resolution

IV-A2 Deblurring

IV-A3 Compressed sensing

IV-B BM3D prior

IV-B1 Super-resolution

IV-B2 Deblurring

IV-C DCGAN prior

IV-C1 Super-resolution

IV-C2 Compressed sensing

V Conclusion

Appendix A The Connection Between IDBP [18] and fBP(x~)f_{BP}(\tilde{\bm{x}})fBP​(x~)

A-A Background

A-B Obtaining IDBP by applying ISTA on fBP(x~)f_{BP}(\tilde{\bm{x}})fBP​(x~)

Definition 1**.**

Proposition 1**.**

Proof.

II Mathematical Analysis for $\ell_{2}$ Priors

Observation 1.

Observation 2.

Observation 3.

II-C Discussion and implications for priors beyond $\ell_{2}$

III Experiments with $\ell_{2}$ Priors

III-E Incorporating prior knowledge on $\bm{x}$ with the results

Appendix A The Connection Between IDBP [18] and $f_{BP}(\tilde{\bm{x}})$

A-B Obtaining IDBP by applying ISTA on $f_{BP}(\tilde{\bm{x}})$

Definition 1.

Proposition 1.