Analysis of Approximate Message Passing with Non-Separable Denoisers and   Markov Random Field Priors

Yanting Ma; Cynthia Rush; Dror Baron

arXiv:1905.03913·cs.IT·August 27, 2019

Analysis of Approximate Message Passing with Non-Separable Denoisers and Markov Random Field Priors

Yanting Ma, Cynthia Rush, Dror Baron

PDF

Open Access

TL;DR

This paper extends the analysis of approximate message passing (AMP) algorithms to cases involving non-separable denoisers and Markov random field priors, demonstrating accurate performance predictions and improved local dependency modeling in images.

Contribution

It provides a rigorous theoretical analysis of AMP with non-separable denoisers under Markov random field priors, expanding the applicability of state evolution predictions.

Findings

01

State evolution accurately predicts AMP performance with non-separable denoisers.

02

AMP with sliding-window denoisers captures local dependencies in images.

03

Numerical results show improved image processing capabilities.

Abstract

Approximate message passing (AMP) is a class of low-complexity, scalable algorithms for solving high-dimensional linear regression tasks where one wishes to recover an unknown signal from noisy, linear measurements. AMP is an iterative algorithm that performs estimation by updating an estimate of the unknown signal at each iteration and the performance of AMP (quantified, for example, by the mean squared error of its estimates) depends on the choice of a "denoiser" function that is used to produce these signal estimates at each iteration. An attractive feature of AMP is that its performance can be tracked by a scalar recursion referred to as state evolution. Previous theoretical analysis of the accuracy of the state evolution predictions has been limited to the use of only separable denoisers or block-separable denoisers, a class of denoisers that underperform when sophisticated…

Figures7

Click any figure to enlarge with its caption.

Equations488

y = A V (β) + w,

y = A V (β) + w,

Γ = ⎩ ⎨ ⎧ [N], [N] \times [N], [N] \times [N] \times [N], if p = 1, if p = 2, if p = 3,

Γ = ⎩ ⎨ ⎧ [N], [N] \times [N], [N] \times [N] \times [N], if p = 1, if p = 2, if p = 3,

Λ := ⎩ ⎨ ⎧ [2 k + 1], [2 k + 1] \times [2 k + 1], [2 k + 1] \times [2 k + 1] \times [2 k + 1], if p = 1, if p = 2, if p = 3,

Λ := ⎩ ⎨ ⎧ [2 k + 1], [2 k + 1] \times [2 k + 1], [2 k + 1] \times [2 k + 1] \times [2 k + 1], if p = 1, if p = 2, if p = 3,

z^{t}

z^{t}

+ \frac{z ^{t - 1}}{n} i \in Γ \sum η_{t - 1}^{'} ([V^{- 1} (A^{*} z^{t - 1}) + β^{t - 1}]_{Λ_{i}}),

β_{i}^{t + 1}

Γ^{mid} Γ^{edge} := {i \in Γ ∣ Λ_{i} \cap Γ^{c} = \emptyset}, := {i \in Γ ∣ Λ_{i} \cap Γ^{c} \neq = \emptyset} .

Γ^{mid} Γ^{edge} := {i \in Γ ∣ Λ_{i} \cap Γ^{c} = \emptyset}, := {i \in Γ ∣ Λ_{i} \cap Γ^{c} \neq = \emptyset} .

v_{[Λ_{i}]_{j}} := \frac{1}{∣ Λ _{i} \cap Γ∣} ℓ \in Λ_{i} \cap Γ \sum v_{ℓ} .

v_{[Λ_{i}]_{j}} := \frac{1}{∣ Λ _{i} \cap Γ∣} ℓ \in Λ_{i} \cap Γ \sum v_{ℓ} .

T_{i} (v_{Λ_{i} \cap Γ}) := v_{Λ_{i}}, for all i \in Γ,

T_{i} (v_{Λ_{i} \cap Γ}) := v_{Λ_{i}}, for all i \in Γ,

v_{Λ_{i}} = T_{i} (v_{i - k}, v_{i - k + 1}, \dots, v_{i + k}) := (v_{i - k}, v_{i - k + 1}, \dots, v_{i + k}) \in R^{2 k + 1} .

v_{Λ_{i}} = T_{i} (v_{i - k}, v_{i - k + 1}, \dots, v_{i + k}) := (v_{i - k}, v_{i - k + 1}, \dots, v_{i + k}) \in R^{2 k + 1} .

\overset{v}{ˉ} = \frac{1}{8} j = 1 \sum 8 v_{j}, and set v_{[Λ_{3}]_{1}} = v_{[Λ_{3}]_{2}} = v_{[Λ_{3}]_{3}} = \overset{v}{ˉ},

\overset{v}{ˉ} = \frac{1}{8} j = 1 \sum 8 v_{j}, and set v_{[Λ_{3}]_{1}} = v_{[Λ_{3}]_{2}} = v_{[Λ_{3}]_{3}} = \overset{v}{ˉ},

N_{i}^{q} = {j \in Γ ∖ {i} ∣ ∥ i - j ∥^{2} \leq q} .

N_{i}^{q} = {j \in Γ ∖ {i} ∣ ∥ i - j ∥^{2} \leq q} .

P (X_{i} \in B ∣ X_{j}, j \in Γ ∖ {i}) = P (X_{i} \in B ∣ X_{j}, j \in N_{i}^{q}),

P (X_{i} \in B ∣ X_{j}, j \in Γ ∖ {i}) = P (X_{i} \in B ∣ X_{j}, j \in N_{i}^{q}),

C_{i, j} := ξ, ξ^{'} \in E^{Γ} ξ_{j^{c}} = ξ_{j^{c}}^{'} sup ∥ μ_{i} (\cdot ∣ ξ) - μ_{i} (\cdot ∣ ξ^{'}) ∥_{tv} .

C_{i, j} := ξ, ξ^{'} \in E^{Γ} ξ_{j^{c}} = ξ_{j^{c}}^{'} sup ∥ μ_{i} (\cdot ∣ ξ) - μ_{i} (\cdot ∣ ξ^{'}) ∥_{tv} .

∥ ρ_{1} (\cdot) - ρ_{2} (\cdot) ∥_{tv} := B \in E max ∣ ρ_{1} (B) - ρ_{2} (B) ∣ .

∥ ρ_{1} (\cdot) - ρ_{2} (\cdot) ∥_{tv} := B \in E max ∣ ρ_{1} (B) - ρ_{2} (B) ∣ .

∥ ρ_{1} (\cdot) - ρ_{2} (\cdot) ∥_{tv} = \frac{1}{2} x \in E \sum ∣ ρ_{1} (x) - ρ_{2} (x) ∣ .

∥ ρ_{1} (\cdot) - ρ_{2} (\cdot) ∥_{tv} = \frac{1}{2} x \in E \sum ∣ ρ_{1} (x) - ρ_{2} (x) ∣ .

c := i \in Γ sup j \in Γ \sum C_{i, j} < 1.

c := i \in Γ sup j \in Γ \sum C_{i, j} < 1.

c^{*} := j \in Γ sup i \in Γ \sum C_{i, j} < 1.

c^{*} := j \in Γ sup i \in Γ \sum C_{i, j} < 1.

P\Big{(}\Big{\lvert}\frac{1}{n}\|w\|^{2}-\sigma^{2}\Big{\lvert}\geq\epsilon\Big{)}\leq Ke^{-\kappa n\epsilon^{2}}.

P\Big{(}\Big{\lvert}\frac{1}{n}\|w\|^{2}-\sigma^{2}\Big{\lvert}\geq\epsilon\Big{)}\leq Ke^{-\kappa n\epsilon^{2}}.

τ_{t}^{2} σ_{t}^{2} = σ^{2} + σ_{t}^{2}, = \frac{1}{δ ∣Γ∣} i \in Γ \sum E [(η_{t - 1} ([β + τ_{t - 1} Z]_{Λ_{i}}) - β_{i})^{2}],

τ_{t}^{2} σ_{t}^{2} = σ^{2} + σ_{t}^{2}, = \frac{1}{δ ∣Γ∣} i \in Γ \sum E [(η_{t - 1} ([β + τ_{t - 1} Z]_{Λ_{i}}) - β_{i})^{2}],

β_{Λ_{c}}^{'} β_{Λ_{c - 2}}^{'} = (β_{1}^{'}, β_{2}^{'}, \dots, β_{2 k + 1}^{'}), = (avg, avg, β_{1}^{'}, β_{2}^{'}, \dots, β_{2 k - 1}^{'}),

β_{Λ_{c}}^{'} β_{Λ_{c - 2}}^{'} = (β_{1}^{'}, β_{2}^{'}, \dots, β_{2 k + 1}^{'}), = (avg, avg, β_{1}^{'}, β_{2}^{'}, \dots, β_{2 k - 1}^{'}),

\beta^{\prime}_{\Lambda_{c+\ell}}\!=\!\begin{cases}&\Big{(}\frac{1}{2k+1+\ell}\sum_{i=1}^{2k+1+\ell}\beta^{\prime}_{i},\ldots,\\ &\frac{1}{2k+1+\ell}\sum_{i=1}^{2k+1+\ell}\beta^{\prime}_{i},\beta^{\prime}_{1},\beta^{\prime}_{2},\ldots,\beta^{\prime}_{2k+1+\ell}\Big{)}\text{ if }\ell<0,\\ &\Big{(}\beta^{\prime}_{1}\,,\,\beta^{\prime}_{2}\,,\,\ldots\,,\,\beta^{\prime}_{2k+1}\Big{)}\text{ if }\ell=0,\\ &\Big{(}\beta^{\prime}_{1+\ell},\beta^{\prime}_{2+\ell},\ldots,\beta^{\prime}_{2k+1},\\ &\frac{1}{2k+1-\ell}\sum_{i=1+\ell}^{2k+1}\beta^{\prime}_{i},\ldots,\frac{1}{2k+1-\ell}\sum_{i=1+\ell}^{2k+1}\beta^{\prime}_{i}\Big{)}\text{ if }\ell>0.\end{cases}

\beta^{\prime}_{\Lambda_{c+\ell}}\!=\!\begin{cases}&\Big{(}\frac{1}{2k+1+\ell}\sum_{i=1}^{2k+1+\ell}\beta^{\prime}_{i},\ldots,\\ &\frac{1}{2k+1+\ell}\sum_{i=1}^{2k+1+\ell}\beta^{\prime}_{i},\beta^{\prime}_{1},\beta^{\prime}_{2},\ldots,\beta^{\prime}_{2k+1+\ell}\Big{)}\text{ if }\ell<0,\\ &\Big{(}\beta^{\prime}_{1}\,,\,\beta^{\prime}_{2}\,,\,\ldots\,,\,\beta^{\prime}_{2k+1}\Big{)}\text{ if }\ell=0,\\ &\Big{(}\beta^{\prime}_{1+\ell},\beta^{\prime}_{2+\ell},\ldots,\beta^{\prime}_{2k+1},\\ &\frac{1}{2k+1-\ell}\sum_{i=1+\ell}^{2k+1}\beta^{\prime}_{i},\ldots,\frac{1}{2k+1-\ell}\sum_{i=1+\ell}^{2k+1}\beta^{\prime}_{i}\Big{)}\text{ if }\ell>0.\end{cases}

σ_{t}^{2} = \frac{( N - 2 k )}{δ N} E [(η_{t - 1} (β^{'} + τ_{t - 1} Z^{'}) - β_{c}^{'})^{2}] + \frac{1}{δ N} ℓ \in K_{0} \sum E [(η_{t - 1} ([β^{'} + τ_{t - 1} Z^{'}]_{Λ_{c + ℓ}}) - β_{c + ℓ}^{'})^{2}],

σ_{t}^{2} = \frac{( N - 2 k )}{δ N} E [(η_{t - 1} (β^{'} + τ_{t - 1} Z^{'}) - β_{c}^{'})^{2}] + \frac{1}{δ N} ℓ \in K_{0} \sum E [(η_{t - 1} ([β^{'} + τ_{t - 1} Z^{'}]_{Λ_{c + ℓ}}) - β_{c + ℓ}^{'})^{2}],

σ_{t}^{2} = \frac{( N - 2 k ) ^{2}}{δ N ^{2}} E [(η_{t - 1} (β^{'} + τ_{t - 1} Z^{'}) - β_{c}^{'})^{2}]

σ_{t}^{2} = \frac{( N - 2 k ) ^{2}}{δ N ^{2}} E [(η_{t - 1} (β^{'} + τ_{t - 1} Z^{'}) - β_{c}^{'})^{2}]

+ \frac{1}{δ N ^{2}} ℓ_{1}, ℓ_{2} \in K_{0} \sum E [(η_{t - 1} ([β^{'} + τ_{t - 1} Z^{'}]_{Λ_{c + ℓ}}) - β_{c + ℓ}^{'})^{2}]

+ \frac{( N - 2 k )}{δ N ^{2}} ℓ_{1} \in K_{0} ℓ_{2} = 0 \sum E [(η_{t - 1} ([β^{'} + τ_{t - 1} Z^{'}]_{Λ_{c + ℓ}}) - β_{c + ℓ}^{'})^{2}]

+ \frac{( N - 2 k )}{δ N ^{2}} ℓ_{2} \in K_{0} ℓ_{1} = 0 \sum E [(η_{t - 1} ([β^{'} + τ_{t - 1} Z^{'}]_{Λ_{c + ℓ}}) - β_{c + l}^{'})^{2}],

\displaystyle P\Big{(}\Big{\lvert}\frac{1}{|\Gamma|}\sum_{i\in\Gamma}\Big{(}\phi(\beta^{t+1}_{i},\beta_{i})-\mathbb{E}[\phi(\eta_{t}([\beta+\tau_{t}Z]_{\Lambda_{i}}),\beta_{i})]\Big{)}\Big{\lvert}\geq\epsilon\Big{)}

\displaystyle P\Big{(}\Big{\lvert}\frac{1}{|\Gamma|}\sum_{i\in\Gamma}\Big{(}\phi(\beta^{t+1}_{i},\beta_{i})-\mathbb{E}[\phi(\eta_{t}([\beta+\tau_{t}Z]_{\Lambda_{i}}),\beta_{i})]\Big{)}\Big{\lvert}\geq\epsilon\Big{)}

\leq K_{k, t} e^{- κ_{k, t} n ϵ^{2}},

P\Big{(}\Big{\lvert}\frac{1}{|\Gamma|}\|\beta^{t+1}-\beta\|^{2}-\delta\sigma_{t+1}^{2}\Big{\lvert}\geq\epsilon\Big{)}\leq K_{k,t}e^{-\kappa_{k,t}n\epsilon^{2}},

P\Big{(}\Big{\lvert}\frac{1}{|\Gamma|}\|\beta^{t+1}-\beta\|^{2}-\delta\sigma_{t+1}^{2}\Big{\lvert}\geq\epsilon\Big{)}\leq K_{k,t}e^{-\kappa_{k,t}n\epsilon^{2}},

μ (x) = P (β = x) = \frac{\prod _{m = 1}^{M - 1} \prod _{n = 1}^{N - 1} [ x _{m, n} x _{m + 1, n} x _{m, n + 1} x _{m + 1, n + 1} ] \prod _{m = 2}^{M - 1} \prod _{n = 2}^{N - 1} [ x _{m, n} ]}{\prod _{m = 2}^{M - 1} \prod _{n = 1}^{N - 1} [ x _{m, n} x _{m, n + 1} ] \prod _{m = 1}^{M - 1} \prod _{n = 2}^{N - 1} [ x _{m, n} x _{m + 1, n} ]},

μ (x) = P (β = x) = \frac{\prod _{m = 1}^{M - 1} \prod _{n = 1}^{N - 1} [ x _{m, n} x _{m + 1, n} x _{m, n + 1} x _{m + 1, n + 1} ] \prod _{m = 2}^{M - 1} \prod _{n = 2}^{N - 1} [ x _{m, n} ]}{\prod _{m = 2}^{M - 1} \prod _{n = 1}^{N - 1} [ x _{m, n} x _{m, n + 1} ] \prod _{m = 1}^{M - 1} \prod _{n = 2}^{N - 1} [ x _{m, n} x _{m + 1, n} ]},

\begin{split}&\left[\begin{matrix}x_{m,n}&x_{m,n+1}\\ x_{m+1,n}&x_{m+1,n+1}\end{matrix}\right]\\ &:=P\Big{(}\beta_{m,n}=x_{m,n},\beta_{m,n+1}=x_{m,n+1},\\ &\qquad\qquad\qquad\beta_{m+1,n}=x_{m+1,n},\beta_{m+1,n+1}=x_{m+1,n+1}\Big{)},\end{split}

\begin{split}&\left[\begin{matrix}x_{m,n}&x_{m,n+1}\\ x_{m+1,n}&x_{m+1,n+1}\end{matrix}\right]\\ &:=P\Big{(}\beta_{m,n}=x_{m,n},\beta_{m,n+1}=x_{m,n+1},\\ &\qquad\qquad\qquad\beta_{m+1,n}=x_{m+1,n},\beta_{m+1,n+1}=x_{m+1,n+1}\Big{)},\end{split}

\begin{split}&\left[\begin{matrix}x_{m,n}&x_{m,n+1}\\ x_{m+1,n}&\framebox{$x_{m+1,n+1}$}\end{matrix}\right]\\ &:=P\Big{(}\beta_{m+1,n+1}=x_{m+1,n+1}\,\Big{|}\\ &\qquad\beta_{m,n}=x_{m,n},\beta_{m+1,n}=x_{m+1,n},\beta_{m,n+1}=x_{m,n+1}\Big{)}.\end{split}

\begin{split}&\left[\begin{matrix}x_{m,n}&x_{m,n+1}\\ x_{m+1,n}&\framebox{$x_{m+1,n+1}$}\end{matrix}\right]\\ &:=P\Big{(}\beta_{m+1,n+1}=x_{m+1,n+1}\,\Big{|}\\ &\qquad\beta_{m,n}=x_{m,n},\beta_{m+1,n}=x_{m+1,n},\beta_{m,n+1}=x_{m,n+1}\Big{)}.\end{split}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Distributed Sensor Networks and Detection Algorithms · Blind Source Separation Techniques

Full text

Analysis of Approximate Message Passing

with Non-Separable Denoisers and

Markov Random Field Priors

Yanting Ma, Cynthia Rush, and Dror Baron Y. Ma is with Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA 02138, USA (e-mail: [email protected]).C. Rush is with the Department of Statistics, Columbia University, New York, NY 10027, USA (e-mail: [email protected]).D. Baron is with the Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695, USA (e-mail: [email protected]).This work was supported by the National Science Foundation (NSF) under grants CCF-1217749 and ECCS-1611112.Portions of the work appeared at the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, June 2017 [1].The work was completed while Y. Ma was with North Carolina State University.

Abstract

Approximate message passing (AMP) is a class of low-complexity, scalable algorithms for solving high-dimensional linear regression tasks where one wishes to recover an unknown signal from noisy, linear measurements. AMP is an iterative algorithm that performs estimation by updating an estimate of the unknown signal at each iteration and the performance of AMP (quantified, for example, by the mean squared error of its estimates) depends on the choice of a “denoiser” function that is used to produce these signal estimates at each iteration.

An attractive feature of AMP is that its performance can be tracked by a scalar recursion referred to as state evolution. Previous theoretical analysis of the accuracy of the state evolution predictions has been limited to the use of only separable denoisers or block-separable denoisers, a class of denoisers that underperform when sophisticated dependencies exist between signal entries. Since signals with entrywise dependencies are common in image/video-processing applications, in this work we study the high-dimensional linear regression task when the dependence structure of the input signal is modeled by a Markov random field prior distribution. We provide a rigorous analysis of the performance of AMP, demonstrating the accuracy of the state evolution predictions, when a class of non-separable sliding-window denoisers is applied. Moreover, we provide numerical examples where AMP with sliding-window denoisers can successfully capture local dependencies in images.

Index Terms:

approximate message passing, non-separable denoiser, Markov random field, finite sample analysis.

I Introduction

In this work, we study the problem of estimating an unknown signal $\beta:=(\beta_{i})_{i\in\Gamma}$ from noisy, linear measurements as in the following model:

[TABLE]

where for some integer $p\in\{1,2,3\}$ , $\Gamma\subset\mathbb{Z}^{p}$ is an index set with cardinality $|\Gamma|$ , $y\in\mathbb{R}^{n}$ is the output, $A\in\mathbb{R}^{n\times|\Gamma|}$ is a known measurement matrix, $w\in\mathbb{R}^{n}$ is zero-mean noise with finite variance $\sigma^{2}$ , and $\mathcal{V}$ (script $V$ stands for “vectorization”) is an invertible operator that rearranges elements of an array into a vector, hence $\mathcal{V}(\beta)$ is a length- $|\Gamma|$ vector. We assume that the ratio of the dimensions of the measurement matrix is a constant value, $\delta:=n/|\Gamma|$ , with $\delta\in(0,\infty)$ .

Approximate message passing (AMP) [2, 3, 4, 5, 6] is a class of low-complexity, scalable algorithms studied to solve the high-dimensional regression task of (1). The performance of AMP depends on a sequence of functions $\{\eta_{t}\}_{t\geq 0}$ used to generate a sequence of estimates $\{\beta^{t}\}_{t\geq 0}$ from effective observations computed in every iteration of the algorithm. A nice property of AMP is that under some technical conditions these observations can be approximated as the input signal $\beta$ plus independent and identically distributed (i.i.d.) Gaussian noise. For this reason, the functions $\{\eta_{t}\}_{t\geq 0}$ are referred to as “denoisers.”

Previous analysis of the performance of AMP only considers denoisers $\{\eta_{t}\}_{t\geq 0}$ that act coordinate-wise when applied to a vector; such denoisers are referred to as separable. If the unknown signal $\beta$ has a prior distribution with i.i.d. entries, restricting consideration to only separable denoisers causes no loss in performance. However, in many real-world applications, the unknown signal $\beta$ contains dependencies between entries, and therefore a coordinate-wise independence structure does not approximate the prior for $\beta$ well. Instead of using a separable denoiser, non-separable denoisers can improve reconstruction quality for signals with such dependencies among entries. For example, when the signals are images [7, 8] or sound clips [9], non-separable denoisers outperform reconstruction techniques based on over-simplified i.i.d. models. In such cases, a more appropriate model might be a finite memory model, well-approximated with a Markov random field (MRF) prior. In this paper, we extend the previous performance guarantees for AMP to a class of non-separable sliding-window denoisers when the unknown signal is a realization of an MRF. Sliding-window schemes have been studied for denoising signals with dependencies among entries by, for example, Sivaramakrishnan and Weissman [10, 11]. MRFs are appropriate models for many types of images, especially texture images, which have an inherently random component [12, 13].

When the measurement matrix $A$ has i.i.d. Gaussian entries and the empirical distribution function111For a vector $(x_{1},\ldots,x_{N})\in\mathbb{R}^{N}$ , the empirical distribution function $F_{N}:\mathbb{R}\to[0,1]$ is defined as $F_{N}(x):=\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}_{\{x_{i}\leq x\}}$ , where $\mathbb{I}$ denotes the indicator function. The empirical distribution function $F_{N}$ is said to converge to some distribution function $F:\mathbb{R}\to[0,1]$ if for all $x\in\mathbb{R}$ such that $F(x)$ is continuous, we have $\lim_{N\to\infty}F_{N}(x)=F(x)$ . of the unknown signal $\beta$ converges to some distribution function on $\mathbb{R}$ , Bayati and Montanari [4] proved that at each iteration the performance of AMP can be accurately predicted by a simple, scalar iteration referred to as state evolution in the large system limit ( $n,|\Gamma|\to\infty$ such that $(n/|\Gamma|)\rightarrow\delta$ is a constant). For example, if $\beta^{t}$ is the estimate produced by AMP at iteration $t$ , the result by Bayati and Montanari [4] implies that the normalized squared error, $\frac{1}{|\Gamma|}\left\lVert\beta^{t}-\beta\right\rVert^{2}$ , and other performance measures converge to deterministic values predicted by state evolution, which is a deterministic recursion calculated using the prior distribution of $\beta$ .222Throughout the paper, $\|v\|^{2}$ denotes the sum of squares of all the entries in $v$ , where $v$ could be, for example, in $\mathbb{R}^{n}$ , $\mathbb{R}^{n\times n}$ , or $\mathbb{R}^{n\times n\times n}$ . Rush and Venkataramanan [14] provided a concentration version of the asymptotic result when the prior distribution of $\beta$ is i.i.d. sub-Gaussian. The result in Rush and Venkataramanan [14] implies that the probability of $\epsilon$ -deviation between various performance measures and their limiting constant values decay exponentially in $|\Gamma|$ .

Extensions of AMP performance guarantees beyond separable denoisers have been considered in special cases [15, 16] for certain classes of block-separable denoisers that allow dependencies within blocks of the signal $\beta$ with independence across blocks. A preliminary version [1] of this work has analyzed the performance of AMP with sliding-window denoisers applied to the setting where the unknown signal has a Markov chain prior. In this paper, we generalize the previous result with the applications of compressive imaging [7, 8] and compressive hyperspectral imaging [17] in mind. We consider 2D/3D MRF priors for the input signal $\beta$ , and provide performance guarantees for AMP with 2D/3D sliding-window denoisers under some technical conditions.

While we were concluding this manuscript, we became aware of recent work of Berthier et al. [18]. The authors prove that the loss of the estimates generated by AMP (for a class of loss functions) with general non-separable denoisers converges to the state evolution predictions asymptotically. Our work differs from [18] in the following three aspects: (i) our work provides finite sample analysis, whereas the result in [18] is asymptotic; (ii) we adjust the state evolution sequence for the specific class of non-separable sliding-window denoisers to account for the “edge” issue that occurs in the finite sample regime (this point will become clear in later sections); (iii) we consider the setting where the unknown signal is a realization of an MRF and the expectation in the definition of the state evolution sequence is with respect to (w.r.t.) the signal $\beta$ , the matrix $A$ , and the noise $w$ , whereas in [18], the signal $\beta$ is deterministic and unknown, hence the expectation is only w.r.t. the matrix $A$ and the noise $w$ .

I-A Sliding-Window Denoisers and AMP Algorithm

Notation: Before introducing the algorithm, we provide some notation that is used to define the sliding window in the sliding-window denoiser. Without loss of generality, we let the index set $\Gamma\subset\mathbb{Z}^{p}$ , on which the input signal $\beta$ in (1) is defined, be

[TABLE]

where for an integer $N$ , the notation $[N]$ represents the set of integers $\{1,\ldots,N\}$ , hence, $|\Gamma|=N^{p}$ . Similarly, let $\Lambda$ be a $p$ -dimensional cube in $\mathbb{Z}^{p}$ with length $(2k+1)$ in each dimension, namely,

[TABLE]

where $2k+1\leq N$ . We call $k$ the half-window size.

AMP with sliding-window denoisers: The AMP algorithm for estimating $\beta$ from $y$ and $A$ in (1) generates a sequence of estimates $\{\beta^{t}\}_{t\geq 0}$ , where $\beta^{t}\in\mathbb{R}^{\Gamma}$ , $t$ is the iteration index, and the initialization $\beta^{0}:=0$ is an all-zero array with the same dimension as the input signal $\beta$ . For $t\geq 0$ , the algorithm proceeds as follows:

[TABLE]

where the function $\{\eta_{t}\}_{t\geq 0}:\mathbb{R}^{\Lambda}\rightarrow\mathbb{R}$ is a sequence of denoisers, $\eta_{t-1}^{\prime}$ is the partial derivative w.r.t. the center coordinate of the argument, $A^{*}$ is the transpose of $A$ , and $\Lambda_{i}\subset\mathbb{Z}^{p}$ for each $i\in\Gamma\subset\mathbb{Z}^{p}$ is the $p$ -dimensional cube $\Lambda$ translated to be centered at location $i$ . The translated $p$ -dimensional cubes $\{\Lambda_{i}\}_{i\in\Gamma}$ are referred to as “sliding windows,” which will be used to subset elements of a $p$ -dimensional array. The effective observation at iteration $t$ is $\mathcal{V}^{-1}(A^{*}z^{t})+\beta^{t}\in\mathbb{R}^{\Gamma}$ , which can be approximated as the true signal $\beta$ plus i.i.d. Gaussian noise (in a sense that will be made clear in the statement of our main result, Theorem 1). Note that the sliding-windows $\{\Lambda_{i}\}_{i\in\Gamma}$ and the sliding-window denoiser $\eta_{t}$ are defined on multidimensional signals, hence we use the inverse of the vectorization operator, $\mathcal{V}^{-1}$ , to rearrange elements of vectors into arrays before applying the sliding-window denoiser $\eta_{t}$ . It should also be noted that the denoiser $\eta_{t}$ may only process part of the signal elements in $\Lambda$ . For example, in the 2D case, if $\Lambda$ is defined as a $3\times 3$ window, then $\eta_{t}$ may only process the center and the four adjacent pixel values in the window (see Figure 1) and ignore the four corners. To simplify notation, we will write $\eta_{t}:\mathbb{R}^{\Lambda}\to\mathbb{R}$ throughout the paper, and interpret this notation to mean that any processing of neighboring signal values is allowed, including the possibility of ignoring some of their values.

Edge cases: Notice that when the center coordinate $i$ is near the edge of $\Gamma$ , some of the elements in $\Lambda_{i}$ may fall outside $\Gamma$ , meaning that $\Lambda_{i}\cap\Gamma^{c}\neq\emptyset$ where $\Gamma^{c}$ is the complement of $\Gamma$ w.r.t. $\mathbb{Z}^{p}$ . In the definition of the AMP algorithm with sliding-window denoisers and the subsequent analysis, these “edge cases” must be handled carefully. The following definitions provide a framework for the special treatment of the edge cases.

Based on whether $\Lambda_{i}$ has elements outside $\Gamma$ , we partition the index set $\Gamma$ into two sets $\Gamma^{\textsf{mid}\,}$ and $\Gamma^{\textsf{edge}\,}$ defined as:

[TABLE]

That is, for $i\in\Gamma^{\textsf{mid}\,}$ , all elements in $\Lambda_{i}$ lie inside $\Gamma$ , whereas for $i\in\Gamma^{\textsf{edge}\,}$ , some of the elements in $\Lambda_{i}$ fall outside $\Gamma$ . The size of each set will depend on the half-window size $k$ and the dimension $p$ .

For any $v\in\mathbb{R}^{\Gamma}$ , let $v_{\Lambda_{i}}$ be a subset of the elements of $v$ with indices in $\Lambda_{i}$ and for any $j\in\Lambda$ , let $[\Lambda_{i}]_{j}$ be the $j^{th}$ index of $\Lambda_{i}$ so that $v_{[\Lambda_{i}]_{j}}$ returns a single element of $v_{\Lambda_{i}}$ . Notice that for $i\in\Gamma^{\textsf{mid}\,}$ , all entries of $v_{\Lambda_{i}}$ are well-defined. However, for $i\in\Gamma^{\textsf{edge}\,}$ , the subset $v_{\Lambda_{i}}$ has undefined entries, namely, for all $j\in\Lambda$ such that $[\Lambda_{i}]_{j}\in\Gamma^{c}$ , the entry $v_{[\Lambda_{i}]_{j}}$ is undefined. We now define the value of those “missing” entries to be the average of the entries of $v_{\Lambda_{i}}$ having indices in $\Gamma$ . Formally, for all $j\in\Lambda$ such that $[\Lambda_{i}]_{j}\in\Gamma^{c}$ , define

[TABLE]

It may improve signal recovery quality to use other schemes for “missing” entries, like interpolation. We leave the study of these improved schemes for future work, but the effect of improved processing around edges should become minor as $N$ increases. Notice that $v_{\Lambda_{i}}$ for all $i\in\Gamma$ are now defined using only the entries in the original $v\in\mathbb{R}^{\Gamma}$ . It will be useful to emphasize this point in the proof of our main result, so we define a set of functions $\{\mathcal{T}_{i}\}_{i\in\Gamma}$ with $\mathcal{T}_{i}:\mathbb{R}^{\Lambda_{i}\cap\Gamma}\to\mathbb{R}^{\Lambda}$ as

[TABLE]

where $v_{\Lambda_{i}}$ follows our definition above. That is, $\mathcal{T}_{i}$ is identity for $i\in\Gamma^{\textsf{mid}\,}$ , whereas for $i\in\Gamma^{\textsf{edge}\,}$ , $\mathcal{T}_{i}$ extends a smaller array $v_{\Lambda_{i}\cap\Gamma}$ to a larger one $v_{\Lambda_{i}}$ with the extended entries defined by (7).

Examples for defining “missing” entries: To illustrate the notations defined above, we present an example for the $p=1$ case (hence $v\in\mathbb{R}^{N}$ is a vector). As defined above in (2) and (3), we have $\Gamma=\{1,\ldots,N\}$ , $\Lambda=\{1,\ldots,2k+1\}$ , and $\Lambda_{i}=(i-k,\ldots,i-1,i,i+1,\ldots,i+k)$ for each $i\in[N]$ . Moreover, $\Gamma^{\textsf{mid}\,}=\{k+1,k+2,\ldots,N-k\}$ and $\Gamma^{\textsf{edge}\,}=\{1,2,\ldots,k\}\cup\{N-k+1,N-k+2,\ldots,N\}$ as defined in (6). Therefore, for $i\in\Gamma^{\textsf{mid}\,}$ ,

[TABLE]

For $i\in\Gamma^{\textsf{edge}\,}$ , the vector $v_{\Lambda_{i}}$ is still length- $(2k+1)$ , and we set the values of the non-positive indices, i.e., $1-k,2-k,\ldots,-1,0$ , or indices above $N$ , i.e., $N+1,N+2,\ldots,N+k$ , to be the average of values in the vector $v_{\Lambda_{i}}$ with indices in $\Lambda_{i}\cap[N]$ . For example, let $i=3$ and $k=5$ giving $\Lambda_{3}=(-2,-1,0,1,\ldots,8)$ so that for $j\in\{1,2,3\}$ we have $[\Lambda_{3}]_{j}\in\Gamma^{c}$ . Following (7), define

[TABLE]

and so $v_{\Lambda_{3}}=\mathcal{T}_{3}(v_{1},\ldots,v_{8}):=(\bar{v},\bar{v},\bar{v},v_{1},\ldots,v_{8})\in\mathbb{R}^{11}$ . An example for the $p=2$ case (hence $v\in\mathbb{R}^{N\times N}$ is a matrix), is shown in Figure 2.

I-B Contributions and Outline

Our main result proves concentration for (order-2) pseudo-Lipschitz (PL(2)) loss functions333A function $f:\mathbb{R}^{m}\to\mathbb{R}$ is (order-2) pseudo-Lipschitz if there exists a constant $L>0$ such that for all $x,y\in\mathbb{R}^{m}$ , $|f(x)-f(y)|\leq L(1+\left\lVert x\right\rVert+\left\lVert y\right\rVert)\left\lVert x-y\right\rVert$ . acting on the AMP estimate given in (5) at any iteration $t$ of the algorithm to constant values predicted by the state evolution equations that will be introduced in the following. This work covers the case where the unknown signal $\beta$ has an MRF prior on $\mathbb{Z}^{p}$ . For example, when $p=2$ , $\beta$ can be thought of as an image, whereas when $p=3$ , $\beta$ can be thought of as a hyperspectral image cube. Moreover we use numerical examples to demonstrate the effectiveness of AMP with sliding-window denoisers when used to reconstruct images from noisy linear measurements.

The rest of the paper is organized as follows. Section II provides model assumptions, state evolution formulas, the main performance guarantee, and numerical examples illustrating the effectiveness of the algorithm for compressive image reconstruction. Our main performance guarantee (Theorem 1) is a concentration result for PL loss functions acting on the AMP outputs from (4)-(5) to the state evolution predictions. Section III provides the proof of Theorem 1. The proof is based on a technical lemma, Lemma 4, and the proof of Lemma 4 is provided in Section IV.

II Main Results

II-A Definitions and Assumptions

First we include some definitions relating to MRFs that will be used to state our assumptions on the unknown signal $\beta$ . These definitions can be found in standard textbooks such as [19]; we include them here for convenience.

Definitions: Let $(\Omega,\mathcal{F},P)$ be a probability space. A random field is a collection of random variables $X=\{X_{i}\}_{i\in\Gamma}$ defined on $(\Omega,\mathcal{F},P)$ having spatial dependencies, where $X_{i}:\Omega\to E$ for some measurable state space $(E,\mathcal{E})$ and $\Gamma\subset\mathbb{Z}^{p}$ is a non-empty, finite subset of the infinite lattice $\mathbb{Z}^{p}$ . Note that $i\in\Gamma\subset\mathbb{Z}^{p}$ , hence $i=(i_{1},\ldots,i_{p})$ . One can consider $\Gamma$ as a collection of spatial locations. Denote the $q^{th}$ -order neighborhood of location $i\in\Gamma$ by $\mathcal{N}^{q}_{i}$ , that is, $\mathcal{N}^{q}_{i}\subset\Gamma$ is a collection of location indices at a distance less than or equal to $q$ from $i$ but not including $i$ . Formally,

[TABLE]

Following these definitions, $X$ is said to be a $q^{th}$ -order MRF if, for all $i\in\Gamma$ and for all measurable subsets $B\in\mathcal{E}$ , we have

[TABLE]

and for all $B\in\mathcal{E}^{\Gamma}$ we have $P(X\in B)>0$ . The positivity condition ensures that the joint distribution of an MRF is a Gibbs distribution by the Hammersley-Clifford theorem [20].

Let $\mu$ denote the distribution measure of $X$ , namely for all $B\in\mathcal{E}^{\Gamma}$ , we have $P(X\in B)=\mu(B)$ , and let $\mu_{\Lambda}$ be the distribution measure of $X_{\Lambda}:=\{X_{i}\}_{i\in\Lambda}$ for $\Lambda\subset\Gamma$ . For any $i\in\Gamma$ , define the set $i+\Lambda:=\{i+j\,\lvert\,j\in\Lambda\}$ . Then the random field is said to be stationary if for all $i\in\Gamma$ such that $i+\Lambda\subset\Gamma$ , it is true that $u_{\Lambda}=u_{i+\Lambda}$ .

Next we introduce the Dobrushin uniqueness condition, under which the random field admits a unique stationary distribution. Define the Dobrushin interdependence matrix $(C_{i,j})_{i,j\in\Gamma}$ for the measure $\mu$ of the random field $X$ to be

[TABLE]

In the above, the index set $j^{c}:=\Gamma\setminus\{j\}$ and the total variation distance $\|\cdot\|_{\mathsf{tv}}$ between two probability measures $\rho_{1}$ and $\rho_{2}$ on $(E,\mathcal{E})$ is defined as

[TABLE]

Note that if $E$ is countable, then

[TABLE]

The measure $\mu$ is said to satisfy the Dobrushin uniqueness condition if

[TABLE]

The Dobrushin contraction coefficient, $c$ , is a quantity that estimates the magnitude of change of the single site conditional expectations, as they appear in (9), when the field values at the other sites vary. Similarly, we define the transposed Dobrushin contraction condition as

[TABLE]

Assumptions: We can now state our assumptions on the signal $\beta$ , the matrix $A$ , and the noise $w$ in the linear system (1), as well as the denoiser function $\eta_{t}$ used in the algorithm (4) and (5).

Signal: Let $E\subset\mathbb{R}$ be a bounded state space (countable or uncountable). Let $\beta=(\beta_{i})_{i\in\Gamma}$ be a stationary MRF with Gibbs distribution measure $\mu$ on $E^{\Gamma}$ , where $\Gamma\subset\mathbb{Z}^{p}$ is a finite and nonempty rectangular lattice. We assume that $\mu$ satisfies the Dobrushin uniqueness condition and the transposed Dobrushin uniqueness condition. These two conditions together are needed for the results in Lemma C.1 and Lemma C.2, which demonstrate concentration of sums of pseudo-Lipschitz functions when the input to the functions are MRFs with distribution measure $\mu$ . Roughly, the conditions ensure that the dependencies between the terms in the sums are sufficiently weak for the desired concentration to hold. The class of finite state space stationary MRFs, which is widely used for image analysis [21], is one example that satisfies our assumption.

Denoiser functions: The denoiser functions $\eta_{t}:\mathbb{R}^{\Lambda}\rightarrow\mathbb{R}$ used in (5) are assumed to be Lipschitz444A function $f:\mathbb{R}^{m}\to\mathbb{R}$ is Lipschitz if there exists a constant $L>0$ such that for all $x,y\in\mathbb{R}^{m}$ , $\left\lvert f(x)-f(y)\right\rvert\leq L\left\lVert x-y\right\rVert$ . for each $t>0$ and are, therefore, also weakly differentiable with bounded (weak) partial derivatives. We further assume that the partial derivative w.r.t. the center coordinate of $\Lambda$ , which is denoted by $\eta_{t}^{\prime}:\mathbb{R}^{\Lambda}\to\mathbb{R}$ , is itself differentiable with bounded partial derivatives. Note that this implies $\eta_{t}^{\prime}$ is Lipschitz. (It is possible to weaken this condition to allow $\eta_{t}^{\prime}$ to have a finite number of discontinuities, if needed, as in [14].)

Matrix: The entries of the matrix $A$ are i.i.d. $\sim\mathcal{N}(0,1/n)$ .

Noise: The entries of the measurement noise vector $w$ are i.i.d. according to some sub-Gaussian distribution $p_{w}$ with mean 0 and finite variance $\sigma^{2}$ . The sub-Gaussian assumption implies [22] that for all $\epsilon\in(0,1)$ and for some constants $K,\kappa>0$ ,

[TABLE]

II-B Performance Guarantee

As noted in Section I, the behavior of the AMP algorithm is predicted by a deterministic scalar recursion referred to as state evolution, which we now introduce. More specifically, the state evolution sequences $\{\tau_{t}^{2}\}_{t\geq 0}$ and $\{\sigma_{t}^{2}\}_{t\geq 0}$ defined below in (11) will be used in Theorem 1 to characterize the estimation error of the estimates produced by AMP. Let the joint distribution $\mu$ define the (stationary) prior distribution for the unknown signal $\beta$ in (1). Following our assumption of stationarity, $\beta_{i}\sim\mu_{1}$ for all $i\in\Gamma$ and $\beta_{\Lambda_{i}}\sim\mu_{\Lambda}$ for all $i\in\Gamma^{\textsf{mid}\,}$ with $\Gamma^{\textsf{mid}\,}$ defined in (6), where $\mu_{1}$ and $\mu_{\Lambda}$ denote the one-dimensional marginal and $\Lambda$ -dimensional marginal of $\mu$ , respectively. Define $\sigma_{\beta}^{2}=\mathbb{E}[\beta_{1}^{2}]>0$ , and $\sigma_{0}^{2}=\sigma_{\beta}^{2}/\delta$ . Iteratively define the state evolution sequences $\{\tau_{t}^{2}\}_{t\geq 0}$ and $\{\sigma_{t}^{2}\}_{t\geq 1}$ as follows:

[TABLE]

where $\eta_{t}:\mathbb{R}^{\Lambda}\to\mathbb{R}$ is the sliding-window denoiser and $Z\in\mathbb{R}^{\Gamma}$ has i.i.d. standard normal entries, independent of $\beta$ , which implies that $Z_{\Lambda_{i}}$ is independent of $\beta_{\Lambda_{i}}$ and $\beta_{i}$ for all $i\in\Gamma$ . Let $\beta^{\prime}\in E^{\Lambda}\sim\mu_{\Lambda}$ and define $Z^{\prime}\in\mathbb{R}^{\Lambda}$ with entries that are i.i.d. $\mathcal{N}(0,1)$ . We notice that for all $i\in\Gamma^{\textsf{mid}\,}$ , we have $\beta_{\Lambda_{i}}\overset{d}{=}\beta^{\prime}$ and $Z_{\Lambda_{i}}\overset{d}{=}Z^{\prime}$ . Therefore, for all $i\in\Gamma^{\textsf{mid}\,}$ , the expectations in (11) satisfy $\mathbb{E}\left[\left(\eta_{t-1}([\beta+\tau_{t-1}Z]_{\Lambda_{i}})-\beta_{i}\right)^{2}\right]=\mathbb{E}\left[\left(\eta_{t-1}(\beta^{\prime}+\tau_{t-1}Z^{\prime})-\beta_{c}^{\prime}\right)^{2}\right]$ , where $\beta_{c}^{\prime}$ is the center coordinate of $\beta^{\prime}$ . For $i\in\Gamma^{\textsf{edge}\,}$ with $\Gamma^{\textsf{edge}\,}$ defined in (6), it is not necessarily true that $\beta_{\Lambda_{i}}\overset{d}{=}\beta^{\prime}$ since, by definition (7), some entries of $\beta_{\Lambda_{i}}$ are defined as the average of other entries.

The explicit expression for the definition of $\sigma_{t}^{2}$ in (11) is different when considering $\Gamma\subset\mathbb{Z}^{p}$ for different $p$ values, as the size of the set $\Gamma^{\textsf{edge}\,}$ depends on the dimension. In the following, we provide explicit expressions for $\sigma_{t}^{2}$ for the cases $p=1,2$ , but in the proof we will use the general expression given in (11) for brevity. We emphasize that the definition of the state evolution sequence in (11) only uses the marginal distribution $\mu_{\Lambda}$ (or $\beta^{\prime}\in E^{\Lambda}$ ) instead of the joint distribution $\mu$ (or $\beta\in E^{\Gamma}$ ), as demonstrated in the two examples below in (12) and (13).

Examples for explicit expressions for $\sigma_{t}^{2}$ : Let $\beta_{c}^{\prime}$ be the center coordinate of $\beta^{\prime}\in E^{\Lambda}$ and $\Lambda_{c}$ the window $\Lambda\subset\mathbb{Z}^{p}$ translated with center $c\in\mathbb{Z}^{p}$ . Recall that $\Lambda$ is the $p$ -dimensional cube with length $(2k+1)$ in each of the $p$ dimensions. Then we have $\beta^{\prime}=\beta^{\prime}_{\Lambda_{c}}$ and when we consider shifts $\beta^{\prime}_{\Lambda_{c+\ell}}$ for $\ell\in\{-k,-k+1,\ldots,k-1,k\}$ we, analogous to the definition in (7), define “missing” entries to be replaced by the average of the existing entries. (Note that since $\beta^{\prime}$ is exactly of size $\Lambda$ , thus for any $\ell\neq 0$ , there will be “missing” entries.) For example, when $p=1$ ,

[TABLE]

where $\text{avg}=\frac{1}{2k-1}\sum_{i=1}^{2k-1}\beta^{\prime}_{i}$ . Generalizing, we have $\beta^{\prime}_{\Lambda_{c+\ell}}~{}\in~{}\mathbb{R}^{2k+1}$ with

[TABLE]

The same idea can be extended when $p>1$ .

For the case $p=1$ , we note that $\Gamma^{\textsf{mid}\,}=\{k+1,k+2,\ldots,N-k-1\}$ and $\Gamma^{\textsf{edge}\,}=\{1,2,\ldots,k\}\cup\{N-k,N-k+1,\ldots,N\}$ , hence $|\Gamma^{\textsf{mid}\,}|=N-2k$ and $|\Gamma^{\textsf{edge}\,}|=2k$ . Therefore, we have

[TABLE]

where $\mathcal{K}_{0}=\{-k,\ldots,-1\}\cup\{1,\ldots,k\}$ . In the above the first term corresponds to the $N-2k$ middle indices, while the second term sums over $2k$ terms corresponding to all the possible edge cases.

For the case $p=2$ , we note that $\Gamma^{\textsf{mid}\,}=\{(i,j)\,\lvert\,k+1\leq i,j\leq N-k+1\}$ , hence $|\Gamma^{\textsf{mid}\,}|=(N-2k)^{2}$ . Here we note $\ell=(\ell_{1},\ell_{2})\in\{-k,-k+1,\ldots,k-1,k\}\times\{-k,-k+1,\ldots,k-1,k\}$ . Therefore,

[TABLE]

where we notice that there are $(2k)^{2}$ terms in the second summand, $2k$ terms in the third and fourth summands, and $\frac{(N-2k)^{2}}{N^{2}}+\frac{(2k)^{2}}{N^{2}}+\frac{2k(N-2k)}{N^{2}}+\frac{2k(N-2k)}{N^{2}}=1$ . Again, in the above the first term sums over all the middle indices. In this case, the second term corresponds to the corner edge cases, while the third and fourth terms correspond to the edge cases in one dimension only. We note that $\sigma_{t}^{2}$ is a function of $N$ , but do not explicitly represent this relationship to simplify the notation. Moreover, for fixed $k$ , the terms $\frac{(2k)^{2}}{N^{2}}$ , $\frac{2k(N-2k)}{N^{2}}$ , and $\frac{2k(N-2k)}{N^{2}}$ vanish as $N$ goes to infinity. Therefore, we have $\lim_{N\to\infty}\sigma_{t}^{2}(N)=\frac{1}{\delta}\mathbb{E}[(\eta_{t-1}(\beta^{\prime}+\tau_{t-1}Z^{\prime})-\beta_{c}^{\prime})^{2}]$ .

Similar to [14], our performance guarantee, Theorem 1, is a concentration inequality for PL(2) loss functions at any fixed iteration $t<T^{*}$ , where $T^{*}$ is the first iteration when either $(\sigma_{t}^{\perp})^{2}$ or $(\tau_{t}^{\perp})^{2}$ defined in (36) is smaller than a predefined quantity $\hat{\epsilon}$ . The precise definition of $(\sigma_{t}^{\perp})^{2}$ and $(\tau_{t}^{\perp})^{2}$ is deferred to Section III-B. For now, we can understand $(\sigma_{t}^{\perp})^{2}$ (respectively, $(\tau_{t}^{\perp})^{2}$ ) as a number that quantifies (in a probabilistic sense) how close an estimate $\beta^{t}$ (respectively, a residual $z^{t}$ ) is to the subspace spanned by the previous estimates $\{\beta^{s}\}_{s<t}$ (respectively, the previous residuals $\{z^{s}\}_{s<t}$ ). In the special case where $\{\eta_{t}\}_{t\geq 0}$ are Bayes-optimal conditional expectation denoisers, it can be shown that small $(\sigma_{t}^{\perp})^{2}$ implies that the difference between $\sigma_{t}^{2}$ and $\sigma_{t-1}^{2}$ is small [14].

Theorem 1.

Under the assumptions stated in Section II-A, and for fixed half window-size $k>0$ , then for any (order- $2$ ) pseudo-Lipschitz function $\phi:\mathbb{R}^{2}\rightarrow\mathbb{R}$ , $\epsilon\in(0,1)$ , and $0\leq t<T^{*}$ ,

[TABLE]

where $\beta\in E^{\Gamma}\sim\mu$ , $Z\in\mathbb{R}^{\Gamma}$ has i.i.d. standard normal entries and is independent of $\beta$ , and the deterministic quantity $\tau_{t}$ is defined in (11). The constants $K_{k,t},\kappa_{k,t}>0$ do not depend on $n$ or $\epsilon$ , but do depend on $k$ and $t$ . Their values are not explicitly specified.

Proof.

See Section III. ∎

Remarks:

(1) The probability in (14) is w.r.t. the product measure on the space of the matrix $A$ , signal $\beta$ , and noise $w$ .

(2) By choosing the following PL(2) loss function, $\phi(a,b)=(a-b)^{2}$ , Theorem 1 gives the following concentration result for the mean squared error of the estimates. For all $t\geq 0$ ,

[TABLE]

with $\sigma_{t+1}^{2}$ defined in (11).

II-C Numerical Examples

Before moving to the proof of Theorem 1, we first demonstrate the effectiveness of the AMP algorithm with sliding-window denoisers when used to reconstruct an image $\beta_{0}$ from its linear measurements acquired according to (1). We verify that state evolution accurately tracks the normalized estimation error of AMP, as is guaranteed by Theorem 1. While we use squared error as the error metric in our examples, which corresponds to the case where the PL(2) loss function $\phi$ in Theorem 1 is defined as $\phi(a,b):=(a-b)^{2}$ , we remind the reader that Theorem 1 also supports other PL(2) loss functions. Moreover, we apply AMP with sliding-window denoisers to reconstruct texture images, which are known to be well-modeled by MRFs in many cases [12, 13].

II-C1 Verification of state evolution

We consider a class of stationary MRFs on $\mathbb{Z}^{2}$ whose neighborhood is defined as the eight-nearest neighbors, meaning this is a $2^{nd}$ -order MRF per the definition in Section II-A. The joint distribution of such an MRF on any finite $M\times N$ rectangular lattice in $\mathbb{Z}^{2}$ has the following expression [23]:

[TABLE]

where we follow the notation in [23] for the generic measure $\left[\begin{matrix}x_{m,n}&x_{m,n+1}\\ x_{m+1,n}&x_{m+1,n+1}\end{matrix}\right]$ defined as

[TABLE]

and the conditional distribution of the element in the box given the element(s) not in the box:

[TABLE]

The generic measure needs to satisfy some consistency conditions to ensure the Markovian property and stationarity of the MRF on a finite grid; details can be found in [23]. For convenience, in simulations we use a $\Pi_{+}$ Binary MRF as defined in [23, Definition 7], for which the generic measure is conveniently parameterized by four parameters, namely,

[TABLE]

In the simulations, we set $\{p=0.4,q=0.5,r=0.01,s=0.4\}$ . Using (9) and (10), it can be checked that the distribution measure of this MRF satisfies the Dobrushin uniqueness condition.

As mentioned previously, an attractive property of AMP, which is formally stated in Theorem 1, is the following: for large $n$ and $|\Gamma|$ and for $i\in\Gamma$ , the observation vector $[A^{*}z^{t}+\beta^{t}]_{\Lambda_{i}}$ used as an input to the estimation function in (5) is approximately distributed as $\beta^{\prime}+\tau_{t}Z^{\prime}$ , where $\beta^{\prime}\sim\mu_{\Lambda}$ , $Z^{\prime}$ has i.i.d. standard normal entries, independent of $\beta^{\prime}$ , and $\tau_{t}$ is defined in (11). With this property in mind, a natural choice of denoiser functions $\{\eta_{t}\}_{t\geq 0}$ are those that calculate the conditional expectation of the signal given the value of the input argument, which we refer to as Bayesian sliding-window denoisers. Let $V_{t}=\beta^{\prime}+\tau_{t}Z^{\prime}$ and $v\in\mathbb{R}^{\Lambda}$ , then for each $t\geq 0$ we define

[TABLE]

where $x_{c}$ denotes the center coordinate of $x\in\mathbb{R}^{\Lambda}$ , $x_{\Lambda\setminus c}$ denotes all coordinates in $x$ except the center, $f_{V_{t}|\beta^{\prime}}(v|x)=\prod_{i\in\Lambda}\frac{1}{\sqrt{2\pi}\tau_{t}}\exp\left(-\frac{(v_{i}-\beta^{\prime}_{i})^{2}}{2\tau_{t}^{2}}\right)$ since coordinates of $Z^{\prime}$ are i.i.d. normal, and $\mu(x)$ is computed according to (15) with $M=N=2k+1$ by using (16) and the property of $\Pi_{+}$ Binary MRF given in [23, Definition 7]. Figure 4 shows that the MSE achieved by AMP with the non-separable sliding-window denoiser defined above is tracked by state evolution at every iteration.

Notice that when $k=0$ , the denoisers $\{\eta_{t}\}_{t\geq 0}$ are separable and since the empirical distribution of $\beta_{0}$ converges to the stationary probability distribution $\mu_{1}$ on $E\subset\mathbb{R}$ , the state evolution analysis for AMP with separable denoisers ( $k=0$ ) was justified by Bayati and Montanari [4]. However, it can be seen in Figures 3 and 4 that the MSE achieved by the separable denoiser ( $k=0$ ) is significantly higher (worse) than that achieved by the non-separable denoisers ( $k=1$ ).

II-C2 Texture Image Reconstruction

We now use the Bayesian sliding-window denoiser defined in (17) to reconstruct binary texture images shown in Figure 5. The MRF prior is the same type as described in Section II-C1, namely the $\Pi_{+}$ Binary MRF, but we set the parameters $\{p=0.18,q=0.16,r=0.034,s=0.01\}$ . Note that while it is possible to learn an MRF model for each of the images using well-established MRF learning algorithms, we do not include this procedure in our simulations since the study of texture image modeling is beyond the scope of this paper. Moreover, the reconstruction results obtained using the simple MRF defined above are sufficiently satisfactory, despite the fact that the prior may be inaccurate. In Figure 5, we begin with natural images of a cloud, a leaf, and wood ( $1^{st}$ column) and then use thresholding to generate binary test images ( $2^{nd}$ column). In addition to presenting the reconstructed images obtained by the Bayesian sliding-window denoisers with $k=1$ ( $4^{th}$ column) and $k=0$ ( $5^{th}$ column), respectively, we also present those obtained by AMP with a total variation denoiser [24] as a baseline approach ( $3^{th}$ column).

III Proof of Theorem 1

The proof of Theorem 1 follows the work of Rush and Venkataramanan [14], with modifications for the dependent structure of the unknown vector $\beta$ in (1). For this reason, we use much of the same notation. We prove Theorem 1 using a technical lemma, Lemma 4, which corresponds to [14, Lemma 6]. Before stating the lemma, we cover some preliminary results and establish notation to be used in its proof.

III-A Proof Notation

As in the previous work by Bayati and Montanari [4], as well as the work by Rush and Venkataramanan [14], the technical lemma is proved for a more general recursion, with AMP being a specific example of the general recursion as shown below. The connection between AMP and the general recursion will be explained in (25) and (26).

Fix the half-window size $0\leq k\leq(N-1)/2$ , an integer. Let $\{f_{t}\}_{t\geq 0}:\mathbb{R}^{\Lambda\times\Lambda}\to\mathbb{R}$ and $\{g_{t}\}_{t\geq 0}:\mathbb{R}^{2}\to\mathbb{R}$ be sequences of Lipschitz functions. Specifically, the arguments of $f_{t}$ are two variables in $\mathbb{R}^{\Lambda}$ , for example, for $x,y\in\mathbb{R}^{\Lambda}$ , we write $f_{t}(x,y)$ and call $x$ the first argument of $f_{t}$ . Given noise $w\in\mathbb{R}^{n}$ and unknown signal $\beta\in E^{\Gamma}$ , define vectors $h^{t+1},q^{t+1}\in\mathbb{R}^{|\Gamma|}$ and $b^{t},m^{t}\in\mathbb{R}^{n}$ , as well as arrays $\hat{h}^{t+1},\hat{q}^{t+1}\in\mathbb{R}^{\Gamma}$ (for which $h^{t+1}$ and $q^{t+1}$ are the vectorized versions) for $t\geq 0$ recursively as follows. Starting with initial condition $\hat{q}^{0}\in\mathbb{R}^{\Gamma}$ :

[TABLE]

with the scalars $\xi_{t},\lambda_{t}$ defined as

[TABLE]

where the derivative of $g_{t}$ is w.r.t. the first argument, and the derivative of $f_{t}$ is w.r.t. the center coordinate of the first argument. In the context of AMP, as made explicit in (25), the terms $\hat{h}^{t+1}$ and $\hat{q}^{t}$ measure the error in the observation $\mathcal{V}^{-1}(A^{*}z^{t})+\beta^{t}$ and the estimate $\beta^{t}$ at time $t$ , respectively, (the error w.r.t. the true $\beta$ ). The term $m^{t}$ measures the residual at time $t$ and the term $b^{t}$ is the difference between the noise and residual at time $t$ .

Recall that the unknown vector $\beta\in E^{\Gamma}$ is assumed to have a stationary MRF prior with joint distribution measure $\mu$ . Let $\beta\in E^{\Gamma}\sim\mu$ and $\mathbf{0}\in\mathbb{R}^{\Lambda}$ be an all-zero array. Define

[TABLE]

Further, for all $i\in\Gamma$ let

[TABLE]

and assume that there exist constants $K,\kappa>0$ such that

[TABLE]

Define the state evolution scalars $\{\tau_{t}^{2}\}_{t\geq 0}$ and $\{\sigma_{t}^{2}\}_{t\geq 1}$ for the general recursion as follows,

[TABLE]

where random variables $W\sim p_{w}$ and $Z\sim\mathcal{N}(0,1)$ are independent and random arrays $\beta\in E^{\Gamma}\sim\mu$ and $\mathbf{Z}\in\mathbb{R}^{\Gamma}$ with i.i.d. $\mathcal{N}(0,1)$ entries are also independent. We assume that both $\sigma_{0}^{2}$ and $\tau_{0}^{2}$ are strictly positive. The technical lemma will show that $\hat{h}^{t+1}$ can be approximated as i.i.d. $\mathcal{N}(0,\tau_{t}^{2})$ in functions of interest for the problem, namely when used as an input to PL functions, and $b^{t}$ can be approximated as i.i.d. $\mathcal{N}(0,\sigma_{t}^{2})$ in PL functions. Moreover, it will be shown that the probability of the deviations of the quantities $\frac{1}{n}\|m^{t}\|^{2}$ and $\frac{1}{n}\|\hat{q}^{t}\|^{2}$ from $\tau_{t}^{2}$ and $\sigma_{t}^{2}$ , respectively, decay exponentially in $n$ .

We note that the AMP algorithm introduced in (4) and (5) is a special case of the general recursion of (18) and (19). Indeed, define the following vectors recursively for $t\geq 0$ , starting with $\beta^{0}=0$ and $z^{0}=y$ ,

[TABLE]

It can be verified that these vectors satisfy (18) and (19) using Lipschitz functions

[TABLE]

where $a\in\mathbb{R}^{\Lambda}$ and $b\in\mathbb{R}$ . Using the choice of $f_{t},g_{t}$ given in (26) also yields the expressions for $\sigma_{t}^{2},\tau_{t}^{2}$ given in (11). In the remaining analysis, the general recursion given in (18) and (19) is used. Note that in AMP, $q^{0}=-\beta$ and $\sigma_{0}^{2}=\sigma_{\beta}^{2}/\delta$ , hence, assumption (23) for AMP requires

[TABLE]

Under our assumptions for $\beta$ as stated in Section II-A, we see that (27) is satisfied using Lemma C.2 (Appendix C), since the function $f(x)=x^{2}$ is pseudo-Lipschitz. Finally, note that if we assume $\sigma_{\beta}^{2}>0$ and $\delta<\infty$ , then the condition of strict positivity of $\sigma_{0}^{2}$ and $\tau_{0}^{2}$ defined in (24) is satisfied.

Let $[c_{1}\mid c_{2}\mid\ldots\mid c_{k}]$ denote a matrix with columns $c_{1},\ldots,c_{k}$ . For $t\geq 1$ , define matrices

[TABLE]

Moreover, $M_{0}$ , $Q_{0}$ , $B_{0}$ , $H_{0}$ are defined to be the all-zero vector.

The values $m^{t}_{\|}$ and $q^{t}_{\|}$ are projections of $m^{t}$ and $q^{t}$ onto the column space of $M_{t}$ and $Q_{t}$ , with $m^{t}_{\perp}:=m^{t}-m^{t}_{\|},$ and $q^{t}_{\perp}:=q^{t}-q^{t}_{\|}$ being the projections onto the orthogonal complements of $M_{t}$ and $Q_{t}$ . Finally, define the vectors

[TABLE]

to be the coefficient vectors of the parallel projections, i.e.,

[TABLE]

The technical lemma, Lemma 4, shows that for large $n$ , the entries of the vectors $\alpha^{t}$ and ${\gamma}^{t}$ concentrate to constant values, which are defined in the following section.

III-B Concentrating Constants

Recall that $\beta\in E^{\Gamma}$ is the unknown vector to be recovered and $w\in\mathbb{R}^{n}$ is the measurement noise. In this section we introduce the concentrating values for inner products of pairs of the vectors $\{h^{t},m^{t},q^{t},b^{t}\}$ that are used in Lemma 4.

Let $\{\breve{Z}_{t}\}_{t\geq 0}$ be a sequence of zero-mean jointly Gaussian random variables taking values in $\mathbb{R}$ , and let $\{\tilde{\mathbf{Z}}_{t}\}_{t\geq 0}$ be a sequence of zero-mean jointly Gaussian random arrays taking values in $\mathbb{R}^{\Gamma}$ . The covariance of the two random sequences is defined recursively as follows. For $r,t\geq 0$ , $i,j\in\Gamma$ ,

[TABLE]

where

[TABLE]

Note that both terms of the above (32) are scalar values and we take $f_{0}(\cdot,\beta_{\Lambda_{i}}):=f_{0}(\mathbf{0},\beta_{\Lambda_{i}})$ , the initial condition. Moreover, $\tilde{E}_{t,t}=\sigma_{t}^{2}$ and $\breve{E}_{t,t}=\tau_{t}^{2}$ , as can be seen from (24), thus for all $i\in\Gamma$ , we have $\mathbb{E}[[\tilde{\mathbf{Z}}_{t}]_{i}^{2}]=\mathbb{E}[\breve{Z}^{2}_{t}]=1$ . Therefore, $\tilde{\mathbf{Z}}_{t}$ has i.i.d. $\mathcal{N}(0,1)$ entries.

Next, we define matrices $\tilde{C}^{t},\breve{C}^{t}\in\mathbb{R}^{t\times t}$ and vectors $\tilde{E}_{t},\breve{E}_{t}\in\mathbb{R}^{t}$ whose entries are $\{\tilde{E}_{r,t}\}_{r,t\geq 0}$ and $\{\breve{E}_{r,t}\}_{r,t\geq 0}$ defined in (32): for $0\leq i,j\leq t-1$ ,

[TABLE]

Lemma 1 below shows that $\tilde{C}^{t}$ and $\breve{C}^{t}$ are invertible. Therefore, we can define the concentrating values for $\gamma^{t}$ and $\alpha^{t}$ defined in (29) as

[TABLE]

as well as the values of $(\sigma_{t}^{\perp})^{2}$ and $(\tau^{\perp}_{t})^{2}$ for $t>0$ :

[TABLE]

For $t=0$ , we let $(\sigma^{\perp}_{0})^{2}:=\sigma_{0}^{2}$ and $(\tau^{\perp}_{0})^{2}:=\tau_{0}^{2}$ . Finally, define the concentrating values for $\lambda_{t+1}$ and $\xi_{t}$ defined in (19) as

[TABLE]

Lemma 1.

If $(\sigma_{k}^{\perp})^{2}$ and $(\tau_{k}^{\perp})^{2}$ are bounded below by some positive constants for $k\leq t$ , then the matrices $\tilde{C}^{k+1}$ and $\breve{C}^{k+1}$ defined in (33) are invertible for $k\leq t$ .

Proof.

The proof follows directly as that of [14, Lemma 1] and therefore is not restated here. To see that this is the case, note that the proof of [14, Lemma 1] relies only on the relationship between $(\sigma_{k}^{\perp})^{2}$ (resp. $(\tau_{k}^{\perp})^{2}$ ) and $\tilde{C}^{k}$ (resp. $\breve{C}^{k}$ ) as defined in [14, (4.19)], which is the same as (36), and not the actual values taken by these objects. Therefore, the proof for [14, Lemma 1] applies here.

∎

III-C Conditional Distribution Lemma

As mentioned previously, the proof of Theorem 1 relies on a technical lemma, Lemma 4, stated in Section III-D and proved in Section IV. Lemma 4 uses the conditional distribution of the vectors $h^{t+1}$ and $b^{t}$ given the matrices in (28) as well as $\beta,w$ . Two forms of the conditional distribution of $h^{t+1}$ will be provided in Lemmas 2 and 47, which correspond to [14, Lemma 4] and [14, Lemma 5], respectively. Lemma 47 explicitly shows that the conditional distribution of $h^{t+1}$ can be represented as the sum of a standard Gaussian vector and a deviation term, where the explicit expression of the deviation term is provided in Lemma 2. Then Lemma 4 shows that the deviation term is small, meaning that its normalized Euclidean norm concentrates on zero, and also provides concentration results for various inner products involving the other terms in recursion (18), namely $\{h^{t+1},q^{t},b^{t},m^{t}\}$ .

The following notation is used. Considering two random variables $X,Y$ and a sigma-algebra $\mathscr{S}$ , we denote the relationship that the conditional distribution of $X$ given $\mathscr{S}$ equals the distribution of $Y$ by $X|_{\mathscr{S}}\stackrel{{\scriptstyle d}}{{=}}Y$ . We represent a $t\times t$ identity matrix as $\mathsf{I}_{t}$ , dropping the $t$ subscript when it is clear from the context. For a matrix $A$ with full column rank, $\mathsf{P}^{\parallel}_{A}:=A(A^{*}A)^{-1}A^{*}$ is the orthogonal projection matrix onto the column space of $A$ , and $\mathsf{P}^{\perp}_{A}:=\mathsf{I}-\mathsf{P}^{\parallel}_{A}$ . Define $\mathscr{S}_{t_{1},t_{2}}$ to be the sigma-algebra generated by the terms

[TABLE]

Lemma 2.

For the vector $h^{t+1}$ and $b^{t}$ defined in (18), the following conditional distribution holds for $t\geq 1$ :

[TABLE]

where $Z_{0},Z_{t}\in\mathbb{R}^{|\Gamma|}$ and $Z^{\prime}_{0},Z^{\prime}_{t}\in\mathbb{R}^{n}$ are i.i.d. standard Gaussian random vectors that are independent of the corresponding conditioning sigma algebras. The term $\hat{\gamma}^{t}_{i}$ and $\hat{\alpha}^{t}_{i}$ for $i=0,...,t-1$ is defined in (35) and the term $(\tau_{t}^{\perp})^{2}$ and $(\sigma_{t}^{\perp})^{2}$ in (36). The deviation terms are

[TABLE]

where $\mathsf{I}$ is the identity matrix and for any matrix $A$ , $\mathsf{P}^{\parallel}_{A}$ is the orthogonal projection matrix onto the column space of $A$ . For $t>0$ , defining $\mathbf{M}_{t}:=\frac{1}{n}M_{t}^{*}M_{t}$ and $\mathbf{Q}_{t+1}:=\frac{1}{n}Q_{t+1}^{*}Q_{t+1}$ ,

[TABLE]

Proof.

As in [14], the key theoretical insight in the proof is to study the distribution of $A$ conditioned on the sigma algebra $\mathscr{S}_{t_{1},t}$ where $t_{1}$ is either $t+1$ or $t$ , meaning one treats $A$ as random, and considers the output of the AMP algorithm up until the current iteration as fixed and given in the sigma-algebra. This is done by observing that conditioning on $\mathscr{S}_{t_{1},t}$ is equivalent to conditioning on the linear constraints

[TABLE]

(due to the relationships $b^{s}=Aq^{s}-\lambda_{s}m^{s-1}$ and $h^{r+1}=A^{*}m^{r}-\xi_{r}q^{r}$ for $0\leq s\leq t$ and $0\leq r\leq t-1$ given in (18)). Then it is straightforward to characterize the conditional distribution of a Gaussian matrix given linear constraints.

Since the sequences $\{\lambda_{s};\xi_{s};b^{s};q^{s};h^{s};m^{s}\}_{0\leq s\leq t}$ are all in the conditioning $\mathscr{S}_{t+1,t}$ , this conditional distribution depends only on the relationship between the matrix $A$ and these fixed, given terms. This relationship, namely that specified via $b^{s}=Aq^{s}-\lambda_{s}m^{s-1}$ and $h^{r+1}=A^{*}m^{r}-\xi_{r}q^{r}$ , is the same here as in [14, Lemma 3], and so the proofs are identical. We therefore do not repeat the details. Although $\{\lambda_{s}\}_{0\leq s\leq t}$ and $\{q^{s}\}_{0\leq s\leq t}$ are obtained from $\{f_{s}\}_{0\leq s\leq t}$ , which is separable in [14], but non-separable in our case, $\{\lambda_{s}\}_{0\leq s\leq t}$ and $\{q^{s}\}_{0\leq s\leq t}$ are simply treated as fixed elements in the conditioning sigma-algebra $\mathscr{S}_{t+1,t}$ and the fact that they are calculated via non-separable functions here does not change the proof.

Then one is able to specify the conditional distributions of $b^{t}$ and $h^{t+1}$ given $\mathscr{S}_{t,t}$ and $\mathscr{S}_{t+1,t}$ , respectively, using the conditional distribution of $A$ . Again since the relationship between $b^{t}$ and $h^{t+1}$ and $A$ is the same here as in [14], the details are identical to that provided in the proof of [14, Lemma 4] and are not repeated here. ∎

Note that Lemma 2 holds only when $Q^{*}_{t+1}Q_{t+1}$ is invertible. The following lemma provides an alternative representation of the conditional distribution of $h^{t+1}|_{\mathscr{S}_{t+1,t}}$ for $t\geq 0$ , and it explicitly shows that $h^{t+1}|_{\mathscr{S}_{t+1,t}}$ is distributed as an i.i.d. Gaussian random vector with $\mathcal{N}(0,\tau_{t}^{2})$ entries plus a deviation term.

Lemma 3.

For $t\geq 0$ , let $Z_{t}\in\mathbb{R}^{|\Gamma|}$ be i.i.d. standard normal random vectors. Let $h_{\mathsf{pure}}^{1}:=\tau_{0}Z_{0}$ . For $t\geq 1$ , recursively define

[TABLE]

and a set of scalars $\{\mathsf{d}^{t}_{i}\}_{0\leq i\leq t}$ with $\mathsf{d}^{0}_{0}=1$ ,

[TABLE]

Let $\hat{h}^{t+1}_{\mathsf{pure}}=\mathcal{V}^{-1}(h^{t+1}_{\mathsf{pure}})\in\mathbb{R}^{\Gamma}$ . Then for all $t\geq 0$ we have

[TABLE]

where $\{\tilde{\mathbf{Z}}_{t}\}_{t\geq 0}$ are jointly Gaussian with correlation structure defined in (31). Moreover,

[TABLE]

Proof.

First, we prove (46) by induction. For $t=1$ , $\hat{h}^{1}_{\mathsf{pure}}=\tau_{0}\mathcal{V}^{-1}(Z_{0})\overset{d}{=}\tau_{0}\tilde{\mathbf{Z}}_{0}$ . As the inductive hypothesis, assume $(\hat{h}^{1}_{\mathsf{pure}},\ldots,\hat{h}^{t}_{\mathsf{pure}})\overset{d}{=}(\tau_{0}\tilde{\mathbf{Z}}_{0},\ldots,\tau_{t-1}\tilde{\mathbf{Z}}_{t-1})$ . By (44), term $\hat{h}^{t+1}_{\mathsf{pure}}$ is equal in distribution to $\sum_{r=0}^{t-1}\hat{\alpha}^{t}_{r}\tau_{r}\tilde{\mathbf{Z}}_{r}+\tau_{t}^{\perp}\mathbf{Z}$ , where $\mathbf{Z}\in\mathbb{R}^{\Gamma}$ is independent of $\tilde{\mathbf{Z}}_{r}$ for all $r=0,\ldots,t-1$ . In what follows, we show

[TABLE]

Note that $\tilde{\mathbf{Z}}_{0},\ldots,\tilde{\mathbf{Z}}_{t-1},\mathbf{Z}$ are all zero-mean Gaussian, and therefore so is the sum. We now study the variance and covariance of $\sum_{r=0}^{t-1}\hat{\alpha}_{r}^{t}\tau_{r}\tilde{\mathbf{Z}}_{r}+\tau_{t}^{\perp}\mathbf{Z}$ by demonstrating the following two results:

(i)

For all $i,j\in\Gamma$ ,

[TABLE] 2. (ii)

For $0\leq s\leq(t-1)$ and all $i,j\in\Gamma$ ,

[TABLE]

First consider (i). We note,

[TABLE]

In the above, step $(a)$ follows from the fact that $\mathbf{Z}$ is independent of $\tilde{\mathbf{Z}}_{0},\ldots,\tilde{\mathbf{Z}}_{t-1}$ , step $(b)$ from the covariance definition (31) and the i.i.d. standard normal nature of elements of $\mathbf{Z}$ , and step $(c)$ from

[TABLE]

Next, consider (ii). We see that

[TABLE]

In the above, step $(a)$ follows since $\mathbf{Z}$ is independent of $\tilde{\mathbf{Z}}_{s}$ and step $(b)$ from (31). Finally, notice that $\sum_{r=0}^{t-1}\breve{E}_{s,r}\hat{\alpha}^{t}_{r}=[\breve{C}^{t}\hat{\alpha}^{t}]_{s+1}=\breve{E}_{s,t},$ where the first equality holds since the sum equals the inner product of the $(s+1)^{th}$ row of $\breve{C}^{t}$ with $\hat{\alpha}^{t}$ and the second equality by definition of $\hat{\alpha}^{t}$ in (35).

Next, we prove (47), also by induction. For $t=0$ , by (38) we have $h^{t+1}|_{\mathscr{S}_{t+1,t}}\overset{d}{=}\tau_{0}Z_{0}+\Delta_{1,0}\overset{d}{=}h_{\mathsf{pure}}^{1}+\Delta_{1,0}$ . Assume that $h^{r+1}|_{\mathscr{S}_{t+1,t}}\overset{d}{=}h^{r+1}_{\mathsf{pure}}+\sum_{i=0}^{r}\mathsf{d}^{r}_{i}\Delta_{i+1,i}$ holds for $r=0,\ldots,t-1$ as the inductive hypothesis. Then,

[TABLE]

In the above, the first equality uses (38) and the second the inductive hypothesis. The last equality follows by noticing that $\sum_{r=0}^{t-1}\sum_{i=0}^{r}v_{r,i}=\sum_{i=0}^{t-1}\sum_{r=i}^{t-1}v_{r,i}$ for $(v_{i,r})_{0\leq i,r\leq t-1}$ and using (45). ∎

III-D Main Concentration Lemma

Lemma 4.

We use the shorthand $X_{n}\doteq c$ to denote the concentration inequality $P(\left\lvert X_{n}-c\right\rvert\geq\epsilon)\leq K_{k,t}e^{-\kappa_{k,t}n\epsilon^{2}}$ , where $K_{k,t},\kappa_{k,t}$ denote constants depending on the iteration index $t$ and the fixed half-window size $k$ , but not on $n$ or $\epsilon$ . The following statements hold for $0\leq t<T^{*}$ and $\epsilon\in(0,1)$ .

(a)

For $\Delta_{t+1,t}$ defined in (41) and (43),

[TABLE] 2. (b)

For (order-2) pseudo-Lipschitz functions $\phi_{h}:\mathbb{R}^{(t+2)|\Lambda|}\to\mathbb{R}$ ,

[TABLE]

The random vectors $\tilde{\mathbf{Z}}_{0},\ldots,\tilde{\mathbf{Z}}_{t}\in\mathbb{R}^{\Gamma}$ are jointly Gaussian with zero mean entries, which are independent of the other entries in the same vector with covariance across iterations given by (31), and are independent of $\beta\sim\mu$ . 3. (c)

Recall that the operator $\mathcal{V}$ rearranges the elements of an array into a vector,

[TABLE] 4. (d)

For all $0\leq r\leq t$ ,

[TABLE] 5. (e)

For all $0\leq r\leq t$ ,

[TABLE] 6. (f)

For all $0\leq r\leq t$ ,

[TABLE] 7. (g)

For $\textbf{Q}_{t+1}=\frac{1}{n}Q_{t+1}^{*}Q_{t+1}$ and $\textbf{M}_{t}=\frac{1}{n}M_{t}^{*}M_{t}$ , when the inverses exist, for all $0\leq i,j\leq t$ and $0\leq i^{\prime},j^{\prime}\leq t-1$ :

[TABLE]

where $\hat{\gamma}^{t+1}_{i}$ and $\hat{\alpha}^{t}_{i^{\prime}}$ are defined in (35), 8. (h)

With $\sigma_{t+1}^{\perp},\tau_{t}^{\perp}$ defined in (36),

[TABLE]

III-E Proof of Theorem 1

Proof.

Applying part (b) of Lemma 4 to a PL(2) function $\phi_{h}:\mathbb{R}^{2|\Lambda|}\to\mathbb{R}$ ,

[TABLE]

where the random field $\beta\in E^{\Gamma}\sim\mu$ is independent of $Z\in\mathbb{R}^{\Gamma}$ having i.i.d. standard normal entries. Now for $i\in\Gamma$ let

[TABLE]

where $\phi:\mathbb{R}^{2}\to\mathbb{R}$ is the PL(2) function in the statement of the theorem. The function $\phi_{h}(\hat{h}^{t+1}_{\Lambda_{i}},\beta_{\Lambda_{i}})$ in (62) is PL(2) since $\phi$ is PL(2) and $\eta_{t}$ is Lipschitz. We therefore obtain

[TABLE]

The proof is completed by noting from (5) and (25) that $\beta^{t+1}_{i}=\eta_{t}([\mathcal{V}^{-1}\left(A^{*}z^{t}\right)+\beta^{t}]_{\Lambda_{i}})=\eta_{t}(\beta_{\Lambda_{i}}-\hat{h}^{t+1}_{\Lambda_{i}})$ . ∎

IV Proof of Lemma 4

We make use of concentration results listed in Appendices A, B, and C, where Appendix C contains concentration results for dependent variables that were needed to provide the new results in this paper. Note that the lemmas that are stated in Appendices are labeled by capital letters with numbers (e.g., Lemma A.1), whereas the lemmas that are stated in the body are labeled by numbers (e.g., Lemma 1).

The proof of Lemma 4 proceeds by induction on $t$ . We label as $\mathcal{H}^{t+1}$ the results (48), (49), (50), (52), (54), (56), (58), (60) and similarly as $\mathcal{B}^{t}$ the results (51), (53), (55), (57), (59), (61). The proof consists of four steps: (1) proving that $\mathcal{B}_{0}$ holds, (2) proving that $\mathcal{H}_{1}$ holds, (3) assuming that $\mathcal{B}_{r}$ and $\mathcal{H}_{s}$ hold for all $r<t$ and $s\leq t$ , then proving that $\mathcal{B}_{t}$ holds, and (4) assuming that $\mathcal{B}_{r}$ and $\mathcal{H}_{s}$ hold for all $r\leq t$ and $s\leq t$ , then proving that $\mathcal{H}_{t+1}$ holds.

The proof of steps (1) and (3) – the $\mathcal{B}$ steps – follow as in [14]. To see that this is the case, notice that in the proof for [14, Lemma 6], the results given by $\mathcal{B}_{t}(b)$ - $\mathcal{B}_{t}(h)$ involve the sequence of functions $\{g_{r}\}_{0\leq r\leq t}$ and the sequence of vectors $\{b^{r}\}_{0\leq r\leq t}$ . In our case, the definition of the (separable) functions $g_{t}(\cdot)$ in (18) is the same that in [14, (4.1)] and the conditional distribution of $\{b^{r}\}_{0\leq r\leq t}$ given in Lemma 2 has the same expression as that in [14, Lemma 6]. Therefore, the proof of $\mathcal{B}_{t}(b)$ - $\mathcal{B}_{t}(h)$ in [14] is directly applicable here.

Now consider $\mathcal{B}_{t}(a)$ . When $t=0$ , it only involves $\|q^{0}\|$ , which has the same assumption in our case and in [14]. When $t>0$ , the proof uses induction hypothesis $\mathcal{B}_{0}(d)$ - $\mathcal{B}_{t-1}(d)$ , $\mathcal{B}_{t-1}(e)$ , and $\mathcal{H}_{t}(g),\mathcal{H}_{t}(h)$ . The statement of those hypotheses have the same form as in [14]. Although the concentration constants in those hypotheses have different definitions in this paper due to using non-separable denoisers, it does not change the proof for $\mathcal{B}_{t}(a)$ , since the actual values of the concentration constants are not involved in the proof. Therefore, we do not repeat the proof of steps (1) and (3) here. In what follows, we only show steps (2) and (4).

For each step, in parts $(a)$ – $(h)$ of the proof, we use $K$ and $\kappa$ to label universal constants, meaning that they do not depend on $n$ or $\epsilon$ , but may depend on $t$ and $k$ , in the concentration upper bounds.

IV-A Step 2: Showing that $\mathcal{H}_{1}$ holds

Throughout the proof we will make use of a function $\mathcal{S}:\mathbb{R}^{\Lambda}\to\mathbb{R}$ that selects the center coordinate of its argument. For example, for $v\in\mathbb{R}^{|\Gamma|}$ ,

[TABLE]

We will only use $\mathcal{S}$ in cases where such a “center point” is well-defined. Notice that $\mathcal{S}$ is Lipschitz, since $|\mathcal{S}(x)-\mathcal{S}(x^{\prime})|=|x_{c}-x_{c}^{\prime}|\leq\|x-x^{\prime}\|,$ for all $x,x^{\prime}\in\mathbb{R}^{\Lambda}$ , where $x_{c}$ (respectively, $x^{\prime}_{c}$ ) is the center coordinate of $x$ (respectively, $x^{\prime}$ ). Moreover, if a function $f:\mathbb{R}^{\Lambda\times\tilde{\Lambda}}\to\mathbb{R}$ is defined as $f(x,y):=\mathcal{S}(x)$ with arbitrary but fixed $\tilde{\Lambda}$ , then $f$ is Lipschitz, because $|f(x,y)-f(x^{\prime},y^{\prime})|=|\mathcal{S}(x)-\mathcal{S}(x^{\prime})|\leq\|x-x^{\prime}\|\leq\|(x,y)-(x^{\prime},y^{\prime})\|$ . We are now ready to prove $\mathcal{H}_{1}(a)-\mathcal{H}_{1}(h)$ .

(a) The definition of $\Delta_{1,0}$ is given in (41). First notice that by Lemma B.1, we have $\mathsf{P}_{q^{0}}^{\parallel}Z_{0}\overset{d}{=}\frac{q^{0}}{\|q^{0}\|}\bar{Z}_{0}$ , where $\bar{Z}_{0}$ is a standard normal random variable on $\mathbb{R}$ . Using this fact and (41), we have

[TABLE]

By applying the triangle inequality to the norm of the RHS of (64) and then applying Lemma A.1,

[TABLE]

Label the three terms on the right-hand side (RHS) of (65) as $T_{1}-T_{3}$ . We will show that each term is bounded by $Ke^{-\kappa n\epsilon}$ .

First consider $T_{1}$ .

[TABLE]

where step $(b)$ follows by Lemma A.3, Lemma A.7, and $\mathcal{B}_{0}(e)$ . To see that step $(a)$ in (66) holds, we notice that

[TABLE]

since if the two events on the left-hand side (LHS) of (67) hold, then using that $\epsilon<1$ ,

[TABLE]

Taking the complement on both sides of (67),

[TABLE]

Then step $(a)$ in (66) follows by the union bound.

Next consider $T_{2}$ .

[TABLE]

where step $(a)$ follows by similar justification as that for step $(a)$ in (66) and step $(b)$ follows by Lemma A.3, Lemma A.6, and $\mathcal{B}_{0}(e)$ .

Finally consider $T_{3}$ .

[TABLE]

where step $(a)$ follows by Lemma A.1 and step $(b)$ follows by Lemma A.2, $\mathcal{B}_{0}(f)$ , and the assumption on $\|q^{0}\|$ given in (23).

(b) Let $\hat{Z}_{0}:=\mathcal{V}^{-1}(Z_{0})\in\mathbb{R}^{\Gamma}$ and $\hat{\Delta}_{1,0}:=\mathcal{V}^{-1}(\Delta_{1,0})\in\mathbb{R}^{\Gamma}$ be the array versions of the vectors $Z_{0}$ and $\Delta_{1,0}$ , respectively. For $t=0$ , the left-hand side (LHS) of (49) can be bounded as

[TABLE]

Step $(a)$ follows from the conditional distribution of $h^{1}$ given in Lemma 2 (38) and since $\hat{h}^{1}=\mathcal{V}^{-1}(h^{1})$ and $\tau_{0}\hat{Z}_{0}+\hat{\Delta}_{1,0}=\mathcal{V}^{-1}(\tau_{0}Z_{0}+\Delta_{1,0})$ . Step $(b)$ follows from Lemma A.1. Label the terms on the RHS of (68) as $T_{1}-T_{3}$ . We show that each of these terms is bounded above by $Ke^{-\kappa n\epsilon^{2}}.$

First, consider $T_{1}$ . Recall the definition of the functions $\mathcal{T}_{i}$ for $i\in\Gamma$ in (8), which extends an array in $\mathbb{R}^{\Lambda_{i}\cap\Gamma}$ to an array in $\mathbb{R}^{\Lambda}$ by defining the extended entries to be the average of the entries in the original array. For arbitrary but fixed $s\in\mathbb{R}^{\Lambda}$ , the function $\tilde{\phi}_{h,i}:\mathbb{R}^{\Lambda_{i}\cap\Gamma\times\Lambda}\to\mathbb{R}$ defined as $\tilde{\phi}_{h,i}(v,s):=\phi_{h}(\mathcal{T}_{i}(v),s)$ is PL(2) by Lemma B.5. Then it follows from Lemma B.4 that the function $\phi_{1,i}:\mathbb{R}^{\Lambda}\to\mathbb{R}$ defined as $\phi_{1,i}(s):=\mathbb{E}_{\tilde{\mathbf{Z}}_{0}}[\tilde{\phi}_{h,i}(\tau_{0}[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}\cap\Gamma},s)]$ is PL(2), since $[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}\cap\Gamma}$ is an array of i.i.d. standard norm random variables for all $i\in\Gamma$ . Notice that $\mathbb{E}_{\tilde{\mathbf{Z}}_{0}}[\tilde{\phi}_{h,i}(\tau_{0}[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}\cap\Gamma},s)]=\mathbb{E}_{\tilde{\mathbf{Z}}_{0}}[\phi_{h}(\tau_{0}[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}},s)]$ by the definition of $\tilde{\phi}_{h,i}$ and $\mathcal{T}_{i}$ . Therefore,

[TABLE]

where in step $(a)$ we use the definition of $\mathcal{T}_{i}$ in (8) and step $(b)$ follows from Lemma C.2 by noticing from Lemma B.5 that the function $\phi_{1,i}(\mathcal{T}_{i}(\cdot))$ is PL(2) for all $i\in\Gamma$ .

Next, consider $T_{2}$ . We use iterated expectation to condition on the value of $\beta$ . Then $T_{2}$ an be expressed as an expectation as follows,

[TABLE]

Define the function $f:E^{\Gamma}\to[0,1]$ as

[TABLE]

For any fixed $a\in E^{\Gamma}$ , define a function $\phi_{2,i}:\mathbb{R}^{\Lambda}\to\mathbb{R}$ as $\phi_{2,i}(s):=\phi_{h}(s,a_{\Lambda_{i}})$ for each $i\in\Gamma$ and note that it is PL(2) with PL constant upper-bounded by $L(1+2\sqrt{|\Lambda|}M)$ , where $L$ is the PL constant for $\phi_{h}$ and $M$ is such that $|x|\leq M$ for all $x\in E$ , since by the pseudo-Lipschitz property of $\phi_{h}$ and the triangle inequality,

[TABLE]

Using $L(1+2\sqrt{|\Lambda|}M)$ as the PL constant for $\phi_{2,i}$ for all $i\in\Gamma$ , then

[TABLE]

where the last inequality follows from Lemma C.4 by noticing that $\mathbb{E}_{\tilde{\mathbf{Z}}_{0}}[\phi_{h}(\tau_{0}\mathcal{T}_{i}([\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}\cap\Gamma}),a_{\Lambda_{i}})]=\mathbb{E}_{\hat{Z}_{0}}[\phi_{h}(\tau_{0}\mathcal{T}_{i}([\hat{Z}_{0}]_{\Lambda_{i}\cap\Gamma}),a_{\Lambda_{i}})]$ . Therefore, $T_{2}=\mathbb{E}[f(\beta)]\leq Ke^{-\kappa|\Gamma|\epsilon^{2}}$ , since $K$ and $\kappa$ don’t depend on $\beta$ (as it doesn’t show up in the pseudo-Lipschitz constant $L(1+2\sqrt{|\Lambda|}M)$ ).

Finally, consider $T_{3}$ , the third term on the RHS of (68).

[TABLE]

Step $(a)$ follows from the fact that $\phi_{h}$ is PL(2). Step $(b)$ uses $||[\tau_{0}\hat{Z}_{0}+\hat{\Delta}_{1,0}]_{\Lambda_{i}}||\leq||\tau_{0}[\hat{Z}_{0}]_{\Lambda_{i}}||+||[\hat{\Delta}_{1,0}]_{\Lambda_{i}}||$ by the triangle inequality, the Cauchy-Schwarz inequality, the fact that for $a\in\mathbb{R}^{\Gamma}$ , $\sum_{i\in\Gamma}\left\lVert a_{\Lambda_{i}}\right\rVert^{2}\leq 2d\left\lVert a\right\rVert^{2}$ , where $d=|\Lambda|=(2k+1)^{p}$ , and the following application of Lemma B.6:

[TABLE]

From (69), we have

[TABLE]

where we use Lemma A.7 and $\mathcal{H}_{1}(a)$ to obtain step $(a)$ .

(c) We first show concentration for $\frac{1}{n}(h^{1})^{*}\mathcal{V}(\beta)=\frac{1}{n}\sum_{i\in\Gamma}\hat{h}_{i}^{1}\beta_{i}$ . Let the function $\phi_{1}:\mathbb{R}^{2|\Lambda|}\to\mathbb{R}$ be defined as $\phi_{1}(x,y):=\mathcal{S}(x)\mathcal{S}(y)$ for any $(x,y)\in\mathbb{R}^{\Lambda\times\Lambda}$ , where the operator $\mathcal{S}$ is defined in (63). Then, using the fact that $\phi_{1}(\hat{h}^{1}_{\Lambda_{i}},\beta_{\Lambda_{i}})=\hat{h}_{i}^{1}\beta_{i}$ and $\mathbb{E}[\phi_{1}(\tau_{0}[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}},\beta_{\Lambda_{i}})]=\mathbb{E}[[\tau_{0}\tilde{\mathbf{Z}}_{0}]_{i}]\mathbb{E}[\beta_{i}]=0$ for all $i\in\Gamma$ , since $[\tilde{\mathbf{Z}}_{0}]_{i}$ has zero-valued mean and is independent of $\beta_{i}$ , we find

[TABLE]

Finally, note that $\phi_{1}$ is PL(2) since $\mathcal{S}$ is Lipschitz by Lemma B.3, hence, we can apply $\mathcal{H}_{1}(b)$ to give the desired upper bound.

Next, we show concentration for $\frac{1}{n}(h^{1})^{*}q^{0}=\frac{1}{n}\sum_{i\in\Gamma}\hat{h}_{i}^{1}\hat{q}^{0}_{i}$ . Recall, $\hat{q}^{0}_{i}=f_{0}(\mathbf{0},\beta_{\Lambda_{i}})$ for all $i\in\Gamma$ . The function $\phi_{2}:\mathbb{R}^{2|\Lambda|}\to\mathbb{R}$ defined as $\phi_{2}(x,y):=\mathcal{S}(x)f_{0}(\mathbf{0},y)$ is PL(2) by Lemma B.3 since $\mathcal{S}$ and $f_{0}$ are both Lipschitz. Notice that $\phi_{2}(\hat{h}^{1}_{\Lambda_{i}},\beta_{\Lambda_{i}})=\hat{h}_{i}^{1}\hat{q}_{i}^{0}$ and $\mathbb{E}[\phi_{2}(\tau_{0}[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}},\beta_{\Lambda_{i}})]=\mathbb{E}[\tau_{0}[\tilde{\mathbf{Z}}_{0}]_{i}]\mathbb{E}[f_{0}(\mathbf{0},\beta_{\Lambda_{i}})]=0$ for all $i\in\Gamma$ since $[\tilde{\mathbf{Z}}_{0}]_{i}$ has zero-valued mean and is independent of $\beta$ . Therefore, using $\mathcal{H}_{1}(b)$ ,

[TABLE]

(d) The function $\phi_{3}:\mathbb{R}^{2|\Lambda|}\to\mathbb{R}$ defined as $\phi_{3}(x,y):=(\mathcal{S}(x))^{2}$ is PL(2) by Lemma B.3 since the operator $\mathcal{S}$ defined in (63) is Lipschitz. Notice that $\frac{1}{|\Gamma|}(h^{1})^{*}h^{1}=\frac{1}{|\Gamma|}\sum_{i\in\Gamma}\phi_{3}(\hat{h}^{1}_{\Lambda_{i}},\beta_{\Lambda_{i}})$ and $\mathbb{E}[\phi_{3}(\tau_{0}[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}},\beta_{\Lambda_{i}})]=\tau_{0}^{2}\mathbb{E}[([\tilde{\mathbf{Z}}_{0}]_{i})^{2}]=\tau_{0}^{2}$ for all $i\in\Gamma$ , which follows from the definition of $\tilde{\mathbf{Z}}_{0}$ in (31). Therefore, the result follows using $\mathcal{H}_{1}(b)$ , since

[TABLE]

(e) We prove concentration for $\frac{1}{n}(q^{0})^{*}q^{1}$ , and the result for $\frac{1}{n}(q^{1})^{*}q^{1}$ follows similarly. The function $\phi_{4}:\mathbb{R}^{2|\Lambda|}\to\mathbb{R}$ defined as $\phi_{4}(x,y):=f_{0}(\mathbf{0},y)f_{1}(x,y)$ is PL(2) by Lemma B.3, since $f_{0}$ and $f_{1}$ are Lipschitz. Notice that $\frac{1}{|\Gamma|}\sum_{i\in\Gamma}\mathbb{E}[f_{0}(\mathbf{0},\beta_{\Lambda_{i}})f_{1}(\tau_{0}[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}},\beta_{\Lambda_{i}})]=\delta\tilde{E}_{0,1}$ by (32) and $(q^{0})^{*}q^{1}=\sum_{i\in\Gamma}\phi_{4}(\hat{h}^{1}_{\Lambda_{i}},\beta_{\Lambda_{i}})$ . Hence, we have the desired upper bound using $\mathcal{H}_{1}(b)$ , since

[TABLE]

(f) The concentration of $\lambda_{0}$ to $\hat{\lambda}_{0}$ follows from $\mathcal{H}_{1}(b)$ applied to the function $\phi_{h}([h^{1}]_{\Lambda_{i}},\beta_{\Lambda_{i}}):=f_{0}^{\prime}([h^{1}]_{\Lambda_{i}},\beta_{\Lambda_{i}})$ , since $f_{0}^{\prime}$ is assumed to be Lipschitz, hence PL(2).

The only other result to prove is concentration for $\frac{1}{n}(h^{1})^{*}q^{1}=\frac{1}{n}\sum_{i\in\Gamma}\hat{h}_{i}^{1}\hat{q}_{i}^{1}$ . The function $\phi_{5}:\mathbb{R}^{2|\Lambda|}\to\mathbb{R}$ defined as $\phi_{5}(x,y)=\mathcal{S}(x)f_{1}(x,y)$ is PL(2) by Lemma B.3. Notice that $\phi_{5}(\hat{h}^{1}_{\Lambda_{i}},\beta_{\Lambda_{i}})=\hat{h}_{i}^{1}\hat{q}_{i}^{1}$ . Moreover, let the function $\tilde{f}_{i}:\mathbb{R}\to\mathbb{R}$ be defined as $\tilde{f}_{i}(x):=\mathbb{E}_{[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}\setminus\{i\}},\beta_{\Lambda_{i}}}[f_{1}(\mathcal{R}(x,[\tau_{0}\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}}),\beta_{\Lambda_{i}})]$ , where the function $\mathcal{R}:\mathbb{R}^{1\times\Lambda}\to\mathbb{R}^{\Lambda}$ replaces the center coordinate of the second argument, which is in $\mathbb{R}^{\Lambda}$ , with the first argument, which is in $\mathbb{R}$ , a scalar. For example, $\tilde{f}_{i}([\tau_{0}\tilde{\mathbf{Z}}_{0}]_{i})=\mathbb{E}_{[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}\setminus\{i\}},\beta_{\Lambda_{i}}}[f_{1}([\tau_{0}\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}},\beta_{\Lambda_{i}})]$ . Then we have

[TABLE]

In the above, step $(a)$ follows from Stein’s Method, Lemma B.2, step $(b)$ follows from the definition of $\tilde{\mathbf{Z}}_{0}$ in (31) and the definition of $f_{1}^{\prime}$ , which is the partial derivative w.r.t. the center coordinate of the first arguments, and step $(c)$ follows from the definition of $\hat{\lambda}_{1}$ in (37) and the definition of $\breve{E}_{0,0}$ in (32). Therefore, using $\mathcal{H}_{1}(b)$ , we have the desired upper bound, since

[TABLE]

(g) Note that $\mathbf{Q}_{1}=\frac{1}{n}\left\lVert q^{0}\right\rVert^{2}$ and $\tilde{C}^{1}=\tilde{E}_{0,0}=\sigma_{0}^{2}>0$ . By Lemma A.5 and (23),

[TABLE]

By the definitions in Section III-A, $\gamma^{1}_{0}=\frac{1}{n}\mathbf{Q}_{1}^{-1}(q^{0})^{*}q^{1}$ and $\hat{\gamma}_{0}^{1}=(\tilde{C}^{1})^{-1}\tilde{E}_{1}=\tilde{E}_{0,1}\sigma_{0}^{-2}.$ Therefore,

[TABLE]

where $(a)$ follows from Lemma A.2 with $\tilde{\epsilon}:=\min\Big{\{}\sqrt{\frac{\epsilon}{3}},\ \frac{\epsilon}{3\tilde{E}_{0,1}},\ \frac{\epsilon\sigma_{0}^{2}}{3}\Big{\}}$ and $(b)$ from (70) and $\mathcal{H}_{1}(e)$ .

(h) From the definitions in Section III-A, we have $\left\lVert q^{1}_{\perp}\right\rVert^{2}=\left\lVert q^{1}\right\rVert^{2}-\left\lVert q^{1}_{\parallel}\right\rVert^{2}=\left\lVert q^{1}\right\rVert^{2}-(\gamma_{0}^{1})^{2}\left\lVert q^{0}\right\rVert^{2}$ , and $(\sigma_{1}^{\perp})^{2}=\tilde{E}_{1,1}-\tilde{E}^{*}_{1}(\tilde{C}^{1})^{-1}\tilde{E}_{1}=\sigma_{1}^{2}-(\tilde{E}_{0,1})^{2}\tilde{E}_{0,0}^{-1}=\sigma_{1}^{2}-(\hat{\gamma}^{1}_{0})^{2}\sigma_{0}^{2}$ . We therefore have

[TABLE]

where the last inequality is obtained using $\mathcal{H}_{1}(e)$ for bounding the first term and by applying Lemma A.2 to the second term along with the concentration of $\left\lVert q^{0}\right\rVert$ in (23), $\mathcal{H}_{1}(g)$ , and Lemma A.4 (for concentration of the square).

IV-B Step 4: Showing that $\mathcal{H}_{t+1}$ holds

The probability statements in the lemma and the other parts of $\mathcal{H}_{t+1}$ are conditioned on the event that the matrices $\mathbf{Q}_{1},\ldots,\mathbf{Q}_{t+1}$ are invertible, but for the sake of brevity, we do not explicitly state the conditioning in the probabilities. The following lemma will be used to prove $\mathcal{H}_{t+1}$ .

Lemma 5.

Let $v:=\frac{1}{n}B^{*}_{t+1}m_{\perp}^{t}-\frac{1}{n}Q_{t+1}^{*}(\xi_{t}q^{t}-\sum_{i=0}^{t-1}\alpha^{t}_{i}\xi_{i}q^{i})$ and $\mathbf{Q}_{t+1}:=\frac{1}{n}Q_{t+1}^{*}Q_{t+1}$ . Then for $j\in[t+1]$ ,

[TABLE]

Proof.

We can infer from the proof of [14, Lemma 6] that the proof of Lemma 5 involves induction hypotheses $\mathcal{H}_{1}(g)$ - $\mathcal{H}_{t}(g)$ , $\mathcal{H}_{1}(h)$ - $\mathcal{H}_{t}(h)$ , $\mathcal{H}_{t}(e)$ , and $\mathcal{B}_{t}(c),\mathcal{B}_{t}(f),\mathcal{B}_{t}(g)$ . Notice that the statements of these hypotheses have the same form as the corresponding statements in [14]. While the definition of the concentration constants in these hypotheses may be different for separable and non-separable denoisers, the proof uses the result that the quantities concentrate with desired rate rather than what the concentration constants are. Therefore, the proof for [14, Lemma 12], which is similar to the proof of [14, Lemma 6], is directly applicable here. ∎

We are ready to prove $\mathcal{H}_{t+1}(a)-\mathcal{H}_{t+1}(h)$ .

(a) Recall the definition of $\Delta_{t+1,t}$ from Lemma 2 (43). Using Lemma B.1, $\frac{1}{\sqrt{n}}\left\lVert m^{t}_{\perp}\right\rVert\mathsf{P}^{\parallel}_{Q_{t+1}}Z_{t}\overset{d}{=}\frac{1}{\sqrt{n}}\left\lVert m^{t}_{\perp}\right\rVert\frac{1}{\sqrt{|\Gamma|}}\hat{Q}_{t+1}\bar{Z}_{t+1},$ where columns of the matrix $\hat{Q}_{t+1}\in\mathbb{R}^{|\Gamma|\times(t+1)}$ form an orthogonal basis for the column space of $Q_{t+1}$ , which are normalized such that $\hat{Q}_{t+1}^{*}\hat{Q}_{t+1}=|\Gamma|\mathsf{I}_{t+1}$ , and $\bar{Z}_{t+1}\in\mathbb{R}^{t+1}$ is an independent random vector with i.i.d. $\mathcal{N}(0,1)$ entries. We can then write

[TABLE]

where $\mathbf{Q}_{t+1}\in\mathbb{R}^{(t+1)\times(t+1)}$ and $v\in\mathbb{R}^{t+1}$ are defined in Lemma 5. By Lemma B.6,

[TABLE]

where we have used $Q_{t+1}\textbf{Q}_{t+1}^{-1}v=\sum_{j=0}^{t}q^{j}[\textbf{Q}_{t+1}^{-1}v]_{j+1}$ . Applying Lemma, with $\epsilon_{t}=\frac{\epsilon}{(2t+3)^{2}}$ , A.1,

[TABLE]

We now show each of the terms in (71) has the desired upper bound. For $0\leq r\leq t$ ,

[TABLE]

where step $(a)$ follows from similar justification as that for step $(a)$ in (66) and step $(b)$ follows from induction hypotheses $\mathcal{B}_{t}(g)$ , $\mathcal{H}_{1}(d)-\mathcal{H}_{t}(d)$ , and Lemma A.3. Next, the second term in (71) is bounded as

[TABLE]

where step $(c)$ is obtained using induction hypothesis $\mathcal{B}_{t}(h)$ , Lemma A.7, and Lemma A.3. Since $\frac{1}{\sqrt{n}}\left\lVert m^{t}_{\perp}\right\rVert$ concentrates on $\tau_{t}^{\perp}$ by $\mathcal{B}_{t}(h)$ , the third term in (71) can be bounded as

[TABLE]

For the second term in (72), first bound the norm of $\hat{Q}_{t+1}\bar{Z}_{t+1}$ as follows. Letting $\hat{q}_{i}$ denote the $i^{th}$ column of $\hat{Q}_{t+1}$ , we have

[TABLE]

where step (d) follows from Lemma B.6 and step (e) uses $\left\lVert\hat{q}_{i}\right\rVert^{2}=|\Gamma|$ for all $0\leq i\leq t$ . Therefore,

[TABLE]

Step $(f)$ is obtained from Lemma A.1 and step $(g)$ from Lemma A.6. Using (73), the RHS of (72) is bounded by $Ke^{-\kappa n\epsilon}$ . Finally, for $0\leq j\leq t$ , the last term in (71) can be bounded by

[TABLE]

where step $(g)$ follows from Lemma 5, the induction hypothesis $\mathcal{H}_{t}(e)$ , and Lemma A.3. Thus we have bounded each term of (71) as desired.

(b) For brevity, we use the notation $\mathbb{E}_{\phi_{h}}:=\frac{1}{|\Gamma|}\sum_{i\in\Gamma}\mathbb{E}[\phi_{h}(\tau_{0}[\tilde{\mathbf{Z}}_{0}]_{\Lambda_{i}},...,\tau_{t}[\tilde{\mathbf{Z}}_{t}]_{\Lambda_{i}},\beta_{\Lambda_{i}})]$ and

[TABLE]

for $i\in\Gamma$ . Hence $\hat{a}$ and $\hat{c}$ are arrays in $\mathbb{R}^{\Gamma}$ with entries $\hat{a}_{i}$ , $\hat{c}_{i}\in\mathbb{R}^{(t+2)}$ . We note that by $\hat{a}_{\Lambda_{i}}$ we mean for the $p$ -dimensional cube $\Lambda_{i}$ to be applied to each of the $(t+2)$ elements of $\hat{a}$ and we define $\left\lVert\hat{a}_{\Lambda_{i}}\right\rVert^{2}:=\sum_{j\in\Lambda_{i}}\left\lVert\hat{a}_{j}\right\rVert^{2}$ . Moreover, define $\hat{\Delta}_{r+1,r}=\mathcal{V}^{-1}(\Delta_{r+1,r})$ , hence $\hat{\Delta}_{r+1,r}\in\mathbb{R}^{\Gamma}$ , for all $r=0,\ldots,t$ . Then, using the conditional distribution of $h^{t+1}$ from Lemma 47 and Lemma A.1,

[TABLE]

Label the terms of (75) as $T_{1}$ and $T_{2}$ . We next show that both terms are bounded by $Ke^{-\kappa n\epsilon^{2}}$ .

First consider term $T_{1}$ . Let $d=|\Lambda|$ . Notice that

[TABLE]

In the above, $(a)$ follows from Cauchy-Schwartz and $(b)$ by collecting the terms in the sums. Hence,

[TABLE]

Denote the RHS of (76) by $\Delta_{\mathsf{total}}^{2}$ , then using Lemma A.1 and $\mathcal{H}_{1}(a)-\mathcal{H}_{t+1}(a)$ , we have

[TABLE]

Now, using the pseudo-Lipschitz property of $\phi_{h}$ , we have

[TABLE]

Above, step $(a)$ follows from Cauchy-Schwartz and step $(b)$ from an application of Lemma B.6:

[TABLE]

and $\sum_{i\in\Gamma}\|[\hat{a}-\hat{c}]_{\Lambda_{i}}\|^{2}\leq 2d\|\hat{a}-\hat{c}\|^{2}$ along with (76). Step $(c)$ follows from $\|\hat{a}\|\leq\|\hat{a}-\hat{c}\|+\|\hat{c}\|\leq\sqrt{|\Gamma|}\Delta_{\mathsf{total}}+\|\hat{c}\|$ . Notice that

[TABLE]

where the last step follows from Lemma 47. Define $\mathbb{E}_{c}:=\sum_{r=0}^{t}\tau_{r}^{2}+\sigma_{\beta}^{2}$ . Then

[TABLE]

where the last step follows from Lemma A.7 and (27). Therefore, using the bound in (78),

[TABLE]

where the last step follows from (77) and (79).

Next, consider term $T_{2}$ of (75).

[TABLE]

Label the two terms on the RHS as $T_{2a}$ and $T_{2b}$ . $T_{2a}$ can be bounded in a similar way as $T_{2}$ in (68) and $T_{2b}$ has the desired bound by Lemma C.2, since the function $\tilde{\phi}_{h}:\mathbb{R}^{\Lambda}\to\mathbb{R}$ defined as

[TABLE]

is PL(2) by Lemmas B.4 and B.5.

(c) We first show the concentration of $\frac{1}{n}(h^{t+1})^{*}\mathcal{V}(\beta)=\frac{1}{n}\sum_{i\in\Gamma}\hat{h}^{t+1}_{i}\beta_{i}$ . Using the PL(2) function $\phi_{1}$ defined in $\mathcal{H}_{1}(c)$ , we have that $\phi_{1}(\hat{h}^{t+1}_{\Lambda_{i}},\beta_{\Lambda_{i}})=\hat{h}^{t+1}_{i}\beta_{i}$ and $\mathbb{E}[\phi_{1}(\tau_{t}[\tilde{\mathbf{Z}}_{t}]_{\Lambda_{i}},\beta_{\Lambda_{i}})]=\mathbb{E}[[\tau_{t}\tilde{\mathbf{Z}}_{t}]_{i}]\mathbb{E}[\beta_{i}]=0$ for all $i\in\Gamma$ , since $[\tau_{t}\tilde{\mathbf{Z}}_{t}]_{i}$ has zero-valued mean and is independent of $\beta_{i}$ . Therefore, $\mathcal{H}_{t+1}(b)$ gives the desired upper bound, since

[TABLE]

We now show the concentration of $\frac{1}{n}(h^{t+1})^{*}q^{0}=\frac{1}{n}\sum_{i\in\Gamma}\hat{h}^{t+1}_{i}\hat{q}^{0}_{i}$ . Using the PL(2) function $\phi_{2}$ defined in $\mathcal{H}_{1}(c)$ , we have that $\phi_{2}(\hat{h}^{t+1}_{\Lambda_{i}},\beta_{\Lambda_{i}})=\hat{h}^{t+1}_{i}\hat{q}^{0}_{i}$ and $\mathbb{E}[\phi_{2}(\tau_{t}[\tilde{\mathbf{Z}}_{t}]_{\Lambda_{i}},\beta_{\Lambda_{i}})]=\mathbb{E}[\tau_{t}[\tilde{\mathbf{Z}}_{t}]_{i}]\mathbb{E}[f_{0}(\mathbf{0},\beta_{\Lambda_{i}})]=0$ , since $[\tilde{\mathbf{Z}}_{t}]_{i}$ has zero-valued mean and is independent of $\beta_{\Lambda_{i}}$ for all $i\in\Gamma$ . Therefore, using $\mathcal{H}_{t+1}(b)$ , we have the desired upper bound, since

[TABLE]

(d) Let a function $\tilde{\phi}_{3}:\mathbb{R}^{3|\Lambda|}\to\mathbb{R}$ be defined as $\tilde{\phi}_{3}(x,y,z)=\mathcal{S}(x)\mathcal{S}(y)$ . Since the operator $\mathcal{S}$ defined in (63) is Lipschitz, $\tilde{\phi}_{3}$ is PL(2) by Lemma B.3. Note that $\tilde{\phi}_{3}([\hat{h}^{r+1}]_{\Lambda_{i}},\hat{h}^{t+1}_{\Lambda_{i}},\beta_{\Lambda_{i}})=\hat{h}^{r+1}_{i}\hat{h}^{t+1}_{i}$ and $\mathbb{E}[\tilde{\phi}_{3}(\tau_{r}[\tilde{\mathbf{Z}}_{r}]_{\Lambda_{i}},\tau_{t}[\tilde{\mathbf{Z}}_{t}]_{\Lambda_{i}},\beta_{\Lambda_{i}})]=\tau_{r}\tau_{t}\mathbb{E}[[\tilde{\mathbf{Z}}_{r}]_{i}[\tilde{\mathbf{Z}}_{t}]_{i}]=\breve{E}_{r,t}$ , where the last equality follows from the definition in (31). Therefore, the result follows from $\mathcal{H}_{t+1}(b)$ , since

[TABLE]

(e) We will show the concentration of $\frac{1}{n}(q^{0})^{*}q^{t+1}=\frac{1}{n}\sum_{i\in\Gamma}\hat{q}_{i}^{0}\hat{q}_{i}^{t+1}$ ; the concentration of $\frac{1}{n}(q^{r+1})^{*}q^{t+1}$ follows similarly. The function $\tilde{\phi}_{4}(x,y):\mathbb{R}^{2|\Lambda|}\rightarrow\mathbb{R}$ defined as $\tilde{\phi}_{4}(x,y):=f_{0}(\mathbf{0},y)f_{t+1}(x,y)$ is PL(2) by Lemma B.3 and $\tilde{\phi_{4}}(\hat{h}^{t+1}_{\Lambda_{i}},\beta_{\Lambda_{i}})=\hat{q}_{i}^{0}\hat{q}_{i}^{t+1}$ . Moreover,

[TABLE]

by definition (32). Therefore, using $\mathcal{H}_{t+1}(b)$ , we have the desired result, since

[TABLE]

(f) The concentration of $\lambda_{t}$ around $\hat{\lambda}_{t}$ follows $\mathcal{H}_{t+1}(b)$ applied to the function $\phi_{h}(h^{t+1}_{\Lambda_{i}},\beta_{\Lambda_{i}}):=f_{t+1}^{\prime}(h^{t+1}_{\Lambda_{i}},\beta_{\Lambda_{i}})$ , since $f_{t+1}^{\prime}$ is assumed to be Lipschitz, hence PL(2). Next, we show concentration for $\frac{1}{n}(h^{t+1})^{*}q^{r+1}=\frac{1}{n}\sum_{i\in\Gamma}\hat{h}^{t+1}_{i}\hat{q}^{r+1}_{i}$ . Let $\tilde{\phi}_{5}:\mathbb{R}^{3|\Lambda|}\to\mathbb{R}$ be defined as $\tilde{\phi}_{5}(x,y,z):=\mathcal{S}(y)f_{r+1}(x,z)$ , which is PL(2) by Lemma B.3. Note, $\tilde{\phi}_{5}([\hat{h}^{r+1}]_{\Lambda_{i}},\hat{h}^{t+1}_{\Lambda_{i}},\beta_{\Lambda_{i}})=\hat{h}^{t+1}_{i}\hat{q}^{r+1}_{i}$ and

[TABLE]

where the last equality follows using Stein’s Method, Lemma B.2, as in $\mathcal{H}_{1}(f)$ . Therefore, $\mathcal{H}_{t+1}(b)$ gives the desired result, since

[TABLE]

(g) We can represent $\textbf{Q}_{t+1}$ as follows.

[TABLE]

Then, using $\frac{1}{n}\mathbf{Q}_{t}^{-1}Q_{t}^{*}q^{t}=\gamma^{t}$ and $(Q_{t}^{*}q^{t})^{*}\gamma^{t}=(q^{t})^{*}q^{t}_{\parallel}$ , it follows by the block inversion formula that

[TABLE]

Using definitions (35) and (36), block inversion can be similarly used to invert $\tilde{C}^{t+1}$ :

[TABLE]

In what follows, we show concentration for each of the elements in (81) to the corresponding elements in (86).

First, $n\left\lVert q^{t}_{\perp}\right\rVert^{-2}$ concentrates to $(\sigma_{t}^{\perp})^{-2}$ at rate $Ke^{-\kappa n\epsilon^{2}}$ by $\mathcal{H}_{t}(h)$ and Lemma A.5. Next, consider the $i^{th}$ element of $-n\left\lVert q^{t}_{\perp}\right\rVert^{-2}\gamma^{t}$ . For $i\in[t]$ , using Lemma A.2 and $\mathcal{H}_{t}(g),(h)$ as discussed in the previous paragraph,

[TABLE]

Consider element $(i,j)$ of $\mathbf{Q}_{t}^{-1}+n\left\lVert q^{t}_{\perp}\right\rVert^{-2}\gamma^{t}(\gamma^{t})^{*}$ for $i,j\in[t]$ .

[TABLE]

Step $(a)$ follows from Lemma A.1 and Lemma A.2 with $\epsilon^{\prime}=\min\{\sqrt{\frac{\epsilon}{3}},\frac{\epsilon(\sigma_{t}^{\perp})^{2}}{3\hat{\gamma}^{t}_{i-1}},\frac{\epsilon}{3\hat{\gamma}^{t}_{j-1}}\}.$ Step $(b)$ follows from the inductive hypothesis $\mathcal{H}_{t}(g)$ , together with (87).

We now prove $\gamma^{t+1}\doteq\hat{\gamma}^{t+1}$ . Recall, $\gamma^{t+1}=\frac{1}{n}\mathbf{Q}_{t+1}^{-1}Q_{t+1}^{*}q^{t+1}$ where $\mathbf{Q}_{t+1}:=\frac{1}{n}Q_{t+1}^{*}Q_{t+1}$ . Thus, $\gamma_{r-1}^{t+1}=\frac{1}{n}\sum_{i=1}^{t+1}[\mathbf{Q}_{t+1}^{-1}]_{r,i}(q^{i-1})^{*}q^{t+1}$ , for $1\leq r\leq t+1$ . Then by the definition of $\hat{\gamma}^{t+1}$ , for $1\leq r\leq t+1$ ,

[TABLE]

Step (a) follows from Lemma A.1 and step (b) from Lemma A.2, with $\tilde{\epsilon}_{i}:=\min\{\sqrt{\frac{\epsilon}{3(t+1)}},\frac{\epsilon}{3(t+1)\tilde{E}_{i-1,t+1}},\frac{\epsilon}{3(t+1)[(\tilde{C}^{t+1})^{-1}]_{r,i}}\}$ . Step (c) uses $\mathcal{H}_{t+1}(e)$ and what we have just demonstrated in the previous paragraphs.

(h) First, note that $||q^{t+1}_{\perp}||^{2}=||q^{t+1}||^{2}-||q^{t+1}_{\parallel}||^{2}=||q^{t+1}||^{2}-||Q_{t+1}\gamma^{t+1}||^{2}$ . Using the definition of $\sigma_{t+1}^{\perp}$ in (36), we then have

[TABLE]

By $\mathcal{H}_{t+1}(e)$ , the first term on the LHS of (88) is bounded by $Ke^{-\kappa n\epsilon^{2}}$ . For the second term, using $\gamma^{t+1}=\frac{1}{n}\textbf{Q}_{t+1}^{-1}Q_{t+1}^{*}q^{t+1}$ ,

[TABLE]

Hence

[TABLE]

Step (a) follows from the concentration of products, Lemma A.2, using $\tilde{\epsilon}_{i}:=\min\Big{\{}\sqrt{\frac{\epsilon}{6(t+1)}},\frac{\epsilon}{6(t+1)\tilde{E}_{i,t+1}},\frac{\epsilon}{6(t+1)\hat{\gamma}^{t+1}_{i}}\Big{\}}$ , and step (b) using $\mathcal{H}_{t+1}(e)$ and $\mathcal{H}_{t+1}(g)$ .

Appendix A Concentration Lemmas

In the following, $\epsilon>0$ is assumed to be a generic constant, with additional conditions specified whenever needed. The proof of the Lemmas in this section can be found in [14].

Lemma A.1.

(Concentration of Sums.) If random variables $X_{1},\ldots,X_{M}$ satisfy $P(\left\lvert X_{i}\right\rvert\geq\epsilon)\leq e^{-n\kappa_{i}\epsilon^{2}}$ for $1\leq i\leq M$ , then

[TABLE]

Lemma A.2 (Concentration of Products).

For random variables $X,Y$ and non-zero constants $c_{X},c_{Y}$ , if

[TABLE]

then the probability $P\Big{(}|XY-c_{X}c_{Y}|\geq\epsilon\Big{)}$ is bounded by

[TABLE]

Lemma A.3.

(Concentration of Square Roots.) Let $c\neq 0$ . If

[TABLE]

then

[TABLE]

Lemma A.4 (Concentration of Powers).

Assume $c\neq 0$ and $\epsilon\in(0,1]$ . Then for any integer $k\geq 2$ , if

[TABLE]

then

[TABLE]

Lemma A.5 (Concentration of Scalar Inverses).

Assume $c\neq 0$ and $\epsilon\in(0,1)$ . If

[TABLE]

then

[TABLE]

Lemma A.6.

For a standard Gaussian random variable $Z$ and $\epsilon>0$ , $P(\left\lvert Z\right\rvert\geq\epsilon)\leq 2e^{-\frac{1}{2}\epsilon^{2}}$ .

Lemma A.7.

( $\chi^{2}$ -concentration.) For $Z_{i}$ , $i\in[n]$ that are i.i.d. $\sim\mathcal{N}(0,1)$ , and $0\leq\epsilon\leq 1$ ,

[TABLE]

Lemma A.8.

[22]** Let $X$ be a centered sub-Gaussian random variable with variance factor $\nu$ , i.e., $\ln\mathbb{E}[e^{tX}]\leq\frac{t^{2}\nu}{2}$ , $\forall t\in\mathbb{R}$ . Then $X$ satisfies:

For all $x>0$ , $P(X>x)\vee P(X<-x)\leq e^{-\frac{x^{2}}{2\nu}}$ , for all $x>0$ . 2. 2.

For every integer $k\geq 1$ , $\mathbb{E}[X^{2k}]\leq 2(k!)(2\nu)^{k}\leq(k!)(4\nu)^{k}.$

Appendix B Other Useful Lemmas

In this section, when the results are standard, they are presented without proof.

Lemma B.1.

[14, Fact 7]** Let $u\in\mathbb{R}^{N}$ be a deterministic vector and let $\tilde{A}\in\mathbb{R}^{n\times N}$ be a matrix with independent $\mathcal{N}(0,1/n)$ entries. Moreover, let $\mathcal{W}$ be a $d$ -dimensional subspace of $\mathbb{R}^{n}$ for $d\leq n$ . Let $(w_{1},...,w_{d})$ be an orthogonal basis of $\mathcal{W}$ with $\left\lVert w_{\ell}\right\rVert^{2}=n$ for $\ell\in[d]$ , and let $\mathsf{P}^{\parallel}_{\mathcal{W}}$ denote the orthogonal projection operator onto $\mathcal{W}$ . Then for $D=[w_{1}\mid\ldots\mid w_{d}]$ , we have $\mathsf{P}^{\parallel}_{\mathcal{W}}\tilde{A}u\overset{d}{=}\frac{\left\lVert u\right\rVert}{\sqrt{n}}\mathsf{P}^{\parallel}_{\mathcal{W}}Z_{u}\overset{d}{=}\frac{\left\lVert u\right\rVert}{\sqrt{n}}Dx$ where $x\in\mathbb{R}^{d}$ is a random vector with i.i.d. $\mathcal{N}(0,1/n)$ entries.

Lemma B.2.

(Stein’s lemma.) For zero-mean jointly Gaussian random variables $Z_{1},Z_{2}$ , and any function $f:\mathbb{R}\to\mathbb{R}$ for which $\mathbb{E}[Z_{1}f(Z_{2})]$ and $\mathbb{E}[f^{\prime}(Z_{2})]$ both exist, we have $\mathbb{E}[Z_{1}f(Z_{2})]=\mathbb{E}[Z_{1}Z_{2}]\mathbb{E}[f^{\prime}(Z_{2})]$ .

Lemma B.3.

(Products of Lipschitz Functions are PL(2).) Let $f,g:\mathbb{R}^{p}\to\mathbb{R}$ be Lipschitz continuous. Then the product function $h:\mathbb{R}^{p}\to\mathbb{R}$ defined as $h(x):=f(x)g(x)$ is PL(2).

Lemma B.4.

Let $\Lambda$ be defined in (3). For each $r=1,\ldots,t$ , let $\tau_{r}>0$ be a constant and let $Z_{r}\in\mathbb{R}^{\Lambda}$ have i.i.d. standard normal entries. Suppose $f:\mathbb{R}^{|\Lambda|(t+1)}\to\mathbb{R}$ is PL(2) with PL constant $L$ , then the function $\tilde{f}:\mathbb{R}^{\Lambda}\to\mathbb{R}$ defined as $\tilde{f}(s):=\mathbb{E}_{Z_{1},\ldots,Z_{t}}[f(\tau_{1}Z_{1},\ldots,\tau_{t}Z_{t},s)]$ is PL(2).

Proof.

Take arbitrary $x,y\in\mathbb{R}^{\Lambda}$ ,

[TABLE]

In the above, step $(a)$ follows from Jensen’s inequality, step $(b)$ holds since $f$ is PL(2) and using the triangle inequality, and step $(c)$ follows from $\mathbb{E}\|Z_{r}\|\leq\sum_{i\in\Lambda}\mathbb{E}\left\lvert[Z_{r}]_{i}\right\rvert=|\Lambda|\sqrt{\frac{2}{\pi}}$ . ∎

Lemma B.5.

Let $\Gamma$ and $\Lambda$ be as defined in (2) and (3), and let $\Lambda_{i}$ be $\Lambda$ translated to be centered at $i$ for each $i\in\Gamma$ . Let $f:\mathbb{R}^{\Lambda}\to\mathbb{R}$ be a PL(2) function with constant $L$ and define $\tilde{f}_{i}:\mathbb{R}^{\Lambda_{i}\cap\Gamma}\to\mathbb{R}$ as $\tilde{f}_{i}(v):=f(\mathcal{T}_{i}(v))$ , where $\mathcal{T}_{i}:\mathbb{R}^{\Lambda_{i}\cap\Gamma}\to\mathbb{R}^{\Lambda}$ is defined in (8). Then $\tilde{f}_{i}$ is PL(2) for all $i\in\Gamma$ .

Proof.

Let $i$ be an arbitrary but fixed index in $\Gamma$ . Let $d=|\Lambda|$ , $a_{i}:=|\Lambda_{i}\cap\Gamma|$ , and $b_{i}=d-a_{i}$ , so that $b_{i}$ counts the number of “missing” entries in $\Lambda_{i}$ . For any $x,y\in\mathbb{R}^{\Lambda_{i}\cap\Gamma}$ , we have that

[TABLE]

where step $(a)$ follows from the pseudo-Lipschitz property of $f$ and Lemma B.6, step $(b)$ from our definition of $\mathcal{T}_{i}$ in (8), and step $(c)$ from Lemma B.6. ∎

Lemma B.6.

For any scalars $a_{1},...,a_{t}$ and positive integer $m$ , we have $(\left\lvert a_{1}\right\rvert+\ldots+\left\lvert a_{t}\right\rvert)^{m}\leq t^{m-1}\sum_{i=1}^{t}\left\lvert a_{i}\right\rvert^{m}$ . Consequently, for any vectors $\underline{u}_{1},\ldots,\underline{u}_{t}\in\mathbb{R}^{N}$ , $\left\lVert\sum_{k=1}^{t}\underline{u}_{k}\right\rVert^{2}\leq t\sum_{k=1}^{t}\left\lVert\underline{u}_{k}\right\rVert^{2}$ .

Appendix C Concentration with Dependencies

We first state a concentration result, existing in the literature, for functions acting on random fields that satisfy the Dobrushin uniqueness condition in Lemma C.1. Then we use Lemma C.1 to obtain Lemma C.2, which is needed to prove $\mathcal{H}_{t}(b)$ .

Lemma C.1.

[25*, Theorem 1]**

Suppose that the random field $X=(X_{i})_{i\in\Gamma}$ taking values in $E^{\Gamma}$ is distributed according to a Gibbs measure $\mu$ that obeys the Dobrushin uniqueness condition with Dobrishin constant $c$ , and the transposed Dobrishin uniqueness condition with constant $c^{*}$ . Suppose that $F$ is a real function on $E^{\Gamma}$ with $\mathbb{E}[\exp(tF(X))]<\infty$ for all real $t$ . Then we have for all $r\geq 0$ ,*

[TABLE]

Here $\underline{\delta}(F):=(\delta_{i}(F))_{i\in\Gamma}$ is the variation vector of $F$ , where $\delta_{i}(F):=\sup_{\xi,\xi^{\prime};\xi_{i^{c}}=\xi^{\prime}_{i^{c}}}\left\lvert F(\xi)-F(\xi^{\prime})\right\rvert$ denotes the variation of $F$ at the site $i$ . Its $\ell^{2}$ -norm is defined as $\|\underline{\delta}(F)\|_{\ell^{2}}^{2}:=\sum_{i\in\Gamma}(\delta_{i}(F))^{2}$ . If this norm is infinite, then the statement is empty (and thus correct).

Lemma C.2.

Let $\Gamma$ and $\Lambda$ be defined in (2) and (3), respectively, and let $X=(X_{i})_{i\in\Gamma}$ be a stationary Markov random field with a unique Gibbs distribution measure $\mu$ on $E^{\Gamma}\subset\mathbb{R}^{\Gamma}$ . Assume that $\mu$ satisfies the Dobrushin uniqueness condition and the transposed Dobrushin uniqueness condition with constants $c$ and $c^{*}$ , respectively. Suppose that the state space $E$ is bounded, meaning that there exists an $M$ such that $|x|\leq M$ , for all $x\in E$ . Let $f_{i}:\mathbb{R}^{\Lambda_{i}\cap\Gamma}\rightarrow\mathbb{R}$ , where $\Lambda_{i}$ is $\Lambda$ being translated to be centered at location $i\in\Gamma$ , be a PL(2) function with pseudo-Lipschitz constant $L_{i}$ , for all $i\in\Gamma$ . Then for all $\epsilon\in(0,1)$ there exist $K,\kappa>0$ , such that

[TABLE]

Proof.

Let the function $F$ in Lemma C.1 be defined as $F(X):=\sum_{i\in\Gamma}f_{i}(X_{\Lambda_{i}\cap\Gamma})$ . In order to apply Lemma C.1, we need to calculate $\|\underline{\delta}(F)\|_{\ell^{2}}^{2}$ . Let $d:=|\Lambda|$ and $L:=\max_{i\in\Gamma}L_{i}$ , then we have that

[TABLE]

In the above, step $(a)$ uses the triangle inequality and the pseudo-Lipschitz property of $f$ . Step $(b)$ follows from the fact that $|x|\leq M,$ for all $x\in E$ and that $L_{j}\leq L$ for all $j\in\Gamma$ . Therefore,

[TABLE]

Now applying Lemma C.1, we have

[TABLE]

∎

Lemma C.3 provides a technical result about pseudo-Lipschitz functions with sub-Gaussian inputs, which will be used to prove Lemma C.4.

Lemma C.3.

[1*, Lemma D.2]**

Let $X\in\mathbb{R}^{d}$ be a random vector whose entries have a sub-Gaussian marginal distribution with variance factor $\nu$ as in Lemma A.8. Let $\tilde{X}$ be an independent copy of $X$ . If $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a PL(2) function with pseudo-Lipschitz constant $L$ , then the expectation $\mathbb{E}[\exp(rf(X))]$ satisfies the following for $0<r<[5L(2d\nu+24d^{2}\nu^{2})^{1/2}]^{-1}$ ,*

[TABLE]

Lemma C.4 provides a concentration inequality for sums of pseudo-Lipschitz functions acting on overlapping subsets of jointly Gaussian random variables.

Lemma C.4.

Let $\Gamma$ and $\Lambda$ be defined as in (2) and (3). For each $r=1,\ldots,t$ , let $(Z^{r}_{i})_{i\in\Gamma}$ have i.i.d. $\mathcal{N}(0,1)$ entries, and for all $r,s=1,\ldots,t$ and $i\neq j$ , $Z^{r}_{i}$ is independent of $Z^{s}_{j}$ . Moreover, for each $i\in\Gamma$ , let $(Z_{i}^{1},\ldots,Z_{i}^{t})$ be jointly Gaussian with covariance matrix $K\in\mathbb{R}^{t\times t}$ .

For each $i\in\Gamma$ , define $Y_{i}:=(Z^{1}_{\Lambda_{i}\cap\Gamma},\ldots,Z^{t}_{\Lambda_{i}\cap\Gamma})$ , where $\Lambda_{i}$ is $\Lambda$ translated to be centered at location $i\in\Gamma$ . Let $f_{i}:\mathbb{R}^{|\Lambda_{i}\cap\Gamma|t}\rightarrow\mathbb{R}$ be a PL(2) function for all $i\in\Gamma$ . Then for all $\epsilon\in(0,1)$ , there exist $K,\kappa>0$ such that

[TABLE]

Proof.

In the following, we prove the case for $p=2$ and the proof for other dimensions follows similarly. Without loss of generality, let $i=(i_{1},i_{2})$ and $\Gamma:=\{(i_{1},i_{2})\}_{1\leq i_{1},i_{2}\leq n}$ , hence $|\Gamma|=n^{2}$ . Further, assume without loss of generality that $\mathbb{E}[f_{i}(Y_{i})]=0,$ for all $i\in\Gamma$ . In what follows, we demonstrate the upper-tail bound:

[TABLE]

and the lower-tail bound follows similarly. Together they provide the desired result.

Using the Cramér-Chernoff method, for all $r>0$ ,

[TABLE]

Let $d^{2}=|\Lambda|$ and $L_{i}$ be the pseudo-Lipschitz parameters associated with functions $f_{i}$ for $i\in\Gamma$ and define $L:=\max_{i\in\Gamma}L_{i}$ . In the following, we will show that for $0<r<(10Ld^{2}\sqrt{2td^{2}+24t^{2}d^{4}})^{-1}$ ,

[TABLE]

where $\kappa^{\prime}$ is any constant that satisfies $\kappa^{\prime}\geq 450L^{2}d^{2}(d^{2}+12d^{4})$ . Then plugging (95) into (94), we can obtain the desired result in (93):

[TABLE]

Set $r=\epsilon/(2\kappa^{\prime})$ , which is the choice that maximizes the term in the exponent in the above, i.e. it maximizes $(r\epsilon-\kappa^{\prime}r^{2})$ over $r$ . We can ensure that $\forall\epsilon\in(0,1)$ , $r$ falls within the region required in (95) by choosing $\kappa^{\prime}$ large enough.

We now show (95). Define index sets

[TABLE]

for $j_{1},j_{2}=1,...,d$ , and let $C_{j_{1},j_{2}}$ denote the cardinality of $I_{j_{1},j_{2}}$ . We notice that for any fixed $(j_{1},j_{2})$ , the $Y_{i_{1},i_{2}}$ ’s are i.i.d. for all $(i_{1},i_{2})\in I_{j_{1},j_{2}}$ . Also, we have $\Gamma=\cup_{j_{1},j_{2}=1}^{d}I_{j_{1},j_{2}}$ , and $I_{j_{1},j_{2}}\cap I_{s_{1},s_{2}}=\emptyset$ , for $(j_{1},j_{2})\neq(s_{1},s_{2})$ , making the collection $I_{1,1},I_{1,2},\ldots,I_{d,d}$ a partition of $\Gamma$ . Therefore,

[TABLE]

where $0<p_{j_{1},j_{2}}<1$ are probabilities satisfying $\sum_{j_{1},j_{2}=1}^{d}p_{j_{1},j_{2}}=1$ . Using the above,

[TABLE]

where step $(a)$ follows from Jensen’s inequality, step $(b)$ from the independence of $Y_{i_{1},i_{2}}$ ’s for $(i_{1},i_{2})\in I_{j_{1},j_{2}}$ , and step $(c)$ from Lemma C.3 with variance factor $\nu=1$ and restriction

[TABLE]

Let $p_{j_{1},j_{2}}=\sqrt{C_{j_{1},j_{2}}}/C$ , where $C=\sum_{j_{1},j_{2}=1}^{d}\sqrt{C_{j_{1},j_{2}}}$ , ensuring $\sum_{j_{1},j_{2}=1}^{d}p_{j_{1},j_{2}}=1$ . Then,

[TABLE]

whenever $\kappa^{\prime}\geq 450d^{2}L^{2}(td^{2}+12t^{2}d^{4})$ . In the above, step $(a)$ follows from:

[TABLE]

where step $(b)$ holds because $C_{1,1}=\max_{(j_{1},j_{2})}C_{j_{1},j_{2}}$ and step $(c)$ holds because $C_{1,1}=(\lfloor\frac{n-1}{d}\rfloor+1)^{2}\leq(\frac{n}{d}+2)^{2}$ . Finally, we consider the effective region for $r$ as required in (97). Notice that

[TABLE]

Hence, if we require $0<r<(10Ld^{2}\sqrt{2td^{2}+24t^{2}d^{4}})^{-1}$ , then (97) is satisfied. ∎

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Ma, C. Rush, and D. Baron, “Analysis of approximate message passing with a class of non-separable denoisers,” Proc. IEEE Int. Symp. Inf. Theory , June 2017, full version: https://arxiv.org/abs/1705.03126 .
2[2] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing,” Proc. Nat. Academy Sci. , vol. 106, no. 45, pp. 18 914–18 919, Nov. 2009.
3[3] A. Montanari, “Graphical models concepts in compressed sensing,” in Compressed Sensing , Y. C. Eldar and G. Kutyniok, Eds. Cambridge University Press, 2012, pp. 394–438. [Online]. Available: http://dx.doi.org/10.1017/CBO 9780511794308.010 · doi ↗
4[4] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,” IEEE Trans. Inf. Theory , vol. 57, no. 2, pp. 764–785, Feb. 2011.
5[5] F. Krzakala, M. Mézard, F. Sausset, Y. Sun, and L. Zdeborová, “Probabilistic reconstruction in compressed sensing: Algorithms, phase diagrams, and threshold achieving matrices,” J. Stat. Mech. – Theory E. , vol. 2012, no. 08, p. P 08009, Aug. 2012.
6[6] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT) , St. Petersburg, Russia, July 2011, pp. 2168–2172.
7[7] J. Tan, Y. Ma, and D. Baron, “Compressive imaging via approximate message passing with image denoising,” IEEE Trans. Signal Processing , vol. 63, no. 8, pp. 2085–2092, April 2015.
8[8] C. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,” IEEE Trans. Inf. Theory , vol. 62, no. 9, pp. 5117 – 5114, Apr. 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Analysis of Approximate Message Passing

Abstract

Index Terms:

I Introduction

I-A Sliding-Window Denoisers and AMP Algorithm

I-B Contributions and Outline

II Main Results

II-A Definitions and Assumptions

II-B Performance Guarantee

Theorem 1**.**

Proof.

II-C Numerical Examples

II-C1 Verification of state evolution

II-C2 Texture Image Reconstruction

III Proof of Theorem 1

III-A Proof Notation

III-B Concentrating Constants

Lemma 1**.**

Proof.

III-C Conditional Distribution Lemma

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

III-D Main Concentration Lemma

Lemma 4**.**

III-E Proof of Theorem 1

Proof.

IV Proof of Lemma 4

IV-A Step 2: Showing that H1\mathcal{H}_{1}H1​ holds

IV-B Step 4: Showing that Ht+1\mathcal{H}_{t+1}Ht+1​ holds

Lemma 5**.**

Proof.

Appendix A Concentration Lemmas

Lemma A.1**.**

Lemma A.2** (Concentration of Products).**

Lemma A.3**.**

Lemma A.4** (Concentration of Powers).**

Lemma A.5** (Concentration of Scalar Inverses).**

Lemma A.6**.**

Lemma A.7**.**

Lemma A.8**.**

Appendix B Other Useful Lemmas

Lemma B.1**.**

Lemma B.2**.**

Lemma B.3**.**

Lemma B.4**.**

Proof.

Lemma B.5**.**

Proof.

Lemma B.6**.**

Appendix C Concentration with Dependencies

Lemma C.1**.**

Lemma C.2**.**

Proof.

Lemma C.3**.**

Lemma C.4**.**

Proof.

Theorem 1.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

IV-A Step 2: Showing that $\mathcal{H}_{1}$ holds

IV-B Step 4: Showing that $\mathcal{H}_{t+1}$ holds

Lemma 5.

Lemma A.1.

Lemma A.2 (Concentration of Products).

Lemma A.3.

Lemma A.4 (Concentration of Powers).

Lemma A.5 (Concentration of Scalar Inverses).

Lemma A.6.

Lemma A.7.

Lemma A.8.

Lemma B.1.

Lemma B.2.

Lemma B.3.

Lemma B.4.

Lemma B.5.

Lemma B.6.

Lemma C.1.

Lemma C.2.

Lemma C.3.

Lemma C.4.