Matrix denoising for weighted loss functions and heterogeneous signals

William Leeb

arXiv:1902.09474·math.ST·April 8, 2021·SIAM J. Math. Data Sci.

Matrix denoising for weighted loss functions and heterogeneous signals

William Leeb

PDF

TL;DR

This paper develops optimal spectral denoisers for low-rank matrix estimation under weighted loss functions, addressing challenges of heterogeneity, missing data, and heteroscedastic noise, to improve denoising performance.

Contribution

It introduces a framework for deriving optimal spectral denoisers for weighted loss functions and combines them to exploit heterogeneity in signals for enhanced estimation.

Findings

01

Derived optimal spectral denoisers for weighted loss functions.

02

Constructed a new denoiser leveraging heterogeneity to improve accuracy.

03

Addressed analysis challenges of non-orthogonally-invariant weighted losses.

Abstract

We consider the problem of estimating a low-rank matrix from a noisy observed matrix. Previous work has shown that the optimal method depends crucially on the choice of loss function. In this paper, we use a family of weighted loss functions, which arise naturally for problems such as submatrix denoising, denoising with heteroscedastic noise, and denoising with missing data. However, weighted loss functions are challenging to analyze because they are not orthogonally-invariant. We derive optimal spectral denoisers for these weighted loss functions. By combining different weights, we then use these optimal denoisers to construct a new denoiser that exploits heterogeneity in the signal matrix to boost estimation with unweighted loss.

Figures7

Click any figure to enlarge with its caption.

Tables5

Table 1. Table 1 : Algorithms introduced in this paper.

#	Description	Reference
1	Optimal spectral denoising for weighted loss	Section 4
2	Localized denoising for unweighted loss	Section 5
3	Submatrix denoising	Section 6.1
4	Matrix denoising with doubly-heteroscedastic noise	Section 6.2
5	Matrix denoising with missing data	Section 6.3

Table 2. Table 2 : Average and maximum relative errors ‖ 𝐂 ^ ω − 𝐂 ω ‖ F / ‖ 𝐂 ω ‖ F subscript norm superscript ^ 𝐂 𝜔 superscript 𝐂 𝜔 F subscript norm superscript 𝐂 𝜔 F \|\widehat{\mathbf{C}}^{\omega}-\mathbf{C}^{\omega}\|_{\mathrm{F}}/\|\mathbf{C}^{\omega}\|_{\mathrm{F}} ; see Section 7.5 for simulation details. For Gaussian, Rademacher, and t 10 distributions, the average errors decay approximately like O ( n − 1 / 2 ) 𝑂 superscript 𝑛 1 2 O(n^{-1/2}) . The errors for the t 3 distribution do not decay, indicating poor model fit.

	Mean relative error, $𝐂^{ω}$
$n$	Gaussian	Rademacher	t, df=10	t, df=3
$500$	2.956e-02	3.057e-02	2.928e-02	3.284e-01
$1000$	2.050e-02	2.127e-02	2.042e-02	4.208e-01
$2000$	1.459e-02	1.514e-02	1.466e-02	5.280e-01
$4000$	1.033e-02	1.019e-02	1.030e-02	6.535e-01
$8000$	7.459e-03	7.693e-03	7.495e-03	7.640e-01

Table 3. Table 3 : Average and maximum relative errors ‖ 𝐃 ^ − 𝐃 ‖ F / ‖ 𝐃 ‖ F subscript norm ^ 𝐃 𝐃 F subscript norm 𝐃 F \|\widehat{\mathbf{D}}-\mathbf{D}\|_{\mathrm{F}}/\|\mathbf{D}\|_{\mathrm{F}} ; see Section 7.5 for simulation details. For Gaussian, Rademacher, and t 10 distributions, the average errors decay approximately like O ( n − 1 / 2 ) 𝑂 superscript 𝑛 1 2 O(n^{-1/2}) . The errors for the t 3 distribution do not decay, indicating poor model fit.

	Mean relative error, $𝐃$
$n$	Gaussian	Rademacher	t, df=10	t, df=3
$500$	1.991e-02	2.039e-02	1.975e-02	1.757e-01
$1000$	1.414e-02	1.437e-02	1.412e-02	2.219e-01
$2000$	9.917e-03	1.015e-02	1.001e-02	2.788e-01
$4000$	7.027e-03	6.875e-03	7.005e-03	3.455e-01
$8000$	5.040e-03	5.224e-03	5.087e-03	4.141e-01

Table 4. Table 4 : Average and maximum relative errors of estimation; see Section 7.6 for simulation details. The naive rank estimate r ^ naive superscript ^ 𝑟 naive \hat{r}^{\mathrm{naive}} from ( 34 ) tends to overestimate the true rank r = 2 𝑟 2 r=2 , whereas the estimate r ^ KN superscript ^ 𝑟 KN \hat{r}^{\mathrm{KN}} of Kritchman and Nadler [ 36 ] is more accurate. However, the difference between the errors in the resulting estimates of 𝐗 𝐗 \mathbf{X} is not large. Both methods perform poorly for heavy tailed distributions.

	Mean relative error
Noise type	Oracle	K-N	Naive
Gaussian	4.010e-01	4.010e-01	4.012e-01
Rademacher	4.012e-01	4.012e-01	4.013e-01
t, df=10	4.022e-01	4.022e-01	4.024e-01
t, df=5	4.047e-01	4.086e-01	4.099e-01
t, df=4	4.337e-01	4.484e-01	4.539e-01
t, df=3	6.606e-01	7.459e-01	7.611e-01
t, df=2.5	1.026e+00	1.287e+00	1.300e+00

Table 5. Table 5 : Average and maximum rank estimates; see Section 7.6 for simulation details. The naive rank estimate r ^ naive superscript ^ 𝑟 naive \hat{r}^{\mathrm{naive}} from ( 34 ) tends to overestimate the true rank r = 2 𝑟 2 r=2 , whereas the estimate r ^ KN superscript ^ 𝑟 KN \hat{r}^{\mathrm{KN}} of Kritchman and Nadler [ 36 ] is more accurate. However, the difference between the errors in the resulting estimates of 𝐗 𝐗 \mathbf{X} is not large. Both methods perform poorly for heavy tailed distributions.

	Mean rank
Noise type	Oracle	K-N	Naive
Gaussian	2.000e+00	2.000e+00	2.084e+00
Rademacher	2.000e+00	2.000e+00	2.037e+00
t, df=10	2.000e+00	2.000e+00	2.094e+00
t, df=5	2.000e+00	2.128e+00	2.443e+00
t, df=4	2.000e+00	2.858e+00	3.577e+00
t, df=3	2.000e+00	7.875e+00	8.967e+00
t, df=2.5	2.000e+00	1.643e+01	1.716e+01

Equations216

μ = p \to \infty lim \frac{1}{p} tr (Ω^{T} Ω), ν = n \to \infty lim \frac{1}{n} tr (Π^{T} Π) .

μ = p \to \infty lim \frac{1}{p} tr (Ω^{T} Ω), ν = n \to \infty lim \frac{1}{n} tr (Π^{T} Π) .

e_{j k} = p \to \infty lim ⟨ Ω u_{j}, Ω u_{k} ⟩, \tilde{e}_{j k} = n \to \infty lim ⟨ Π v_{j}, Π v_{k} ⟩ .

e_{j k} = p \to \infty lim ⟨ Ω u_{j}, Ω u_{k} ⟩, \tilde{e}_{j k} = n \to \infty lim ⟨ Π v_{j}, Π v_{k} ⟩ .

γ = n \to \infty lim \frac{p _{n}}{n}

γ = n \to \infty lim \frac{p _{n}}{n}

x^{T} Ax = k = 1 \sum m h_{k} ⟨ x, w_{k} ⟩^{2} .

x^{T} Ax = k = 1 \sum m h_{k} ⟨ x, w_{k} ⟩^{2} .

x_{i}^{T} A x_{j} \sim 0

x_{i}^{T} A x_{j} \sim 0

\SS = {U B V^{T} : B \in R^{r \times r}} .

\SS = {U B V^{T} : B \in R^{r \times r}} .

L_{n} (X, X) = ∥Ω (X - X) Π^{T} ∥_{F}^{2},

L_{n} (X, X) = ∥Ω (X - X) Π^{T} ∥_{F}^{2},

L (U B V^{T}, X) = n \to \infty lim L_{n} (U B V^{T}, X) .

L (U B V^{T}, X) = n \to \infty lim L_{n} (U B V^{T}, X) .

B = B^{'} \in R^{r \times r} argmin L (U B^{'} V^{T}, X)

B = B^{'} \in R^{r \times r} argmin L (U B^{'} V^{T}, X)

c_{j k} = p \to \infty lim ⟨ \hat{u}_{j}, u_{k} ⟩, \tilde{c}_{j k} = n \to \infty lim ⟨ \hat{v}_{j}, v_{k} ⟩ .

c_{j k} = p \to \infty lim ⟨ \hat{u}_{j}, u_{k} ⟩, \tilde{c}_{j k} = n \to \infty lim ⟨ \hat{v}_{j}, v_{k} ⟩ .

c_{j k}^{ω} = p \to \infty lim ⟨ Ω \hat{u}_{j}, Ω u_{k} ⟩, \tilde{c}_{j k}^{ω} = n \to \infty lim ⟨ Π \hat{v}_{j}, Π v_{k} ⟩ .

c_{j k}^{ω} = p \to \infty lim ⟨ Ω \hat{u}_{j}, Ω u_{k} ⟩, \tilde{c}_{j k}^{ω} = n \to \infty lim ⟨ Π \hat{v}_{j}, Π v_{k} ⟩ .

d_{j k} = p \to \infty lim ⟨ Ω \hat{u}_{j}, Ω \hat{u}_{k} ⟩, \tilde{d}_{j k} = n \to \infty lim ⟨ Π \hat{v}_{j}, Π \hat{v}_{k} ⟩ .

d_{j k} = p \to \infty lim ⟨ Ω \hat{u}_{j}, Ω \hat{u}_{k} ⟩, \tilde{d}_{j k} = n \to \infty lim ⟨ Π \hat{v}_{j}, Π \hat{v}_{k} ⟩ .

λ_{k}^{2} = {(t_{k}^{2} + 1) (1 + \frac{γ}{t _{k}^{2}}), (1 + γ)^{2}, if t_{k} > γ^{1/4}, if t_{k} \leq γ^{1/4},

λ_{k}^{2} = {(t_{k}^{2} + 1) (1 + \frac{γ}{t _{k}^{2}}), (1 + γ)^{2}, if t_{k} > γ^{1/4}, if t_{k} \leq γ^{1/4},

c_{j k}^{2} = {\frac{1 - γ / t _{k}^{4}}{1 + γ / t _{k}^{2}}, 0, if j = k and t_{k} > γ^{1/4}, if j \neq = k or t_{k} \leq γ^{1/4},

c_{j k}^{2} = {\frac{1 - γ / t _{k}^{4}}{1 + γ / t _{k}^{2}}, 0, if j = k and t_{k} > γ^{1/4}, if j \neq = k or t_{k} \leq γ^{1/4},

\tilde{c}_{j k}^{2} = {\frac{1 - γ / t _{k}^{4}}{1 + 1/ t _{k}^{2}}, 0, if j = k and t_{k} > γ^{1/4}, if j \neq = k or t_{k} \leq γ^{1/4} .

\tilde{c}_{j k}^{2} = {\frac{1 - γ / t _{k}^{4}}{1 + 1/ t _{k}^{2}}, 0, if j = k and t_{k} > γ^{1/4}, if j \neq = k or t_{k} \leq γ^{1/4} .

c_{j k}^{ω} = {e_{j k} c_{j}, 0, if t_{j} > γ^{1/4}, if t_{j} \leq γ^{1/4},

c_{j k}^{ω} = {e_{j k} c_{j}, 0, if t_{j} > γ^{1/4}, if t_{j} \leq γ^{1/4},

\tilde{c}_{j k}^{ω} = {\tilde{e}_{j k} \tilde{c}_{j}, 0, if t_{j} > γ^{1/4}, if t_{j} \leq γ^{1/4},

\tilde{c}_{j k}^{ω} = {\tilde{e}_{j k} \tilde{c}_{j}, 0, if t_{j} > γ^{1/4}, if t_{j} \leq γ^{1/4},

d_{j k} = ⎩ ⎨ ⎧ c_{k}^{2} α_{k} + s_{k}^{2} μ, e_{j k} c_{j} c_{k}, 0, if j = k and t_{k} > γ^{1/4}, if j \neq = k and min {t_{j}, t_{k}} > γ^{1/4}, if j \neq = k and min {t_{j}, t_{k}} \leq γ^{1/4},

d_{j k} = ⎩ ⎨ ⎧ c_{k}^{2} α_{k} + s_{k}^{2} μ, e_{j k} c_{j} c_{k}, 0, if j = k and t_{k} > γ^{1/4}, if j \neq = k and min {t_{j}, t_{k}} > γ^{1/4}, if j \neq = k and min {t_{j}, t_{k}} \leq γ^{1/4},

\tilde{d}_{j k} = ⎩ ⎨ ⎧ \tilde{c}_{k}^{2} β_{k} + \tilde{s}_{k}^{2} ν, \tilde{e}_{j k} \tilde{c}_{j} \tilde{c}_{k}, 0, if j = k and t_{k} > γ^{1/4}, if j \neq = k and min {t_{j}, t_{k}} > γ^{1/4}, if j \neq = k and min {t_{j}, t_{k}} \leq γ^{1/4} .

\tilde{d}_{j k} = ⎩ ⎨ ⎧ \tilde{c}_{k}^{2} β_{k} + \tilde{s}_{k}^{2} ν, \tilde{e}_{j k} \tilde{c}_{j} \tilde{c}_{k}, 0, if j = k and t_{k} > γ^{1/4}, if j \neq = k and min {t_{j}, t_{k}} > γ^{1/4}, if j \neq = k and min {t_{j}, t_{k}} \leq γ^{1/4} .

n \to \infty lim ∥Ω (X - X) Π^{T} ∥_{F}^{2} = ⟨ E diag (t) E - C^{T} D^{+} C diag (t) C^{T} D^{+} C, diag (t) ⟩_{F} .

n \to \infty lim ∥Ω (X - X) Π^{T} ∥_{F}^{2} = ⟨ E diag (t) E - C^{T} D^{+} C diag (t) C^{T} D^{+} C, diag (t) ⟩_{F} .

t_{k} = \frac{λ _{k}^{2} - 1 - γ + ( λ _{k}^{2} - 1 - γ ) ^{2} - 4 γ}{2}, c_{k} = \frac{1 - γ / t _{k}^{4}}{1 + γ / t _{k}^{2}}, \tilde{c}_{k} = \frac{1 - γ / t _{k}^{4}}{1 + 1/ t _{k}^{2}} .

t_{k} = \frac{λ _{k}^{2} - 1 - γ + ( λ _{k}^{2} - 1 - γ ) ^{2} - 4 γ}{2}, c_{k} = \frac{1 - γ / t _{k}^{4}}{1 + γ / t _{k}^{2}}, \tilde{c}_{k} = \frac{1 - γ / t _{k}^{4}}{1 + 1/ t _{k}^{2}} .

α_{k} = \frac{d _{k} - s _{k}^{2} μ}{c _{k}^{2}}, β_{k} = \frac{d ~ _{k} - s ~ _{k}^{2} ν}{c ~ _{k}^{2}} .

α_{k} = \frac{d _{k} - s _{k}^{2} μ}{c _{k}^{2}}, β_{k} = \frac{d ~ _{k} - s ~ _{k}^{2} ν}{c ~ _{k}^{2}} .

X^{dd} = U diag (\hat{t}) V^{T} = k = 1 \sum r \hat{t}_{k} \hat{u}_{k} \hat{v}_{k}^{T}

X^{dd} = U diag (\hat{t}) V^{T} = k = 1 \sum r \hat{t}_{k} \hat{u}_{k} \hat{v}_{k}^{T}

\hat{t}_{k} = t_{k} c_{k} \tilde{c}_{k} \cdot \frac{α _{k}}{c _{k}^{2} α _{k} + s _{k}^{2} μ} \cdot \frac{β _{k}}{c ~ _{k}^{2} β _{k} + s ~ _{k}^{2} ν},

\hat{t}_{k} = t_{k} c_{k} \tilde{c}_{k} \cdot \frac{α _{k}}{c _{k}^{2} α _{k} + s _{k}^{2} μ} \cdot \frac{β _{k}}{c ~ _{k}^{2} β _{k} + s ~ _{k}^{2} ν},

n \to \infty lim ∥Ω (X - X) Π^{T} ∥_{F}^{2} = k = 1 \sum r t_{k}^{2} α_{k} β_{k} (1 - c_{k}^{2} \tilde{c}_{k}^{2} \cdot \frac{α _{k}}{c _{k}^{2} α _{k} + s _{k}^{2} μ} \cdot \frac{β _{k}}{c ~ _{k}^{2} β _{k} + s ~ _{k}^{2} ν}) .

n \to \infty lim ∥Ω (X - X) Π^{T} ∥_{F}^{2} = k = 1 \sum r t_{k}^{2} α_{k} β_{k} (1 - c_{k}^{2} \tilde{c}_{k}^{2} \cdot \frac{α _{k}}{c _{k}^{2} α _{k} + s _{k}^{2} μ} \cdot \frac{β _{k}}{c ~ _{k}^{2} β _{k} + s ~ _{k}^{2} ν}) .

X^{loc} = i = 1 \sum I j = 1 \sum J Ω_{i} X_{(i, j)}^{loc} Π_{j} .

X^{loc} = i = 1 \sum I j = 1 \sum J Ω_{i} X_{(i, j)}^{loc} Π_{j} .

∥ X - X ∥_{F}^{2} = ∥ S^{1/2} ψ (Y) T^{1/2} - S^{1/2} X T^{1/2} ∥_{F}^{2} = ∥ S^{1/2} (ψ (Y) - X) T^{1/2} ∥_{F}^{2},

∥ X - X ∥_{F}^{2} = ∥ S^{1/2} ψ (Y) T^{1/2} - S^{1/2} X T^{1/2} ∥_{F}^{2} = ∥ S^{1/2} (ψ (Y) - X) T^{1/2} ∥_{F}^{2},

p \to \infty lim ∥ S_{p} - S_{p} ∥_{op} = n \to \infty lim ∥ T_{n} - T_{n} ∥_{op} = 0.

p \to \infty lim ∥ S_{p} - S_{p} ∥_{op} = n \to \infty lim ∥ T_{n} - T_{n} ∥_{op} = 0.

\overset{a}{^}_{i} = j = 1 \sum n Y_{ij}^{2}, \hat{b}_{j} = \frac{\sum _{i = 1}^{p} Y _{ij}^{2}}{\frac{1}{n} \sum _{i = 1}^{p} a ^ _{i}},

\overset{a}{^}_{i} = j = 1 \sum n Y_{ij}^{2}, \hat{b}_{j} = \frac{\sum _{i = 1}^{p} Y _{ij}^{2}}{\frac{1}{n} \sum _{i = 1}^{p} a ^ _{i}},

SNR_{k} = \frac{t _{k}^{2}}{∥ N ∥ _{op}^{2}},

SNR_{k} = \frac{t _{k}^{2}}{∥ N ∥ _{op}^{2}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Matrix denoising for weighted loss functions and

heterogeneous signals

William Leeb School of Mathematics, University of Minnesota, Twin Cities. Minneapolis, MN.

Abstract

We consider the problem of estimating a low-rank matrix from a noisy observed matrix. Previous work has shown that the optimal method depends crucially on the choice of loss function. In this paper, we use a family of weighted loss functions, which arise naturally for problems such as submatrix denoising, denoising with heteroscedastic noise, and denoising with missing data. However, weighted loss functions are challenging to analyze because they are not orthogonally-invariant. We derive optimal spectral denoisers for these weighted loss functions. By combining different weights, we then use these optimal denoisers to construct a new denoiser that exploits heterogeneity in the signal matrix to boost estimation with unweighted loss.

1 Introduction

This paper is concerned with estimating a low-rank signal matrix $\mathbf{X}$ from an observed matrix $\mathbf{Y}=\mathbf{X}+\mathbf{G}$ , where $\mathbf{G}$ is a full-rank matrix of noise. We consider two distinct aspects of the matrix denoising problem. First, we study methods designed for a broader family of loss functions, known as weighted loss functions, than considered in earlier works. Second, we design a new denoiser for unweighted loss that improves upon previous work by exploiting heterogeneity in the target matrix’s singular vectors. Like many works on matrix denoising, our methods are designed for an asymptotic regime where the number of rows and columns of $\mathbf{X}$ grow infinitely large, and where the energy in the noise swamps the energy in the signal. This setting is often referred to as the spiked model [4, 3, 5, 47, 29, 8].

The methods introduced in this paper extend singular value shrinkage [51, 20, 19, 43, 18, 38], which modifies $\mathbf{Y}$ ’s singular values to mitigate the effects of noise. Our method of spectral denoising agrees with singular value shrinkage with unweighted loss, but performs better with weighted loss. While weighted loss functions arise in a number of applications which we describe, they are challenging as they are not orthogonally-invariant. To derive optimal spectral denoisers for weighted loss, we extend the asymptotic theory of the spiked model, building on work from [40].

Our new method of localized denoising is designed for unweighted loss. Unlike singular value shrinkage, however, localized denoising exploits heterogeneity in $\mathbf{X}$ ’s singular vectors; when certain blocks of coordinates of $\mathbf{X}$ are known to contain more of the signal’s energy than others, localized denoising outperforms shrinkage. At the same time, localized denoising’s asymptotic performance is never worse than shrinkage’s, and so localized denoising inherits shrinkage’s well-known optimality properties.

1.1 Main ideas

In the high-noise, high-dimensional spiked model, the energy of the noise $\mathbf{G}$ is unbounded as $p,n\to\infty$ , while the energy of $\mathbf{X}$ is fixed. Consistent estimation of $\mathbf{X}$ from $\mathbf{Y}$ is therefore not possible, so the “best” denoiser depends on the choice of loss function. The weighted loss functions we use arise in a variety of applications, described in Section 6; the new method of spectral denoising is adapted to each of these. Table 1 lists these algorithms and their locations in the paper.

The optimal spectral denoiser for weighted loss solves a least-squares problem parameterized by weighted inner products between the singular vectors of $\mathbf{X}$ and $\mathbf{Y}$ . Though formulas for unweighted inner products are well-known [47, 8], the results we need require a new analysis extending our earlier work in [40]. While we leave the details to Theorem 3.2, the key idea is that a singular vector $\hat{\mathbf{u}}_{j}$ of $\mathbf{Y}$ may be written as a combination of its projection onto the corresponding singular vector $\mathbf{u}_{j}$ of $\mathbf{X}$ and a residual unit vector $\tilde{\mathbf{u}}_{j}$ , $\hat{\mathbf{u}}_{j}=c_{j}\mathbf{u}_{j}+s_{j}\tilde{\mathbf{u}}_{j}$ . Here, $c_{j}$ and $s_{j}$ are known from the classical theory of the spiked model [47]. Because the noise $\mathbf{G}$ is orthogonally invariant, the $\tilde{\mathbf{u}}_{j}$ are uniformly random in the subspace orthogonal to $\mathbf{X}$ ’s singular vectors. Consequently, inner products of the form $\tilde{\mathbf{u}}_{j}^{T}\mathbf{A}\tilde{\mathbf{u}}_{k}$ have predictable behavior when the dimension is large [21, 57, 49].

1.2 Illustrative example

The method of localized denoising, introduced in Section 5, uses the optimal spectral denoiser for weighted loss to construct a matrix denoiser for unweighted loss. The matrix is broken into submatrices, each of which is denoised by applying the optimal spectral denoiser with weights projecting onto that submatrix’s coordinates. In Figure 1, we illustrate the performance of localized denoising on the MIT logo, which is a $1574$ -by- $2800$ matrix with rank $5$ . The logo is corrupted by iid Gaussian noise with standard deviation $\sigma=t_{5}/(1.5\gamma^{1/4})$ , $t_{5}$ being the smallest singular value of the clean image. We apply optimal singular value shrinkage and localized denoising, the latter by breaking the rows into $15$ equispaced segments and the columns into $30$ equispaced segments. The relative error $\|\widehat{\mathbf{X}}^{\mathrm{loc}}-\mathbf{X}\|_{\mathrm{F}}/\|\mathbf{X}\|_{\mathrm{F}}$ of localized denoising is approximately $7.41\times 10^{-2}$ ; the relative error of singular value shrinkage is approximately $1.25\times 10^{-1}$ , which is significantly larger. The improvement from localized denoising is due to the signal matrix’s heterogeneity along the rows and columns. However, the row and column subdivisions are not chosen to extract any specific structure in the image, and localized denoising does not appear to be very sensitive to the choice of subdivisions; similar results may be obtained with other subdivisions as well.

1.3 Outline of the paper

Section 2 contains the problem statement and key definitions. Section 3 presents the new asymptotic results. Section 4 derives the optimal spectral denoiser for weighted loss. Section 5 introduces localized denoising. Section 6 describes three applications of weighted loss functions. Section 7 reports on numerical experiments. Section 8 concludes by discussing potential applications.

2 Preliminaries

2.1 The observation model

We observe a $p$ -by- $n$ data matrix $\mathbf{Y}=\mathbf{X}+\mathbf{G}$ , consisting of a low-rank signal matrix $\mathbf{X}$ and a full-rank isotropic Gaussian noise matrix $\mathbf{G}$ . We write $\mathbf{X}$ as $\mathbf{X}=\sum_{k=1}^{r}t_{k}\mathbf{u}_{k}\mathbf{v}_{k}^{T},$ where the $\mathbf{u}_{k}$ and $\mathbf{v}_{k}$ are orthonormal vectors in $\mathbb{R}^{p}$ and $\mathbb{R}^{n}$ , respectively, and $t_{1}>\dots>t_{r}>0$ . The entries of the noise matrix $\mathbf{G}$ are iid $N(0,1/n)$ . We write $\mathbf{Y}$ as $\mathbf{Y}=\sum_{k=1}^{\min(n,p)}\lambda_{k}\hat{\mathbf{u}}_{k}\hat{\mathbf{v}}_{k}^{T},$ where the $\hat{\mathbf{u}}_{k}$ and $\hat{\mathbf{v}}_{k}$ are orthonormal vectors in $\mathbb{R}^{p}$ and $\mathbb{R}^{n}$ , respectively, and $\lambda_{1}\geq\dots\geq\lambda_{\min(n,p)}\geq 0$ .

We let $\Omega=\Omega_{p}$ be one of a sequence of matrices with $p$ columns, and $\Pi=\Pi_{n}$ be one of a sequence of matrices with $n$ columns. In Section 2.3, these matrices will be used to define the loss function for estimating $\mathbf{X}$ . In order to have a well-defined asymptotic theory when $p,n\to\infty$ , we assume that certain quantities defined in terms of $\Omega$ , $\Pi$ , and the singular vectors of $\mathbf{X}$ have definite, finite limits. We define:

[TABLE]

For $1\leq j,k\leq r$ , we assume that the weighted inner products between the population singular vectors converge almost surely in the large $p$ , large $n$ limits:

[TABLE]

For $1\leq k\leq r$ , we will let $\alpha_{k}=e_{kk}$ and $\beta_{k}=\tilde{e}_{kk}$ .

We assume that these limits exist and are finite and positive. We also assume that the operator norms of the matrices $\Omega=\Omega_{p}$ and $\Pi=\Pi_{n}$ remain bounded as $p,n\to\infty$ ; that is, $\|\Omega_{p}\|_{\mathrm{op}},\|\Pi_{n}\|_{\mathrm{op}}\leq C<\infty$ for all $p$ and $n$ , where $C$ does not depend on $p$ or $n$ . These conditions are the only assumptions we make on the matrices $\Omega$ and $\Pi$ .

We will parametrize the problem size by the number of columns $n$ , and let the number of rows $p=p_{n}$ grow with $n$ . Specifically, we will assume that the limit

[TABLE]

is well-defined and finite. In all statements where $n\to\infty$ , it will be implicitly assumed as well that $p\to\infty$ and $p/n\to\gamma$ . We assume that the number of population components $r$ and the singular values $t_{1},\dots,t_{r}$ stay fixed with $p$ and $n$ .

Remark 1.

All quantities that depend on $p$ and $n$ , such as $\mathbf{u}_{k}$ and $\mathbf{v}_{k}$ , are actually elements of a sequence indexed by $p$ and/or $n$ . However, for notational simplicity, we drop the explicit dependence on $p$ and $n$ unless it is needed for clarity.

Remark 2.

There are counterexamples to the existence of the limits in (1) and (2). For example, one may take $\mathbf{u}_{1}$ to be the first standard unit vector $(1,0,\dots,0)^{T}$ when $p$ is even, and the constant vector $(1,\dots,1)^{T}/\sqrt{p}$ when $p$ is odd; and take $\Omega_{p}=\mathrm{diag}(0,1,\dots,1)$ . The limit defining $\alpha_{1}$ will not exist in this case, as odd terms in the sequence $\|\Omega_{p}\mathbf{u}_{1}\|^{2}$ converge to $1$ , and even terms converge to [math]. By contrast, the examples in Section 7 satisfy the asymptotic conditions.

Remark 3.

The values $\mu$ , $\nu$ , $e_{jk}$ , and $\tilde{e}_{jk}$ from equations (1) and (2) are used to characterize the weighted inner products between singular vectors of $\mathbf{X}$ and $\mathbf{Y}$ ; see Theorem 3.2. These weighted inner products are needed to evaluate optimal spectral denoisers for weighted loss, as described in Section 4.

2.2 Heterogeneity, genericity, and weighted orthogonality

One of the aspects of the theory of matrix denoising we will explore is the role of the signal matrix $\mathbf{X}$ ’s singular vectors, $\mathbf{u}_{1},\dots,\mathbf{u}_{r}$ and $\mathbf{v}_{1},\dots,\mathbf{v}_{r}$ . To that end, we introduce two definitions we will be using throughout the paper. We say that a unit vector $\mathbf{x}\in\mathbb{R}^{m}$ is generic with respect to an $m$ -by- $m$ positive-semidefinite matrix $\mathbf{A}_{m}\in\mathbb{R}^{m\times m}$ if $\mathbf{x}^{T}\mathbf{A}\mathbf{x}\sim\frac{1}{m}\text{tr}(\mathbf{A}),$ where “ $\sim$ ” indicates that the difference between the two sides vanishes almost surely as $m\to\infty$ (to be precise, $\mathbf{x}$ and $\mathbf{A}$ are elements of a sequence of vectors and matrices, respectively, indexed by $m$ ; but following the convention described in Remark 1 we will drop the explicit dependence on $m$ ).

By contrast, we say that $\mathbf{x}$ is heterogeneous if it is not generic. This means that the energy of the vector $\mathbf{x}$ is not uniformly distributed across its coordinates in the eigenbasis of $\mathbf{A}$ . Indeed, if $\mathbf{A}=\sum_{k=1}^{m}h_{k}\mathbf{w}_{k}\mathbf{w}_{k}^{T}$ is the eigendecomposition of $\mathbf{A}$ , then

[TABLE]

If the energy of $\mathbf{x}$ were equally spread out across the $\mathbf{w}_{k}$ , then $\langle\mathbf{x},\mathbf{w}_{k}\rangle\sim 1/\sqrt{m}$ , and so $\mathbf{x}^{T}\mathbf{A}\mathbf{x}\sim\text{tr}(\mathbf{A})/m$ .

Given a collection of vectors $\mathbf{x}_{1},\dots,\mathbf{x}_{k}\in\mathbb{R}^{m}$ , we will say that they satisfy the weighted orthogonality condition (or are weighted orthogonal) with respect to a positive-semidefinite matrix $\mathbf{A}$ if

[TABLE]

whenever $i\neq j$ . In other words, the $\mathbf{x}_{j}$ are asymptotically orthogonal with respect to the weighted inner product defined by $\mathbf{A}$ .

Remark 4.

From the Hanson-Wright inequality [21, 57, 49], random unit vectors $\mathbf{x}$ from suitably regular distributions are generic, with respect to any $\mathbf{A}$ with bounded operator norm. Furthermore, the weighted orthogonality condition will also hold for independent random unit vectors $\mathbf{x}_{1},\dots,\mathbf{x}_{k}$ from a suitable distribution (see [7]).

2.3 Spectral denoisers and weighted loss functions

For the top $r$ empirical singular vectors $\hat{\mathbf{u}}_{1},\dots,\hat{\mathbf{u}}_{r}$ and $\hat{\mathbf{v}}_{1},\dots,\hat{\mathbf{v}}_{r}$ of $\mathbf{Y}$ , define the matrices $\widehat{\mathbf{U}}=[\hat{\mathbf{u}}_{1},\dots,\hat{\mathbf{u}}_{r}]$ and $\widehat{\mathbf{V}}=[\hat{\mathbf{v}}_{1},\dots,\hat{\mathbf{v}}_{r}]$ . Consider the class of estimators defined by

[TABLE]

Each matrix in $\SS$ has the same singular subspaces as the observed matrix $\mathbf{Y}$ , though not necessarily the same singular vectors. We call $\SS$ the family of spectral denoisers.

We consider estimating the low-rank signal matrix $\mathbf{X}$ with respect to the weighted Frobenius loss defined by

[TABLE]

where $\|\cdot\|_{\mathrm{F}}$ denotes the matrix Frobenius norm, and $\Omega$ and $\Pi$ are matrices satisfying the conditions in Section 2.1. This type of loss function is used when the user pays different prices for errors in different rows and columns.

We now define the precise estimation problem we will consider. For any deterministic $r$ -by- $r$ matrix $\widehat{\mathbf{B}}$ , we define the asymptotic error

[TABLE]

Our goal is then to find the matrix $\widehat{\mathbf{B}}$ to minimize this loss, and show how $\widehat{\mathbf{B}}$ may be consistently estimated from the observed matrix $\mathbf{Y}$ . That is, we define

[TABLE]

and define $\widehat{\mathbf{X}}=\widehat{\mathbf{U}}\widehat{\mathbf{B}}\widehat{\mathbf{V}}^{T}$ .

Remark 5.

For any deterministic $\widehat{\mathbf{B}}$ , the asymptotic loss (8) exists and is finite almost surely, even though the matrices $\widehat{\mathbf{U}}\widehat{\mathbf{B}}\widehat{\mathbf{V}}^{T}$ and $\mathbf{X}$ are growing in size. It will be shown in Section 4 that since $\widehat{\mathbf{U}}\widehat{\mathbf{B}}\widehat{\mathbf{V}}^{T}$ and $\mathbf{X}$ each have rank at most $r$ , $\|\Omega(\widehat{\mathbf{X}}-\mathbf{X})\Pi^{T}\|_{\mathrm{F}}^{2}$ depends only on $t_{1},\dots,t_{r}$ ; the $r^{2}$ entries of $\widehat{\mathbf{B}}$ ; and the weighted inner products between the top $r$ singular vectors of $\mathbf{Y}$ and $\mathbf{X}$ . It will follow from Theorem 3.2 that these inner products converge almost surely to finite limits, and consequently that the asymptotic loss (8) is well-defined almost surely.

3 Asymptotic theory for the spiked model

In this section, we derive the limits of inner products between the weighted population and empirical vectors. We define the cosines between the unweighted empirical and population vectors:

[TABLE]

Next we define weighted inner products between the population and empirical vectors:

[TABLE]

These are inner products with weight matrices $\Omega^{T}\Omega$ and $\Pi^{T}\Pi$ , respectively. We also define the weighted inner products between the empirical singular vectors:

[TABLE]

We let $c_{k}^{\omega}=c_{kk}^{\omega}$ and $\tilde{c}_{k}^{\omega}=\tilde{c}_{kk}^{\omega}$ , and similarly for the other terms.

Remark 6.

From Theorem 3.2 below, the limits (11) and (12) exist almost surely and are finite so long as the assumptions on $\Omega$ and $\Pi$ from Section 2.1 hold.

The first result provides formulas for $c_{jk}$ and $\tilde{c}_{jk}$ , and relates the singular values of $\mathbf{X}$ to those of $\mathbf{Y}$ . It is well-known in the literature (see e.g. [47, 8]).

Proposition 3.1.

For $1\leq k\leq r$ , the $k^{th}$ singular value of $\mathbf{Y}$ converges almost surely as $p,n\to\infty$ to $\lambda_{k}$ , defined by:

[TABLE]

For $1\leq j,k\leq r$ , the limits (10) defining $c_{jk}$ and $\tilde{c}_{jk}$ almost surely exist and are given by the following expressions:

[TABLE]

and

[TABLE]

Remark 7.

While the signs of $c_{k}$ and $\tilde{c}_{k}$ are arbitrary, their product satisfies $c_{k}\tilde{c}_{k}\geq 0$ . We may therefore assume that $c_{k}\geq 0$ and $\tilde{c}_{k}\geq 0$ (see, e.g., [43]).

Theorem 3.2.

Suppose $1\leq j,k\leq r$ . Then the limits (11) and (12) almost surely exist and are equal to the following expressions:

[TABLE]

The proof of Theorem 3.2 may be found in Section A.

Remark 8.

While the signs of inner products between singular vectors are arbitrary, Theorem 3.2 states that once the signs of $e_{jk}$ and $\tilde{e}_{jk}$ are fixed, the signs of $c_{jk}^{\omega}$ , $\tilde{c}_{jk}^{\omega}$ , $d_{jk}$ and $\tilde{d}_{jk}$ are determined.

4 Optimal spectral denoising

In this section, we derive the asymptotically optimal spectral denoiser with respect to the weighted loss (8), and show how to consistently estimate it from $\mathbf{Y}$ . We define the $r$ -by- $r$ weighted inner product matrices $\mathbf{D}=(d_{kl})$ , $\widetilde{\mathbf{D}}=(\tilde{d}_{jk})$ , $\mathbf{E}=(e_{jk})$ , $\widetilde{\mathbf{E}}=(\tilde{e}_{jk})$ , $\mathbf{C}=(c_{jk}^{\omega})$ , and $\widetilde{\mathbf{C}}=(\tilde{c}_{jk}^{\omega})$ , and the vector $\mathbf{t}=(t_{1},\dots,t_{r})^{T}$ of population singular values.

Theorem 4.1.

The optimal choice of $\widehat{\mathbf{B}}$ is given by $\widehat{\mathbf{B}}=\mathbf{D}^{+}\mathbf{C}\mathrm{diag}(\mathbf{t})\widetilde{\mathbf{C}}^{T}\widetilde{\mathbf{D}}^{+},$ with weighted AMSE almost surely equal to

[TABLE]

The proof of Theorem 4.1 may be found in Section B.

The matrices $\mathbf{D}$ , $\widetilde{\mathbf{D}}$ , $\mathbf{E}$ , $\widetilde{\mathbf{E}}$ , $\mathbf{C}$ , and $\widetilde{\mathbf{C}}$ and the singular values $t_{1},\dots,t_{r}$ may be estimated using Proposition 3.1 and Theorem 3.2. First, from Proposition 3.1, we can estimate $t_{k}$ , $c_{k}$ and $\tilde{c}_{k}$ , so long as $t_{k}>\gamma^{1/4}$ , i.e. if $\lambda_{k}>1+\sqrt{\gamma}$ :

[TABLE]

Remark 9.

From Remark 7, we can take both $c_{k}$ and $\tilde{c}_{k}$ to be positive.

The values $d_{jk}=\hat{\mathbf{u}}_{j}^{T}\Omega^{T}\Omega\hat{\mathbf{u}}_{j}$ and $\tilde{d}_{jk}=\hat{\mathbf{v}}_{j}^{T}\Pi^{T}\Pi\hat{\mathbf{v}}_{k}$ are directly estimable, as they are the weighted inner products between the empirical singular vectors. We then estimate $\alpha_{k}$ and $\beta_{k}$ , assuming $t_{k}>\gamma^{1/4}$ :

[TABLE]

When $j\neq k$ , we take $e_{jk}=d_{jk}/(c_{j}c_{k})$ and $\tilde{e}_{jk}=\tilde{d}_{jk}/(\tilde{c}_{j}\tilde{c}_{k})$ (so long as $t_{j}$ and $t_{k}$ both exceed $\gamma^{1/4}$ , i.e. $\lambda_{j}$ and $\lambda_{k}$ both exceed $1+\sqrt{\gamma}$ ). Finally, for all $j,k$ , we take $c_{jk}^{\omega}=e_{jk}c_{j}$ and $\tilde{c}_{jk}=\tilde{e}_{jk}\tilde{c}_{j}$ . The method is summarized in Algorithm 1.

4.1 Diagonal denoisers

In this section, we consider a subset of spectral denoisers, in which the matrix $\widehat{\mathbf{B}}$ is required to be diagonal. More precisely, we search for a vector $\hat{\mathbf{t}}=(\hat{t}_{1},\dots,\hat{t}_{r})^{T}$ of real numbers, so that the estimator

[TABLE]

minimizes the AMSE $\mathcal{L}(\widehat{\mathbf{X}}^{\mathrm{dd}},\mathbf{X})=\lim_{n\to\infty}\|\Omega(\widehat{\mathbf{X}}^{\mathrm{dd}}-\mathbf{X})\Pi^{T}\|_{\mathrm{F}}^{2}$ .

Remark 10.

Optimal diagonal denoising cannot have better asymptotic performance than optimal spectral denoising, as the diagonal denoiser is a spectral denoiser. However, Theorem 4.2 below shows that under weighted orthogonality, the methods coincide; and the simplicity of $\widehat{\mathbf{X}}^{\mathrm{dd}}$ makes it easier to analyze, which will be exploited in the proofs of Theorem 5.2 and Proposition 6.1 and the analysis of Section 4.2.

Theorem 4.2.

Suppose that either $\mathbf{u}_{1},\dots,\mathbf{u}_{r}$ are weighted orthogonal with respect to $\Omega^{T}\Omega$ , or $\mathbf{v}_{1},\dots,\mathbf{v}_{r}$ are weighted orthogonal with respect to $\Pi^{T}\Pi$ . Suppose too that $t_{k}>\gamma^{1/4}$ , $1\leq k\leq r$ . Then the singular values $\hat{t}_{k}$ , $1\leq k\leq r$ , of $\widehat{\mathbf{X}}^{\mathrm{dd}}$ are:

[TABLE]

where $t_{k}$ , $c_{k}$ and $\tilde{c}_{k}$ are given by (21), and $\alpha_{k}$ and $\beta_{k}$ are given by (22). The weighted AMSE is almost surely equal to

[TABLE]

If $\mathbf{u}_{1},\dots,\mathbf{u}_{r}$ and $\mathbf{v}_{1},\dots,\mathbf{v}_{r}$ are both weighted orthogonal with respect to $\Omega^{T}\Omega$ and $\Pi^{T}\Pi$ , respectively, then $\widehat{\mathbf{X}}=\widehat{\mathbf{X}}^{\mathrm{dd}}$ .

The proof of Theorem 4.2 is found in Section C.

4.2 Behavior of the optimal singular values

In this section, we assume either that $r=1$ ; or that $\mathbf{u}_{1},\dots,\mathbf{u}_{r}$ are weighted orthogonal with respect to $\Omega^{T}\Omega$ and $\mathbf{v}_{1},\dots,\mathbf{v}_{r}$ are weighted orthogonal with respect to $\Pi^{T}\Pi$ . In either case, the optimal spectral denoiser coincides with the optimal diagonal denoiser, and both are given by Theorem 4.2. Though this setting is quite restrictive, it permits us to exploit formula (24) for the optimal singular values to gain insight into the behavior of the optimal spectral denoiser. Propositions 4.3 and 4.4 describe the behavior of the optimal singular value $\hat{t}_{k}$ in this setting. Because each $\hat{t}_{k}$ depends only on the information specific to component $k$ , we will drop the subscript $k$ from the notation.

Proposition 4.3.

If either $\alpha\leq\mu$ or $\beta\leq\nu$ , then $\hat{t}\leq\lambda$ . Conversely, for any fixed value of $t$ , there are sufficiently large values of $\alpha$ and $\beta$ for which $\hat{t}>\lambda$ .

Proposition 4.4.

If $\alpha\leq\mu$ or $\beta\leq\nu$ , then $\hat{t}$ is an increasing function of $\lambda$ .

The proofs of Propositions 4.3 and 4.4 may be found in Section D and Section E, respectively.

Remark 11.

From [51, 20, 43], the optimal singular value for unweighted Frobenius loss is $\hat{t}^{\mathrm{shr}}=tc\tilde{c}$ , which is smaller than the observed singular value $\lambda$ . Proposition 4.3 shows that with weighted loss, such shrinkage only occurs for small $\alpha$ or $\beta$ .

The conclusion of Proposition 4.4 need not hold if $\alpha>\mu$ and $\beta>\mu$ . In Figure 2 we plot the optimal $\hat{t}$ , both as a function of the observed singular value $\lambda$ and the population singular value $t$ , for various values of $\alpha$ and $\beta$ (and $\mu=\nu=1$ ). The non-monotonicity is apparent when $\alpha=\beta=10$ .

5 Localized denoising

In this section, we introduce a new procedure called localized denoising for estimating $\mathbf{X}$ with unweighted Frobenius loss. As we will show, localized denoising is asymptotically never worse than optimal singular value shrinkage [20, 51], defined by $\widehat{\mathbf{X}}^{\mathrm{shr}}=\sum_{k=1}^{r}\hat{t}_{k}^{\mathrm{shr}}\hat{\mathbf{u}}_{k}\hat{\mathbf{v}}_{k}^{T},$ where $\hat{t}_{k}^{\mathrm{shr}}=t_{k}c_{k}\tilde{c}_{k}$ . Since singular value shrinkage is optimal for unweighted loss both in the minimax sense and when averaging over a uniform prior on $\mathbf{u}_{k}$ and $\mathbf{v}_{k}$ [18, 51], localized denoising inherits these same optimality properties. Furthermore, localized denoising can outperform singular value shrinkage when the singular vectors of $\mathbf{X}$ are heterogeneous.

5.1 Definition of localized denoising

To define localized denoising, we expand the identity matrices $\mathbf{I}_{p}=\sum_{i=1}^{I}\Omega_{i}$ and $\mathbf{I}_{n}=\sum_{j=1}^{J}\Pi_{j}$ into sums of pairwise orthogonal projections $\Omega_{i}\in\mathbb{R}^{p\times p}$ and $\Pi_{j}\in\mathbb{R}^{n\times n}$ , where $I$ and $J$ are fixed. We require that $\Omega_{i}=\Omega_{i}^{T}=\Omega_{i}^{2}$ and $\Omega_{i^{\prime}}\Omega_{i}=\mathbf{O}_{p\times p}$ for $i\neq i^{\prime}$ ; and similarly for the $\Pi_{j}$ .

We let $\widehat{\mathbf{X}}_{(i,j)}^{\mathrm{loc}}$ denote the optimal spectral denoiser with respect to the weight matrices $\Omega_{i}$ and $\Pi_{j}$ . We then define the locally-denoised matrix:

[TABLE]

We summarize the localized denoising procedure in Algorithm 2.

5.2 Performance of localized denoising

The following results compare the behavior of the localized denoiser $\widehat{\mathbf{X}}^{\mathrm{loc}}$ to the optimal singular value shrinker $\widehat{\mathbf{X}}^{\mathrm{shr}}$ .

Theorem 5.1.

$\|\widehat{\mathbf{X}}^{\mathrm{loc}}-\mathbf{X}\|_{\mathrm{F}}^{2}\leq\|\widehat{\mathbf{X}}^{\mathrm{shr}}-\mathbf{X}\|_{\mathrm{F}}^{2}$ * almost surely as $p,n\to\infty$ .*

Theorem 5.2.

Suppose that either $\mathbf{u}_{1},\dots,\mathbf{u}_{r}$ are weighted orthogonal with respect to all $\Omega_{i}$ , or $\mathbf{v}_{1},\dots,\mathbf{v}_{r}$ are weighted orthogonal with respect to all $\Pi_{j}$ . Then almost surely as $p,n\to\infty$ , $\|\widehat{\mathbf{X}}^{\mathrm{loc}}-\mathbf{X}\|_{\mathrm{F}}^{2}\leq\|\widehat{\mathbf{X}}^{\mathrm{shr}}-\mathbf{X}\|_{\mathrm{F}}^{2}-\xi,$ where $\xi\geq 0$ , and $\xi>0$ if some $\mathbf{u}_{k}$ is heterogeneous with respect to some $\Omega_{i}$ or some $\mathbf{v}_{k}$ is heterogeneous with respect to some $\Pi_{j}$ .

In other words, unless all the $\mathbf{u}_{k}$ are generic with respect to all of the $\Omega_{i}$ and all the $\mathbf{v}_{k}$ are generic with respect to all of the $\Pi_{i}$ , localized denoising will outperform singular value shrinkage asymptotically. The proofs of Theorems 5.1 and 5.2 are found in Section F and Section G, respectively.

Remark 12.

The weighted orthogonality condition of Theorem 5.2 will hold if the columns of $\mathbf{X}$ are drawn iid from a sufficiently well-behaved distribution in $\mathbb{R}^{p}$ .

Remark 13.

To apply Theorem 5.2, the user must select projection matrices $\Omega_{i}$ and $\Pi_{j}$ with respect to which the singular vectors of $\mathbf{X}$ are heterogeneous. Datasets are often drawn from different experimental regimes in genetic microarray experiments [27, 41, 50], single-cell RNA processing [52, 55], and medical imaging [35]. In these settings, it is known a priori that signal components will likely be heterogeneous across the different subpopulations, and localized shrinkage is a natural tool.

Remark 14.

Theorem 5.1 guarantees that even if the projection matrices $\Omega_{i}$ and $\Pi_{j}$ are not chosen judiciously (see Remark 13), the asymptotic performance of localized denoising is never worse than singular value shrinkage. In practice, localized denoising requires estimating more parameters than does shrinkage, and if $I$ and $J$ are sizeable relative to $p$ and $n$ its performance might be worse due to finite sample fluctations in these parameter estimates, especially when the singular vectors of $\mathbf{X}$ do not exhibit strong heterogeneity with respect to the projections. For such an example, see Section 7.1, and specifically Remark 21. Though a detailed analysis of this topic is beyond the scope of the present work, in practice the user can compare these trade-offs via simulation to determine if localized denoising is appropriate for their problem size and the expected level of heterogeneity with respect to the projections.

6 Applications of weighted denoising

In this section, we describe three applications of weighted loss functions: submatrix denoising, denoising with heteroscedastic noise, and denoising with missing data. In these problems, we estimate a low-rank matrix with respect to unweighted Frobenius loss. However, an intermediate step of the estimation procedure requires the use of a weighted loss function.

6.1 Submatrix denoising

We suppose we observe a data matrix $\mathbf{Y}=\mathbf{X}+\mathbf{G}$ , but our goal is to estimate only a $p_{0}$ -by- $n_{0}$ submatrix of $\mathbf{X}$ , where $p_{0}/p\sim\mu$ and $n_{0}/n\sim\nu$ . Denoting by $\Omega\in\mathbb{R}^{p_{0}\times p}$ the coordinate selection operator for the $p_{0}$ rows of the submatrix, and $\Pi\in\mathbb{R}^{n_{0}\times n}$ the coordinate selection operator for the $n_{0}$ columns of the submatrix, we may write the target submatrix as $\mathbf{X}_{0}=\Omega\mathbf{X}\Pi^{T}$ .

One approach is to estimate the entire matrix $\mathbf{X}$ with respect to the weighted loss $\mathcal{L}(\widehat{\mathbf{X}},\mathbf{X})=\|\Omega(\widehat{\mathbf{X}}-\mathbf{X})\Pi^{T}\|_{\mathrm{F}}^{2}.$ This loss only penalizes errors in the $p_{0}$ rows and $n_{0}$ columns of $\mathbf{X}_{0}$ . If $\widehat{\mathbf{X}}$ denotes the optimal spectral denoiser minimizing $\mathcal{L}(\widehat{\mathbf{X}},\mathbf{X})$ , we define our estimator $\widehat{\mathbf{X}}_{0}=\Omega\widehat{\mathbf{X}}\Pi^{T}$ . The method is summarized in Algorithm 3.

Another natural approach is to simply ignore the $p-p_{0}$ rows and $n-n_{0}$ columns outside of $\mathbf{X}_{0}$ , and denoise $\mathbf{X}_{0}$ directly by optimal singular value shrinkage to the matrix $\mathbf{Y}_{0}=\mathbf{X}_{0}+\mathbf{G}_{0}$ (defining $\mathbf{G}_{0}=\Omega\mathbf{G}\Pi^{T}$ ). We let $\widehat{\mathbf{X}}_{0}^{\mathrm{shr}}$ denote this estimator.

In the following result, we make the same assumptions on $\Omega$ and $\Pi$ from Section 2.1. Note that $p_{0}=\text{tr}(\Omega^{T}\Omega)$ , and $n_{0}=\text{tr}(\Pi^{T}\Pi)$ .

Proposition 6.1.

Suppose $\mathbf{u}_{1},\dots,\mathbf{u}_{r}$ are weighted orthogonal with respect to $\Omega^{T}\Omega$ , and $\mathbf{v}_{1},\dots,\mathbf{v}_{r}$ are weighted orthogonal with respect to $\Pi^{T}\Pi$ . Suppose $\alpha_{k}<\sqrt{\mu}$ and $\beta_{k}<\sqrt{\nu},$ for $1\leq k\leq r$ . Then $\|\widehat{\mathbf{X}}_{0}-\mathbf{X}_{0}\|_{\mathrm{F}}^{2}<\|\widehat{\mathbf{X}}_{0}^{\mathrm{shr}}-\mathbf{X}_{0}\|_{\mathrm{F}}^{2},$ where the strict inequality holds almost surely in the limit $p,n\to\infty$ .

The proof of Proposition 6.1 is found in Section H.

Remark 15.

If $\mathbf{u}_{k}$ and $\mathbf{v}_{k}$ are generic with respect to $\Omega^{T}\Omega$ and $\Pi^{T}\Pi$ , respectively, then $\alpha_{k}=\mu$ and $\beta_{k}=\nu$ . Proposition 6.1 requires the much weaker condition that $\alpha_{k}\leq\sqrt{\mu}$ and $\beta_{k}\leq\sqrt{\nu}$ (note that $\mu\leq\sqrt{\mu}$ and $\nu\leq\sqrt{\nu}$ ). Informally, even if the fraction of the signal’s energy contained in $\mathbf{X}_{0}$ is disproportionately large, it still pays to denoise $\mathbf{X}_{0}$ using the entire observed matrix $\mathbf{Y}$ , rather than the submatrix $\mathbf{Y}_{0}$ alone.

It will follow from the proof of Proposition 6.1 that if $\alpha_{k}<\sqrt{\mu}$ and $\beta_{k}<\sqrt{\nu}$ , then the singular vectors of $\mathbf{X}_{0}$ are better approximated by computing the singular vectors of $\mathbf{Y}$ and projecting onto the images of $\Omega$ and $\Pi$ , respectively, rather than computing the singular vectors of the submatrix $\mathbf{Y}_{0}$ itself. More precisely, we will show that $\mathbf{u}_{k}^{0}=\frac{\Omega\mathbf{u}_{k}}{\|\Omega\mathbf{u}_{k}\|}$ and $\mathbf{v}_{k}^{0}=\frac{\Pi\mathbf{v}_{k}}{\|\Pi\mathbf{v}_{k}\|}$ are the singular vectors of $\mathbf{X}_{0}$ , and that the vectors $\hat{\mathbf{u}}_{k}^{\omega}=\frac{\Omega\hat{\mathbf{u}}_{k}}{\|\Omega\hat{\mathbf{u}}_{k}\|}$ and $\hat{\mathbf{v}}_{k}^{\omega}=\frac{\Pi\hat{\mathbf{v}}_{k}}{\|\Pi\hat{\mathbf{v}}_{k}\|},$ are better correlated with $\mathbf{u}_{k}^{0}$ and $\mathbf{v}_{k}^{0}$ , respectively, then are the singular vectors of $\mathbf{Y}_{0}$ .

6.2 Doubly-heteroscedastic noise

We consider estimating a low-rank matrix $\mathbf{X}$ from an observed matrix $\mathbf{Y}=\mathbf{X}+\mathbf{N}$ , where $\mathbf{N}$ is a noise matrix of the form $\mathbf{N}=\mathbf{S}^{1/2}\mathbf{G}\mathbf{T}^{1/2}$ , $\mathbf{G}$ has iid entries with distribution $N(0,1/n)$ , and $\mathbf{S}\in\mathbb{R}^{p\times p}$ and $\mathbf{T}\in\mathbb{R}^{n\times n}$ are positive-definite matrices. We assume the eigenvalues of $\mathbf{S}=\mathbf{S}_{p}$ and $\mathbf{T}=\mathbf{T}_{n}$ remain in an interval $[a,b]$ for all $p$ and $n$ , where $a>0$ and $b<\infty$ are fixed independently of $p$ and $n$ . We refer to the matrix $\mathbf{N}$ as doubly-heteroscedastic noise.

Remark 16.

This noise model generalizes two previous models of heteroscedastic noise in the context of principal component analysis [22, 23, 24, 58, 40]. In both, the matrix $\mathbf{X}$ consists of random, iid signal vectors $X_{1},\dots,X_{n}$ of the form $X_{j}=\sum_{k=1}^{r}\ell_{k}^{1/2}z_{jk}\mathbf{u}_{k},$ where $\ell_{1}>\dots>\ell_{r}>0$ , $\mathbf{u}_{1},\dots,\mathbf{u}_{r}$ are orthonormal vectors, and the $z_{jk}$ are iid random variables with variance $1$ and mean [math]. The model from [58] and [40] takes $\mathbf{T}=\mathbf{I}_{n}$ , in which case the observations are of the form $Y_{j}=X_{j}+\mathbf{S}^{1/2}G_{j}$ , where $G_{j}\sim N(\mathbf{0},\mathbf{I}_{p})$ . By contrast, the papers [22, 23, 24] take $\mathbf{S}=\mathbf{I}_{p}$ , in which case the observations are of the form $Y_{j}=X_{j}+b_{j}^{1/2}G_{j}$ . The doubly-heteroscedastic noise model takes $Y_{j}=X_{j}+b_{j}^{1/2}\mathbf{S}^{1/2}G_{j}$ , which generalizes both these models.

We consider the following three-step procedure. First, we whiten the noise, replacing $\mathbf{Y}$ by $\widetilde{\mathbf{Y}}$ defined by $\widetilde{\mathbf{Y}}=\mathbf{S}^{-1/2}\mathbf{Y}\mathbf{T}^{-1/2}.$ We may write $\widetilde{\mathbf{Y}}=\widetilde{\mathbf{X}}+\mathbf{G}$ , where $\widetilde{\mathbf{X}}=\mathbf{S}^{-1/2}\mathbf{X}\mathbf{T}^{-1/2}$ and $\mathbf{G}$ has iid $N(0,1/n)$ entries. Next, we apply a denoiser to $\widetilde{\mathbf{Y}}$ to estimate $\widetilde{\mathbf{X}}$ ; we denote this by $\psi(\widetilde{\mathbf{Y}})$ , for $\psi$ tailored to removing white noise. Finally, we unwhiten $\psi(\widetilde{\mathbf{Y}})$ to obtain our final estimate $\widehat{\mathbf{X}}=\mathbf{S}^{1/2}\psi(\widetilde{\mathbf{Y}})\mathbf{T}^{1/2}$ of $\mathbf{X}$ .

The Frobenius loss between $\widehat{\mathbf{X}}$ and $\mathbf{X}$ may be written as follows:

[TABLE]

which is a weighted loss between $\psi(\widetilde{\mathbf{Y}})$ and $\widetilde{\mathbf{X}}$ , with weights $\mathbf{S}^{1/2}$ and $\mathbf{T}^{1/2}$ . The denoiser $\psi(\widetilde{\mathbf{Y}})$ should be chosen to minimize this weighted Frobenius loss. The procedure is summarized in Algorithm 4, where $\psi$ is taken to be optimal spectral denoising,

Remark 17.

The procedure of whitening, denoising, and unwhitening has been employed in recent papers on the spiked model; see, for instance, [42, 40, 16]. In particular, [40] shows several advantages of working with the whitened matrix when the noise is one-sided, such as improved estimation of the singular vectors of $\mathbf{X}$ . By contrast, the paper [24] shows that whitening is suboptimal in certain settings.

6.2.1 Estimating $\mathbf{S}$ and $\mathbf{T}$

The signal/noise decomposition $\mathbf{Y}=\mathbf{X}+\mathbf{N}$ is obviously not well-defined unless the user possesses some additional knowledge about the noise matrix $\mathbf{N}=\mathbf{S}^{1/2}\mathbf{G}\mathbf{T}^{1/2}$ . While a detailed treatment of this problem is outside the scope of this paper, we observe that in the large $p$ , large $n$ asymptotic limit, the matrices $\mathbf{S}=\mathbf{S}_{p}$ and $\mathbf{T}=\mathbf{T}_{n}$ may be replaced by estimators $\widehat{\mathbf{S}}=\widehat{\mathbf{S}}_{p}$ and $\widehat{\mathbf{T}}=\widehat{\mathbf{T}}_{n}$ consistent in operator norm; that is, almost surely

[TABLE]

Remark 18.

The matrices $\mathbf{S}$ and $\mathbf{T}$ may be replaced by, respectively, $\theta\mathbf{S}$ and $\mathbf{T}/\theta$ for any $\theta>0$ . Without loss of generality, we may therefore assume that $\text{tr}(\mathbf{T})/n=1$ .

The next result describes a simple method for estimating $\mathbf{S}$ and $\mathbf{T}$ consistently in operator norm when both are diagonal and the singular vectors of $\mathbf{X}$ are delocalized.

Proposition 6.2.

Suppose $\max_{1\leq k\leq r}\|\mathbf{u}_{k}\|_{\infty}\|\mathbf{v}_{k}\|_{\infty}=o(n^{-1/2})$ , $\mathbf{S}=\mathrm{diag}(a_{1},\dots,a_{p})$ and $\mathbf{T}=\mathrm{diag}(b_{1},\dots,b_{n})$ . For $1\leq i\leq p$ and $1\leq j\leq n$ , define the estimators

[TABLE]

and and let $\widehat{\mathbf{S}}=\widehat{\mathbf{S}}_{p}=\mathrm{diag}(\hat{a}_{1},\dots,\hat{a}_{p})$ and $\widehat{\mathbf{T}}=\widehat{\mathbf{T}}_{n}=\mathrm{diag}(\hat{b}_{1},\dots,\hat{b}_{p})$ . Then $\widehat{\mathbf{S}}$ and $\widehat{\mathbf{T}}$ are consistent estimators of $\mathbf{S}$ and $\mathbf{T}$ , respectively; that is, (28) holds almost surely.

The proof of Proposition 6.2 may be found in Section I.

Remark 19.

The values $\hat{a}_{i}$ in (29) are the sample standard deviations of the rows of $\sqrt{n}\mathbf{Y}$ . Normalizing $\mathbf{Y}$ by $\widehat{\mathbf{S}}^{1/2}$ is then an instance of standardization of the rows, a commonly used method in principal component analysis [30].

Remark 20.

The estimates $\hat{a}_{i}$ and $\hat{b}_{j}$ capture the variation of both the noise and the signal. Proposition 6.2 states that if the signal is delocalized, then in the large $p$ , large $n$ limit its contribution becomes negligible. However, for finite $p$ and $n$ , $\widehat{\mathbf{S}}$ and $\widehat{\mathbf{T}}$ will still see the effects of the signal, and may not be good estimates of $\mathbf{S}$ and $\mathbf{T}$ . The experiment in Section 7.3 compares the use of the true $\mathbf{S}$ and $\mathbf{T}$ to their estimates.

6.2.2 Whitening increases the SNR for generic signal matrices

We show that the whitening transformation increases a natural signal-to-noise ratio of the observed matrix. We will assume throughout this section that the $\mathbf{u}_{k}$ (respectively, $\mathbf{v}_{k}$ ) are generic with respect to $\mathbf{S}$ (respectively, $\mathbf{T}$ ), and that they satisfy the pairwise orthogonality condition with respect to $\mathbf{S}$ (respectively $\mathbf{T}$ ). Writing the SVD of $\mathbf{X}$ as $\mathbf{X}=\sum_{k=1}^{r}t_{k}\mathbf{u}_{k}\mathbf{v}_{k}^{T},$ we define the signal-to-noise ratio (SNR) for component $k$ of $\mathbf{X}$ :

[TABLE]

which is the ratio of the squared operator norm of the component $t_{k}\mathbf{u}_{k}\mathbf{v}_{k}^{T}$ of $\mathbf{X}$ and the squared operator norm of the noise.

Whitening turns $\mathbf{Y}$ into $\widetilde{\mathbf{Y}}=\widetilde{\mathbf{X}}+\mathbf{G}$ , with $\widetilde{\mathbf{X}}=\sum_{k=1}^{r}t_{k}(\mathbf{S}^{-1/2}\mathbf{u}_{k})(\mathbf{T}^{-1/2}\mathbf{v}_{k})^{T}=\sum_{k=1}^{r}\tilde{t}_{k}\tilde{\mathbf{u}}_{k}\tilde{\mathbf{v}}_{k}^{T},$ where $\tilde{t}_{k}=t_{k}\|\mathbf{S}^{-1/2}\mathbf{u}_{k}\|\|\mathbf{T}^{-1/2}\mathbf{v}_{k}\|$ , $\tilde{\mathbf{u}}_{k}=\mathbf{S}^{-1/2}\mathbf{u}_{k}/\|\mathbf{S}^{-1/2}\mathbf{u}_{k}\|$ , and $\tilde{\mathbf{v}}_{k}=\mathbf{T}^{-1/2}\mathbf{v}_{k}/\|\mathbf{T}^{-1/2}\mathbf{v}_{k}\|$ . The SNR after whitening is then:

[TABLE]

Define

[TABLE]

Note that from Jensen’s inequality, $\tau\geq 1$ , and $\tau>1$ if either $\mathbf{S}$ or $\mathbf{T}$ is not a multiple of the identity. The following result extends an analogous finding from [40]:

Proposition 6.3.

Suppose $\mathbf{u}_{1},\dots,\mathbf{u}_{r}$ are generic and weighted orthogonal with respect to $\mathbf{S}$ , and $\mathbf{v}_{1},\dots,\mathbf{v}_{r}$ are generic and weighted orthogonal with respect to $\mathbf{T}$ . Then $\widetilde{\mathrm{SNR}}_{k}\geq\tau\mathrm{SNR}_{k},$ $1\leq k\leq r$ , almost surely as $p,n\to\infty$ . In particular, $\widetilde{\mathrm{SNR}}_{k}$ is larger than $\mathrm{SNR}_{k}$ if either $\mathbf{S}$ or $\mathbf{T}$ is not a multiple of the identity.

In other words, the SNR increases after whitening the noise by at least a factor of $\tau$ ; energy is transferred from the noise component to the signal component. The proof of Proposition 6.3, which extends an analogous result in [40], is in Section J.

6.3 Matrices with missing/unobserved values

We consider the setting where $\mathbf{X}$ is a low-rank target matrix we wish to recover and $\mathbf{G}$ is a matrix of iid Gaussian $N(0,1)$ entries, but rather than observe $\mathbf{X}+\mathbf{G}$ , we observe only some subset of the entries. The problem of estimating a matrix from a subset of its entries is known as matrix completion [48, 32, 25, 13, 14, 12, 13, 33, 31, 11, 16, 34, 44, 53].

In this section, we will adopt a heterogeneous, rank 1 sampling model, as in [14]. We suppose that the rows and columns are sampled independently, with row $i$ sampled with probability $q_{i}^{r}$ , and column $j$ sampled with probability $q_{j}^{c}$ ; that is, entry $(i,j)$ of $\mathbf{X}+\mathbf{G}$ is sampled with probability $q_{i}^{r}q_{j}^{c}$ . We observe the vector $\mathbf{y}=\mathcal{F}(\mathbf{X}+\mathbf{G})$ of $M$ sampled entries, where $\mathcal{F}:\mathbb{R}^{p\times n}\to\mathbb{R}^{M}$ is the subsampling operator.

Following the approach from [16], we consider the backprojected matrix $\mathbf{Y}=\mathcal{F}^{*}(\mathbf{y})\in\mathbb{R}^{p\times n}$ , in which the unobserved entries are replaced by [math]’s. We write $\mathbf{Y}=\mathcal{F}^{*}(\mathcal{F}(\mathbf{X}))+\mathcal{F}^{*}(\mathcal{F}(\mathbf{G}))$ . We show that asymptotically, $\mathcal{F}^{*}(\mathcal{F}(\mathbf{X}))$ behaves like the matrix $\mathbf{P}\mathbf{X}\mathbf{Q}$ . More precisely, we have the following result:

Proposition 6.4.

Suppose $\max_{1\leq k\leq r}\|\mathbf{u}_{k}\|_{\infty}\|\mathbf{v}_{k}\|_{\infty}=o(n^{-1/2})$ . Then in the limit $p/n\to\gamma$ , $\|\mathcal{F}^{*}(\mathcal{F}(\mathbf{X}))-\mathbf{P}\mathbf{X}\mathbf{Q}\|_{\mathrm{op}}\to 0$ almost surely.

The proof of Proposition 6.4 may be found in Section K. It is a straightforward generalization of the analogous one-sided result in [16].

Let $\mathbf{N}=\mathcal{F}^{*}(\mathcal{F}(\mathbf{G}))$ . Writing $\mathbf{N}=(N_{ij})$ , we have $N_{ij}=\delta_{ij}G_{ij}$ , where $\delta_{ij}$ is $1$ if entry $(i,j)$ is sampled, and [math] otherwise. Then $N_{ij}$ has variance $q_{i}^{r}q_{j}^{c}$ . Consequently, we can whiten the noise by applying $\mathbf{P}^{-1/2}$ and $\mathbf{Q}^{-1/2}$ ; Proposition 6.3 suggests this will improve estimation of the matrix. To that end, we define $\widetilde{\mathbf{Y}}=\mathbf{P}^{-1/2}\mathbf{Y}\mathbf{Q}^{-1/2}=\widetilde{\mathbf{X}}+\widetilde{\mathbf{G}},$ where $\widetilde{\mathbf{X}}=\mathbf{P}^{-1/2}\mathcal{F}^{*}(\mathcal{F}(\mathbf{X}))\mathbf{Q}^{-1/2}$ , and $\widetilde{\mathbf{G}}=\mathbf{P}^{-1/2}\mathbf{N}\mathbf{Q}^{-1/2}$ . Then $\widetilde{\mathbf{G}}$ is a random matrix where each entry has mean zero and variance $1$ .

From Proposition 6.4, asymptotically the matrix $\widetilde{\mathbf{X}}$ behaves like $\mathbf{P}^{1/2}\mathbf{X}\mathbf{Q}^{1/2}$ , and so denoising $\widetilde{\mathbf{Y}}$ estimates $\psi(\widetilde{\mathbf{Y}})$ of $\mathbf{P}^{1/2}\mathbf{X}\mathbf{Q}^{1/2}$ . To estimate $\mathbf{X}$ we should perform denoising to $\widetilde{\mathbf{Y}}$ with respect to the weighted loss function $\mathcal{L}(\widehat{\mathbf{X}},\mathbf{X})=\|\mathbf{P}^{-1/2}(\widehat{\mathbf{X}}-\mathbf{X})\mathbf{Q}^{-1/2}\|_{\mathrm{F}}^{2},$ with weight matrices $\mathbf{P}^{-1/2}$ and $\mathbf{Q}^{-1/2}$ . We then apply $\mathbf{P}^{-1/2}$ and $\mathbf{Q}^{-1/2}$ to the resulting matrix, to obtain an estimator of $\mathbf{X}$ itself. The method is summarized in Algorithm 5, where $\psi$ is the optimal spectral denoiser.

7 Numerical results

In this section, we report on numerical simulations demonstrating the performance of the algorithms from this paper.

7.1 Localized denoising

We evaluate the performance of localized denoising (Algorithm 2). We generate a “checkerboard” signal matrix $\mathbf{X}$ of size $p$ -by- $n$ , $p=n=800$ , shown in the top left panel of Figure 4. Each light square has the same value, as does each dark square. For a specified number $f\in[1/2,1]$ , the total energy of the light squares is $f\times 100\%$ of the total energy of $\mathbf{X}$ . The Frobenius norm of $\mathbf{X}$ is normalized to be $1$ . Whenever $f>1/2$ , $\mathbf{X}$ is rank $2$ ; when $f=1/2$ , $\mathbf{X}$ has constant value and is rank $1$ . We add a matrix $\mathbf{G}$ of Gaussian noise, whose entries have standard deviation $1/(10\sqrt{n})$ .

We estimate $\mathbf{X}$ from $\mathbf{Y}$ using two methods: singular value shrinkage [20, 51] and localized denoising. Localized denoising is applied with row projection matrices $\Omega_{i}$ , $i=1,2,3,4$ , that project onto equispaced blocks of rows, and column projection matrices $\Pi_{i}$ , $i=1,2,3,4$ , that project onto equispaced blocks of columns. For each $f$ , the experiment is repeated $50$ times; the $\log_{2}$ mean errors are plotted in Figure 3.

As $f$ increases, localized denoising outperforms singular value shrinkage more dramatically. This is because localized denoising uses a priori knowledge of $\mathbf{X}$ ’s block structure, which becomes more pronounced as $f$ grows. Figure 4 shows an example of images of the true matrix $\mathbf{X}$ , the noisy matrix $\mathbf{Y}$ , and the two denoised matrices, when $f=0.7$ . In this example, the relative error $\|\widehat{\mathbf{X}}^{\mathrm{loc}}-\mathbf{X}\|_{\mathrm{F}}/\|\mathbf{X}\|_{\mathrm{F}}$ of localized denoising is approximately $1.40\times 10^{-1}$ , whereas the shrinkage error is $1.92\times 10^{-1}$ .

Remark 21.

The error curves in Figure 3 both appear nearly identical when $f\lesssim 0.6$ , after which localized denoising begins to outperform singular value shrinkage. This is because for small values of $f$ the smallest singular value of $\mathbf{X}$ is not detectable, and so both methods treat the matrix as a constant, rank $1$ matrix. Though not apparent from the plot, when $f\leq 0.55$ the performance of singular value shrinkage is slightly better than localized denoising, due to finite sample fluctuations (see Remark 14). For example, when $f=0.51$ , the mean relative error of localized denoising is approximately $1.4118\times 10^{-1}$ , while that of shrinkage is approximately $1.4112\times 10^{-1}$ .

7.2 Submatrix denoising

We evaluate the performance of spectral denoising for estimating a submatrix $\mathbf{X}_{0}$ contained within a larger matrix $\mathbf{X}$ (Algorithm 3). We generate a rank 1 signal matrix $\mathbf{X}$ of size $p$ -by- $n$ , $p=500$ , $n=1000$ , with singular values $\gamma^{1/4}+1/2$ , where $\gamma=1/4$ . For a specified $f\in(0,1)$ , the left singular vector $\mathbf{u}=\mathbf{u}_{1}$ of $\mathbf{X}$ is piecewise constant on the two sets of coordinates $\{1,\dots,p/2\}$ and $\{p/2+1,\dots,p\}$ ; the values are such that the energy of $\mathbf{u}$ on coordinates $\{1,\dots,p/2\}$ is equal to $\sqrt{f}$ . Similarly, the right singular vector $\mathbf{v}=\mathbf{v}_{1}$ of $\mathbf{X}$ is piecewise constant on the two sets of coordinates $\{1,\dots,n/2\}$ and $\{n/2+1,\dots,n\}$ , with values such that the energy of $\mathbf{v}$ on $\{1,\dots,n/2\}$ is also equal to $\sqrt{f}$ . Denoting by $\mathbf{X}_{0}$ the $p/2$ -by- $n/2$ upper-left submatrix of $\mathbf{X}$ , $f=\|\mathbf{X}_{0}\|_{\mathrm{F}}^{2}/\|\mathbf{X}\|_{\mathrm{F}}^{2}$ .

The noise matrix has Gaussian entries with variance $1/n$ . We denoise the submatrix $\mathbf{X}_{0}$ using Algorithm 3; optimal singular value shrinkage [20, 51] on $\mathbf{X}_{0}$ alone (“submatrix shrinkage”); and optimal singular value shrinkage on $\mathbf{X}$ followed by projection onto the rows and columns of $\mathbf{X}_{0}$ (“global shrinkage”). For each $f$ , the experiment is repeated for $50$ draws. Figure 5 plots the $\log_{2}$ mean relative errors.

Optimal spectral denoising outperforms global shrinkage for all $f$ , since singular value shrinkage is an instance of spectral denoising and hence will not do better than the optimal spectral denoiser. Optimal spectral denoising and global shrinkage perform nearly identically when $f\approx 1/4$ , since in this regime the singular vectors of $\mathbf{X}$ are constant, and hence generic with respect to the weight matrices.

For small $f$ , the relative error of global shrinkage exceeds $1$ , since the submatrix’s energy is very small compared to the rest of the matrix. By contrast, optimal spectral denoising with weights $\Omega$ and $\Pi$ highlights the rows and columns in $\mathbf{X}_{0}$ .

Optimal spectral denoising outperforms submatrix shrinkage except when $f$ is close to $1$ . This is consistent with Proposition 6.1, which states that unless the energy of $\mathbf{X}$ ’s singular vectors are highly concentrated in the rows and columns of $\mathbf{X}_{0}$ , optimal spectral denoising will outperform singular value shrinkage on the submatrix.

Finally, optimal singular value shrinkage on the submatrix has relative error $1$ when $f$ is small. This is because singular value shrinkage on the submatrix only computes the SVD of $\mathbf{Y}_{0}$ , not $\mathbf{Y}$ ; when the energy in the submatrix $\mathbf{X}_{0}$ is too weak (i.e. $f$ is too small), no signal will be detected in the submatrix $\mathbf{Y}_{0}$ alone. By contrast, the singular values of the full matrix $\mathbf{Y}$ always reveal the presence of signal.

7.3 Doubly-heteroscedastic noise

We examine the performance of Algorithm 4. We generate a $p$ -by- $n$ signal matrix $\mathbf{X}$ , $p=1000$ , $n=2000$ , of rank $r=5$ , with singular values $t^{*}+1/2+k$ , $k=0,1,2,3,4$ , where $t^{*}$ is the smallest singular detectable value of $\mathbf{X}$ , evaluated using the method in [39]. Both the left and right singular vectors of $\mathbf{X}$ are random orthonormal vectors in $\mathbb{R}^{p}$ and $\mathbb{R}^{n}$ , respectively.

For specified $\kappa\geq 1$ , we generate row and column diagonal covariance matrices $\mathbf{S}$ and $\mathbf{T}$ , each with eigenvalues linearly spaced between $1/\kappa$ and $1$ . The noise matrix is $\mathbf{S}^{1/2}\mathbf{G}\mathbf{T}^{1/2}$ , where $\mathbf{G}$ has iid Gaussian entries with variance $1/n$ . We apply three denoising schemes: Algorithm 4 with the true $\mathbf{S}$ and $\mathbf{T}$ ; Algorithm 4 with $\mathbf{S}$ and $\mathbf{T}$ estimated using the method described in Proposition 6.2; and OptShrink [43]. The experiment is repeated $50$ times for each value of $\kappa$ .

Figure 6 shows the $\log_{2}$ mean relative errors of each method as a function of $\log_{2}(\kappa)$ . For this model of $\mathbf{S}$ and $\mathbf{T}$ , the condition number $\kappa$ is an increasing function of the parameter $\tau$ from Section 6.2.2. Consequently, Proposition 6.3 suggests that whitening will improve the matrix SNR, and that the improvement should increase as $\kappa$ grows. This is precisely what Figure 6 demonstrates; optimal spectral denoising with whitening by $\mathbf{S}$ and $\mathbf{T}$ does indeed outperform OptShrink, and the performance gap grows with $\kappa$ . Using the estimated covariances, the performance is degraded but still outperforms OptShrink when $\kappa$ is large.

7.4 Missing data

We test spectral denoising for missing data (Algorithm 5) by comparing it to nuclear-norm regularized least-squares [11], which estimates $\mathbf{X}$ by:

[TABLE]

Here, $\|\cdot\|_{*}$ denotes the nuclear norm; $\mathcal{F}:\mathbb{R}^{p\times n}\to\mathbb{R}^{M}$ is the projection operator onto the $M$ observed samples; and $\mathbf{P}$ and $\mathbf{Q}$ are the diagonal matrices of sampling probabilities for rows and columns, respectively. We weight the nuclear norm by the square root of the sampling probabilities, as suggested in [14]. Following [11], we choose the parameter $\theta$ so that when $\mathbf{y}$ is pure noise, $\widehat{\mathbf{X}}^{\mathrm{nuc}}$ is set to zero. It follows from the KKT conditions [10] that this is equivalent to $\theta=\|\mathbf{P}^{-1/2}\mathcal{F}^{*}(\mathbf{y})\mathbf{Q}^{-1/2}\|_{*},$ which is approximated by $1+\sqrt{\gamma}$ . We solve (33) using the algorithm in [26].

We generate a rank $r=5$ signal matrix $\mathbf{X}$ of size $p$ -by- $n$ , $p=200$ , $n=400$ , with singular values $\sqrt{\sqrt{\gamma}+200k}$ , $k=1,\dots,5$ , $\gamma=1/2$ . Both the left and right singular vectors of $\mathbf{X}$ are uniformly random. We add to $\mathbf{X}$ a Gaussian noise matrix $\mathbf{G}$ , where each entry has variance $\sigma^{2}/n$ for a specified value of $\sigma$ . $\mathbf{X}+\mathbf{G}$ is then subsampled with row and column sampling probabilities each equispaced between $0.3$ and $0.7$ . For each value of $\sigma$ , the experiment is repeated 50 times. Figure 7 displays the $\log_{2}$ mean relative errors. When $\sigma$ is large, spectral denoising is superior, whereas in the small $\sigma$ regime nuclear-norm regularized least-squares is better.

7.5 Non-Gaussian noise

The optimal spectral denoiser requires estimation of the weighted inner product matrices $\mathbf{D}$ , $\widetilde{\mathbf{D}}$ , $\mathbf{C}^{\omega}$ and $\widetilde{\mathbf{C}}^{\omega}$ . The formulas for the entries of these matrices provided by Theorem 3.2 assumes that the noise matrix $\mathbf{G}$ is Gaussian. However, it is of interest whether the same formulas may be applied to non-Gaussian noise. To partially address this question, we compare the finite sample accuracy of the formulas in Theorem 3.2 for different noise distributions.

For each $n$ , we generate $\mathbf{Y}=\mathbf{X}+\mathbf{G}$ of size $p$ -by- $n$ , where $p=2n$ . The signal has rank $r=2$ , with singular values $\gamma^{1/4}+2$ and $\gamma^{1/4}+3$ ; $\mathbf{u}_{1}$ is uniformly equal to $1/\sqrt{p}$ , and $\mathbf{u}_{2}$ is $1/\sqrt{p}$ on entries $1,\dots,p/2$ , and $-1/\sqrt{p}$ on entries $p/2+1,\dots,p$ . $\mathbf{v}_{1}$ and $\mathbf{v}_{2}$ are generated similarly, with $n$ in place of $p$ . The noise matrix has iid entries of variance $1/n$ , drawn from a specified distribution: Gaussian, Rademacher, t10 or t3, where the t distributions are normalized to have variance $1/n$ .

The $p$ -by- $p$ weight matrix $\Omega$ is diagonal with diagonal entries $1,\dots,3p/4$ equal to $1$ , and the remaining entries [math]. We evaluate the true matrix $\mathbf{E}$ and use formulas (16) and (18) to predict $\mathbf{D}$ and $\mathbf{C}^{\omega}$ . For each draw, we compute the actual inner product matrices $\widehat{\mathbf{D}}$ and $\widehat{\mathbf{C}}^{\omega}$ using the left singular vectors $\hat{\mathbf{u}}_{1}$ and $\hat{\mathbf{u}}_{2}$ of $\mathbf{Y}$ . Due to the ambiguity in signs, we make all entries of the matrices positive. We then compute the relative errors $\|\widehat{\mathbf{D}}-\mathbf{D}\|_{\mathrm{F}}/\|\mathbf{D}\|_{\mathrm{F}}$ and $\|\widehat{\mathbf{C}}^{\omega}-\mathbf{C}^{\omega}\|_{\mathrm{F}}/\|\mathbf{C}^{\omega}\|_{\mathrm{F}}$ .

For each noise type and each value of $n=500k$ , $k=1,2,4,8,16$ , the experiment is repeated 1000 times. The average and maximum relative errors are recorded in Table 2 for $\mathbf{C}^{\omega}$ , and in Table 3 for $\mathbf{D}$ . Both the average and maximum errors for Gaussian noise very nearly match those for the Rademacher and t10 distributions. However, the errors for the heavier tailed t3 distribution are much larger, indicating that the theory breaks down for this noise model. The errors for the Gaussian, Rademacher, and t10 distributions appear to decay approximately like $O(n^{-1/2})$ ; this is the rate we expect from [6] and Theorem 2.19 in [8]. The errors for the t3 distribution do not exhibit such decay with increasing $n$ , indicating that the model does not match.

7.6 Rank estimation

In this section, we explore estimation of the rank $r$ of $\mathbf{X}$ from the observed matrix $\mathbf{Y}$ , a topic that has received considerable attention [36, 37, 17, 15, 46, 45]. The naive estimator $\hat{r}^{\mathrm{naive}}$ is defined by

[TABLE]

this simply counts the number of $\mathbf{Y}$ ’s singular values exceeding $1+\sqrt{\gamma}$ , the asymptotically largest singular value of the noise matrix $\mathbf{G}$ . It is known that $\hat{r}^{\mathrm{naive}}$ may overestimate the rank; see, e.g., [28]. The rank estimator of Kritchman and Nadler from [36], denoted by $\hat{r}^{\mathrm{KN}}$ , is designed to prevent attributing noisy singular values to signal. We compare the performance of $\hat{r}^{\mathrm{naive}}$ and $\hat{r}^{\mathrm{KN}}$ for different noise distributions in terms of the accuracy of estimating $r$ and the effect on the denoising error.

For $p=300$ and $n=600$ , we generate a $p$ -by- $n$ signal matrix $\mathbf{X}$ with rank $r=2$ and singular values $\gamma^{1/4}+1$ and $\gamma^{1/4}+2$ . The left singular vector $\mathbf{u}_{1}$ is uniformly equal to $1/\sqrt{p}$ , and $\mathbf{u}_{2}$ is $1/\sqrt{p}$ on entries $1,\dots,p/2$ , and $-1/\sqrt{p}$ on entries $p/2+1,\dots,p$ . The right singular vectors $\mathbf{v}_{1}$ and $\mathbf{v}_{2}$ are generated the same way, with $n$ in place of $p$ . The noise matrix $\mathbf{G}$ had iid entries of variance $1/n$ , drawn from one a specified distribution, namely: Gaussian, Rademacher, or the t distributions with $10$ , $5$ , $4$ , $3$ and $2.5$ degrees of freedom, where the t distributions are normalized to have variance $1/n$ . The $p$ -by- $p$ weight matrix $\Omega$ is diagonal with diagonal entries linearly spaced between $1/p$ and $1$ . The $n$ -by- $n$ weight matrix $\Pi$ is also diagonal, with diagonal entries linearly spaced between $1/p$ and $1/\gamma$ . In each run, we apply Algorithm 1 with the oracle $r=2$ , the naive $\hat{r}^{\mathrm{naive}}$ from (34), and $\hat{r}^{\mathrm{KN}}$ from [36], with $0.1$ confidence level111The code for computing $\hat{r}^{\mathrm{KN}}$ was taken from Boaz Nadler’s website: www.wisdom.weizmann.ac.il/~nadler/Rank_Estimation/rank_estimation.html. For each noise distribution, the experiment is repeated $1000$ times. In Tables 4 and 5 we record the relative errors $\|\Omega(\widehat{\mathbf{X}}-\mathbf{X})\Pi^{T}\|_{\mathrm{F}}/\|\Omega\mathbf{X}\Pi^{T}\|_{\mathrm{F}}$ and the estimated ranks.

For the Gaussian, Rademacher, and t10 distributions, the Kritchman-Nadler estimate $\hat{r}^{\mathrm{KN}}$ is typically closer to the true rank, $r=2$ , than is the naive estimate $\hat{r}^{\mathrm{naive}}$ . However, the average and maximum errors are close regardless of the rank estimator used, since even when $\hat{r}^{\mathrm{naive}}=3$ , the third singular value of $\mathbf{Y}$ is so close to the detection edge $1+\sqrt{\gamma}$ that the estimates of the cosines $c_{3}$ and $\tilde{c}_{3}$ are nearly [math]. With the t5 distribution, both $\hat{r}^{\mathrm{naive}}$ and $\hat{r}^{\mathrm{KN}}$ are more likely to overestimate the true rank. While the average errors are close to those for the Gaussian, Rademacher, and t10 distributions, the maximum errors are much larger, indicating that while this noise distribution’s “typical” behavior may be close to the thinner tailed ones, a small number of extreme draws of $\mathbf{G}$ can result in very poor performance. For the t distributions with $4$ , $3$ , and $2.5$ degrees of freedom, both $\hat{r}^{\mathrm{naive}}$ and $\hat{r}^{\mathrm{KN}}$ drastically overestimate the rank, and the resulting relative errors are enormous.

8 Conclusion

This paper has introduced a family of spectral denoisers for low-rank matrix estimation, which generalizes singular value shrinkage. We have derived optimal spectral denoisers for weighted loss functions, and discussed applications. By judiciously combining these denoisers for different weights we contructed the method of localized denoising, which outperforms singular value shrinkage under heterogeneity. While this paper has focused on theoretical and algorithmic development, in future work we plan to apply the methods to problems where related but suboptimal methods have previously been employed. This includes the problems of denoising and deconvolution of images from cryoelectron microscopy [9]; 3-D reconstruction of heterogeneous molecules from noisy images [1]; and denoising XFEL images [42, 59].

Acknowledgements

I thank Elad Romanov and Amit Singer for stimulating discussions related to this work, Edgar Dobriban for valuable feedback on an earlier version of this manuscript, and the reviewers for their helpful comments. I acknowledge support from the NSF BIGDATA award IIS 1837992 and BSF award 2018230.

Appendix A Proof of Theorem 3.2

The proof of Theorem 3.2 is similar to the analysis found in [40], in that it rests on the same decomposition of the empirical singular vectors $\hat{\mathbf{u}}_{j}$ and $\hat{\mathbf{v}}_{j}$ into the signal and residual components. If $\mathbf{a}$ and $\mathbf{b}$ are vectors of the same dimension, we will write $\mathbf{a}\sim\mathbf{b}$ as a short-hand for $\|\mathbf{a}-\mathbf{b}\|\to 0$ almost surely as $p,n\to\infty$ . The statements are symmetric in the left and right singular vectors, so for compactness we will only prove them for the left ones. The proofs for the other side are identical.

Because the noise matrix $\mathbf{G}$ has an isotropic distribution, we can write:

[TABLE]

where $\tilde{\mathbf{u}}_{j}$ is a unit vector that is uniformly random over the sphere in the subspace orthogonal to $\mathbf{u}_{1},\dots,\mathbf{u}_{r}$ (see [47]). Because $\tilde{\mathbf{u}}_{j}$ is uniformly random, it is asymptotically orthogonal to any independent unit vector $\mathbf{w}$ ; that is,

[TABLE]

Furthermore, $\tilde{\mathbf{u}}_{j}$ satisfies the normalized trace formula, namely if $\mathbf{A}$ is any matrix with bounded operator norm, then

[TABLE]

We refer the reader to [7, 21, 57, 49] for details. We will use (36) and (37) repeatedly. Furthermore, when $j\neq k$ it follows from Lemma A.2 in [40] that

[TABLE]

Applying $\Omega$ to each side of (35), we have:

[TABLE]

The proofs of the identities in Theorem 3.2 now follow by manipulating the asymptotic equation (39) appropriately, in conjunction with (36), (37) and (38).

We first show the formulas for $c_{jk}^{\omega}$ . We take inner products of each side of (39) with $\Omega\mathbf{u}_{k}$ :

[TABLE]

where we have used (36).

To derive the formula for $d_{j}$ , we take the squared norm of each side of (39):

[TABLE]

The first asymptotic equivalence follows from (36), and the second from (37).

Finally, we derive the formula for $d_{jk}$ , $j\neq k$ . From (39), we have

[TABLE]

From (36) and (38), the terms involving $\tilde{\mathbf{u}}_{j}$ and $\tilde{\mathbf{u}}_{k}$ vanish, and we are left with

[TABLE]

This completes the proof of Theorem 3.2.

Appendix B Proof of Theorem 4.1

The target matrix $\mathbf{X}$ may be written

[TABLE]

and our estimate $\widehat{\mathbf{X}}$ is of the form

[TABLE]

where $\mathbf{U}=[\mathbf{u}_{1},\dots,\mathbf{u}_{r}]$ , $\mathbf{V}=[\mathbf{v}_{1},\dots,\mathbf{v}_{r}]$ , $\widehat{\mathbf{U}}=[\hat{\mathbf{u}}_{1},\dots,\hat{\mathbf{u}}_{r}]$ , $\widehat{\mathbf{V}}=[\hat{\mathbf{v}}_{1},\dots,\hat{\mathbf{v}}_{r}]$ , and $\mathbf{t}=(t_{1},\dots,t_{r})^{T}$ .

Define $\mathbf{W}=\Omega\mathbf{U}$ , $\mathbf{Z}=\Pi\mathbf{V}$ , $\widehat{\mathbf{W}}=\Omega\widehat{\mathbf{U}}$ , and $\widehat{\mathbf{Z}}=\Pi\widehat{\mathbf{V}}$ . We may then write the weighted loss as follows:

[TABLE]

which is the unweighted Frobenius loss between $\widehat{\mathbf{W}}\widehat{\mathbf{B}}\widehat{\mathbf{Z}}^{T}$ and $\mathbf{W}\mathrm{diag}(\mathbf{t})\mathbf{Z}^{T}$ . Continuing, we have:

[TABLE]

Defining the operator $\mathcal{T}$ by $\mathcal{T}(\widehat{\mathbf{B}})=\mathbf{D}\widehat{\mathbf{B}}\widetilde{\mathbf{D}}$ , the pseudoinverse of $\mathcal{T}$ is given by $\mathcal{T}^{+}(\mathbf{B})=\mathbf{D}^{+}\mathbf{B}\widetilde{\mathbf{D}}^{+}$ . Consequently, the choice of $\widehat{\mathbf{B}}$ that minimizes $\mathcal{L}(\widehat{\mathbf{X}},\mathbf{X})$ is given by:

[TABLE]

The error may then be evaluated by substituting this expression for $\widehat{\mathbf{B}}$ into (B), completing the proof.

Appendix C Proof of Theorem 4.2

Under weighted orthogonality, $e_{jk}=\tilde{e}_{jk}=0$ whenever $j\neq k$ , and so $d_{jk}=\tilde{d}_{jk}=c_{jk}^{\omega}=\tilde{c}_{jk}^{\omega}=0$ when $j\neq k$ as well. Consequently, the matrices $\mathbf{E}$ , $\widetilde{\mathbf{E}}$ , $\mathbf{D}$ , $\widetilde{\mathbf{D}}$ , $\mathbf{C}$ , and $\widetilde{\mathbf{C}}$ are diagonal. The optimal $\widehat{\mathbf{B}}$ is given by:

[TABLE]

which is also diagonal, with diagonal entries

[TABLE]

which is the desired expression.

Appendix D Proof of Proposition 4.3

Suppose a coordinate has signal strength $t=t_{k}$ (we drop the subscript as we are only considering one component). We may assume without loss of generality (and by rescaling $\alpha$ and $\beta$ ) that $\mu=\nu=1$ . Consequently, the optimal singular value is equal to:

[TABLE]

By taking $\alpha$ and $\beta$ sufficiently large, this value can be made arbitrarily close to

[TABLE]

That is, the optimal singular value $\hat{t}$ will be greater than the observed singular value $\lambda$ in this parameter regime.

On the other hand, if $\beta\leq 1=\nu$ , we have:

[TABLE]

which shows that $\hat{t}\leq\lambda$ . A nearly identical proof works if $\alpha\leq\mu$ . This completes the proof.

Appendix E Proof of Proposition 4.4

Without loss of generality, we will assume $\mu=\nu=1$ . We consider the functions $c(t)=\sqrt{(1-\gamma/t^{4})/(1+\gamma/t^{2})}$ and $\tilde{c}(t)=\sqrt{(1-\gamma/t^{4})/(1+1/t^{2})}$ . Define the functions $\varphi(t)$ and $\psi(t)$ by

[TABLE]

and

[TABLE]

Then we may write the optimal singular value $\hat{t}$ as a function $f(t)$ as follows:

[TABLE]

Let us assume that $\alpha\leq 1$ ; the proof for $\beta\leq 1$ will be nearly identical. We wish to show that $f^{\prime}(t)\geq 0$ , for $t>\gamma^{1/4}$ . We have

[TABLE]

and since $f(t)>0$ , we must show that the right side is positive. It is straightforward to verify that

[TABLE]

from which it follows that

[TABLE]

Similarly, we can show

[TABLE]

Consequently, it is enough to show

[TABLE]

Direction computation shows

[TABLE]

and

[TABLE]

Substituting (62) and (63) into the left side of (61) and multiplying by $t(t^{2}+\gamma)(t^{2}+1)$ , we get:

[TABLE]

which is the desired result.

Appendix F Proof of Theorem 5.1

We denote by $\hat{t}_{1}^{\mathrm{shr}},\dots,\hat{t}_{r}^{\mathrm{shr}}$ the singular values of $\widehat{\mathbf{X}}^{\mathrm{shr}}$ , and $\hat{\mathbf{t}}^{\mathrm{shr}}=(\hat{t}_{1}^{\mathrm{shr}},\dots,\hat{t}_{r}^{\mathrm{shr}})^{T}$ . We may then write

[TABLE]

This is a spectral denoiser (in the set $\SS$ ), and hence its weighted loss with weights $\Omega_{i}$ and $\Pi_{j}$ cannot be less than that of the optimal spectral denoiser $\widehat{\mathbf{X}}_{(i,j)}^{\mathrm{loc}}$ . That is,

[TABLE]

Because the $\Omega_{i}$ and $\Pi_{j}$ are pairwise orthogonal projections which sum to the identity, the total Frobenius loss can be decomposed:

[TABLE]

which is the desired inequality.

Appendix G Proof of Theorem 5.2

For $1\leq k\leq r$ , $1\leq i\leq I$ , and $1\leq j\leq J$ , let $\alpha_{k}^{(i)}=\|\Omega_{i}\mathbf{u}_{k}\|^{2}$ , $\mu^{(i)}=\text{tr}(\Omega_{i})/p$ , $\beta_{k}^{(j)}=\|\Pi_{j}\mathbf{v}_{k}\|^{2}$ , and $\nu^{(j)}=\text{tr}(\Pi_{j})/n$ . Then

[TABLE]

Let $\widehat{\mathbf{X}}_{(i,j)}^{\mathrm{dd}}$ be the optimal diagonal denoiser with weights $\Omega_{i}$ and $\Pi_{j}$ . Because of the weighted orthogonality condition, Theorem 4.2 states that the AMSE for $\widehat{\mathbf{X}}_{(i,j)}^{\mathrm{dd}}$ is

[TABLE]

Since $\widehat{\mathbf{X}}_{(i,j)}^{\mathrm{loc}}$ minimizes the weighted error with weights $\Omega_{i}$ and $\Pi_{j}$ , we have:

[TABLE]

On the other hand, the error obtained by $\widehat{\mathbf{X}}^{\mathrm{shr}}$ is equal to

[TABLE]

Comparing (G) and (71), the result will follow if we can show that for each $1\leq k\leq r$ ,

[TABLE]

where the inequality is strict so long as one of $\mathbf{u}_{k}$ or $\mathbf{v}_{k}$ is not generic with respect to some $\Omega_{i}$ or $\Pi_{j}$ ; or equivalently, either $\alpha_{k}^{(i)}\neq\mu^{(i)}$ for some $i$ , or $\beta_{k}^{(j)}\neq\nu^{(j)}$ for some $j$ . Because

[TABLE]

it is enough to show that

[TABLE]

with the inequality being strict so long as $\alpha_{k}^{(i)}\neq\mu^{(i)}$ for some $i$ .

For each $1\leq i\leq I$ , let $r_{i}=\alpha_{k}^{(i)}/\mu^{(i)}$ . Then

[TABLE]

The function $F(r)=r^{2}/(c_{k}^{2}r+s_{k}^{2})$ is convex. Since $\sum_{i=1}^{I}\mu^{(i)}=1$ , Jensen’s inequality implies

[TABLE]

which is the desired inequality. The inequality will be strict so long as $r_{i}=\alpha_{k}^{(i)}/\mu^{(i)}$ is not constantly equal to $1$ over $i$ , or equivalently if $\alpha_{k}^{(i)}\neq\mu^{(i)}$ for some $i$ . This is the desired result.

Appendix H Proof of Proposition 6.1

Since $\mathbf{Y}_{0}=\Omega\mathbf{Y}\Pi^{T}$ has only $n_{0}$ columns, to ensure that the scaling of the noise matches that of the standard spiked model, we must multiply it by $\sqrt{n/n_{0}}=1/\sqrt{\nu}$ . We define $\widetilde{\mathbf{Y}}_{0}=\mathbf{Y}_{0}/\sqrt{\nu}$ and $\widetilde{\mathbf{X}}_{0}=\mathbf{X}_{0}/\sqrt{\nu}$ . Then $\widetilde{\mathbf{Y}}_{0}$ follows a standard spiked model with signal matrix $\widetilde{\mathbf{X}}_{0}$ .

For $1\leq k\leq r$ , we let $\mathbf{u}_{k}$ and $\mathbf{v}_{k}$ denote the $k^{th}$ singular vectors of $\mathbf{X}$ ; $\hat{\mathbf{u}}_{k}$ and $\hat{\mathbf{v}}_{k}$ denote the $k^{th}$ singular vectors of $\mathbf{Y}$ ; $\mathbf{u}_{k}^{0}$ and $\mathbf{v}_{k}^{0}$ denote the $k^{th}$ singular vectors of $\mathbf{X}_{0}$ (and $\widetilde{\mathbf{X}}_{0}$ ); and $\hat{\mathbf{u}}_{k}^{0}$ and $\hat{\mathbf{v}}_{k}^{0}$ denote the $k^{th}$ singular vectors of $\mathbf{Y}_{0}$ (and $\widetilde{\mathbf{Y}}_{0}$ ). We let $t_{1}^{0},\dots,t_{r}^{0}$ denote the singular values of $\widetilde{\mathbf{X}}_{0}$ . We also let $\gamma_{0}=p_{0}/n_{0}=(\mu/\nu)\gamma$ be the aspect ratio of the submatrix.

If $t_{1},\dots,t_{r}$ are the singular values of the full $p$ -by- $n$ signal matrix $\mathbf{X}$ , then we may write the rescaled submatrix $\widetilde{\mathbf{X}}_{0}$ as

[TABLE]

Because the $\Omega\mathbf{u}_{k}$ and $\Pi\mathbf{v}_{k}$ are assumed to by pairwise orthogonal, (77) is the SVD of $\widetilde{\mathbf{X}}_{0}$ . Consequently:

[TABLE]

We define the cosines

[TABLE]

Following Remark 7, we may assume that the singular vectors have been chosen so that both $c_{k}^{0}$ and $\tilde{c}_{k}^{0}$ are non-negative. Then the AMSE obtained by first applying optimal singular value shrinkage to $\widetilde{\mathbf{Y}}_{0}$ , and then rescaling by $\nu$ , is

[TABLE]

We now turn to the weighted estimator $\widehat{\mathbf{X}}_{0}=\Omega\widehat{\mathbf{X}}\Pi^{T}$ . From the weighted orthogonality condition, $\widehat{\mathbf{X}}=\widehat{\mathbf{X}}^{\mathrm{dd}}$ , the optimal diagonal denoiser. From Theorem 4.2, the AMSE of $\widehat{\mathbf{X}}_{0}$ may be written

[TABLE]

Comparing (80) and (81), the result will be proven if we can show

[TABLE]

and

[TABLE]

By the symmetry in the problem, it is enough to prove (82). Because we are working with each singular component separately, we will drop the subscript $k$ . From Proposition 3.1, the formula for $(c^{0})^{2}$ is given by

[TABLE]

If $t^{0}\leq\gamma_{0}^{1/4}$ , then (82) is trivial. Consequently, we assume $t^{0}>\gamma_{0}^{1/4}$ . Because $t^{0}=t\sqrt{\alpha\beta/\nu}$ and $\gamma_{0}=\gamma\mu/\nu$ , this is equivalent to the condition

[TABLE]

Defining $R=\gamma/t^{4}$ , we may consequently assume that

[TABLE]

We may rewrite $(c^{0})^{2}$ in terms of $R$ as follows:

[TABLE]

From the formula

[TABLE]

we may rewrite the right side of (82) as

[TABLE]

Comparing (87) to (89), the inequality (82) is equivalent to showing

[TABLE]

Because each side is affine linear in $R$ , and $0\leq R\leq\alpha^{2}\beta^{2}/(\mu\nu)$ , it is enough to verify (90) at $R=0$ and $R=\alpha^{2}\beta^{2}/(\mu\nu)$ . When $R=\alpha^{2}\beta^{2}/(\mu\nu)$ , the left side of (90) is [math], whereas the right side is non-negative because $\alpha<\sqrt{\mu}$ and $\beta<\sqrt{\nu}$ , verifying the inequality in this case. When $R=0$ , the difference between the right side and left side of (90), divided by $\alpha\beta\mu$ , is equal to

[TABLE]

Since $\beta^{2}\leq\nu\leq 1$ and $\alpha\leq 1$ , this expression is positive, verifying (90) and completing the proof.

Appendix I Proof of Proposition 6.2

From a standard Bernstein-type inequality for subexponential random variables (e.g. Proposition 5.16 from [56]), for every $\theta>0$ we have:

[TABLE]

for constants $C,C^{\prime}>0$ . Since $p\sim\gamma n$ , the right hand side is summable in $n$ ; it follows from the Borel-Cantelli Lemma that $\max_{i}|\hat{a}_{i}-\mathbb{E}[\hat{a}_{i}]|\to 0$ almost surely as $n\to\infty$ . From the delocalization of $\mathbf{X}$ ’s singular vectors, $\sup_{i,j}|X_{ij}|^{2}=o(1/n)$ . Since $\text{tr}(\mathbf{T})/n=1$ , we then have

[TABLE]

Consequently, $\|\widehat{\mathbf{S}}_{p}-\mathbf{S}_{p}\|_{\mathrm{op}}=\max_{1\leq i\leq p}|\widehat{a}_{i}-a_{i}|\to 0$ almost surely, as desired. Similar reasoning also shows that $\sum_{i=1}^{p}(\hat{a}_{i}-a_{i})/n\to 0$ almost surely as $n\to\infty$ . A nearly identical argument applied to the numerator of $\hat{b}_{j}$ then shows $\|\widehat{\mathbf{T}}_{n}-\mathbf{T}_{n}\|_{\mathrm{op}}\to 0$ almost surely, completing the proof.

Appendix J Proof of Proposition 6.3

To prove Proposition 6.3,we begin by deriving a lower bound on the operator norm of the noise matrix $\mathbf{N}=\mathbf{S}^{1/2}\mathbf{G}\mathbf{T}^{1/2}$ . We let $\mathbf{a}$ and $\mathbf{b}$ be unit vectors so that $\mathbf{G}^{T}\mathbf{b}=\|\mathbf{G}\|_{\mathrm{op}}\mathbf{a}$ . Then

[TABLE]

Next, we take unit vectors $\mathbf{c}$ and $\mathbf{d}$ so that $\mathbf{G}\mathbf{T}^{1/2}\mathbf{d}=\|\mathbf{G}\mathbf{T}^{1/2}\|_{\mathrm{op}}\mathbf{c}$ . Then we have

[TABLE]

Since the distribution of $\mathbf{G}$ is orthogonally-invariant, the distributions of $\mathbf{a}$ and $\mathbf{c}$ are uniform over the unit spheres in $\mathbb{R}^{n}$ and $\mathbb{R}^{p}$ , respectively. Consequently, $\|\mathbf{T}^{1/2}\mathbf{a}\|^{2}\sim\text{tr}(\mathbf{T})/n$ and $\|\mathbf{S}^{1/2}\mathbf{c}\|^{2}\sim\text{tr}(\mathbf{S})/p$ . Therefore,

[TABLE]

where the inequality holds almost surely in the large $p$ , large $n$ limit. Note that $\|\mathbf{G}\|_{\mathrm{op}}\sim 1+\sqrt{\gamma}$ (see, e.g., [2]), though we do not need to use this fact.

Furthermore, we also have

[TABLE]

Consequently,

[TABLE]

completing the proof.

Appendix K Proof of Proposition 6.4

Let $\delta_{ij}$ be $1$ if entry $(i,j)$ is sampled, and [math] otherwise. Then $\delta_{ij}\sim\text{Bernoulli}(p_{i}q_{j})$ . Let $\Delta=(\delta_{ij})$ ; then $\mathcal{F}^{*}(\mathcal{F}(\mathbf{X}))=\Delta\odot\mathbf{X}$ , where $\odot$ denotes the Hadamard product. Let $\mathbf{q}_{r}=(q_{1}^{r},\dots,q_{p}^{r})^{T}$ and $\mathbf{q}_{c}=(q_{1}^{c},\dots,q_{n}^{c})^{T}$ . The matrix $\Delta-\mathbf{q}_{r}\mathbf{q}_{c}^{T}$ is a random matrix with mean zero, whose entries are uniformly bounded. It follows from Corollary 2.3.5 of [54] that $\|\Delta-\mathbf{q}_{r}\mathbf{q}_{c}^{T}\|_{\mathrm{op}}/\sqrt{n}\leq A$ a.s. as $n\to\infty$ , for some constant $A>0$ .

We may write $\mathbf{P}\mathbf{X}\mathbf{Q}=\mathbf{X}\odot(\mathbf{q}_{r}\mathbf{q}_{c}^{T})$ , and consequently $\Delta\odot\mathbf{X}-\mathbf{P}\mathbf{X}\mathbf{Q}=(\Delta-\mathbf{q}_{r}\mathbf{q}_{c}^{T})\odot\mathbf{X}$ . Since $\mathbf{X}=\sum_{k=1}^{r}t_{k}\mathbf{u}_{k}\mathbf{v}_{k}^{T}$ , it is enough to show that

[TABLE]

almost surely, for each $k$ .

Suppose $\mathbf{a}$ and $\mathbf{b}$ are two unit vectors. Then

[TABLE]

almost surely as $n\to\infty$ .

Now, since $\mathbf{a}$ is a unit vector,

[TABLE]

and similarly,

[TABLE]

Since $\max_{1\leq k\leq r}\|\mathbf{u}_{k}\|_{\infty}\|\mathbf{v}_{k}\|_{\infty}=o(n^{-1/2})$ , the result follows.

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Joakim Andén and Amit Singer. Structural variability from noisy tomographic projections. SIAM Journal on Imaging Sciences , 11(2):1441–1492, 2018.
2[2] Zhidong Bai and Jack W. Silverstein. Spectral analysis of large dimensional random matrices . Springer Series in Statistics. Springer, 2009.
3[3] Zhidong Bai and Jian-feng Yao. Central limit theorems for eigenvalues in a spiked population model. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques , 44(3):447–474, 2008.
4[4] Zhidong Bai and Jian-feng Yao. On sample eigenvalues in a generalized spiked population model. Journal of Multivariate Analysis , 106:167–177, 2012.
5[5] Jinho Baik and Jack W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis , 97(6):1382–1408, 2006.
6[6] Zhigang Bao, Xiucai Ding, and Ke Wang. Singular vector and singular subspace distribution for the matrix denoising model. ar Xiv preprint ar Xiv:1809.10476 , 2018.
7[7] Florent Benaych-Georges, Alice Guionnet, and Myléne Maida. Fluctuations of the extreme eigenvalues of finite rank deformations of random matrices. Electronic Journal of Probability , 16:1621–1662, 2011.
8[8] Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vectors of low rank perturbations of large rectangular random matrices. Journal of Multivariate Analysis , 111:120–135, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Matrix denoising for weighted loss functions and

Abstract

1 Introduction

1.1 Main ideas

1.2 Illustrative example

1.3 Outline of the paper

2 Preliminaries

2.1 The observation model

Remark 1**.**

Remark 2**.**

Remark 3**.**

2.2 Heterogeneity, genericity, and weighted orthogonality

Remark 4**.**

2.3 Spectral denoisers and weighted loss functions

Remark 5**.**

3 Asymptotic theory for the spiked model

Remark 6**.**

Proposition 3.1**.**

Remark 7**.**

Theorem 3.2**.**

Remark 8**.**

4 Optimal spectral denoising

Theorem 4.1**.**

Remark 9**.**

4.1 Diagonal denoisers

Remark 10**.**

Theorem 4.2**.**

4.2 Behavior of the optimal singular values

Proposition 4.3**.**

Proposition 4.4**.**

Remark 11**.**

5 Localized denoising

5.1 Definition of localized denoising

5.2 Performance of localized denoising

Theorem 5.1**.**

Theorem 5.2**.**

Remark 12**.**

Remark 13**.**

Remark 14**.**

6 Applications of weighted denoising

6.1 Submatrix denoising

Proposition 6.1**.**

Remark 15**.**

6.2 Doubly-heteroscedastic noise

Remark 16**.**

Remark 17**.**

6.2.1 Estimating S\mathbf{S}S and T\mathbf{T}T

Remark 18**.**

Proposition 6.2**.**

Remark 19**.**

Remark 20**.**

6.2.2 Whitening increases the SNR for generic signal matrices

Proposition 6.3**.**

6.3 Matrices with missing/unobserved values

Proposition 6.4**.**

7 Numerical results

7.1 Localized denoising

Remark 21**.**

7.2 Submatrix denoising

7.3 Doubly-heteroscedastic noise

7.4 Missing data

7.5 Non-Gaussian noise

7.6 Rank estimation

8 Conclusion

Acknowledgements

Appendix A Proof of Theorem 3.2

Appendix B Proof of Theorem 4.1

Appendix C Proof of Theorem 4.2

Appendix D Proof of Proposition 4.3

Appendix E Proof of Proposition 4.4

Appendix F Proof of Theorem 5.1

Appendix G Proof of Theorem 5.2

Appendix H Proof of Proposition 6.1

Appendix I Proof of Proposition 6.2

Remark 1.

Remark 2.

Remark 3.

Remark 4.

Remark 5.

Remark 6.

Proposition 3.1.

Remark 7.

Theorem 3.2.

Remark 8.

Theorem 4.1.

Remark 9.

Remark 10.

Theorem 4.2.

Proposition 4.3.

Proposition 4.4.

Remark 11.

Theorem 5.1.

Theorem 5.2.

Remark 12.

Remark 13.

Remark 14.

Proposition 6.1.

Remark 15.

Remark 16.

Remark 17.

6.2.1 Estimating $\mathbf{S}$ and $\mathbf{T}$

Remark 18.

Proposition 6.2.

Remark 19.

Remark 20.

Proposition 6.3.

Proposition 6.4.

Remark 21.