Low-rank matrix completion and denoising under Poisson noise

Andrew D. McRae; Mark A. Davenport

arXiv:1907.05325·stat.ML·April 22, 2024

Low-rank matrix completion and denoising under Poisson noise

Andrew D. McRae, Mark A. Davenport

PDF

TL;DR

This paper investigates low-rank matrix estimation from Poisson noise observations, providing theoretical error bounds for various estimators and demonstrating their minimax optimality, with extensions to multinomial cases.

Contribution

It introduces and analyzes several estimators for low-rank matrix completion and denoising under Poisson noise, establishing their error bounds and optimality.

Findings

01

Error bounds depend on matrix rank and observed fraction.

02

All estimators are minimax optimal within certain classes.

03

Results extend to multinomial matrix denoising and completion.

Abstract

This paper considers the problem of estimating a low-rank matrix from the observation of all or a subset of its entries in the presence of Poisson noise. When we observe all entries, this is a problem of matrix denoising; when we observe only a subset of the entries, this is a problem of matrix completion. In both cases, we exploit an assumption that the underlying matrix is low-rank. Specifically, we analyze several estimators, including a constrained nuclear-norm minimization program, nuclear-norm regularized least squares, and a nonconvex constrained low-rank optimization problem. We show that for all three estimators, with high probability, we have an upper error bound (in the Frobenius norm error metric) that depends on the matrix rank, the fraction of the elements observed, and maximal row and column sums of the true matrix. We furthermore show that the above results are minimax…

Equations216

M^{(1)} = M^{'} \in [0, \infty)^{m \times n} argmin ∥ M^{'} ∥_{*} s.t. ∥ A_{Ω}^{*} (X) - p M^{'} ∥ \leq δ,

M^{(1)} = M^{'} \in [0, \infty)^{m \times n} argmin ∥ M^{'} ∥_{*} s.t. ∥ A_{Ω}^{*} (X) - p M^{'} ∥ \leq δ,

M^{(2)} = M^{'} \in [0, \infty)^{m \times n} argmin ∥ A_{Ω}^{*} (X) - p M^{'} ∥_{F}^{2} + λ ∥ M^{'} ∥_{*},

M^{(2)} = M^{'} \in [0, \infty)^{m \times n} argmin ∥ A_{Ω}^{*} (X) - p M^{'} ∥_{F}^{2} + λ ∥ M^{'} ∥_{*},

M^{(3)} = M^{'} \in [0, \infty)^{m \times n} argmin ∥ A_{Ω}^{*} (X) - p M^{'} ∥_{F} s.t. rank (M^{'}) \leq r,

M^{(3)} = M^{'} \in [0, \infty)^{m \times n} argmin ∥ A_{Ω}^{*} (X) - p M^{'} ∥_{F} s.t. rank (M^{'}) \leq r,

∥ M - M^{(i)} ∥_{F} ≲ \frac{r}{p} σ (M) + logarithmic terms,

∥ M - M^{(i)} ∥_{F} ≲ \frac{r}{p} σ (M) + logarithmic terms,

σ (M) = i max j \sum M_{ij} + (1 - p) M_{ij}^{2} + j max i \sum M_{ij} + (1 - p) M_{ij}^{2} .

σ (M) = i max j \sum M_{ij} + (1 - p) M_{ij}^{2} + j max i \sum M_{ij} + (1 - p) M_{ij}^{2} .

∥ M - M^{(i)} ∥_{F} ≲ \frac{r}{p} σ (M) .

∥ M - M^{(i)} ∥_{F} ≲ \frac{r}{p} σ (M) .

M in f M sup E_{M} ∥ M - M ∥_{F} ≳ \frac{r}{p} σ .

M in f M sup E_{M} ∥ M - M ∥_{F} ≳ \frac{r}{p} σ .

σ \approx i max j \sum M_{ij} + (1 - p) M_{ij}^{2} .

σ \approx i max j \sum M_{ij} + (1 - p) M_{ij}^{2} .

∥ M - M^{(i)} ∥_{F} ≲ r i max j \sum M_{ij} .

∥ M - M^{(i)} ∥_{F} ≲ r i max j \sum M_{ij} .

E ∥ M - M^{MLE} ∥_{F}^{2} = i, j \sum var (X_{ij}) = i, j \sum M_{ij} .

E ∥ M - M^{MLE} ∥_{F}^{2} = i, j \sum var (X_{ij}) = i, j \sum M_{ij} .

∥ P - P ∥_{F} ≲ \frac{r}{N ( m \land n )}

∥ P - P ∥_{F} ≲ \frac{r}{N ( m \land n )}

∥ D^{1/2} (P - P) ∥_{F} ≲ r lo g (m + n),

∥ D^{1/2} (P - P) ∥_{F} ≲ r lo g (m + n),

∥ M - M ∥_{F} ≲ \frac{r}{p} i max j \sum var (ξ_{ij}) + (1 - p) M_{ij}^{2} + j max i \sum var (ξ_{ij}) + (1 - p) M_{ij}^{2} .

∥ M - M ∥_{F} ≲ \frac{r}{p} i max j \sum var (ξ_{ij}) + (1 - p) M_{ij}^{2} + j max i \sum var (ξ_{ij}) + (1 - p) M_{ij}^{2} .

∥ M - M ∥_{F}^{2} ≲ (λ_{m a x} + (1 - p) λ_{m a x}^{2}) \frac{r m}{p},

∥ M - M ∥_{F}^{2} ≲ (λ_{m a x} + (1 - p) λ_{m a x}^{2}) \frac{r m}{p},

E ∥ M - M ∥_{F}^{2} ≲ \frac{λ _{m a x}^{3}}{λ _{m i n}} \frac{r m}{p} lo g m,

E ∥ M - M ∥_{F}^{2} ≲ \frac{λ _{m a x}^{3}}{λ _{m i n}} \frac{r m}{p} lo g m,

∥ M - M ∥_{F}^{2} ≲ \frac{r n m}{p},

∥ M - M ∥_{F}^{2} ≲ \frac{r n m}{p},

σ (M) = i max j \sum M_{ij} + (1 - p) M_{ij}^{2} + j max i \sum M_{ij} + (1 - p) M_{ij}^{2} .

σ (M) = i max j \sum M_{ij} + (1 - p) M_{ij}^{2} + j max i \sum M_{ij} + (1 - p) M_{ij}^{2} .

A (M, p, ϵ) = 2 p σ (M) + \frac{8 ϵ}{mn} + C max {λ_{max}, 4 lo g \frac{2 mn}{ϵ}} lo g \frac{m \lor n}{ϵ},

A (M, p, ϵ) = 2 p σ (M) + \frac{8 ϵ}{mn} + C max {λ_{max}, 4 lo g \frac{2 mn}{ϵ}} lo g \frac{m \lor n}{ϵ},

∥ M^{(1)} - M ∥_{F} \leq \frac{4 2 r δ}{p}

∥ M^{(1)} - M ∥_{F} \leq \frac{4 2 r δ}{p}

∥ M^{(2)} - M ∥_{F} \leq \frac{2 2 r λ}{p ^{2}} .

∥ M^{(2)} - M ∥_{F} \leq \frac{2 2 r λ}{p ^{2}} .

∥ M^{(3)} - M ∥_{F} \leq \frac{2 2 r}{p} A (M, p, ϵ) .

∥ M^{(3)} - M ∥_{F} \leq \frac{2 2 r}{p} A (M, p, ϵ) .

∥ M^{(i)} - M ∥_{F} \leq \frac{4 2 r δ}{p} = \frac{2 2 r λ}{p ^{2}} .

∥ M^{(i)} - M ∥_{F} \leq \frac{4 2 r δ}{p} = \frac{2 2 r λ}{p ^{2}} .

∥ M^{(3)} - M ∥_{F} \leq \frac{2 2 r}{p} ∥ A_{Ω}^{*} (X) - pM ∥ .

∥ M^{(3)} - M ∥_{F} \leq \frac{2 2 r}{p} ∥ A_{Ω}^{*} (X) - pM ∥ .

0 \geq ∥ M^{δ} ∥_{*} - ∥ M ∥_{*} \geq ⟨ H^{(1)}, Z ⟩,

0 \geq ∥ M^{δ} ∥_{*} - ∥ M ∥_{*} \geq ⟨ H^{(1)}, Z ⟩,

Z = U V^{*} + P_{T^{⊥}} (W),

Z = U V^{*} + P_{T^{⊥}} (W),

0

0

= ⟨ P_{T} (H^{(1)}), U V^{*} ⟩ + ∥ P_{T^{⊥}} (H^{(1)}) ∥_{*}

\geq - ∥ P_{T} (H^{(1)}) ∥_{*} + ∥ P_{T^{⊥}} (H^{(1)}) ∥_{*},

∥ P_{T^{⊥}} (H^{(1)}) ∥_{*} \leq ∥ P_{T} (H^{(1)}) ∥_{*} .

∥ P_{T^{⊥}} (H^{(1)}) ∥_{*} \leq ∥ P_{T} (H^{(1)}) ∥_{*} .

∥ H^{(1)} ∥_{*}

∥ H^{(1)} ∥_{*}

\leq 2 2 r ∥ P_{T} (H^{(1)}) ∥_{F}

\leq 2 2 r ∥(H^{(1)}) ∥_{F},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Low-rank matrix completion and denoising under Poisson noise

Andrew D. McRae Mark A. Davenport

Georgia Institute of Technology

Abstract

This paper considers the problem of estimating a low-rank matrix from the observation of all or a subset of its entries in the presence of Poisson noise. When we observe all entries, this is a problem of matrix denoising; when we observe only a subset of the entries, this is a problem of matrix completion. In both cases, we exploit an assumption that the underlying matrix is low-rank. Specifically, we analyze several estimators, including a constrained nuclear-norm minimization program, nuclear-norm regularized least squares, and a nonconvex constrained low-rank optimization problem. We show that for all three estimators, with high probability, we have an upper error bound (in the Frobenius norm error metric) that depends on the matrix rank, the fraction of the elements observed, and maximal row and column sums of the true matrix. We furthermore show that the above results are minimax optimal (within a universal constant) in classes of matrices with low rank and bounded row and column sums. We also extend these results to handle the case of matrix multinomial denoising and completion.

1 Introduction

1.1 Low-rank models for count data

In this paper, we consider the problem of estimating a non-negative matrix $M\in\mathbf{R}^{m\times n}$ given independent observations distributed according to $\operatorname{Poisson}(M_{ij})$ for $(i,j)\in\Omega$ , where $\Omega$ is a (not necessarily strict) subset of $\{1,\dots,m\}\times\{1,\dots,n\}$ . If we do not make an observation for every entry of the matrix, the recovery problem is, in general, ill-posed in the absence of any additional assumptions on the underlying matrix. A common assumption for this type of problem is that the unknown matrix $M$ is low-rank; i.e., the dimension of the spans of the columns and rows of $M$ is much smaller than the actual numbers of columns and rows. This assumption greatly reduces the number of degrees of freedom in the model, making the recovery problem far more tractable. Note that even if we do observe every entry, we can still exploit the structure of the model to reduce the error due to noise.

While the problems of matrix completion and denoising have received a significant amount of attention in the settings of Gaussian noise and of small, bounded (in $\ell_{2}$ ) perturbations (e.g., [1, 2, 3]), Poisson noise models have received comparatively less attention. In this paper, we focus primarily on the Poisson model, but we also examine closely-related multinomial models; this includes the case in which we have a single probability distribution over matrix coordinates (for which our result is a corollary of our Poisson results) as well as the case in which we make independent multinomial observations of matrix rows. Collectively, these models are often natural in applications where the observations arise via some form of counting process. The ability to recover (or de-noise) a low-rank signal from noisy, count-based observations is useful in many situations. We briefly mention two examples.

One potential application area involves imaging systems. This includes conventional cameras (which often suffer from noise in low light or with short exposures), but also 3-D imaging methods such as X-ray computed tomography (CT) and positron emission tomography (PET), which, in medical imaging, would greatly benefit from an improved noise/radiation dose tradeoff. In these scenarios, the Poisson noise model is natural because the observations consists of counts of particle (e.g., photon) arrivals at a detector. In many of these settings, such as when observing a periodic or slowly-varying sequence of images, a low-rank assumption on the underlying data is natural (see, e.g., [4] for an overview of low-rank modeling in image applications).

Another important application is topic modeling, which is a common form of dimensionality reduction for text documents. In this case, our observations consist of counts of word occurrences in a corpus of documents. If we suppose that these documents can be decomposed according to a small set of topics, and that within each topic documents will exhibit similar word occurrence counts, then a low-rank assumption on the word-frequency matrix is natural. For example, the popular PLSI model [5] uses a multinomial probability model parameterized by a low-rank matrix.

Low-rank models are also popular in nonnegative matrix factorization, which is commonly applied in a range of contexts where count data is common. A wide variety of such models have been developed, along with inference algorithms such as expectation maximization, variational Bayes, and Markov chain Monte Carlo [6, 7, 8, 9]. These models and algorithms have been applied to many tasks, especially recommendation systems [10, 11]. However, the algorithms used are nonconvex, and there is little in the way of theoretical guarantees for their performance in the Poisson or multinomial setting.

1.2 Summary of main results for Poisson noise

In our analysis, we assume a Bernoulli sampling of the matrix entries: i.e., the events $\{(i,j)\in\Omega\}$ are independent with probability $p\in(0,1]$ , and the observed Poisson random variables are independent conditioned on $\Omega$ . Note that taking $p=1$ handles the case in which we observe every entry of the matrix. For a matrix $A$ , $\lVert A\rVert_{*}$ , $\lVert A\rVert$ , and $\lVert A\rVert_{F}$ denote the nuclear norm, operator norm, and Frobenius norm of $A$ , respectively.

Let $\operatorname{\mathcal{A}}_{\Omega}\colon\mathbf{R}^{m\times n}\to\mathbf{R}^{\Omega}$ denote the entry-wise sampling operator given by $(\operatorname{\mathcal{A}}_{\Omega}(Z))_{(i,j)}=Z_{ij}$ for $(i,j)\in\Omega$ . Note that its adjoint $\operatorname{\mathcal{A}}^{*}\colon\mathbf{R}^{\Omega}\to\mathbf{R}^{m\times n}$ maps the vector $(x_{(i,j)})_{(i,j)\in\Omega}\in\mathbf{R}^{\Omega}$ to the $m\times n$ matrix whose $(i,j)$ th entry is $x_{(i,j)}$ if $(i,j)\in\Omega$ and zero otherwise.

Given observations $X\sim\operatorname{Poisson}(\operatorname{\mathcal{A}}_{\Omega}(M))$ , we consider several different estimators with similar theoretical properties. The first can be interpreted as a matrix version of the Dantzig selector [12]:

[TABLE]

where $\delta>0$ is a parameter which we will see how to set later. The second is a nuclear-norm-regularized least-squares type estimator:

[TABLE]

where $\lambda>0$ is another parameter which we will set. The third is least-squares under an exact low-rank constraint:

[TABLE]

where $r$ is an upper bound on the rank of the true rate matrix $M$ . This problem is not convex (and is, in general, hard to solve directly while respecting nonnegativity constraints), but we will see later how we can address this issue without affecting its theoretical properties.

Theorem 1, which is the main result of Section 2.1, states that, if $M$ has rank $r$ , and hyperparameters are properly chosen, each of the estimators $\{\widehat{M}^{(i)}\}_{i=1}^{3}$ satisfies, with high probability,

[TABLE]

where

[TABLE]

In many situations (see Section 3.3), the logarithmic terms are negligible, so we can approximate this result by the bound

[TABLE]

Section 3 uses two standard methods to find lower bounds on the minimax risk of any estimator in classes of matrices with bounded row and column sums. These results (Theorems 3 and 4) can be summarized as follows: over all nonnegative matrices $M$ such that $\operatorname{rank}(M)\leq r$ , and $\widetilde{\sigma}(M)\leq\sigma$ , we have

[TABLE]

Thus, Theorem 1 is optimal (up to a multiplicative constant and an additive logarithmic factor) for this class of matrices.

To gain a more intuitive understanding of our result, it is helpful to examine the formula for $\widetilde{\sigma}$ . For simplicity, assume, without loss of generality, that the row sums dominate the column sums, so that

[TABLE]

The two terms inside the sum have different roles. The first term ( $M_{ij}$ ) corresponds to the variance of the Poisson random variables. Indeed, if we take $p=1$ , this is the only term, so our result has the form

[TABLE]

If we do not impose any structure on the model, the maximum likelihood (and least-squares) estimate is $\widehat{M}^{\text{MLE}}=X$ , which has risk

[TABLE]

If every row of $M$ has approximately the same sum, estimators defined above improve on $\widehat{M}^{\text{MLE}}$ (in squared Frobenius error) by a factor of approximately $r/n$ . If the sums of the rows of $M$ differ significantly, the improvement is smaller. However, this should not be too surprising — if the variance in the problem is already concentrated into a smaller sub-matrix, we are effectively solving a smaller problem, and hence the low-rank assumption is less restrictive and, therefore, less beneficial.

The second term (of the form $(1-p)M_{ij}^{2}$ ) in the formula for $\widetilde{\sigma}$ corresponds to the inherent difficulty in estimating the values of a matrix due to the fact that we do not observe every entry. This term in the lower bound applies regardless of the noise model, even when there is no noise. This might seem to contradict existing exact noiseless matrix completion results, but we note here that such results make stronger assumptions (incoherence of the row and column spaces) beyond what we are assuming here. In fact, the matrices used in the proof of Theorem 4 are highly coherent.

Although this second error term is necessary for general matrices, an interesting open problem is whether it could be entirely removed (leaving only the variance term) when we assume additional structure (such as incoherence) on the true rate matrix. Such a result would be a bridge between existing noisy and noiseless matrix completion literature; the existence of exact completion for the noiseless case implies that current results for the noisy case (including this paper) become highly suboptimal when the signal-to-noise ratio goes to infinity. An exception is [3], but we note that this approach is not without its own drawbacks as this approach leads to error rates which are suboptimal with respect to the rank $r$ . More recent work in this direction is [13, 14]; however, these results depend strongly on the condition number of the true matrix, and their dependence on matrix rank seems suboptimal. Thus there is still much work to do in analyzing noisy completion of incoherent matrices.

1.3 Summary of main results for multinomial denoising

We can also derive an interesting result on multinomial matrix denoising as a corollary of our result on Poisson denoising. If $P$ is a non-negative $m\times n$ matrix such that $\sum_{i,j}P_{ij}=1$ , and we independently sample $N$ objects according to the probabilities contained in $P$ , the number of times each entry of $P$ is sampled (which we can denote by an $m\times n$ count matrix $X$ ) has a (matrix) multinomial distribution. Our results on Poisson denoising apply by considering a multinomial distribution to be a vector of independent Poisson variables conditioned on its sum. Corollary 1 in Section 2.2 shows that if $P$ has rank $r$ , and none of the row and column sums of $P$ are too large, there is an estimator $\widehat{P}$ of $P$ (which could be defined similarly to any of the estimators above) such that

[TABLE]

One can easily check that, for the maximum likelihood estimator $\widehat{P}^{\text{MLE}}=X/N$ , $\operatorname{\mathbf{E}}\lVert\widehat{P}^{\text{MLE}}-P\rVert_{F}^{2}\approx N^{-1}$ , so our result (approximately) reduces the squared error by a factor of $r/(m\wedge n)$ , which is the effective rank deficiency.

For a more complete exploration of multinomial matrix denoising, we also consider a model in which our observations are independent multinomial samples from rows of a low-rank matrix. Concretely, our observations are now a matrix $X$ whose rows, which we denote $\{X_{i}\}_{i=1}^{m}$ , are independent and distributed according to $X_{i}\sim\operatorname{Multinomial}(p_{i},N_{i})$ , where $\{p_{i}\}_{i=1}^{m}$ are the rows of a rank- $r$ $m\times n$ matrix $P$ . Theorem 2 states that, under mild conditions on the sums of columns of $P$ , there is an estimator $\widehat{P}$ (defined similarly to those above) such that, with high probability,

[TABLE]

where $D=\operatorname{diag}(N_{1},\dots,N_{m})$ . It is easily checked that the maximum likelihood estimator $\widehat{P}_{\text{MLE}}=D^{-1}X$ has expected error $\operatorname{\mathbf{E}}\lVert D^{1/2}(\widehat{P}^{\text{MLE}}-P)\rVert_{F}^{2}\approx m$ , so we again get a reduction in (squared) error that is (approximately, modulo a logarithmic factor) proportional to the reduction in degrees of freedom.

We do not analyze the multinomial estimation problems from a minimax risk standpoint, but, due to the similarities between the Poisson and multinomial distributions, we suspect that one could find similar matching lower bounds in a similar manner to the Poisson case.

1.4 Computation and implications for general noisy matrix completion

Note that all three of our estimators would be very easy to compute if we discarded the nonnegativity constraint: we could take a singular value decomposition of the matrix $\operatorname{\mathcal{A}}_{\Omega}^{*}(X)$ and then either do singular value soft thresholding (for $\widehat{M}^{(1)}$ and $\widehat{M}^{(2)}$ ) or truncate it to the $r$ largest singular values (for $\widehat{M}^{(3)}$ ).

We claim that ignoring the nonnegativity constraint does not change our analysis or the resulting error bounds at all; this constraint does not appear anywhere in the proof of Theorem 1. Therefore, if computational ability is a limiting factor, we could simply take the more efficient approach of solving without any nonnegativity constraints, and the error bounds we have presented will still apply. Projecting the result onto any convex constraint set that contains $M$ can then only improve the performance.

We also note here that although we have chosen to focus on Poisson noise, our approach is fairly general and could apply to other types of noise. Indeed, if $M$ were an arbitrary (not necessarily nonnegative) matrix, and we make observations of the form $M_{ij}+\xi_{ij}$ for $(i,j)\in\Omega$ , and the $\xi_{ij}$ ’s are zero-mean noise variables with reasonably light tails, we could adapt our arguments to show that, for each of our three estimators,

[TABLE]

The lower bound Theorem 4 is completely independent of the noise distribution; a version of Theorem 3 could be proved for many common distributions.

The above two points have some interesting implications for general matrix completion with noise. Many of the existing algorithms, such as low-rank factorization [15], iterative imputation [16], or the many other algorithms, including (1), that can be expressed as semidefinite programs, are fairly complex. Our results suggest that, not only do simple SVD-based algorithms have theoretical properties that are just as good as current state-of-the-art guarantees for more complex algorithms, but that, in a minimax error sense, it is impossible to do any better.

This realization does not, however, imply that there is no value to more sophisticated algorithms. As mentioned earlier, how well we can exploit incoherence in noisy matrix completion remains an important open question, and the matrices used in the proof of the minimax lower bound Theorem 4 are highly coherent. Therefore, it is likely that more sophisticated algorithms are still beneficial when trying to recover non-pathological (i.e., incoherent) matrices.

1.5 Comparison to prior work

There are several categories of existing literature to which we can compare our results. Some papers explicitly consider Poisson noise, using a maximum-likelihood framework. [17] consider nuclear-norm penalized maximum likelihood for matrices contained in a nuclear norm ball (rather than exactly low-rank matrices). This approach uses an empirical process argument to bound the Kullback-Leibler divergence between the true and predicted distributions. This argument requires a Lipschitz condition on the log-likelihood function, which, for the Poisson distribution, requires imposing a lower bound on the rates. [18] and [19] consider a penalized maximum likelihood estimator from a carefully-chosen finite set of candidates (which is exponentially large in the size of the problem and hence computationally intractable). The matrices considered have a non-negative low-rank factorization (with a particular emphasis on the case when one factor is sparse). They use an information-theoretic argument to bound the expected error in terms of Bhattacharyya distance. The result of [18], which applies to matrix completion, requires imposing a lower bound on the rates, while that of [19], which considers only denoising, does not. All three papers find an upper bound on Frobenius error in terms of the statistical error metrics that they originally bound.

Other, more general approaches, are designed specifically with Frobenius-norm error in mind. One class of methods uses “restricted strong convexity” arguments, which were introduced by [1]. These methods rely on approximating the Frobenius norm in certain restricted classes of matrices (in which the error matrix must fall) using only samples of the entries. These methods lead to simple and elegant proofs, but the concentration inequalities on which they rely require imposing uniform upper bounds in magnitude on both the true matrix entries and the estimator entries. Other recent papers which use this type of argument include [20, 21, 22]. Another interesting paper which uses learning theory arguments to achieve a similar result is [23].

Another class of methods is direct singular value thresholding. Papers with this approach include [2] (which inspired our approach) as well as [24, 25, 26]. These methods are very simple and lend themselves to simple proofs.

An interesting blend of techniques can be seen in the papers [27, 28, 29], which combine some of the general approaches mentioned above with maximum likelihood estimation for exponential families of distributions. These methods, like those in [17] and [18], are difficult to apply to the Poisson distribution without imposing a lower bound on rates because, as the mean $\lambda$ of the distribution goes to [math], the “natural parameter” $\log\lambda$ goes to $-\infty$ , whereas the general methods used require parameters to be bounded. They also require (approximate) low rank in the matrix of natural parameters. In the Poisson case, this is equivalent to assuming a bound on the rank of the matrix $[\log M_{ij}]$ of elementwise logarithms of the means, which is somewhat non-standard, and certainly not the same as bounding the rank of the original matrix $M$ .

There is some previous work on the denoising problem in terms of Poisson or exponential family principal component analysis (PCA). Papers in this area include [30], which recommends maximum likelihood approaches; [31], which uses variational Bayesian inference; and [32], which uses a singular value shrinkage algorithm on the means (much like we do). The recent preprint [33] examines a variety of models with random scaling factors and nonlinearities. These papers do not contain theoretical results applicable to our problem, however. [32] is related to [34], which contains an asymptotic analysis of a similar singular value shrinkage algorithm for more general problems (including matrix completion). Another work in this area is [35], which contains consistency results for low-dimensional subspace recovery.

Most of the papers mentioned above do not find error bounds which explicitly depend on the “true” rate matrix; rather, they find uniform upper bounds for classes of structured matrices with uniform upper (and, sometimes, lower) bounds on the entries. To compare our results directly to this literature, we consider what we obtain when we only impose a uniform upper and lower bounds (by, say $\lambda_{\max}$ and $\lambda_{\min}$ ) on the matrix entries. The approximate bound of (5) reduces to

[TABLE]

where we have assumed, without loss of generality, that $m\geq n$ . Previous results show similar error rates in terms of matrix dimensions for exactly low-rank matrices. For example, [18] establishes a bound of

[TABLE]

which provides a similar dependence on $r$ , $m$ , and $p$ , but with an additional logarithmic term and a worse dependence on the minimum and maximum matrix values. In a slightly different setting, [17] shows that for matrices in the nuclear norm ball of radius $\lambda_{\max}\sqrt{rmn}$ (which is a convex relaxation of the exact low-rank constraint), we instead obtain (ignoring logarithmic terms and a complicated but severe dependence on $\lambda_{\max}$ and $\lambda_{\min}$ ) an error bound of

[TABLE]

where $p$ is now the number of samples for entry in a uniform-at-random sampling model. The different dependence on $r$ and $p$ is interesting, but, if one compares it to results in linear regression over $\ell_{1}$ balls (see, e.g., [36]), the rate given is perhaps not surprising. To compare to some of the more general methods mentioned above, we note that, if we consider the generalization of our method mentioned in Section 1.4, and we assume a uniform upper bound on the magnitudes of matrix elements and the noise variances, our results are comparable to [20] (albeit under less-strict assumptions).

As noted earlier, our paper uses fairly general matrix completion methods. For the Frobenius norm error metric, this gives us an advantage over more distribution-specific approaches such as [17, 18, 27, 28], in part because we do not have to approximate the Frobenius-norm error by a statistical divergence measure or by a norm in a transformed parameter space. Our results also do not suffer from the fact that a Poisson distribution’s likelihood function is ill-conditioned for very small rates. In addition, our results avoid a multiplicative logarithmic factor that appears in much of the previous literature (replacing it with an additive factor that is often negligible); this achievement (which also appears in [20]) is almost entirely due to the use of recent results in bounding the operator norm of a random matrix (such as [37]).

Finally, much of the previous literature in the Poisson case (from those mentioned above, [27, 17, 18]) finds lower bounds on minimax risk in certain classes of matrices. Although these lower bounds have the same large-scale error rate (in terms of the rank and dimensions of the matrix and the number of samples) as the corresponding upper bounds, they differ from the upper bounds by factors that are logarithmic in the problem size and that depend on the ratio of largest to smallest allowable rates. To our knowledge, the results in this paper are the first for noisy low-rank matrix completion in which the minimax rate for large classes of matrices is found to within a universal constant.

We are aware of much less theoretical work for low-rank denoising. One recent work that is worth noting is [38]. This paper shows that, in the case of a matrix multinomial distribution, one can achieve a tight error bound in $\ell_{1}$ distance (sum of absolute values of entries) for certain factorizable probability matrices using $O(mr^{4})$ samples. It is difficult to compare this directly to our result, since it is in the much stronger $\ell_{1}$ norm, but we note that, in a similar fashion as many of the results on Poisson observations, this paper also relies on lower bounding certain sums of entries in the factor matrices. Other theoretical work on topic modeling includes [39, 40, 41, 42, 43].

We add a final caveat to our results by noting that $\lVert\widehat{M}-M\rVert_{F}$ might not always be the most appropriate error metric; for example, there is a much larger difference qualitatively (and quantitatively, if we use an appropriate statistical divergence) between Poisson distributions of means 0 and 10 than between Poisson distributions of means 100 and 110. We see a similar disconnect between squared error and other probabilistic metrics in the case of the multinomial distribution. Further investigation of distribution-specific methods (such as maximum likelihood) that yield bounds in more statistically-motivated metrics is thus certainly warranted.

1.6 Paper outline

The remainder of this paper is organized as follows. Section 2 contains the the formal statements and proofs of the upper bounds on error in this paper. Section 3 contains the statements and proofs of two separate minimax lower bounds which, when combined, yield the matching lower bound to (5). Section 3.3 also discusses briefly some situations where the approximation of (5) is accurate, and thus the upper and lower bounds match within a universal constant.

2 Upper bounds

2.1 Poisson noise

This section is dedicated to proving the main result of this paper, which is the following theorem.

Theorem 1.

Let $M$ be a non-negative $m\times n$ matrix with rank $r$ . Let $\lambda_{\text{max}}=\max_{ij}M_{ij}$ , and let

[TABLE]

Suppose $\Omega\subset\{1,\dots,m\}\times\{1,\dots,n\}$ is chosen according to a Bernoulli sampling model with sampling probability $p$ , and suppose, conditionally on $\Omega$ , $X\sim\operatorname{Poisson}(\operatorname{\mathcal{A}}_{\Omega}(M))$ . Set $\epsilon\in(0,1/2)$ , and let

[TABLE]

where $C$ is a universal constant.

Then, with probability at least $1-2\epsilon$ , if $\delta\geq A(M,p,\epsilon)$ and $\lambda\geq 2pA(M,p,\epsilon)$ , we also have

[TABLE]

and

[TABLE]

Moreover, we also have that with probability at least $1-2\epsilon$ ,

[TABLE]

The result follows from a series of lemmas. The first steps in upper bounding the error are the following (deterministic) results.

Lemma 1.

Suppose $M$ is a rank- $r$ matrix such that $\lVert\operatorname{\mathcal{A}}_{\Omega}^{*}(X)-pM\rVert\leq\delta=\frac{\lambda}{2p}$ . Then, for $i\in\{1,2\}$ ,

[TABLE]

Lemma 2.

Suppose $M$ is a rank- $r$ matrix. Then

[TABLE]

Proof of Lemma 1.

Let $M=U\Sigma V^{*}$ be the singular value decomposition of $M$ , where $U\in\mathbf{R}^{m\times r}$ and $V\in\mathbf{R}^{n\times r}$ are such that $U^{*}U=V^{*}V=I_{r}$ , and $\Sigma$ is an $r\times r$ diagonal matrix with positive entries on the diagonal. Let $T$ be the subspace of $\mathbf{R}^{m\times n}$ spanned by matrices of the form $UA$ and $BV^{T}$ for arbitrary matrices $A\in\mathbf{R}^{r\times n}$ and $B\in\mathbf{R}^{m\times r}$ . We denote by $\operatorname{\mathcal{P}}_{T}$ and $\operatorname{\mathcal{P}}_{T^{\perp}}$ , respectively, the orthogonal projections onto $T$ and its orthogonal complement $T^{\perp}$ .

Denote $H^{(1)}=\widehat{M}^{(1)}-M$ . Because $M$ is feasible, and the nuclear norm is a convex function, we have

[TABLE]

where $Z\in\partial\lVert M\rVert_{*}$ is any subgradient of the nuclear norm function at the point $M$ . Such a subgradient must have the form

[TABLE]

where $W$ is an arbitrary matrix with $\lVert W\rVert\leq 1$ . By the duality of the nuclear norm and operator norm, we can choose $W$ so that $\langle W,\operatorname{\mathcal{P}}_{T^{\perp}}(H^{(1)})\rangle=\lVert\operatorname{\mathcal{P}}_{T^{\perp}}(H^{(1)})\rVert_{*}$ . We then have

[TABLE]

where the last inequality follows from the fact that $\lVert UV^{*}\rVert=1$ . We therefore have

[TABLE]

Hence

[TABLE]

where the second inequality follows from the fact that any element of $T$ has rank at most $2r$ .

By the triangle inequality and the homogeneity of the operator norm, we also have

[TABLE]

Thus

[TABLE]

and the first part of the result immediately follows.

The proof of the second part is similar; letting $H^{(2)}=\widehat{M}^{(2)}-M$ , we now have, by the optimality of $\widehat{M}$ ,

[TABLE]

Noting, as before, that $\lVert\widehat{M}^{(2)}\rVert_{*}-\lVert M\rVert_{*}\geq\lVert\operatorname{\mathcal{P}}_{T^{\perp}}(H^{(2)})\rVert_{*}-\lVert\operatorname{\mathcal{P}}_{T}(H^{(2)})\rVert_{*}$ , and that

[TABLE]

we have

[TABLE]

∎

Proof of Lemma 2.

Because both $M$ and $\widehat{M}^{(3)}$ have rank at most $r$ , $H^{(3)}=\widehat{M}^{(3)}-M$ has rank at most $2r$ . The optimality of $\widehat{M}^{(3)}$ implies

[TABLE]

so

[TABLE]

∎

The remainder of the work is to show that $\lVert\operatorname{\mathcal{A}}_{\Omega}^{*}(X)-pM\rVert\leq A(M,p,\epsilon)$ with probability at least $1-2\epsilon$ . We will use the following fundamental lemma, which was originally proved by [37] and appears with a slightly improved constant in [44].

Lemma 3 (Theorem 4.9 and Remark 4.11 in [44]).

Let $X$ be a random $m\times n$ matrix whose entries are independent, centered, and almost surely bounded in absolute value by a constant $b$ . Let

[TABLE]

Then

[TABLE]

where $C_{0}$ is a universal constant.

Poisson random variables are clearly unbounded, so Lemma 3 does not directly apply. The following technical lemma allows us to extend the result to the case of random variables with sub-exponential tails.

Lemma 4.

Let $X$ be a random $m\times n$ matrix whose entries are independent and centered, and suppose that for some $v,t_{0}>0$ , we have, for all $t\geq t_{0}$ ,

[TABLE]

Let $\epsilon\in(0,1/2)$ , and let

[TABLE]

Then

[TABLE]

where $\sigma$ and $C_{0}$ are the same as in Lemma 3.

Proof.

First, note that, by a union bound,

[TABLE]

Consider the truncation $X^{K}=[X^{K}_{ij}]$ , where $X^{K}_{ij}=X_{ij}\mathbf{1}_{\{\lvert X_{ij}\rvert\leq K\}}$ . Note that

[TABLE]

Let $\widetilde{X}^{K}=X^{K}-\operatorname{\mathbf{E}}X^{K}$ be the centered version of $X^{K}$ . Clearly, $\operatorname{\mathbf{E}}(\widetilde{X}^{K}_{ij})^{2}\leq\operatorname{\mathbf{E}}X_{ij}^{2}$ , and $\lvert\widetilde{X}^{K}_{ij}\rvert\leq 2K$ . Then, by Lemma 3,

[TABLE]

Furthermore, with probability at least $1-\epsilon$ ,

[TABLE]

and the result follows. ∎

To apply this result, we need a subexponential tail bound for the Poisson distribution.

Lemma 5.

Let $X\sim\operatorname{Poisson}(\lambda)$ . Then

[TABLE]

For $t\geq\lambda$ ,

[TABLE]

The first inequality can be established by approximating the Poisson distribution with mean $\lambda$ as the sum of $k$ Bernoulli random variables with mean $\lambda/k$ , applying Bernstein’s inequality, and taking $k\to\infty$ . The idea for this argument was suggested by an exercise in [45].

Going back to our original problem, we need to bound the operator norm of $Z=\operatorname{\mathcal{A}}_{\Omega}^{*}(X)-pM$ . Note that since we are using a Bernoulli sampling model, the entries of $Z$ are independent. Let $\lambda_{\text{max}}=\max_{i,j}M_{ij}$ . Note that for every $(i,j)$ , $\operatorname{\mathbf{E}}Z_{ij}=0$ ,

[TABLE]

and, for $t\geq 2\lambda_{\text{max}}$ ,

[TABLE]

Then, by Lemma 4, we have, for $\epsilon\in(0,1/2)$ ,

[TABLE]

where

[TABLE]

and $\sigma$ is defined as before. To calculate $\sigma$ in terms of $p$ and $M$ , we note that

[TABLE]

Therefore, we can calculate

[TABLE]

2.2 Corollary on multinomial estimation

Here we prove a corollary of Theorem 1, showing how it implies a result for estimating the low-rank-matrix parameter of a matrix multinomial distribution. For the sake of brevity, we only consider the analogue of $\widehat{M}^{(1)}$ .

Corollary 1.

Let $P$ be a nonnegative $m\times n$ matrix with rank $r$ such that $\sum_{i,j}P_{ij}=1$ . Suppose, furthermore, that we have

[TABLE]

for some constants $a,b\geq 1$ , and that $\max_{i,j}P_{ij}\leq c$ . Let $N$ be a positive integer, and suppose that $X\sim\operatorname{Multinomial}(P,N)$ . Let $\epsilon\in(0,1)$ , choose $\delta>0$ such that

[TABLE]

and let

[TABLE]

Then, with probability at least $1-\epsilon$ ,

[TABLE]

As in the Poisson case, there are many situations where the additive logarithmic term in the definition of $\delta$ is negligible, and we have

[TABLE]

Proof of Corollary 1.

Suppose $Y\sim\operatorname{Poisson}(NP)$ . Let $\epsilon^{\prime}=\epsilon/2e\sqrt{N}$ . Define the event $A=\{\lVert\widehat{P}^{\delta}(Y)-P\rVert_{F}\leq 4\sqrt{2r}\delta\}$ . Theorem 1 (with $p=1$ ) implies that $\operatorname{\mathbf{P}}(A)\geq 1-2\epsilon^{\prime}=1-\epsilon/e\sqrt{N}$ . Note that the additional constraints in the optimization problem do not affect this fact, since $P$ , by definition, meets these constraints.

$X$ has the same distribution as $Y$ conditioned on $B=\{\sum_{i,j}Y_{ij}=N\}$ , so it suffices to show that the probability of $A$ conditioned on this event is at least $1-\epsilon$ . Indeed, note that $\sum_{i,j}Y_{ij}\sim\operatorname{Poisson}(N)$ , so

[TABLE]

by Stirling’s approximation. Then, by Bayes’ rule,

[TABLE]

∎

2.3 Multinomial denoising with independent rows

We now consider a slightly different setting for multinomial estimation. Here, we consider a model in which we observe a collection of independent multinomial random variables, each of which has distribution parametrized by a row in a low-rank matrix.

Theorem 2.

Let $X_{1},\dots,X_{m}$ be independent multinomial random variables, with

[TABLE]

where, for each $i$ , $N_{i}\geq 1$ is an integer, and $p_{i}=(p_{i1},\dots,p_{in})$ is a vector of probabilities. Denote the matrices

[TABLE]

where each $X_{i}$ and $p_{i}$ are considered row vectors. Define the estimator

[TABLE]

Let $D_{\mathrm{min}}=\min_{i}D_{i}$ , and choose $\delta$ such that

[TABLE]

Then, with probability at least $1-\epsilon$ ,

[TABLE]

where $r$ is the rank of $P$ .

Our approach uses the following matrix Bernstein inequality.

Lemma 6 ([46, Theorem 6.1.1]).

Suppose $Z=\sum_{k}S_{k}\in\mathbf{R}^{m\times n}$ , where $\{S_{k}\}_{k}$ is a finite sequence of independent, zero-mean random matrices. Suppose that for some constant $L>0$ , for each $k$ , $\lVert S_{k}\rVert\leq L$ almost surely. Let

[TABLE]

Then, for $t\geq 0$ ,

[TABLE]

Proof of Theorem 2.

On the event that $\lVert D^{-1/2}(X-DP)\rVert\leq\delta$ , we have $\lVert D^{1/2}(\widehat{P}^{\delta}-P)\rVert\leq 2\delta$ , and the result follows by the same steps as in the proof of Theorem 1.

To find a bound on the operator norm of $Z=D^{-1/2}(X-DP)$ , we apply Lemma 6, noting that $D^{-1/2}(X-DP)$ is the sum of $\sum_{i=1}^{m}D_{i}$ random, zero-mean matrices of the form $\frac{1}{\sqrt{D_{i}}}e_{i}u_{ij}^{T}$ , where $e_{i}$ is the $i^{\text{th}}$ standard basis element in $\mathbf{R}^{m}$ , and $u_{ij}$ is a vector in $\mathbf{R}^{n}$ that is equal to $e_{k}$ with probability $p_{ik}$ .

One can verify that

[TABLE]

where $I_{m}$ denotes the $m\times m$ identity matrix, and that

[TABLE]

where $p_{i}=(p_{i1},\dots,p_{in})$ is the $i^{\text{th}}$ row of $p$ . We can then take $v\leq\max\{1,\max_{j}\sum_{i}p_{ij}\}$ . Clearly, each term in the sum has operator norm bounded above by $L=1/\sqrt{D_{\mathrm{min}}}$ .

Some algebraic manipulation of the result of Lemma 6 implies that, for $\epsilon\in(0,1)$ , we have, with probability at least $1-\epsilon$ ,

[TABLE]

which establishes the result.

∎

3 Minimax lower bounds

In this section, we show that the rate in (5) is optimal (within a multiplicative constant) in the sense of minimax risk. We do this in two parts; the two minimax lower bounds derived in the next two sections, when combined, match the rate in (5) is optimal.

3.1 First lower bound

In the Poisson upper error bound, $\widetilde{\sigma}$ is partially determined by the maximal row and column sums of the rate matrix $M$ , which we can think of as the maximal variance of any row or column (without sampling a subset of the entries). Our first lower bound shows that we cannot improve on this term:

Theorem 3.

Let $r$ , $k$ , and $\ell$ , be positive integers, and take $m=rk$ , $n=r\ell$ . Let $\lambda_{\max}\geq 1/8\ell p$ , set $\sigma_{1}^{2}=k\lambda_{\max}$ , and let

[TABLE]

Then, under a Bernoulli sampling model with sampling probability $p$ ,

[TABLE]

Proof.

Assume, without loss of generality, that $k\geq\ell$ . We use a variant of Fano’s method. We first find a large hypercube of matrices, and then we use the fact that we can find a large subset that is well-separated.

For $i\in\{1,\dots,m\}$ , let

[TABLE]

where $q\in\{1,\dots,r\}$ is the unique integer such that $i\in\{k(q-1)+1,\dots,kq\}$ . For $\lambda_{0},\lambda_{1}$ to be chosen later, we define, for $\theta\in\{0,1\}^{m}$ , the block-diagonal matrix $M_{\theta}\in S$ by

[TABLE]

The nonzero elements of the $i^{\text{th}}$ row of $M_{\theta}$ are all either $\lambda_{0}$ or $\lambda_{1}$ , depending on the value of $\theta_{i}$ .

By a combinatorial argument (see, e.g., [47]), one can show that there exists $\Theta\subset\{0,1\}^{m}$ such that $\operatorname{card}(\Theta)\geq e^{m/8}$ and, for all distinct $\theta,\theta^{\prime}\in\Theta$ , the Hamming distance $d_{H}(\theta,\theta^{\prime})\geq m/4$ .

Take $\lambda_{0}=\lambda_{\max}/2-\delta$ and $\lambda_{1}=\lambda_{\max}/2+\delta$ , where $\delta\leq\lambda_{\max}/2$ is a constant to be chosen later. Note that for all distinct $\theta,\theta^{\prime}\in\Theta$ , we have

[TABLE]

We denote by $P_{\theta}$ the distribution of $\operatorname{\mathcal{P}}_{\Omega}(X)$ when $X\sim\operatorname{Poisson}(M_{\theta})$ , and $\Omega$ is independently chosen from the Bernoulli sampling model with parameter $p$ . From Fano’s inequality from information theory, one can derive (see, e.g., [47]) a lower bound on the probability of an estimator’s error exceeding half the distance between points indexed by $\Theta$ :

[TABLE]

where $\phi$ denotes a test taking values in $\Theta$ , and $Q$ is any probability distribution on $\mathbf{R}^{m\times n}$ . We take $Q$ to be the distribution generated in the same manner as each $P_{\theta}$ , simply with $\lambda_{1}$ and $\lambda_{2}$ replaced by $\lambda_{\max}/2$ .

Note that the Kullback-Leibler divergence between two Poisson distributions with rates $\lambda$ and $\lambda^{\prime}$ is

[TABLE]

Therefore, for any $\theta\in\Theta$ ,

[TABLE]

Take

[TABLE]

Then

[TABLE]

and half the separation between points indexed by $\Theta$ is at least

[TABLE]

∎

3.2 Second lower bound

The previous theorem relies on the fact that the observations are conditionally Poisson. The next result, which provides the second part of a matching lower bound to (5), does not depend on the conditional distribution of the observations, and instead shows a fundamental limit in inferring missing matrix entries.

Theorem 4.

Take again $m=rk$ , $n=r\ell$ . Set $\sigma_{2}^{2}=k\lambda_{\max}^{2}$ . Let

[TABLE]

Suppose $p\geq\frac{1}{2(k\wedge\ell)}=\frac{r}{2(m\wedge n)}$ . Then, under a Bernoulli sampling model with probability $p$ (with any conditional distribution on the observations),

[TABLE]

Proof.

Again, we assume that $k\geq\ell$ . We first prove the lower bound with the first item in the maximum. For notational simplicity, we can assume that $1/2p$ is an integer. Furthermore, we can assume that $\ell=1/2p$ , since decreasing the number of columns does not increase risk.

We consider a set of matrices $\{M_{\theta}\}_{\theta\in\{0,1\}^{m}}$ with the same structure as in the proof of Theorem 3, but now, we take $\lambda_{0}=0$ , $\lambda_{1}=\lambda_{\max}$ .

Assouad’s lemma (see, e.g., [48] or [47]) gives a lower bound on the Bayes risk of an estimator $\widehat{M}$ for a uniform prior on $\{0,1\}^{m}$ :

[TABLE]

where $\phi\colon\mathbf{R}^{m\times n}\to\{0,1\}$ is a test, and, for $\theta\in\{0,1\}^{m}$ , $\theta^{i}$ denotes the element of $\{0,1\}^{m}$ that is equal to $\theta$ except in the $i^{\text{th}}$ position.

Denote by $P_{\theta}^{i}$ the marginal distribution of the $i^{\text{th}}$ row of a matrix with distribution $P_{\theta}$ . The minimal testing risk which appears in the sum is equal to the $L_{1}$ norm of the minimum of the densities of $P_{\theta}^{i}$ and $P_{\theta^{i}}^{i}$ , which measures how much the distributions overlap. Thus we have

[TABLE]

where the first inequality is due to the fact that, with probability $(1-p)^{\ell}$ , no entry from the $i^{\text{th}}$ row of $M$ is observed.

Then

[TABLE]

The result follows from the fact that minimax risk always exceeds Bayes risk.

A simple modification with $\ell=1$ yields the result for the second term in the maximum. ∎

We note here that we could also use a similar argument as in the proof of Theorem 3 to get a high-probability lower bound on error (with a somewhat worse constant). For example, setting $Q=\frac{P_{(0,\dots,0)}+P_{(1,\dots,1)}}{2}$ , it is easily verified that the resulting Kullback-Leibler divergences $\operatorname{D_{KL}}(P_{\theta}\>\|\>Q)$ can be upper bounded by a sum of coordinate-wise total variation distance.

3.3 When do the upper and lower bounds match?

Within multiplicative constants, the lower bounds of Theorems 3 and 4 match the approximate upper bound of (5). We must therefore consider when the approximation in (5) is accurate.

Finding technical conditions that guarantee matching rates is not something we think likely to be very instructive at this point, especially since a different proof technique could potentially change the logarithmic term in $A(M,p,\epsilon))$ . However, we think it is helpful to look at the matrices involved in the proofs of Theorems 3 and 4. Note that when the bounds match for those particular matrices, the minimax error rate bounds are tight for the matrix classes considered in those proofs.

For these matrices (assuming that $m\geq n$ ),

[TABLE]

For this term to dominate (6), we must have

[TABLE]

For example, if $\lambda_{\max}\geq\log m$ , it would suffice to take

[TABLE]

which is a standard condition in noiseless matrix completion. If $\lambda_{\max}\leq\log m$ , it would suffice to take

[TABLE]

4 Conclusion and future work

In this paper, we have derived an upper bound in Frobenius norm error for an estimator for Poisson matrix completion, and we have derived a minimax lower bound that matches this upper bound (within a universal constant) for many classes of nonnegative rate matrices. We have also derived similar upper bounds for error in two types of multinomial matrix denoising problems. The estimators we use are computationally tractable, and require significantly fewer assumptions on the underlying matrix than previous results in the literature. Significantly, we impose no lower bounds on the entries of the underlying matrix. This is crucial in many applications (such as topic modelling) where zero or very small means can be relatively common.

Because we have found upper and lower error bounds in Frobenius norm, the only theoretical improvement remaining for this model and error metric in general classes of matrices is to try to relax the conditions under which the bounds match (although, as we have seen, they are not too restrictive now). This could potentially come about by reducing the logarithmic term in (6) and/or by finding a logarithmic term to add to the minimax lower bounds. One could also further examine how to obtain better error rates for more restrictive classes of matrices, such as incoherent matrices.

It would also be interesting to extend the results presented here to matrices that are not exactly low-rank, but are instead “approximately low-rank”; for example, we could consider matrices which are contained in Schatten balls (which, for $q\in[0,1]$ , are sets of matrices for which $\sum_{i}\sigma_{i}^{q}\leq R$ , where $\{\sigma_{i}\}$ is the set of singular values). As mentioned previously, [17] used the Schatten $1$ -norm ( $q=1$ , or nuclear norm ball); [1] also examined these classes of matrices.

Another avenue of research would be to examine structured Poisson or multinomial estimation under different, more statistically motivated error metrics. Maximum likelihood methods seem more suitable here than least-squares, but analysis of maximum likelihood estimators has proved difficult for the reasons outlined in Section 1.5. It is not clear what kind of structure would be relevant in a different error metric. Low-rank structure seems to work well with a least-squares error framework, but there is a priori not much reason to think that it would work similarly well for another metric; for example, the Bhattacharyya distance between Poisson distributions, is proportional to the (squared) $\ell_{2}$ distance between the square roots of the rates, but the element-wise square root of a low-rank matrix is not, in general, low rank. Thus, this approach may not immediately bear much fruit. However, an analysis of matrix estimation under alternative error metrics remains an important area for future research.

Acknowledgments

This work was supported, in part, by NSF grant CCF-1350616 and a gift from the Alfred P. Sloan Foundation.

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Sahand Negahban and Martin Wainwright “Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise” In J. Mach. Learn. Res. 13 , 2012, pp. 1665–1697
2[2] Vladimir Koltchinskii, Karim Lounici and Alexandre Tsybakov “Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion” In Ann. Stat. 39.5 , 2011, pp. 2302–2329 DOI: 10.1214/11-AOS 894 · doi ↗
3[3] Emmanuel J. Candès and Yaniv Plan “Matrix Completion With Noise” In Proc. IEEE 98.6 , 2010, pp. 925–936 DOI: 10.1109/JPROC.2009.2035722 · doi ↗
4[4] Xiaowei Zhou, Can Yang, Hongyu Zhao and Weichuan Yu “Low-Rank Modeling and its Applications in Image Analysis” In ACM Comput. Surv. 47.2 , 2014, pp. 36:1–36:33 DOI: 10.1145/2674559 · doi ↗
5[5] Thomas Hofmann “Probabilistic Latent Semantic Indexing” In Proc. ACM SIGIR Conf. (SIGIR) , 1999 DOI: 10.1145/312624.312649 · doi ↗
6[6] Daniel Lee and H. Seung “Learning the Parts of Objects by Non-negative Matrix Factorization” In Nature 401 , 1999, pp. 788–791
7[7] John Canny “Ga P: A Factor Model for Discrete Data” In Proc. ACM SIGIR Conf. (SIGIR) , 2004 DOI: 10.1145/1008992.1009016 · doi ↗
8[8] Ali Cemgil “Bayesian Inference for Nonnegative Matrix Factorisation Models” In Comput. Intell. Neurosci. , 2009 DOI: 10.1155/2009/785152 · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Low-rank matrix completion and denoising under Poisson noise

Abstract

1 Introduction

1.1 Low-rank models for count data

1.2 Summary of main results for Poisson noise

1.3 Summary of main results for multinomial denoising

1.4 Computation and implications for general noisy matrix completion

1.5 Comparison to prior work

1.6 Paper outline

2 Upper bounds

2.1 Poisson noise

Theorem 1**.**

Lemma 1**.**

Lemma 2**.**

Proof of Lemma 1.

Proof of Lemma 2.

Lemma 3** (Theorem 4.9 and Remark 4.11 in [44]).**

Lemma 4**.**

Proof.

Lemma 5**.**

2.2 Corollary on multinomial estimation

Corollary 1**.**

Proof of Corollary 1.

2.3 Multinomial denoising with independent rows

Theorem 2**.**

Lemma 6** ([46, Theorem 6.1.1]).**

Proof of Theorem 2.

3 Minimax lower bounds

3.1 First lower bound

Theorem 3**.**

Proof.

3.2 Second lower bound

Theorem 4**.**

Proof.

3.3 When do the upper and lower bounds match?

4 Conclusion and future work

Acknowledgments

Theorem 1.

Lemma 1.

Lemma 2.

Lemma 3 (Theorem 4.9 and Remark 4.11 in [44]).

Lemma 4.

Lemma 5.

Corollary 1.

Theorem 2.

Lemma 6 ([46, Theorem 6.1.1]).

Theorem 3.

Theorem 4.