On Inferences from Completed Data

Jamie Haddock; Denali Molitor; Deanna Needell; Sneha Sambandam; Joy; Song; Simon Sun

arXiv:1907.03028·math.ST·July 9, 2019

On Inferences from Completed Data

Jamie Haddock, Denali Molitor, Deanna Needell, Sneha Sambandam, Joy, Song, Simon Sun

PDF

Open Access

TL;DR

This paper studies how errors from matrix completion impact statistical inference, providing error bounds and demonstrating that perfect matrix recovery isn't always necessary for accurate inference.

Contribution

It introduces recovery error bounds for statistical inference based on matrix completion and analyzes the effects of approximate recovery in practical scenarios.

Findings

01

Error bounds depend on matrix recovery error.

02

Exact matrix recovery isn't always needed for accurate inference.

03

Numerical experiments confirm theoretical insights.

Abstract

Matrix completion has become an extremely important technique as data scientists are routinely faced with large, incomplete datasets on which they wish to perform statistical inferences. We investigate how error introduced via matrix completion affects statistical inference. Furthermore, we prove recovery error bounds which depend upon the matrix recovery error for several common statistical inferences. We consider matrix recovery via nuclear norm minimization and a variant, $ℓ_{1}$ -regularized nuclear norm minimization for data with a structured sampling pattern. Finally, we run a series of numerical experiments on synthetic data and real patient surveys from MyLymeData, which illustrate the relationship between inference recovery error and matrix recovery error. These results indicate that exact matrix recovery is often not necessary to achieve small inference recovery error.

Equations21

X \in R^{m \times n} argmin ∥ X ∥_{*} s.t. M_{ij} = X_{ij} for all (i, j) \in Ω.

X \in R^{m \times n} argmin ∥ X ∥_{*} s.t. M_{ij} = X_{ij} for all (i, j) \in Ω.

X \in R^{m \times n} argmin ∥ X ∥_{*} + α ∥ X_{Ω^{C}} ∥_{1} s.t. M_{ij} = X_{ij} for all (i, j) \in Ω

X \in R^{m \times n} argmin ∥ X ∥_{*} + α ∥ X_{Ω^{C}} ∥_{1} s.t. M_{ij} = X_{ij} for all (i, j) \in Ω

\overset{ˉ}{λ} (M) - \overset{ˉ}{λ} (M) \leq (mn)^{- \frac{1}{q}} ∥ M - M ∥_{q}

\overset{ˉ}{λ} (M) - \overset{ˉ}{λ} (M) \leq (mn)^{- \frac{1}{q}} ∥ M - M ∥_{q}

∥ μ (M) - μ (M) ∥_{q} \leq (\frac{n ^{q - 1}}{m})^{\frac{1}{q}} ∥ M - M ∥_{q}

∥ μ (M) - μ (M) ∥_{q} \leq (\frac{n ^{q - 1}}{m})^{\frac{1}{q}} ∥ M - M ∥_{q}

∣ \overset{ˉ}{λ} (A) ∣ \leq \frac{1}{mn} ∥ A ∥_{1} \leq (mn)^{- \frac{1}{q}} ∥ A ∥_{q}

∣ \overset{ˉ}{λ} (A) ∣ \leq \frac{1}{mn} ∥ A ∥_{1} \leq (mn)^{- \frac{1}{q}} ∥ A ∥_{q}

∥ μ (A) ∥_{q}^{q}

∥ μ (A) ∥_{q}^{q}

= \frac{∥ A ∥ _{1}^{q}}{m ^{q}} \leq \frac{n ^{q - 1} ∥ A ∥ _{q}^{q}}{m}

∥ M - M ∥_{F} \leq 2 r^{2} σ_{1}^{2} - ∥ M_{Ω} ∥_{F}^{2} .

∥ M - M ∥_{F} \leq 2 r^{2} σ_{1}^{2} - ∥ M_{Ω} ∥_{F}^{2} .

∥ M - M ∥_{F}^{2} = 2 (∥ M ∥_{F}^{2} + ∥ M ∥_{F}^{2}) - ∥ M + M ∥_{F}^{2} .

∥ M - M ∥_{F}^{2} = 2 (∥ M ∥_{F}^{2} + ∥ M ∥_{F}^{2}) - ∥ M + M ∥_{F}^{2} .

∥ M ∥_{F}^{2} = ∥ (σ_{1}, σ_{2}, \dots, σ_{r}) ∥_{2}^{2} \leq r^{2} σ_{1}^{2} .

∥ M ∥_{F}^{2} = ∥ (σ_{1}, σ_{2}, \dots, σ_{r}) ∥_{2}^{2} \leq r^{2} σ_{1}^{2} .

∥ M ∥_{F}^{2} \leq ∥ M ∥_{*}^{2} \leq ∥ M ∥_{*}^{2} = ∥ (σ_{1}, σ_{2}, \dots, σ_{r}) ∥_{1}^{2} \leq r^{2} σ_{1}^{2} .

∥ M ∥_{F}^{2} \leq ∥ M ∥_{*}^{2} \leq ∥ M ∥_{*}^{2} = ∥ (σ_{1}, σ_{2}, \dots, σ_{r}) ∥_{1}^{2} \leq r^{2} σ_{1}^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Statistical Methods and Inference · Distributed Sensor Networks and Detection Algorithms

Full text

On Inferences from Completed Data

Jamie Haddock Department of Mathematics, UCLA, Los Angeles, CA (\[email protected]).

Denali Molitor Department of Mathematics, UCLA, Los Angeles, CA (\[email protected]).

Deanna Needell Department of Mathematics, UCLA, Los Angeles, CA (\[email protected]).

Sneha Sambandam UCLA, Los Angeles, CA (\[email protected]).

Joy Song Tsinghua University, Beijing, China (\[email protected]).

Simon Sun Peking University, Beijing, China (\[email protected]).

Abstract

Matrix completion has become an extremely important technique as data scientists are routinely faced with large, incomplete datasets on which they wish to perform statistical inferences. We investigate how error introduced via matrix completion affects statistical inference. Furthermore, we prove recovery error bounds which depend upon the matrix recovery error for several common statistical inferences. We consider matrix recovery via nuclear norm minimization and a variant, $\ell_{1}$ -regularized nuclear norm minimization for data with a structured sampling pattern. Finally, we run a series of numerical experiments on synthetic data and real patient surveys from MyLymeData, which illustrate the relationship between inference recovery error and matrix recovery error. These results indicate that exact matrix recovery is often not necessary to achieve small inference recovery error.

I Background and Motivation

Real-world data is often high-dimensional and incomplete; e.g., a survey may be incomplete because respondents may skip questions or as a consequence of the structure of the survey. In recent years, much work has been invested towards determining efficient and accurate methods for data completion [4, 8, 9, 10]. Often, however, data practitioners are interested not in any particular missing entry or the completed data itself, but in performing statistical inferences on the completed data set (e.g., entrywise mean, linear regression, support vector machines) [1]. For this reason, we study how missing and artificially completed data introduces error into the recovery of statistical inferences.

In general, incomplete data can be modeled as a matrix with subsampled entries. In typical matrix completion results, entries are assumed to be uniformly sampled. We expect this to be the easiest setting to analyze mathematicaly. Unfortunately, this is invalid in many practical situations. We consider various sampling strategies which select certain entries from a complete matrix to construct an incomplete matrix. Entries of a data matrix could be selected using uniform sampling; that is, each entry could be sampled with equal probability as in [3]. On the other hand, one could employ structured sampling and select entries with probability dependent upon their value as in [7]. The details of these two sampling methods are given in Section I-B. Such strategies can be used to model the ways that incomplete data appears in the real world. For instance, we consider a structured sampling strategy in which entries of smaller magnitude are sampled less often which models the situation in which survey participants are more likely to skip questions that are not important to them (in which their answers may have smaller magnitude).

If the matrix to be recovered is low rank, one can accurately infer the missing entries of the data matrix using the algebraic structure of the observed entries. Indeed, Candès and Recht show that if the observed sample of entries is uniformly distributed and sufficiently large then one can exactly recover the matrix via nuclear norm minimization [3]. There are many matrix completion approaches, however we focus on nuclear norm minimization (NNM) and $\ell_{1}$ -regularized nuclear norm minimization ( $\ell_{1}$ -NNM), defined in Subsection I-A.

Data completion can also be helpful for data collection purposes; only partial information may be required for data completion to preserve the statistical properties of a dataset, allowing for reduction in the quantity of data that must be collected, stored, or transmitted. Returning to the survey example, one could ask respondents a small selection of questions from a larger set of candidate questions, predict their answers to the unasked questions using data completion, and apply inference methods to the recovered dataset. These applications are of particular interest to LymeDisease.org, an advocacy organization that collects survey data from Lyme patients through studies like MyLymeData [6]. The surveys used in MyLymeData branch, presenting different sets of questions to respondents based on their previous answers. Patients may also skip questions. The resulting data matrix, in which rows correspond to patients and columns correspond to questions, is highly incomplete. Another concern of LymeDisease.org is the length of the MyLymeData surveys, since overlong surveys can cause survey fatigue and lead patients to ignore questions or answer inaccurately. Developing sound inference methods for incomplete data would allow us to sample strategically and use data completion techniques to design shorter surveys that preserve high-level information about the respondents.

In this report, we study the effects of different sampling techniques on statistical inference. We derive provable error bounds for certain statistics and run numerical simulations on synthetic data as well as large-scale, incomplete survey data from MyLymeData with the goal of reducing the amount of data required from each survey respondent while preserving population-level insights.

I-A Notation

We begin by establishing notation that will be used throughout the paper. Recall that $[n]=\{1,2,...,n\}$ . For $\mathbf{A}\in\mathbb{R}^{m\times n}$ , we denote the $(i,j)$ entry of $\mathbf{A}$ as $A_{ij}$ and the $i$ th row of $\mathbf{A}$ as $\mathbf{a}_{i}$ . The standard $\ell_{q}$ -norm on $\mathbb{R}^{n}$ is denoted $\|\cdot\|_{q}$ for $1\leq q\leq\infty$ . For $\mathbf{A}\in\mathbb{R}^{m\times n}$ , $\|\mathbf{A}\|_{q}$ is the entrywise matrix $q$ -norm; i.e., the $\ell_{q}$ -norm of the vectorization of $\mathbf{A}$ . The matrix nuclear norm is denoted $\|\mathbf{A}\|_{*}=\operatorname*{trace}(\sqrt{\mathbf{A}^{*}\mathbf{A}})$ .

We consider two sampling strategies, uniform and structured sampling. For uniform sampling, the probability of sampling each entry is given by $p\in(0,1)$ . We also investigate a structured sampling strategy in which the probability of sampling entries equal to zero is given by $p_{0}$ , and the probability of sampling nonzero entries is given by $p_{1}$ ; we assume $p_{0}<p_{1}$ .

We denote the original complete matrix by $\mathbf{M}$ , the set of observed indices from the original matrix by $\Omega\subset[m]\times[n]$ , the observed matrix by $\mathbf{M}_{\Omega}$ , the recovered matrix by $\widetilde{\mathbf{M}}$ , and the fraction of entries which are observed as $\omega\in(0,1)$ . We consider two recovery methods, nuclear norm minimization (NNM) and $\ell_{1}$ -regularized nuclear norm minimization ( $\ell_{1}$ -NNM). The recovered matrix $\widetilde{\mathbf{M}}$ for NNM is defined as

[TABLE]

The recovered matrix $\widetilde{\mathbf{M}}$ for $\ell_{1}$ -NNM is defined as

[TABLE]

for some regularization parameter $\alpha>0$ . The addition of the $\ell_{1}$ -regularization term in the objective of $\ell_{1}$ -NNM encourages unobserved entries of the recovered matrix to be near [math], which makes it a natural choice for recovery on an incomplete matrix generated by structured sampling [7].

The inferences we consider are basic statistics. The first inference is the entrywise mean, defined as $\bar{\lambda}(\mathbf{A}):=\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}$ . We additionally consider the row mean, a row vector containing the mean value for each column or feature, which is defined as $\mu(\mathbf{A}):=\frac{1}{m}\sum_{i=1}^{m}\mathbf{a}_{i}$ .

I-B Methodology

To perform our experiments, we begin with a complete matrix $\mathbf{M}$ either artificial or extracted from real data, which we take as the ground truth. We then use either the uniform or structured sampling strategies to obtain an incomplete observed matrix, $\mathbf{M}_{\Omega}$ . The values of $p$ and $p_{0},p_{1}$ used for uniform and structured sampling respectively are noted in each experiment. We recover $\widetilde{\mathbf{M}}$ via either NNM or $\ell_{1}$ -NNM. For $\mathbf{M}_{\Omega}$ constructed via the uniform sampling strategy, we use NNM to recover $\widetilde{\mathbf{M}}$ while for $\mathbf{M}_{\Omega}$ constructed via the structured sampling strategy, we use $\ell_{1}$ -NNM to recover $\widetilde{\mathbf{M}}$ . Here, we choose $\alpha$ optimally from among $\{0.05,0.1,0.2,...,0.5\}$ to minimize the resulting error $\|\mathbf{M}-\widetilde{\mathbf{M}}\|_{F}$ . We use the alternating direction method of multipliers (ADMM) [5] to solve both NNM and $\ell_{1}$ -NNM. We consider the normalized matrix recovery error $E(\mathbf{M},\widetilde{\mathbf{M}}):=\|\mathbf{M}-\widetilde{\mathbf{M}}\|_{F}/\|\mathbf{M}\|_{F}$ as an estimate of the error introduced by sampling and data completion.

Finally, we compute inferences on the original matrix $\mathbf{M}$ and the recovered matrix $\widetilde{\mathbf{M}}$ . We estimate the inference error between these two matrices via various measures. We define the absolute error of the entrywise mean as $E_{\bar{\lambda}}(\mathbf{M},\widetilde{\mathbf{M}}):=|\bar{\lambda}(\mathbf{M})-\bar{\lambda}(\widetilde{\mathbf{M}})|$ and the normalized error of the row mean as $E_{\mu}(\mathbf{M},\widetilde{\mathbf{M}}):=\|\mu(\mathbf{M})-\mu(\widetilde{\mathbf{M}})\|_{2}/\|\mu(\mathbf{M})\|_{2}$ .

We perform numerical experiments on both synthetic and real-world data. The real-world dataset consists of survey data from the MyLymeData patient study conducted by LymeDisease.org [6]. For experiments on synthetic data, we generate artificial matrices as follows. To guarantee a certain rank $r$ , we generate $m\times n$ scalar matrices by multiplying two matrices whose sizes are $m\times r$ and $r\times n$ . The entries of each pair of matrices we generate are uniformly distributed integers within the range $[0,C]$ . For experiments on real data, we extract a complete portion of MyLymeData consisting of patient responses to questions regarding their symptoms and health history.

II Experimental Results

In Figures 1, 2, and 3, we plot experimentally collected matrix and inference recovery errors on synthetic matrices; the figures differ by the choice of zero sampling probability $p_{0}$ for the structured sampling strategy. We generate a $30\times 30$ matrix with rank $5$ as described in Subsection I-B. For various $p$ and $(p_{0},p_{1})$ sampling probabilities, we measure the resulting matrix recovery errors and inference recovery errors. These results are averaged over 10 trials (each trial consists of a sample of observed entries) and plotted with the standard deviation of these errors. Errors are plotted versus the proportion of observed entries $\omega$ . We additionally record the optimal regularization parameter $\alpha$ which resulted in the smallest matrix recovery for the given structured sampling proportion $\omega$ error in the plots in the upper left of each figure.

In Figures 4, 5, and 6, we plot experimentally collected matrix and inference recovery errors on MyLymeData matrices; the figures differ by the choice of zero sampling probability $p_{0}$ for the structured sampling strategy. We select a complete matrix of size $30\times 16$ by selecting the 16 questions (columns) every patient must answer and select the 30 patients with the most zero entries. For various $p$ and $(p_{0},p_{1})$ sampling probabilities, we measure the resulting matrix recovery errors and inference recovery errors. These results are averaged over 10 trials (each trial consists of a sample of observed entries) and plotted with the standard deviation of these errors. Errors are plotted versus the proportion of observed entries $\omega$ . We additionally record the optimal regularization parameter $\alpha$ which resulted in the smallest matrix recovery for the given structured sampling proportion $\omega$ error in the plots in the upper left of each figure.

Note that in Figures 1, 2, 3, 4, and 5, the optimal regularization parameter $\alpha$ is greater than zero for sufficiently large observation proportion $\omega$ . Furthermore, in Figures 1, 2, 3, 4, and 5, the $\ell_{1}$ -NNM recovered solution is exact for sufficiently large $\omega$ , and the $\ell_{1}$ -NNM recovery for the observations sampled via the structured strategy is more accurate than the NNM recovery for the observations sampled via the uniform strategy for larger proportion $\omega$ . Finally, often the inference recoveries are exact for smaller $\omega$ than is necessary for exact matrix recovery, as in Figure 1, 2, and 3.

In Figures 7, 8, and 9, we plot experimentally collected matrix and inference recovery errors on synthetic matrices; the figures differ by the choice of zero sampling probability $p_{0}$ . In these figures, we compare $\ell_{1}$ -NNM and NNM recovery for matrices which have been sampled via the structured sampling strategy. We generate a $30\times 30$ matrix with rank $5$ as described in Subsection I-B. We average the matrix recovery and inference recovery errors over 10 trials (each trial consists of a sample of observed entries) and plot the mean and standard deviation of these errors. Errors are plotted versus the probability of sampling non-zero entries, $p_{1}$ . We additionally record the optimal regularization parameter $\alpha$ which resulted in the smallest matrix recovery for the given non-zero structured sampling probability $p_{1}$ error in the plots in the upper left of each figure.

III Theoretical Results

Given that the matrix recovery error has been studied closely in the literature [3, 2], we aim to bound the inference recovery error by a function of the matrix recovery error. We establish bounds on the recovery error for the entrywise mean and row mean.

The first result bounds the recovery error of the entrywise mean $\bar{\lambda}$ and the row mean $\mathbf{\mu}$ by a scalar multiple of the matrix recovery error. Recall that $\|\mathbf{A}\|_{q}$ denotes the standard $\ell_{q}$ vector-norm of the vectorization of the matrix $\mathbf{A}$ .

Theorem III.1.

Let $\bar{\lambda}$ and $\mathbf{\mu}$ be the entrywise and row mean operators respectively. Then

[TABLE]

and

[TABLE]

for all $\mathbf{M},\widetilde{\mathbf{M}}\in\mathbb{R}^{m\times n}$ and $1\leq q\leq\infty$ .

Proof.

First, note that $\bar{\lambda}$ and $\mathbf{\mu}$ are linear operators, so it suffices to show that $|\bar{\lambda}(\mathbf{A})|\leq(mn)^{-1/q}\|\mathbf{A}\|_{q}$ and $\|\mu(\mathbf{A})\|_{q}\leq(n^{q-1}/m)^{1/q}\|\mathbf{A}\|_{q}$ for $\mathbf{A}\in\mathbb{R}^{m\times n}$ . Next, note that $|\bar{\lambda}(\mathbf{A})|\leq\|\mathbf{A}\|_{1}/mn$ .

Applying Hölder’s inequality, we have

[TABLE]

where $1/q$ assumes the value [math] if $q=\infty$ .

Next, note that

[TABLE]

where the last inequality follows from Hölder’s inequality. ∎

In Figure 7 we explore the bounds given in Theorem III.1. We generate $20$ random scalar matrices of size $16\times 80$ as described in Subection I-B. For each matrix, we collect $20$ uniform samples of the entries using the sampling probability $p$ , then calculate the averages of the entrywise mean recovery error, the row mean recovery error, and the derived upper bounds based on the matrix recovery error for each sample. We perform this process for $p=0,0.01,...,1$ .

Finally, we present a simple analytic bound for NNM matrix recovery error. Note that this bound illustrates that the inference recovery errors may still be small even if the matrix recovery is not exact.

Theorem III.2.

Let $\mathbf{M}\in\mathbb{R}^{m\times n}$ , $\Omega$ , and $\widetilde{\mathbf{M}}$ be computed via NNM as described in Subsection I-A. Let $r={\rm rank}(\mathbf{M})$ denote the rank of $\mathbf{M}$ , and denote the singular values of $\mathbf{M}$ by $\sigma_{1}\geq\sigma_{2}\geq\cdots\geq\sigma_{r}$ in decreasing order. Then

[TABLE]

Proof.

Applying the Parallelogram Identity, we have

[TABLE]

We bound each term of the right-hand side, beginning with the $\|\mathbf{M}\|_{F}^{2}$ term. By Hölder’s Inequality, we have

[TABLE]

Next, we bound the $\|\widetilde{\mathbf{M}}\|_{F}^{2}$ term above. Since $\mathbf{M}$ is feasible for the nuclear norm minimization problem, note that $\|\widetilde{\mathbf{M}}\|_{*}\leq\|\mathbf{M}\|_{*}$ . Therefore, through repeated use of Hölder’s Inequality, we calculate that

[TABLE]

Finally, note that $\|\mathbf{M}+\widetilde{\mathbf{M}}\|_{F}^{2}\geq 4\|\mathbf{M}_{\Omega}\|_{F}^{2}$ . ∎

Note that this bound proves exact recovery when all entries of the matrix are observed and all singular values of the matrix are equal, but is likely not tight for many situations when exact recovery can be guaranteed by e.g., [3, 2].

IV Conclusion

In this work, we explored how error introduced by data completion affects recovery of statistical inferences. Our numerical experiments demonstrate that simple inferences such as the entrywise mean or the row mean can be recovered accurately even when the matrix is not recovered exactly. We prove bounds on the inference recovery error in terms of the matrix recovery error for the entrywise mean and the row mean. Additionally, we prove an analytical bound on the matrix recovery error which applies even when the matrix cannot be recovered exactly.

Future directions include exploring more common statistical inferences, such as support vector machine models. Additionally, we hope to develop a better analytic bound on the matrix recovery error which generalizes the exact recovery results in the literature. Furthermore, we will explore theory for exact recovery via $\ell_{1}$ -NNM for matrices whose observations are sampled via the structured sampling strategy.

V Acknowledgements

DN, JH, and DM are grateful to and were partially supported by NSF CAREER DMS #1348721 and NSF BIGDATA DMS #1740325. This work is based upon work completed at the UCLA CAM REU during Summer 2018 which was funded by NSF DMS #1659676. The authors would like to thank CEO Lorraine Johnson, LymeDisease.org, and the patients who participated in the MyLymeData survey. Additionally, they thank Dr. Anna Ma for her assistance with this data, and Prof. Andrea Bertozzi and the UCLA Applied and Computational Math REU program for their support.

Bibliography10

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. L. Bello. Imputation techniques in regression analysis: looking closely at their implementation. Computational statistics & data analysis , 20(1):45–57, 1995.
2[2] E. J. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE , 98(6):925–936, June 2010.
3[3] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics , 9:717–772, 2009.
4[4] E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory , 56(5):2053–2080, 2010.
5[5] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing , pages 665–674. ACM, 2013.
6[6] Lyme Disease.org. Lymedisease.org, 2018. https://www.lymedisease.org, Last accessed on 2018-08-17.
7[7] D. Molitor and D. Needell. Matrix completion for structured observations. ar Xiv preprint ar Xiv:1801.09657 , 2018.
8[8] D. B. Rubin. Multiple imputation for nonresponse in surveys , volume 81. John Wiley & Sons, 2004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On Inferences from Completed Data

Abstract

I Background and Motivation

I-A Notation

I-B Methodology

II Experimental Results

III Theoretical Results

Theorem III.1**.**

Proof.

Theorem III.2**.**

Proof.

IV Conclusion

V Acknowledgements

Theorem III.1.

Theorem III.2.