Nearest Neighbors for Matrix Estimation Interpreted as Blind Regression for Latent Variable Model
Yihua Li, Devavrat Shah, Dogyoon Song, Christina Lee Yu

TL;DR
This paper introduces a nearest-neighbor-based algorithm for blind matrix regression that effectively estimates missing entries using latent features, with proven consistency and competitive performance on real datasets and tensor completion tasks.
Contribution
The paper presents a novel nearest-neighbor approach for blind matrix estimation, providing theoretical guarantees and demonstrating practical effectiveness on real-world data and tensor completion.
Findings
Algorithm is consistent under Lipschitz conditions.
Performs better than basic collaborative filtering.
Competitive with state-of-the-art tensor completion methods.
Abstract
We consider the setup of nonparametric {\em blind regression} for estimating the entries of a large matrix, when provided with a small, random fraction of noisy measurements. We assume that all rows and columns of the matrix are associated to latent features and respectively, and the -th entry of the matrix, is equal to for a latent function . Given noisy observations of a small, random subset of the matrix entries, our goal is to estimate the unobserved entries of the matrix as well as to "de-noise" the observed entries. As the main result of this work, we introduce a nearest-neighbor-based estimation algorithm, and establish its consistency when the underlying latent function is Lipschitz, the underlying latent space is a bounded diameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Nearest Neighbors for Matrix Estimation Interpreted as Blind Regression for Latent Variable Model
Yihua Li, Devavrat Shah, Dogyoon Song, Christina Lee Yu This work was performed while all authors were affiliated with the Laboratory for Information and Decision Systems and the department of EECS at Massachusetts Institute of Technology. Shah is a member and director of the Statistics and Data Science Center at MIT.A preliminary version of this work was presented at the Neural Information Processing Systems Conference in December 2016 under the name “Blind Regression: Nonparametric Regression for Latent Variable Models via Collaborative Filtering”. The results have been significantly improved, strengthened, and expanded with new extensions since the preliminary version, and thus this manuscript has limited overlap with the preliminary version of this work.
Abstract
We consider the setup of nonparametric blind regression for estimating the entries of a large matrix, when provided with a small, random fraction of noisy measurements. We assume that all rows and columns of the matrix are associated to latent features and respectively, and the -th entry of the matrix, is equal to for a latent function . Given noisy observations of a small, random subset of the matrix entries, our goal is to estimate the unobserved entries of the matrix as well as to “de-noise” the observed entries. As the main result of this work, we introduce a nearest-neighbor-based estimation algorithm, and establish its consistency when the underlying latent function is Lipschitz, the underlying latent space is a bounded diameter Polish space, and the random fraction of observed entries in the matrix is at least \max\big{(}m^{-1+\delta},n^{-1/2+\delta}\big{)}, for any . As an important byproduct, our analysis sheds light into the performance of the classical collaborative filtering algorithm for matrix completion, which has been widely utilized in practice. Experiments with the MovieLens and Netflix datasets suggest that our algorithm provides a principled improvement over basic collaborative filtering and is competitive with matrix factorization methods. Our algorithm has a natural extension to the setting of tensor completion via flattening the tensor to matrix. When applied to the setting of image in-painting, which is a -order tensor, we find that our approach is competitive with respect to state-of-art tensor completion algorithms across benchmark images.
Index Terms:
Blind regression, matrix estimation, matrix completion, tensor estimation, tensor completion, latent variable model, collaborative filtering, nearest neighbor methods
I Introduction
The problem of matrix completion has received enormous attention in the past decade: consider an matrix of interest. Suppose we observe a subset of the entries of an matrix , which is a noisy version of , such that each -th entry is a random variable with for 111We shall utilize notation .. The goal of matrix completion is to recover matrix given partial observations from .
I-A Our Contributions
We provide a similarity-based nearest neighbor algorithm akin to popular collaborative filtering with theoretical performance guarantees under the latent variable model. In addition, we extend the analysis of our algorithm to the setting of tensor completion through flattering tensor to matrix. To our knowledge, this is the first theoretical analysis for similarity-based collaborative filtering algorithms, shedding insight into the widespread success of this popular heuristic for the past two decades. The algorithm we introduce is a simple variant of classical collaborative filtering, in which we compute similarities between pairs of rows and pairs of columns by comparing their common overlapped entries. Our model assumes that each row and column is associated to a latent variable, i.e. a vector of hidden features, and that the data entry is in expectation equal to some unknown function of those latent variables. We assume the latent space is a complete, separable metric space aka Polish space equipped with a Borel probability measure. The key regularity condition that we require is that the image of the latent space by the latent function has a small effective covering number with respect to the push-forward measure222Let denote the latent space. Let . Given , the effective covering number of with respect to
which Borel measure over Polish space naturally satisfies.
Given this latent variable model, we prove that the estimate produced by this algorithm is consistent as long as the fraction of entries that are observed is at least for some for an matrix (for precise statement, See Corollary 3 and its implication). We provide experiments using our method to predict ratings in the MovieLens and Netflix datasets. The results suggest that our algorithm improves over basic collaborative filtering and is competitive with factorization-based methods.
We also discuss that the algorithm and analysis can be extended to tensor completion by flattening the tensor to a matrix. We implemented our method for predicting missing pixels in image in-painting, which showed that our method is competitive with existing spectral methods used for tensor completion.
The algorithm that we propose has similarities to classical non-parametric nearest neighbor method, cf. [1] and kernel regression, which also relies on approximations by local smoothing, cf. [2, 3]. However, since kernel regression and other similar methods use explicit knowledge of the input features, their analysis and proof techniques do not extend to our context. Instead of using distance in the unknown latent space, the algorithm weighs datapoints according to similarities that are computed from the data itself. Our analysis shows that although the similarities between the data points may not reflect the distance between the latent features, they essentially reflect the functional distances (in the sense) between the latent function restricted to the pair of rows (or columns) associated with the data points, which is sufficient to guarantee that the datapoints with high similarities are indeed similar in value.
I-B Related Literature
The primary methods used to solve the problem in the literature include neighbor-based approaches, such as collaborative filtering, and spectral approaches, which include low-rank matrix factorization or minimization of a loss function with respect to spectral constraints.
Spectral Methods
In the recent years, there have been exciting intellectual developments in the context of spectral approaches such as matrix factorization. All matrices admit a singular-value decomposition, such that they can be uniquely factorized. The goal of the factorization-based method is to recover row and column singular vectors accurately from the partially observed, noisy matrix and subsequently estimate the matrix . [4] was one of the earliest works to suggest the use of low-rank matrix approximation in this context. Subsequently, statistically efficient approaches were suggested using optimization-based estimators, proving that matrix factorization can fill in the missing entries with sample complexity as low as for an matrix, where is the rank of the matrix [5, 6, 7, 8, 9]. There has been an exciting line of ongoing work to make the resulting algorithms faster and scalable [10, 11, 12, 13, 14, 15].
[16] proposed a spectral clustering method for inferring the edge label distribution for a network sampled from a generalized stochastic block model. The model is similar to the proposed latent variable model introduced in Section II, except that the edges are labeled by one of finitely many labels in a symmetric setup with , and the goal is to estimate the label distribution in addition to the expected label. When the expected function has a finite spectrum decomposition, i.e. low rank, then they provide a consistent estimator for the sparse data regime, with samples. When the function is only approximately low rank (e.g. the class of general Lipschitz functions), for a fixed rank approximation, the mean squared error bounds converge to a positive constant which captures the low rank approximation gap. That is, samples are not sufficient to guarantee consistent estimation for the entire class of Lipschitz functions.
Many of these approaches are based on the structural assumption that the underlying matrix is low-rank and the matrix entries are reasonably “incoherent”. Unfortunately, the low-rank assumption may not hold in practice. The recent work [17] makes precisely this observation, showing that a simple non-linear, monotonic transformation of a low-rank matrix could easily produce an effectively high-rank matrix, despite few free model parameters. They provide an algorithm and analysis specific to the form of their model, which achieves sample complexity of for an matrix. However, their algorithm only applies to functions which are a nonlinear monotonic transformation of the inner product of the latent features. [18] propose an algorithm for estimating locally low rank matrices, however their algorithm assumes prior knowledge of the “correct” kernel function between pairs of rows and columns which is not known a priori.
[19] proposes the universal singular value thresholding estimator (USVT) inspired by low-rank matrix approximation. Somewhat interestingly, it argues that under the latent variable model considered in this work (see Section II), the USVT algorithm provides an accurate estimate for any Lipschitz function. However, to guarantee consistency of the USVT estimator for an (i.e. ) matrix, it requires observing many entries out of the total entries, where is the dimension of the latent space in which the row and column latent features belong. In recent work, [20] extends the analysis of USVT for graphon estimation, assuming a generative latent variable model for binary observation matrices representing networks. If the latent function is -Holder smooth, author establishes that the spectrum decays polynomially, and thus the MSE of the USVT estimator is bounded above by O\big{(}(mp)^{-\frac{2\alpha}{2\alpha+d}}\big{)}, which converges to zero as long as (for an symmetric matrix).
Collaborative Filtering
The term collaborative filtering was coined by [21], and this technique is widely used in practice due to its simplicity and ability to scale. There are two main paradigms in neighborhood-based collaborative filtering: the user-user paradigm and the item-item paradigm. To recommend items to a user in the user-user paradigm, one first looks for similar users, and then recommends items liked by those similar users. In the item-item paradigm, in contrast, items similar to those liked by the user are found and subsequently recommended. Much empirical evidence exists that the item-item paradigm performs well in many cases [22, 23, 24]. There have also been many heuristic improvements upon the basic algorithm, such as normalizing the data, combining neighbor methods with spectral methods, combining both user and item neighbors, and additionally optimizing over interpolation weights given to each datapoint within the neighborhood when computing the final prediction [25, 26, 27].
Despite the widespread success of similarity-based collaborative filtering heuristics, the theoretical understanding of these method is very limited. In recent works, latent mixture models have been introduced to explain the collaborative filtering algorithm as well as the empirically observed superior performance of item-item paradigms, c.f. [28, 29]. However, these results assume binary ratings and a specific parametric model, such as a mixture distribution model for preferences across users and movies. We hope that by providing an analysis for collaborative filtering within a nonparametric model, we can provide a better understanding of collaborative filtering.
Within the context of dense graphon estimation, when the entries in the data matrix are binary and the sample probability is constant , there have been a few theoretical results that prove convergence of the mean squared error for similarity-based methods [30, 31]. They hinge upon computing similarities between rows or columns by comparing commonly observed entries, similar to collaborative filtering. Similar to our result, they are able to prove convergence for the class of Lipschitz functions. However, [30] assumes that the algorithm is given multiple instances of the sampled dataset, which is not available in our formulation. [31] is weaker than our result in that it assumes , however they are able to handle a more general noise model, when the entries are binary. The similarity between a pair of vertices is computed from the maximum difference between entries in the associated rows of the second power of the data matrix, which is computationally more expensive than directly comparing rows in the original data matrix.
Tensor Completion
Recently there have been efforts to extend decomposition methods or neighborhood-based approaches to the context of tensor completion, however this has proven to be significantly more challenging than matrix completion due to the complication that tensors do not have a canonical decomposition such as the singular value decomposition (SVD) for a matrix. This property makes obtaining a decomposition for a tensor challenging. The survey [32] elaborates on these challenges. There have been recent developments in obtaining efficient tensor decompositions in form of rank-1 tensors (tensors obtained from one vector), presented in [33]. This has been especially effective in learning latent variable models and estimating missing data as shown in, for example [34, 35].
Many results in tensor estimation take the approach of flattening the tensor to a matrix and subsequently apply matrix estimation algorithms [36, 37, 38, 39]. A -order tensor where each dimension is length would be flattened to a matrix, resulting in a sample complexity of . Subsequently there has been a line of work extending spectral methods, local iterative methods, or the sum of squares method to the specific tensor structure to obtain improved sample complexities of [34, 40, 41, 42, 43, 44, 45].
Beyond tensor decomposition, there have been recent developments in the context of learning latent variable models or mixture distributions also called non-negative matrix factorization, c.f. [46, 47].
I-C Organization of the Paper
In Section II we setup the formal model and problem statement and discuss the assumptions needed for our analysis. In Section III we introduce the basic form of our algorithm, which is similar to the user-user variant of collaborative filtering. We present heuristic variants of the algorithm that perform well in practice. In Section IV we present the main theoretical results of our paper as they pertain to matrix completion, showing provable convergence of the user-user (and by symmetry item-item) variant of our algorithm. In Section V we provide a discussion of our results and comparison with other models. In Section VI we discuss how to extend the algorithm and analysis to tensor completion. In Section VII we present experimental results from applying our methods to both matrix completion in the context of predicting movie ratings, and tensor completion in the context of image inpainting. The detailed proofs are presented in the Appendix.
II Setup
II-A Our Model
Suppose that there is an unknown matrix which we would like to estimate. We observe only a fraction of the total entries of with some noise added. Let denote the index set of observed entries. Specifically, we observe entries of data matrix that is generated as follows. Let be a binary matrix, which we call the masking matrix. We let denote the signal matrix and denote the noise matrix. For each ,
[TABLE]
For later use, we let denote the set of index pairs of the observed entries.
II-A1 Latent Variable Model
We assume is generated by the following model.
- •
Nonparametric model: there exists a latent function such that
[TABLE]
for all . Here, denote latent variables associated with row and column , respectively.
- •
Regularity Assumptions:
- –
For all , , where is a complete, separable metric space a.k.a. Polish space equipped with a Borel probability measure and is drawn i.i.d. according to .
- –
For all , , where is a complete, separable metric space aka Polish space equipped with a Borel probability measure and is drawn i.i.d. according to .
- –
Latent function is bounded, i.e. for all and , \big{|}f(\alpha,\beta)\big{|}\leq D_{f}.
- –
Without loss of generality, we shall assume that there exists and such that .
Our results will additionally require that the local measure is well-behaved such that there is a sufficient mass of nearest neighbors. In particular we introduce two concrete models that have good local neighborhood properties:
- (a)
Finite Types: Let be equipped with the discrete metric333 if and only if . topology and suppose has finite support in with denoting the support of . 2. (b)
Lipschitz Latent Function: Assume that the function is -Lipschitz in the sense that
[TABLE]
II-A2 Noise
We assume is a centered, sub-gaussian random variable for all such that
- •
Centered: for all .
- •
Sub-gaussian444The Orlicz -norm of a random variable is defined as \|X\|_{\psi_{2}}\triangleq\inf_{t>0}\{\mathbb{E}\left[\exp\big{(}\frac{|X|^{2}}{t^{2}}\big{)}\right]\leq 2\}.: there exists such that for all .
- •
Independent and identically distributed (i.i.d.): ’s are independent and identically distributed.
II-A3 Masking
We assume is a random matrix with each entry drawn as per for some , i.i.d. That is, for each ,
[TABLE]
Our algorithm will involve one threshold parameter which will need to be tuned. As a result of our analysis we can specify the optimal choice of the threshold as a function of which trades off between the bias and variance. In practice, all of these parameters are assumed to be unknown and the threshold parameter can be chosen via cross-validation.
II-B Problem Statement: Blind Regression
Note that the latent variable model representation in (2) is not the unique representation, as there exists multiple equivalent representations. Suppose that one applies a measure-preserving transformation on the latent feature space , and take the push-forward of with respect to () as the new latent function. This new representation – the pair of the latent space and the latent function – yields the same data generation process. Therefore, the question of estimating the function itself is not well posed, and we focus our energy on predicting the values .
Problem 1** (Blind Regression).**
Given that is partially observed on as described in Section II-A, we want to estimate the underlying matrix .
We call the problem of interest Blind Regression for the following reason. In the setting of Regression, one observes data containing features and associated labels; the goal is to learn the functional relationship (or model) between features and labels assuming that labels are noisy observations. In our setting, tuples are the relevant (but unobserved) features and are noisy observations of associated labels . We want to predict the value of for all pairs for . In the sense that we want to predict the function evaluated on new points given previous data, the task has the feel of Regression. However, the features are latent; and thus we use the term Blind Regression.
Given an estimator for the unknown matrix of interest, we use the mean-squared error (MSE) to evaluate the performance of the estimator, defined as
[TABLE]
The expectation here is taken over all sources of randomness in the data generation process: (i) realization of the latent variables; (ii) realization of the noise variables; and (iii) masking. An estimator is called consistent if . For a consistent estimator, we also want to establish an upper bound on the rate of convergence rate for the mean-squared error. Now we pose a follow-up question.
Problem 2**.**
Can we achieve a consistent estimator for in the setup of Problem 1? If so, how fast does decay to [math] for given as the problem size ?
II-C Exchangeability and Latent Variable Model
The latent variable model is well motivated and arises as a canonical representation for row and column exchangeable data, cf. [48] and [49]. Suppose that our data matrix is a particular realization of the first entries of a random array , which satisfies
[TABLE]
for every pair of permutations555The permutations over are defined in the usual manner where only finitely many indices are permuted. of . We use to denote that the joint distribution over is equivalently distributed as the joint distribution over , i.e. the random variables on both sides have the same distribution. Random array satisfying (4) is called exchangeable666To be precise, separately row and column exchangeable.. For an interested reader, [50] and [51] present overviews of exchangeable arrays.
In practice, the use of exchangeable arrays as a model is appropriate for variety of reasons. For example, in the setting of a recommendation system with anonymized data, this property may be reasonable if the order of the users in the system does not intrinsically carry information about the type of user; or in other words, if a user in the system could equally likely have been located in any row of the dataset.
In addition to exchangeability being quite a reasonable property for a wide variety of applications, it also leads to a convenient latent variable representation. The Aldous-Hoover representation theorem provides a succinct characterization for such exchangeable arrays. According to the theorem (see Corollary 3.3 in [51] for example), a random data array is exchangeable if and only if it can also be represented as
[TABLE]
where , \big{\{}\theta_{\textrm{row}}(u)\big{\}}_{u\in\mathbb{N}}, \big{\{}\theta_{\textrm{col}}(i)\big{\}}_{i\in\mathbb{N}}, \big{\{}\theta_{\text{entry}}(u,i)\big{\}}_{(u,i)\in\mathbb{N}\times\mathbb{N}} are independent random variables drawn uniformly at random from the unit interval , and is a measurable function indexed by the realization of . As described in [51], this suggests the following generative model:
Sample an instance of determining the governing function . 2. 2.
For every row and every column , independently sample uniform random variables , , . 3. 3.
Compute the realized data matrix according to
[TABLE]
By comparing the model from (5) with the latent variable model described in Section II-A1 (cf. (1), (2)), we can see that the latent variable model considered in this work is a restricted subclass of exchangeable models that additionally impose an additive noise model and regularity conditions on the function in exchange for a more general latent space and associated probability measure.
Equivalence of Models
In our model we have conditioned on the universal index777Equivalently, we may think that for all . Note that we do not know what is a priori. , such that given partial observations from matrix for a particular , our goal is to learn predicted outcomes of the realized . Our model takes the form of
[TABLE]
where are sampled independently. We can transform this to the form of (5) by considering to be equal to the realized function , considering the latent variables and to be higher dimensional representations of and in spaces and , and considering the noise term to be generated by applying some transformation to the variable . Given these transformations, it becomes equivalent that
[TABLE]
Instead of allowing to be any arbitrary measurable function over , our model additionally imposes regularity conditions on by requiring it to be bounded and either finite types or Lipschitz continuous with respect to a higher dimensional representation . From a modeling perspective, we are effectively transferring the model complexity from a potentially complex measurable latent function over to a simpler (e.g. Lipschitz) latent function over a potentially more complex latent variable space . The simple functional form provides analytic tractability for establishing theoretical results.
II-D Comparison with Other Models
In this section we discuss how our model relates to other models considered in the literature. In particular we would like to clarify which models are captured in the latent variable model and which are not. The key assumptions of our model are (1) bounded or Lipschitz latent function, (2) additive i.i.d. sub-gaussian noise, and (3) latent space being finite or more generally bounded Polish space.
II-D1 Low-rank Assumption
Suppose that the latent spaces and are equal dimensional vector spaces and the latent function is a bilinear form. This is a low-rank model with its rank being equal to . If we assume and have bounded diameters and are equipped with some metric topology, then this model is a specific case of our model with Lipschitz constant being \big{(}\textrm{diam}~{}\mathcal{X}_{\textrm{row}}+\textrm{diam}~{}\mathcal{X}_{\textrm{col}}\big{)} (up to a multiplicative constant). Our latent variable model additionally assumes that the latent feature vectors are randomly sampled i.i.d. according to some probability measure, whereas typical low rank results allow for arbitrarily chosen latent feature vectors that satisfy incoherence.
Note that we utilize a different measure of ‘model complexity’ than the rank of the parameter matrix. When the underlying function is truly bilinear and \min\big{\{}\dim\mathcal{X}_{\textrm{row}},\dim\mathcal{X}_{\textrm{row}}\big{\}}\ll\textrm{diam}~{}\mathcal{X}_{\textrm{row}}+\textrm{diam}~{}\mathcal{X}_{\textrm{row}}, then rank of the data matrix could be a better measure to capture the essential complexity of the model; note that the rank of the data matrix does not scale as the diameter of the latent space increases. However, if the underlying model is nonlinear, the data matrix is likely to have full rank, e.g., when for some monotone increasing nonlinear function . In particular, when the size of the matrix far exceeds the ‘intrinsic model complexity’ (and hence so does the rank of the data matrix), it would make more sense to use our model.
Suppose that there is an underlying low rank matrix, yet the observation is an entrywise nonlinear monotone transformation of the low rank matrix; this is also known as single index models. [17] pointed out that adding nonlinearity could easily result in the underlying matrix no longer being low rank, requiring more complex approaches to estimate. In contrast, our model very easily handles nonlinearities as long as they satisfy local smoothness conditions.
II-D2 Biclustering, Submatrix Detection, Stochastic Block Model
The models used in biclustering, submatrix detection, and planted clique assume that there is some submatrix for which the data is significantly shifted in expectation compared to the rest of the matrix which is assumed to be uniform plus noise [52, 53]. This would correspond to “finite types” in our model, which can be modeled as a piecewise constant function. The desired goal is to detect or identify the deviant submatrix corresponding to a subset of the rows and columns. In contrast our task focuses on estimating the expected matrix. Given an estimation algorithm that could guarantee max error bounds entrywise or for each row/column, we could simply threshold to obtain an estimate for the deviant submatrix. Another distinction is that the literature in submatrix detection focuses on understanding thresholds of the minimal size submatrix that is detectable. However, in our setting as the latent feature vectors are sampled from a fixed distribution, the size of the deviant submatrix that our model studies would always be proportional to as long as the probability of sampling a deviant row or column is bounded below by a fixed constant. The stochastic block model also assumes finite types modeled through a piecewise constant function, which would fit within our latent variable model. It is commonly used to study the task of clustering, where the goal is to recover a true underlying partitioning of the rows and columns rather than only estimate the expected matrix [54]. Their setting specifically assumes a binary observation model, while our model assumes identically distributed additive sub-gaussian noise.
II-D3 Strong Stochastic Transitivity, Statistical Seriation, Low Permutation-Rank Matrices
There have been many models introduced to study matrices that arise from pairwise comparisons for the purposes of ranking or estimation. Many parametric models such as Bradley-Terry-Luce or Thurstone, can be described by a smooth function of scalar latent row and column variables. Our model adds the additional assumption that the latent variables must be sampled i.i.d. from a bounded diameter subspace. There have also been a series of nonparametric models introduced, in particular the strong stochastic transitivity assumption, which only requires that there exists some permutation of the rows and columns such that the entries are monotonically increasing along the permuted rows and columns [55, 56]. Statistical seriation assumes that there is some permutation of the rows and columns such that after the permutation, all rows and columns have the same shape, which could include monotonicity or other shape constraints [57]. Low permutation-rank matrices assume that a mixture model over strong stochastic transitive matrices, i.e. a mixture over permutation matrices [58].
Our latent variable model can encompass a subset of such matrices by imposing monotonicity assumptions on our latent function. However our model additionally assumes latent space being Polish space endowed with Borel measure, which are not required for permutation matrices. Similarly, a subset of mixtures of permutation matrices can be modeled by latent functions which are a mixture of different monotonic functions; however this may not be able to fully encompass all low permutation-rank models. If one were to impose boundedness on the entries, then it is possible that as the size of the matrix grows, strong stochastic transitivity might imply the existence of good nearest neighbor rows, which would then allow our algorithm to also perform well, but that remains an important direction to explore.
III Algorithm
Our algorithm builds on intuition from local approximation methods such as kernel regression. Therefore it takes the form of a similarity-based method, which first defines a kernel, i.e. similarity between pairs of rows or columns, and then computes the estimate for each matrix entry by averging over datapoints that are determined to be close to the entry of interest according to the ‘similarity.’
We present the general form of our algorithm in Section III-A. In Section III-B, we describe the basic user-user fixed radius nearest neighbor algorithm888This is equivalent to a variant of the classical similarity based collaborative filtering methods. as an example of the algorithm with a concrete choice of similarity and the averaging scheme to define the estimate. In Section III-C, we describe other variations of our algorithm, e.g., a variation of the algorithm that combines both row and column similarities to compute the kernel between datapoints.
The user-user nearest neighbor algorithm serves as our prime example and we provide theoretical guarantees for a vanishing upper bound on its mean squared error later in Section IV. Also, we show experimental results that suggest combining row and column similarities improve the quality of estimates; see Section VII.
III-A General Form
Our algorithm takes (i) the data matrix , (ii) a rule for sample splitting, and (iii) a thresholding parameter as its inputs. The algorithm outputs .
Specifically, our algorithm determines the set of reliable neighbors by considering the ‘behavioral’ similarity in the function values. We need to define a similarity/dissimilarity statistic that can capture such similarities as well as can be computed from the data matrix. In Section III-B, we describe a version of our algorithm that utilizes the squared distance as a measure of dissimilarity, cf. Algorithm 2. This version of algorithm is the prime example we consider in this work and we provide a theoretical guarantee on its performance in Section IV.
We remark that some similarity functions can be preferred over others, depending on the model assumptions. For example, the squared distance is a natural choice when the latent space is assumed to be Euclidean, while the cosine similarity (= the angle between the latent variables) would be a more faithful measure of similarity when the latent space is the projective space (= spherical). We briefly discuss this matter in Section III-C with additional examples of variations of our algorithm.
Remark 1*.*
There are many variations in (1) how we determine the set of reliable neighbors (e.g., by defining similarity/dissimilarity functions), and (2) how we compute the final estimate based on the observations at those neighbors.
Remark 2*.*
We note that sample splitting is done for the ease of analysis, and is not essential in executing the algorithm.
III-B Prime Example: User-user Fixed Radius Nearest Neighbor (of the [math]-th order)
In this section, we describe our prime example of the generic algorithm described in Algorithm 1, namely, the user-user fixed radius neighbor algorithm. In this version of algorithm,
- •
we measure the dissimilarity between two rows with the squared distance between the rows at the overlapping column indices; specifically, we define the dissimilarity between two rows as
[TABLE]
with the convention , and
- •
for each , we estimate by averaging for such that and is similar to the row . We define that is similar to if the squared distance between their rows is no greater than some threshold , which is a tunable parameter input to the algorithm.
See Algorithm 2 for the full description. We state the algorithm for estimating a single entry , which affects the sample splitting rule. The sample splitting is used for a cleaner analysis, but we believe that the results should extend without sample splitting as well.
Remark 3*.*
Note that Algorithm 2 is equivalent to the classical user-user fixed radius nearest neighbor collaborative filtering algorithm. The algorithm analyzed in the preliminary version of this work, cf. [59], is similar but not identical to Algorithm 2. The algorithm in [59] is motivated by the kernel regression of the first order and it is asymptotically equivalent to a mean-adjusted variant of the user-user -nearest neighbor collaborative filtering algorithm. On the other hand, Algorithm 2 does not adjust the estimate the empirical means and it is kernel regression of the zeroth order in that sense.
Intuition behind the Algorithm
The following equations provide intuition for the reason why Algorithm 2 works. Assuming the row latent variables are instantiated as and , the mean squared error of the resulting estimate is given as999In fact, is not deterministic but random. We provide a complete analysis with formal proofs in Section IV and Appendix A.
[TABLE]
Here, \mathbb{E}_{\textrm{cond}}\left[~{}\cdot~{}\right]=\mathbb{E}\big{[}~{}\cdot~{}\big{|}~{}x_{\text{row}}(1),\ldots,x_{\text{row}}(m),|\mathcal{B}_{\text{est}}(u,i)|$$=N_{\text{base}}(\eta)\big{]} denotes the conditional expectation. This expression shows that the expected squared error conditioned on row latent variables is directly related to the squared distance between the slices of the latent function associated to rows and . The good news is that the distance can in fact be estimated from the data itself. For any pair of row indices ,
[TABLE]
Because of the concentration of measure, we expect the dissimilarity between rows and defined in (6) to be close to \big{\|}f(x_{\text{row}}(u),\cdot)-f(x_{\text{row}}(v),\cdot)\big{\|}_{L^{2}}^{2}+2\sigma^{2} as the size of the overlap increases.
Remark 4*.*
For fixed rows and columns, choosing a large leads to the increase in the size of , which results in the increase in the first term in (7) and the decrease in the second term in (7). This demonstrates a bias-variance tradeoff associated to the choice of algorithmic parameter .
Remark 5*.*
Our algorithm does not require any prior knowledge on the model parameters such as the regularity parameters , diameter of and the noise variance . The algorithmic parameter can be chosen arbitrarily. However, we need to be in a certain range to achieve good theoretical guarantee for the upper bound on its MSE. See Theorem 1.
III-C Additional Examples of Our Algorithm: Other Variations
In this section, we exhibit other variations of our algorithm and provide intuition for them. However, we do not formally prove error bounds for these variations in this paper.
III-C1 Item-item and User-item Variants
In Section III-B, we described a user-user fixed radius nearest neighbor algorithm that utilizes the (dis-)similarity between rows, for . We can apply the same idea to use the similarity between columns, for . Moreover, once the similarity between rows and columns are identified, we can define similarity between any pair of index tuples based on the and . For example, we may define .
III-C2 Kernels for Weighting
Once the (dis-)similarities between rows and columns are identified, we need a weighting scheme to define the estimates. For example, in Algorithm 2, we use the the hard-thresholding kernel in which the weight is if the row dissimilarity is less than and the weight is [math] otherwise. However, this choice is not necessary and our algorithm can be implemented with any choice of kernels (i.e. weighting scheme).
Example 1** (Gaussian Kernel Weights).**
Given , suppose that we have computed dissimilarity between and for all , e.g., by defining . Then, given a bandwith parameter , we define the weights according to a Gaussian kernel101010One may doubt why we call this Gaussian kernel instead of Laplacian kernel. The reason is that we treat \textrm{dissim}((u,i),(v,j))\big{)} as a proxy of the squared distance. as: for each ,
[TABLE]
and then define
[TABLE]
When , the estimate only depends on the ‘nearest’ neighbor . On the other hand, when , the algorithm equally averages all the for . Empirically, this variant of the algorithm seems to perform very well with an appropriate selection of the bandwidth parameter , which can be tuned using cross validation.
III-C3 Other Ways to Define Dissimilarity
We suggest other alternatives to define similarities.
Higher Order Information
Algorithm 2 detects reliable neighbors and estimates by averaging . This procedure is equivalent to traditional local smoothing, or kernel regression of the [math]-th order, except that we needed to estimate the ‘proximity’ of the latent features from the data matrix. If we consider as a constant function, then can be viewed as the evaluation of the average of the constant function at .
The next question naturally arises: “Can we come up with an algorithm that is equivalent to higher-order kernel regression algorithm?” For example, one can build linear regression estimators centered at for each , instead of considering the constant function at the level of . Evaluating the average of such linear estimators at will yield a different, probably more refined, estimate . This turns out to be equivalent to kernel regression of the -st order with latent features. This version of algorithm is analyzed in the preliminary version of this paper, cf. [59].
Beyond the Euclidean Latent Space
In our proposed algorithm, we defined the dissimilarities between the rows and the columns using empirical distance of the slices of the latent function. Depending on the geometry of the latent space, other ways of defining dissimilarities can be favorable. For example, suppose that only the ‘direction’ of the rows in the data matrix matters, while the amplitude does not carry much information. Then it would make sense to measure the dissimilarity between rows by estimating their ‘angles’ instead of their Euclidean distance111111To be precise, we want to define the dissimilarity between the equivalent classes of the rows. In this example, the equivalence relation is defined by positive scaling. In fact, this is what cosine similarity measures, which is commonly used in classical collaborative filtering.
More importantly, the canonical latent space in the data generation process can be non-Euclidean. For example, the latent variables could be drawn from a sphere121212more generally, from a Riemannian manifold with non-zero curvature. Cosine similarity makes use of this knowledge, while the dissimilarity does not – it essentially isometrically embeds the sphere to a higher-dimensional Euclidean space and then uses the metric of the ambient space. This could be problematic because one may need a much larger dimension for Euclidean embedding than the intrinsic dimension. Additionally in some cases it may be impossible to find an isometric embedding to the Euclidean space of any finite dimension; in particular, when the latent space is negatively curved. One prominent example of such latent space is the tree equipped with the geodesic distance; it is impossible to find an isometric embedding of an infinitely growing -regular tree to for any .
Both the question of finding a good similarity/dissimilarity function that utilizes the structure of the latent space and the question of systematically exploiting higher-order information of the latent function remain important direction of future research.
III-D Computational Complexity
In order to compute , we need to average differences over the set . As the sparsity of each row is in expectation , the cost of the similarity computation is in expectation bounded by . We need to compute for all pairs. For each row , we need to compute the set , which takes . To compute the final estimate for each entry we need to average over the set , which in expectation is bounded above by . Therefore, the computational complexity is bounded by .
The algorithm can be parallelized into two phases. In phase 1, we can in parallel compute for all pairs, each taking computation time . In phase 2, we can compute the estimates in parallel for all indices of the matrix, each taking computation time . Therefore assuming we have processors that can compute simultaneously, the parallelized computational complexity is .
IV Main Results
IV-A Theorem Statement
We present a theorem which upper bounds the MSE of the estimate produced by the user-user fixed radius nearest neighbor algorithm presented in Section III-B (cf. Algorithm 2). Recall from our problem statement in Section II that is the probability each entry is observed, is an upper bound on magnitude of the latent function, and is an upper bound on the noise variance131313In fact, . We may identify with an upper bound on the variance, up to an absolute constant..
IV-A1 Informally Stated Asymptotic Upper Bounds
We present an informal version of our main theorem under a simplified setting for ease of exposition. The general form of main result is stated in Theorem 1 which is rather involved. To that end, consider a simplified setting where there are only finite types of row latent variables. In that case, we obtain the following result.
Theorem** (Informal Statement of Corollary 1).**
Suppose that the measure for the latent row space has finite support. If p\gg\max\Big{(}\frac{1}{m},\sqrt{\frac{\log m}{n}}\Big{)}, then
[TABLE]
Note that the prefix constant is the measure of variability (complexity) for the latent function (up to noise).
There are two main sources of the estimation error: (i) the error in estimation of similarities (step 2 of Algorithm 2) and (ii) the error in approximation by smoothing (step 3 of Algorithm 2). Intuitively, empirical estimates of similarity/dissimilarity between two rows becomes more accurate as , which is equal to the expected size of the overlap, increases. This is captured by the term .
Given the similarities are sufficiently accurate, the estimation error is dominated by the approximation error. There are types of rows and number of available samples, hence, in expectation. This captures the second term in the error. This also reflects the impact of local geometry of the measure on the estimation error.
IV-A2 Formal Statement of the Main Theorem
We define the function for and as
[TABLE]
Theorem 1 presents an upper bound on the mean-squared error (cf. (3)) of the user-user Fixed radius nearest neighbor algorithm described in Algorithm 2. Before presenting the theorem statement, we remark that
[TABLE]
due to the exchangeability of the model and the linearity of the expectation. In Theorem 1, we provide an upper bound on \mathbb{E}\left[\big{(}\widehat{A}(1,1)-A(1,1)\big{)}^{2}~{}\big{|}~{}x_{\text{row}}(1)\right], conditioned on , which is the latent feature of the first row.
Theorem 1** (Main Theorem).**
Let be the estimator returned by the Algorithm 2. Let constant K\triangleq\Big{(}\frac{2D_{f}}{\sqrt{\ln 2}}+2\sigma\Big{)}^{2}. Suppose that our algorithm uses threshold parameter \eta\geq\eta^{\prime}+K\max\bigg{(}\sqrt{\frac{4\log(m-1)}{c(n-1)p^{2}}},\frac{4\log(m-1)}{c(n-1)p^{2}}\bigg{)} for some . Then
[TABLE]
In the above expression, are absolute constants, and \phi(x,r)=\mathbb{P}_{x_{\text{row}}(v)\sim\mu_{\mathcal{X}_{\textrm{row}}}}\left(\big{\|}f(x,\cdot)-f(x_{\text{row}}(v),\cdot)\big{\|}_{L^{2}}^{2}\leq r\right) for any and .
The threshold is a tunable parameter that our algorithm uses. Our analysis captures the bias-variance tradeoff associated with the parameter . When is too large, our upper bound becomes loose because the algorithm utilizes information from less reliable neighbors; on the other hand, when is chosen too small, the resulting estimate count only on a few neighbors and suffers from large variance.
The first two terms of the MSE bound come from bounding the bias of the estimator. The first term \big{(}\eta-2\sigma^{2}\big{)} reflects the bias that is unavoidable due to the selection of the threshold and the error in estimating , and the second term bounds the tail of the bias along with the bad event that is too small to produce a good estimate of . The remaining three terms bound the variance of the estimator; the error bound for the trivial estimates (when ), given in the form of is also subsumed in these three terms.
Taking expectation of the bound in Theorem 1 with respect to , we obtain the desired bound on MSE, i.e.
[TABLE]
IV-B Implications
Theorem 1 provides an upper bound on MSE for Algorithm 2, which depends on the underlying probability measure and the latent function through . Here we evaluate this implicit bound for three special examples:
- •
The latent space has finitely many elements, or equivalently has finite support.
- •
The latent space is unit hypercube in a finite dimensional space, latent function is Lipschitz.
- •
The latent space is a complete, separate metric space aka Polish space with bounded diameter, latent function is Lipschitz.
IV-B1 Finite Types
We state the following Corollary of Theorem 1 when has finite support.
Corollary 1** (Finite support).**
Suppose that is equipped with the discrete metric141414 if and only if . topology and has finite support in with denoting the support of . Let K=\Big{(}\frac{2D_{f}}{\sqrt{\ln 2}}+2\sigma\Big{)}^{2}, , and
[TABLE]
If p\geq\max\bigg{(}\frac{8}{m-1},~{}\Big{(}4\vee\sqrt{\frac{2}{c}}\Big{)}\cdot\sqrt{\frac{\log(m-1)}{n-1}}\bigg{)}, then
[TABLE]
where are absolute constants.
That is, as long as p\gg\max\Big{(}\frac{1}{m},\sqrt{\frac{\log m}{n}}\Big{)}, under discrete measure with finite support
[TABLE]
Proof of Corollary 1.
Our interest is in bounding
[TABLE]
and
[TABLE]
from the statement of Theorem 1 to obtain the desired result. Since we are considering discrete metric with measure having finite support, for any and , . Given choice of , . Therefore, (9) can be written as
[TABLE]
Similarly, (10) reduces to
[TABLE]
In above, we have used an easy to verify fact that \sup_{\theta\in[0,1]}\theta\exp\bigg{(}-\frac{(m-1)p}{8}\theta\bigg{)} is achieved for . Replacing (IV-B1) and (IV-B1) in the statement of Theorem 1, and realizing that other terms in the bound of Theorem 1 are asymptotically smaller order, we obtain (1). ∎
IV-B2 Uniform Measure over , Lipschitz Latent Function
Let denote the Lipschitz constant of the latent function . We consider the setting where is uniform Lebesgue measure on , the unit cube in dimension. We define for any . By Lipschitzness, implies \big{\|}f(x_{\text{row}}(1),\cdot)-f(x_{\text{row}}(v),\cdot)\big{\|}_{L^{2}}^{2}\leq\eta^{\prime}-2\sigma^{2}. Therefore,
[TABLE]
There exists universal constants such that for any , and and ,
[TABLE]
We shall assume that as well. Therefore,
[TABLE]
and hence since there exists such that . Now, we state the following implication of Theorem 1 in this setting.
Corollary 2** (Unit cube and the Lebesgue measure).**
Let be the uniform measure on . Let K=\big{(}\frac{2L\sqrt{d}}{\sqrt{\ln 2}}+2\sigma\big{)}^{2}, \eta^{\prime}=2\sigma^{2}+\alpha^{2/d}\beta^{2}L^{2}\big{(}(m-1)p\big{)}^{-\frac{2}{d+2}}, and
[TABLE]
If p\geq\max\bigg{(}\frac{8}{m-1},~{}\Big{(}4\vee\sqrt{\frac{2}{c}}\Big{)}\cdot\sqrt{\frac{\log(m-1)}{n-1}}\bigg{)}, then
[TABLE]
where are absolute constants.
That is, as long as p\gg\max\Big{(}\frac{1}{m},\sqrt{\frac{\log m}{n}}\Big{)}, under the uniform measure,
[TABLE]
Proof of Corollary 2.
Similar to proof of Corollary 1, our interest is in bounding (9) and (10). To that end, under uniform Lebesgue measure , on dimensional unit cube , it follows that for any and
[TABLE]
Therefore, by choice of
[TABLE]
Similarly,
[TABLE]
From (IV-B2) and (IV-B2), and statement of Theorem 1, and realizing that other terms in the bound of Theorem 1 are asymptotically smaller order, we conclude (2). ∎
IV-B3 Bounded Polish Space, Lipschitz Latent Function
Let denote the Lipschitz constant of the latent function . We define for any . By Lipschitzness, implies \big{\|}f(x_{\text{row}}(1),\cdot)-f(x_{\text{row}}(v),\cdot)\big{\|}_{L^{2}}^{2}\leq\eta^{\prime}-2\sigma^{2}. Therefore,
[TABLE]
We consider the row latent space to be a complete, separable metric space, i.e. a Polish space151515Recall that a metric space is separable if it has a countable dense subset, and complete if every Cauchy sequence converges within the space.. Let the diameter of the space be bounded, i.e. there exists finite such . We shall also assume that the diameter of is bounded by as well. This implies
[TABLE]
since there exists such that .
An important consequence of being Polish is that is tight, i.e. for any , there exists a compact set such that . Due to compactness of and the space being Polish, there exists a finite number of balls of any given radius of choice such that they cover . That is, for any , the effective covering number is always finite. Let for denote the collection of balls so that . By construction, \mu_{\mathcal{X}_{\textrm{row}}}\Big{(}\cup_{i\in[N_{\text{eff}}(\mathcal{X}_{\textrm{row}},\varepsilon,\delta)]}B(x_{i},\varepsilon)\Big{)}\geq\mu_{\mathcal{X}_{\textrm{row}}}(S_{\delta})\geq 1-\delta. Define . Let . Therefore,
[TABLE]
since . Therefore,
[TABLE]
For any , there exists such that . Hence, . Therefore, for any ,
[TABLE]
where \mu_{\mathcal{X}_{\textrm{row}}}\Big{(}S^{\text{GOOD}}(\delta,\varepsilon)\Big{)}~{}\geq~{}1-2\delta. By (16) it follows that for with \mu_{\mathcal{X}_{\textrm{row}}}\Big{(}S^{\text{GOOD}}(\delta,\varepsilon)\Big{)}\geq 1-2\delta,
[TABLE]
for any choice of . Let us choose where . Replacing this choice of in (17), we obtain that for any , there exists a set such that (i) \mu_{\mathcal{X}_{\textrm{row}}}\big{(}S^{\prime}(\delta)\big{)}\geq 1-2\delta and (ii) for all ,
[TABLE]
For all , we define function to be
[TABLE]
For any , is finite, and thus for , . We can verify that is also monotonically non-increasing. Therefore, it follows that
[TABLE]
Using the above developed machinery, now we are ready to bound as summarized in Corollary 3.
Corollary 3** (Bounded Polish Space).**
Let be any measure on , which is assumed to be a bounded diameter Polish space. Let the diameter of and be bounded above by . Let K=\big{(}\frac{2L\sqrt{d}}{\sqrt{\ln 2}}+2\sigma\big{)}^{2}, , and
[TABLE]
If p\geq\max\bigg{(}\frac{8N_{\text{eff}}(\mathcal{X}_{\textrm{row}},\frac{1}{2L},1)}{m-1},~{}\Big{(}4\vee\sqrt{\frac{2}{c}}\Big{)}\cdot\sqrt{\frac{\log(m-1)}{n-1}}\bigg{)}, then
[TABLE]
where are absolute constants, and as .
As an immediate consequence of Corollary 3, it follows that as long as p\gg\max\Big{(}\frac{1}{m},\sqrt{\frac{\log m}{n}}\Big{)}, for any measure on a bounded Polish space,
[TABLE]
This implies that our estimator is consistent for any latent variable model with bounded Polish space as long as for any and .
If , for any , . Therefore, . We can conclude that as long as p\gg\max\Big{(}\frac{1}{m},\sqrt{\frac{\log m}{n}}\Big{)}, and ,
[TABLE]
with universal constant . It is worth noting the similarity between (2) and (IV-B3) – the only difference is the instead of in the denominator of the exponent for term . This is precisely the minimal cost of generalizing from the specific uniform distribution to any arbitrary distribution.
Proof.
Since , as argued before, is well defined and we shall choose . From the choice of , for any , it follows from (18) that . Therefore, for any ,
[TABLE]
where the last inequality follows from the fact that . Similarly,
[TABLE]
where again the last inequality follows from the fact that . Replacing (IV-B3) and (IV-B3) in the statement of Theorem 1, and realizing that the other terms in the bound of Theorem 1 are asymptotically smaller order and \exp\big{(}-\frac{1}{8\delta}\big{)}\leq\delta for any , we obtain
[TABLE]
By Cauchy-Schwarz inequality and the fact that , we have
[TABLE]
In what follows, we shall argue that
[TABLE]
By (IV-B3), (IV-B3) and (25), the main claim (3) follows. With that in mind, we shall now establish (25) to conclude the proof. Recall from Algorithm 2 that
[TABLE]
in our user-user fixed radius nearest neighbor algorithm when and when . Therefore,
[TABLE]
By the assumption that the magnitude of the latent function is bounded by ,
[TABLE]
By introducing the notations
[TABLE]
we can rewrite the expected squared error as
[TABLE]
where we have used the independence of terms in and . It immediately follows that is a bounded random variable with . Therefore,
[TABLE]
The randomness influencing the selection of is independent of the random variables in the summation of term . Conditioned on for any , is simply a summation of i.i.d. random variables, each with norm bounded by and zero mean. Therefore, it follows that conditioned on for any ,
[TABLE]
From (IV-B3)-(28), the inequality (25) follows. This completes the proof of Corollary 3. ∎
V Discussion
V-A Intuition
V-A1 Structural Assumptions and Sample Complexity
Without any structure assumed on the latent function and the latent spaces , one would need number of samples to recover the matrix. Our framework relies on two key assumptions to reduce the sample complexity: (i) (most of) the latent spaces can be covered by balls centered at a few representative points; and (ii) the latent function is regular (Lipschitz) and hence the proximity of two points in the latent space results in the similarity of the function values, which is observable from data.
Given a latent space , we want to cover the entire space minus a small fraction that has negligible measure , with a collection of balls of radius . In other words, our model considers the metric entropy161616The smallest number of bits that suffices to specify every point in the set with accuracy in the metric . of minus -fraction as a measure of complexity. However, we don’t have direct access to the latent features, but the distance between two points must be estimated from the pattern of the associated function values. If the latent function were an isometry, the distance in the image would faithfully reflect the distance in the domain. Lipschitzness assumption on is a robust analogue of the isometry assumption; if the latent function is -Lipschitz, then the distance in the domain (= latent space) cannot be inflated more than times in the image; therefore, the image of the latent function for close neighbors in the latent space will behave in a similar fashion. With the Lipschitz assumption, the cost of indirect measurement is no more than , and it suffices to consider a -covering of the latent space instead of a -covering.
V-A2 MSE Upper Bounds and the Rate of Convergence
The minimax optimal rate for nonparametric regression is where is the number of observations uniformly distributed in a -dimensional hypercube. In our algorithm we use the data across columns to learn similarities between the rows, yet when we actually compute the final estimates, we only use the datapoints in each column separately. The number of observations per column is in expectation. In the case when row latent features are sampled from a uniform measure over a -dimensional cube, the second term of our MSE bound from Corollary 2 is . This indicates that the second term of our MSE bound is optimal for any estimator that uses entries in each column separately. In particular, even if the algorithm were given oracle knowledge of the row latent features it could not do better as long as it constrains itself to estimating each column separately. An algorithm which would average both similar rows and columns could be able to improve beyond this bound. The first term of our MSE bound comes from the step of estimating for all pairs of rows; however this first term is quickly dominated by the second term as increases, suggesting that the second term dominates the MSE. This suggests that our analysis is tight for our estimator.
Our result can be compared with the upper bound on the MSE for the UVST estimator as presented in Theorem 2.7 of [19]. For simplicity, consider the setting of a square matrix, i.e. . For a matrix sampled from the latent variable model with latent variable dimension , their theorem guarantees that
[TABLE]
for some constant as long as . This upper bound is meaningful only when , because the MSE bound in (29) is bounded below by when . However, requiring can be too restrictive when the latent dimension is large since it means that we need to sample almost every entry to achieve a nontrivial bound. Chatterjee’s result stems from showing that a Lipschitz function can be approximated by a piecewise constant function, which upper bounds the rank of the (approximate) target matrix. This global discretization results in a large penalty with regards to the dimension of the latent space.
In recent work, [20] extends the analysis of USVT for graphon estimation, when the observation matrix is binary. He shows if the latent function is -Hölder smooth, the spectrum decays polynomially, and thus the MSE of the USVT estimator is bounded above by
[TABLE]
This refined bound shows that at least for the binary observation case, the USVT estimator is consistent as long as , independent from the latent dimension , relying on the quick decay of the spectrum.
Our algorithm and analysis provides a vanishing upper bound on the MSE whenever , also independent of the latent dimension. In fact even as grows with , as long as , our analysis guarantees that our algorithm achieves a vanishing MSE. Our analysis relies on the “local” structure of the latent space; even if the latent dimension increases, we only need to ensure that there are sufficiently many close neighbor rows so that nearest neighbor averaging produces a good estimate.
V-B Future Work
Our analysis seems tight due to the comparison with minimax rates for noparametric regression, however it is likely that a model which could average entries across both rows and columns would be able to improve the MSE bounds. In particular, the minimax optimal rates for nonparametric regression would imply a lower bound of
[TABLE]
in the setting where both row and column latent features are sampled uniformly from a -dimensional hypercube. This is achievable with an oracle estimator that were given knowledge of the latent features, but we do not know of information theoretic lower bounds for the MSE that are specific to the latent variable model where features are unobserved. For specific settings such as when the function when considered as an integral operator has finite spectrum, it is equivalent to low-rank models, for which lower bounds have been characterized. For specific noise models such as the binary observation model which corresponds to the graphon generative model for random graphs, [60, 61] show that variants of the least squares estimator achieve optimal rates, but unfortunately they are not polynomial time computable.
From an implementation perspective, the similarity based algorithm proposed in this work, similar to classical collaborative filtering methods, is easy to implement and scales extremely well to large datasets, as it naturally enjoys a parallelizable implementation. Furthermore, the operation of finding nearest neighbors can benefit from computational advances in building scalable approximate nearest neighbor indices, cf. [62, 63].
Next we discuss some natural extensions and directions for future work. In our model, the latent function is assumed to be Lipschitz. However, the proof only truly utilizes the fact that “locally” the function value does not oscillate too wildly. Intuitively, this suggests that the result may extend to a broader class of functions, beyond Lipschitz functions. For example, a function with bounded Fourier coefficients does not oscillate too wildly, and thus it may behave well for the purposes of analyzing our algorithm.
Another possible direction for extension is related to the measurement of similarity and the sample complexity. Our current algorithm measures the similarity of rows and from their overlapping observed entries, which critically determines the sample complexity requirement of . However, for sparser datasets without overlaps, we may be able to reveal the similarity by instead comparing distribution signatures such as moments or comparing them through their “extended” neighborhoods [64].
As a concluding remark, we would like to mention that the latent variable model is a fairly general model and there is a large body of related applications. Some of the popular recent examples, which are special cases of latent variable model, include Stochastic blockmodels for community detection, the Bradley-Terry model for ranking from pair-wise comparison data and the Dawid-Skene model for low-cost crowd sourcing. Another prominent example of latent variable model is the generative model for random graphs referred to as a Graphon, which has been shown to be the limit of a sequence of graphs. We refer interested readers to [19, Section 2.4] for an excellent overview on the broad applicability of the latent variable model.
VI Extending Beyond Matrices to Tensors
A natural extension beyond matrix completion is the problem of completing a tensor of higher () order. Given an unknown tensor of order with dimensions , suppose that we observe a fraction of its entries corrupted by noise. Similar to matrix completion, the goal in tensor completion is to estimate the missing entries in the tensor from the noisy partial observations, as well as to “de-noise” the observed entries.
VI-A Short Background
The tensor completion problem is important within a wide variety of applications, including recommendation systems, multi-aspect data mining [65, 66], and machine vision [67, 68, 69].
Although tensor completion has been widely studied, there is still a wide gap in understanding, unlike matrix completion. This gap partially stems from the hardness of tensor decomposition, as most recovery methods rely on retrieving hidden algebraic structure through the framework of low-rank factorization. Tensors do not have a canonical decomposition such as the singular value decomposition (SVD) for a matrix.
There is a factorization scheme, namely the CANDECOMP/PARAFAC (CP) decomposition, which factorizes the tensor as a sum of rank-1 tensors (outer product of vectors). However, it is known that finding the rank of a tensor is NP-Complete, which makes it computationally intractable. Also, there are known ill-posedness [70] issues with CP-based low-rank approximation.
There are other kinds of decompositions such as the Tucker decomposition. Approaches based on Tucker decomposition essentially unfold (matricize or flatten) the tensor, and make use of matrix completion theory and methods [71, 72, 73, 67, 74].
VI-B Setup for Tensor Completion
VI-B1 Revisiting Our Model
The nonparametric model presented in Section II-A naturally extends beyond the bivariate case that corresponds to matrices, to multivariate setups encompassing higher-order tensors. We recap the latent variable model for a tensor following similar assumptions.
Suppose that there is an unknown tensor that we would like to estimate. We observe only a fraction of the total entries of with some noise added. Let denote the index set of observed entries.
Precisely, we consider the following data generation model. Our data tensor is an tensor that is generated as follows. Let be a binary matrix, which we call the masking tensor. We let denote the signal tensor and denote the noise tensor. For each ,
[TABLE]
Similar to the model assumptions for the matrix case in Section II-A1, we assume is generated by the following latent variable model equipped with certain regularity assumptions.
- •
Nonparametric model: there exists a latent function such that
[TABLE]
for all . Here, denote latent variables associated with index in the -th coordinate of the tensor , respectively.
- •
Regularity Assumptions
- –
For each , the latent variables for all , where is a metric space such that . Moreover, is equipped with a Borel probability measure and is drawn i.i.d. according to .
- –
Latent function is bounded, specifically that there exists a constant such that for all , .
- –
Latent function is -Lipschitz in the sense that
[TABLE]
Note that our model assumptions for the noise matrix and the masking matrix are stated in an entrywise fashion and readily extends to their tensor analogues, cf. Section II-A.
VI-B2 Flattening a Tensor to a Matrix
A -order tensor can be viewed as a -dimensional array of numbers. It is possible to ‘flatten’ the tensor to a matrix (i.e. 2-dimensional array) by rearranging the numbers in the -dimensional array. Formally, given a set , we define to be a by matrix obtained by flattening the original tensor. Without loss of generality171717by taking transpose of , we may assume for some . We index the rows and the columns of using a -tuple and a -tuple, respectively. That is to say, for any and , we have
[TABLE]
VI-C Tensor Completion Algorithm
We describe our generic recipe for tensor completion in Algorithm 3. We remark here that any matrix estimation algorithm can be used as the matrix estimation subroutine in Step 2.
VI-D Analysis
Suppose that Algorithm 2 is used as the matrix estimation subroutine in Algorithm 3. We can obtain an MSE upper bound for the tensor completion algorithm, which is similar to that stated in Theorem 1. We do not include a formal theorem statement and its proof here, but we briefly point out what remains unchanged and what needs to be modified in the proof of Theorem 1 (cf. Appendix A) to obtain its counterpart for tensor completion.
Given a -order tensor and an index set , we consider the flattened matrix, . Since this matrix is obtained by flattening , the ‘rows’ and the ‘columns’ of are not fully exchangeable – they satisfy only ‘partial’ exchangeability induced by the exchangeability in . As a result, we cannot assume the latent variables associated with rows of are drawn i.i.d. from a latent space in the current setup. In fact, there are only number of independent latent random variables ( from for each ) associated to the rows of the flattened matrix. Similarly, the latent variables associated to the columns of are also not sampled i.i.d. as they involve shared coordinates in the original tensor. This difference adds a complication to the analysis of Algorithm 3, but its effect is limited to the Step 1 in the proof of Lemma 2.
Following the discussion in Appendix A-A, we observe that there are independent sources of randomness in our model for : \big{\{}x_{i}(\alpha_{i})\big{\}}_{i\in[\tau]\atop\alpha_{i}\in[n_{i}]}, \big{\{}N(\vec{\alpha})\big{\}}_{\vec{\alpha}\in[n_{1}]\times\dots\times[n_{\tau}]}, \big{\{}M(\vec{\alpha})\big{\}}_{\vec{\alpha}\in[n_{1}]\times\dots\times[n_{\tau}]}. For the sake of simplicity, we may assume for some . For and , we let \vec{x}_{\text{row}}(\vec{\alpha}^{(1)})=\big{(}x_{1}(\alpha^{(1)}_{1}),\dots,x_{\upsilon}(\alpha^{(1)}_{\upsilon})\big{)} and \vec{x}_{\text{col}}(\vec{\alpha}^{(2)})=\big{(}x_{\upsilon+1}(\alpha^{(2)}_{1}),\dots,x_{\tau}(\alpha^{(2)}_{\tau-\upsilon})\big{)}, respectively. Also, we let denote a sequence of an appropriate length with all coordinates being .
By the same exchangeability argument, we can upper bound the MSE by the sum of signal component and the noise component as in the proof of Theorem 1 – see Eq. (31), Lemmas 1 and 3 in Appendix A):
[TABLE]
It remains to bound each of the two terms in (30), which can be accomplished by concentration inequalities.
The second term in (30) can be bounded by a very similar argument as in Lemma 4, but we need to adjust for the fact that is associated to a tuple of latent variables, that are shared across different rows. In Step 1 of the proof or Lemma 4, we would use Chernoff bound separately for each dimension of the tensor to argue that for each entry , there are sufficiently many “nearest neighbor” coordinates with similar latent variables in the -th dimension of the tensor. The number of nearest neighbor rows to would then be lower bounded by the product of the number of nearest neighbors for each coordinate .
We also need a small modification in bounding the first term in (30). In Step 1 of the proof of Lemma 2, we use Bernstein’s inequality to prove the concentration of
[TABLE]
We observe that we cannot use Bernstein’s inequality in the current setup of tensor completion, because the summands, \big{(}T_{Z}(\vec{1},\vec{j})-T_{Z}(\vec{v},\vec{j})\big{)}^{2}, are no longer independent, unlike the setup of matrix completion.
Nevertheless, the dependence between the summands is still reasonably weak. We may consider as a function of the independent random variables, \big{\{}x_{i}(\alpha_{i})\big{\}}_{i\in[\tau]\atop\alpha_{i}\in[n_{i}]}, \big{\{}N(\vec{\alpha})\big{\}}_{\vec{\alpha}\in[n_{1}]\times\dots\times[n_{\tau}]}, \big{\{}M(\vec{\alpha})\big{\}}_{\vec{\alpha}\in[n_{1}]\times\dots\times[n_{\tau}]}. Then we are able to prove concentration of to its expectation, using a different tool, e.g., by modified log-Sobolev inequality. Once we show \textrm{dissim}_{\text{row}}(\vec{1},\vec{v})\approx\big{\|}f(x_{\text{row}}(\vec{v}),\cdot)-f(x_{\text{row}}(\vec{1}),\cdot)\big{\|}_{L^{2}}^{2}+2\sigma^{2} in a similar form as in (33), the rest of the proof of Lemma 2 can be reused.
Remark 6*.*
In order to quickly get a sense of the appropriate concentration result, we consider the case where () for all , , and the noise is bounded, i.e., for all . We also assume for all . Given and , we can observe that is a function of \big{\{}x_{i}(\alpha_{i})\big{\}}_{i=\upsilon+1,\ldots,\tau\atop\alpha_{i}\in[n]}, \big{\{}N(\vec{1},\vec{\beta})\big{\}}_{\vec{\beta}\in[n]^{\tau-\upsilon}\atop\vec{\beta}\neq\vec{1}}, and \big{\{}N(\vec{v},\vec{\beta})\big{\}}_{\vec{\beta}\in[n]^{\tau-\upsilon}\atop\vec{\beta}\neq\vec{1}}. Note that .
We consider the influence that each independent random variable exerts on the function . It is easy to verify that satisfies the bounded difference property. To verify the property, let us introduce shorthand notations g\big{[}x_{i}(\alpha_{i})\big{]}\triangleq\textrm{dissim}_{\text{row}}(\vec{1},\vec{v})\big{[}x_{i}(\alpha_{i});~{}\textrm{the rest}\big{]} and h\big{[}N(\vec{1},\vec{\beta})\big{]}\triangleq\textrm{dissim}_{\text{row}}(\vec{1},\vec{v})\big{[}N(\vec{1},\vec{\beta});~{}\textrm{the rest}\big{]}. Then we observe that
[TABLE]
Applying the bounded difference inequality (e.g., McDiarmid’s), it follows that for any ,
[TABLE]
Note that we may assume and therefore (because ). As a result, we can expect \big{|}\textrm{dissim}_{\text{row}}(\vec{1},\vec{v})-\mathbb{E}\textrm{dissim}_{\text{row}}(\vec{1},\vec{v})\big{|}\lesssim\frac{(L^{2}D^{2}+\gamma^{2})\sqrt{\tau-\upsilon}}{\sqrt{n}}. This concentration inequality is times weaker than the result (Bernstein’s inequality) from Step 1 of the proof of Lemma 2, but its dependence on remains the same.
Remark 7*.*
The sample complexity of our matrix estimation algorithm requires that , where is the number of rows and is the number of columns. Therefore, the optimal flattening of the tensor that would minimize sample complexity is the flattening that such that
[TABLE]
If for all , then the optimal flattening would result in matrix dimensions of . In this case the natural extension of our analysis should result in MSE convergence with sample complexity of .
VII Experiments
In this section we present experimental results from applying the User-Item Gaussian Kernel variant of our algorithm to real datasets. We state in Algorithm 4 the specific algorithm variant that is used in the experiments. We did not sample split as it was primarily used for the analysis but is not necessary in practice. The implemented algorithm adds a few modifications:
- •
is computed according to the sample variance of the differences between two rows
- •
we use Gausian kernel weights, and we combine both row and column dissimilarities by taking the maximum
- •
we use an estimator motivated by first order Taylor series approximation
The resulting algorithm is similar to the mean-adjusted variant of collaborative filtering, except for the addition of combining both row and column dissimilarities.
VII-A Matrix completion
We evaluated the performance of our algorithm on predicting user-movie ratings for the MovieLens 1M and Netflix datasets. We chose the overlap parameter to ensure the algorithm is able to compute an estimate for all missing entries. When is larger, the algorithm enforces rows (or columns) to have more commonly rated movies (or users). Although this increases the reliability of the estimates, it also reduces the fraction of entries for which the estimate is defined.
We compared our method with user-user collaborative filtering, item-item collaborative filtering, and SoftImpute from [75]. We chose the classic mean-adjusted collaborative filtering method, in which the weights are proportional to the cosine similarity of pairs of users or items (i.e. movies). SoftImpute is a matrix-factorization-based method which iteratively replaces missing elements in the matrix with those obtained from a soft-thresholded SVD.
The MovieLens 1M data set contains about 1 million ratings by 6000 users of 4000 movies. The Netflix data set consists of about 100 million movie ratings by 480,189 users of about 17,770 movies. For both MovieLens and Netflix data sets, the ratings are integers from 1 to 5. From each dataset, we generated 100 smaller user-movie rating matrices, in which we randomly subsampled 2000 users and 2000 movies. For each rating matrix, we randomly select and withhold a percentage of the known ratings for the test set, while the remaining portion of the data set is revealed to the algorithm for computing the estimates (or training). After the algorithm computes its predictions for all the missing user-movie pairs, we evaluate the Root Mean Squared Error (RMSE) of the predictions compared to the ratings from the withheld test set. Figure 1 plots the RMSE of our method along with classic collaborative filtering and SoftImpute evaluated against , , , and withheld test sets. The RMSE is averaged over 100 subsampled rating matrices, and confidence intervals are provided.
Figure 1 suggests that our algorithm achieves a systematic improvement over classical user-user and item-item collaborative filtering. SoftImpute performs worse than all methods on the MovieLens dataset, but it performs better than all methods on the Netflix dataset.
VII-B Tensor completion
We consider the problem of image inpainting for evaluating the performance of tensor completion algorithm. Inpainting is the process of reconstructing lost or deteriorated parts of image or videos. Such methods, in particular, have revitalized the process of recovery old artifacts in museum world which was historically done by conservators or art restorers. An interested reader is referred to a recent survey [69] for summary of the state of art on methods and techniques. We compare performance of our algorithm against existing methods in the literature on the image inpainting problem. Figure 2 shows a sample of the image inpainting results for the facade and pepper images when of the pixels are removed.
An image can be represented as a -order tensor where the dimensions are rows columns RGB. In particular we used three images (Lenna, Pepper, and Facade) of dimensions 256 256 3. For each image, a percentage of the pixels are randomly removed, and the missing entries are filled in by various tensor completion algorithms.
For the implementation of our tensor completion method, we collapsed the last two dimensions of the tensor (columns and RGB) to reduce the image to a matrix, and applied our method. We set the overlap parameter . We compared our method against fast low rank tensor completion (FaLRTC) [76], alternating minimization for tensor completion (TenAlt) [34], and fully Bayesian CP factorization (FBCP) [77], which extends the CANDECOMP/PARAFAC(CP) tensor factorization with automatic tensor rank determination.
To evaluate the outputs produced by each method, we computed the relative squared error (RSE), defined as
[TABLE]
where is the average value of the true entries. Figure 3 plots the RSE achieved by each tensor completion method on the three images, as a function of the percentage of pixels removed. The results demonstrate that our tensor completion method is competitive with existing tensor factorization based approaches, while maintaining a naive simplicity. In short, a simple algorithm works nearly as good as the best algorithm for this problem!
Acknowledgment
This work is supported in parts by ARO under MURI award 133668-5079809, by NSF under grants CMMI-1462158, CMMI-1634259 and TRIPODS, by DARPA under grant W911NF-16-1-0551, and additionally by a Samsung Scholarship, Siebel Scholarship, NSF Graduate Fellowship, and Claude E. Shannon Research Assistantship.
Appendix A Proof of Theorem 1
A-A Outline of the Proof
There are independent sources of randomness in our model, \big{\{}x_{\text{row}}(u)\big{\}}_{u\in[m]}, \big{\{}x_{\text{col}}(i)\big{\}}_{i\in[n]},\big{\{}N(u,i)\big{\}}_{(u,i)\in[m]\times[n]}, \big{\{}M(u,i)\big{\}}_{(u,i)\in[m]\times[n]}. We let denote the collection of these random variables.
Due to the exchangeability of the model and the linearity of the expectation,
[TABLE]
In Theorem 1, we provide an upper bound on \mathbb{E}\left[\big{(}\widehat{A}(1,1)-A(1,1)\big{)}^{2}~{}\big{|}~{}x_{\text{row}}(1)\right], conditioned on the latent feature of the first row.
Recall from Algorithm 2 that
[TABLE]
in our user-user fixed radius nearest neighbor algorithm when and when . Therefore,
[TABLE]
By the assumption that the magnitude of the latent function is bounded by ,
[TABLE]
and that
[TABLE]
where is an absolute constant (coming from ). The inequality in (a) follows from Lemmas 1 and 3.
We need local properties of the probability measure on the latent space to upper bound the two terms in the last line; see Lemmas 2 and 4. We define the function for and
[TABLE]
We prove in Lemma 5 that
[TABLE]
A-B Upper Bounding the Contribution of Signal on MSE
Lemma 1**.**
[TABLE]
Proof.
Recall from Section A-A that we let denote the collection of independent sources of randomness in our model, \big{\{}x_{\text{row}}(u)\big{\}}_{u\in[m]}, \big{\{}x_{\text{col}}(i)\big{\}}_{i\in[n]},\big{\{}N(u,i)\big{\}}_{(u,i)\in[m]\times[n]}, \big{\{}M(u,i)\big{\}}_{(u,i)\in[m]\times[n]}.
By the tower property of expectation,
[TABLE]
We investigate the conditional expectation, again using the tower property and the fact that is fully determined when we condition on . For the sake of readability, \mathbb{I}\big{(}{|\mathcal{B}_{\text{est}}(1,1)|\geq 1}\big{)} is omitted in the lines below.
[TABLE]
Equation (a) follows from the conditional independence; (b) follows from Fubini’s theorem and the independence between and ; and (c) follows from Cauchy-Schwarz inequality.
Consequently, it follows that
[TABLE]
∎
Lemma 2**.**
Let . The following inequality holds for user-user fixed radius neighbor algorithm:
[TABLE]
In the above expression, K\triangleq\Big{(}\frac{D_{f}}{\sqrt{\ln 2}}+2\sigma\Big{)}^{2} and are absolute constants.
Proof.
Step 1: Choose such that . Observe that
[TABLE]
Note that Z(1,j)-Z(v,j)=\big{(}A(1,j)-A(v,j)\big{)}+\big{(}N(1,j)-N(v,j)\big{)}. By the model assumptions, we have
[TABLE]
Therefore (cf. [78, Lemma 2.7.6]),
[TABLE]
For notational conciseness, we let denote the conditional expectation of :
[TABLE]
Now, suppose that . We observe that regardless of . Then it follows from Bernstein’s inequality (cf. [78, Theorem 2.8.2]) that for any ,
[TABLE]
Note that we provided an upper bound on the conditional probability in (33) that holds for any realization of and , and thus we could remove the conditioning on . By plugging in the value of , this leads to the following inequality: for any ,
[TABLE]
Step 2: Let denote the maximizer such that and
[TABLE]
According to the description of our user-user fixed radius nearest neighbor algorithm, .
[TABLE]
Now we observe that
[TABLE]
Recall that for and ,
[TABLE]
Step 3: Observe that . By the binomial Chernoff theorem,
[TABLE]
Therefore, with the shorthand notation , we can see that
[TABLE]
The inequality in (a) follows from that and is maximized when . ∎
A-C Upper Bounding the Contribution of Noise on MSE
Lemma 3**.**
[TABLE]
Proof.
By the tower property of expectation,
[TABLE]
(a) follows from \mathbb{E}\Big{[}|N(v,1)|^{2}~{}\big{|}~{}x_{\text{row}}(1),|\mathcal{B}_{\text{est}}(1,1)|\Big{]}=\mathbb{E}\big{[}|N(v,1)|^{2}\big{]}\leq C\sigma^{2}, as a result of . ∎
Lemma 4**.**
Let and \eta\geq\eta^{\prime}+K\max\Big{(}\sqrt{\frac{4\log(m-1)}{c(n-1)p^{2}}},\frac{4\log(m-1)}{c(n-1)p^{2}}\Big{)}. The following inequality holds for user-user fixed radius nearest neighbor algorithm:
[TABLE]
where \phi(x,r)=\mathbb{P}_{x_{\text{row}}(v)\sim\mu_{\mathcal{X}_{\textrm{row}}}}\left(\big{\|}f(x,\cdot)-f(x_{\text{row}}(v),\cdot)\big{\|}_{L^{2}}^{2}\leq r\right) for any and .
Proof.
Step 1: Our interest is in bounding conditional expectation of \frac{1}{|\mathcal{B}_{\text{est}}(1,1)|}~{}\mathbb{I}\big{(}|\mathcal{B}_{\text{est}}(1,1)|\geq 1\big{)} given . To that end, let us fix . Let \eta^{\prime}=\eta-K\max\Big{(}\sqrt{\frac{4\log(m-1)}{c(n-1)p^{2}}},\frac{4\log(m-1)}{c(n-1)p^{2}}\Big{)}. For each , the following two conditions
[TABLE]
are satisfied with success probability181818Note that the first condition solely depends on , while the second condition depends only on , hence, they are independent events. p\cdot\phi\big{(}x_{\text{row}}(1),\eta^{\prime}-2\sigma^{2}\big{)}.
Let denote the set of row indices that satisfy the two conditions described in (37). By the binomial Chernoff bound,
[TABLE]
Step 2: Next, we want to show that with high probability. For that purpose, we first require to be sufficiently large for all . Observe that for each , is distributed following . Again by the binomial Chernoff bound (as in (36)), we have
[TABLE]
Since , it follows from the union bound that
[TABLE]
By construction, for all ,
[TABLE]
Note that we can obtain the following concentration inequality by similar arguments191919This provides a probabilistic tail bound on the opposite side of that in (33). The proof remains the same. as in (33):
[TABLE]
With another application of the union bound, this implies that
[TABLE]
Therefore202020Recall that with t=K\max\Big{(}\sqrt{\frac{4\log(m-1)}{c(n-1)p^{2}}},\frac{4\log(m-1)}{c(n-1)p^{2}}\Big{)}.,
[TABLE]
Observe that
[TABLE]
which implies that .
Step 3: Let
[TABLE]
By the law of total probability,
[TABLE]
where (a) follows from \frac{1}{|\mathcal{B}_{\text{est}}(1,1)|}~{}\mathbb{I}\big{(}|\mathcal{B}_{\text{est}}(1,1)|\geq 1\big{)}\leq 1; and (b) follows from that (42) and
[TABLE]
∎
A-D Upper Bounding the Probability of
Lemma 5**.**
Let and \eta\geq\eta^{\prime}+K\max\Big{(}\sqrt{\frac{4\log(m-1)}{c(n-1)p^{2}}},\frac{4\log(m-1)}{c(n-1)p^{2}}\Big{)}. The following inequality holds for user-user fixed radius nearest neighbor algorithm:
[TABLE]
Proof.
Recall the definition of from (37):
[TABLE]
Note that we obtain in (38) that
[TABLE]
by applying the binomial Chernoff bound.
Now we observe that if there exists at least one , for which
[TABLE]
then due to the Lipschitzness assumption on . Therefore, we have
[TABLE]
where .
We bound the two terms in (43) separately. First, as long as \phi\Big{(}x_{\text{row}}(1),\eta^{\prime}-2\sigma^{2}\Big{)}>0, it follows from (38) that
[TABLE]
Second, by the usual trick of total probability (recall from (32) that denotes ),
[TABLE]
where (a) follows from (33) and the independence between and ; and (b) follows from (36) and the assumption that \eta-\eta^{\prime}\geq K\max\Big{(}\sqrt{\frac{4\log(m-1)}{c(n-1)p^{2}}},\frac{4\log(m-1)}{c(n-1)p^{2}}\Big{)}.
All in all,
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. Chen and D. Shah, Explaining the success of nearest neighbor method in prediction . Foundations and Trends in Machine Learning, 2018.
- 2[2] Y. Mack and B. W. Silverman, “Weak and strong uniform consistency of kernel regression estimates,” Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete , vol. 61, no. 3, pp. 405–415, 1982.
- 3[3] M. P. Wand and M. C. Jones, Kernel smoothing . Crc Press, 1994.
- 4[4] N. Srebro, N. Alon, and T. S. Jaakkola, “Generalization error bounds for collaborative prediction with low-rank matrices,” in Advances In Neural Information Processing Systems , 2004, pp. 1321–1328.
- 5[5] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics , vol. 9, no. 6, pp. 717–772, 2009.
- 6[6] A. Rohde, A. B. Tsybakov et al. , “Estimation of high-dimensional low-rank matrices,” The Annals of Statistics , vol. 39, no. 2, pp. 887–930, 2011.
- 7[7] R. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a few entries,” IEEE Trans. Inf. Theory , vol. 56, no. 6, 2009.
- 8[8] S. Negahban and M. J. Wainwright, “Restricted strong convexity and weighted matrix completion: Optimal bounds with noise,” The Journal of Machine Learning Research , vol. 13, no. 1, pp. 1665–1697, 2012.
