Denoising Linear Models with Permuted Data
Ashwin Pananjady, Martin J. Wainwright, Thomas A. Courtade

TL;DR
This paper characterizes the minimax error rate for denoising in permuted linear models with Gaussian noise, analyzes efficient estimators, and provides algorithms applicable to image matching and datasets with outliers.
Contribution
It offers a sharp characterization of the minimax error rate and analyzes the performance of efficient estimators for denoising in permuted linear models.
Findings
Minimax error rate characterized up to logarithmic factors.
Efficient estimators shown to be consistent across various parameters.
Exact algorithm demonstrated on image point-cloud matching.
Abstract
The multivariate linear regression model with shuffled data and additive Gaussian noise arises in various correspondence estimation and matching problems. Focusing on the denoising aspect of this problem, we provide a characterization the minimax error rate that is sharp up to logarithmic factors. We also analyze the performance of two versions of a computationally efficient estimator, and establish their consistency for a large range of input parameters. Finally, we provide an exact algorithm for the noiseless problem and demonstrate its performance on an image point-cloud matching task. Our analysis also extends to datasets with outliers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Regression
Denoising Linear Models with Permuted Data
[TABLE]
[TABLE]
Abstract
The multivariate linear regression model with shuffled data and additive Gaussian noise arises in various correspondence estimation and matching problems. Focusing on the denoising aspect of this problem, we provide a characterization the minimax error rate that is sharp up to logarithmic factors. We also analyze the performance of two versions of a computationally efficient estimator, and establish their consistency for a large range of input parameters. Finally, we provide an exact algorithm for the noiseless problem and demonstrate its performance on an image point-cloud matching task. Our analysis also extends to datasets with outliers.
1 Introduction
The linear model is a ubiquitous and well-studied tool for predicting responses based on a vector of covariates or predictors. In this paper, we consider the multivariate version of the model, with vector-valued responses , and covariates . In the standard formulation of this problem, estimation is performed on the basis of a data set of pairs , in which each response is correctly associated with the covariate vector that generated it. Our focus is instead on the following variant of the standard set-up: the input consists of the permuted data set , where represents an unknown permutation. The presence of this unknown permutation—which can be viewed as a nuisance parameter—introduces substantial challenges to this problem.
It is convenient to introduce matrix-vector notation so as to state the problem more precisely. If we form the matrices and with and , respectively, as their row, we arrive at the model
[TABLE]
where is an unknown permutation matrix, is an unknown matrix of parameters, and is the additive observation noise111We refer to the setting a.s. as the noiseless case.. When , this reduces to the vector linear regression model with an unknown permutation, given by
[TABLE]
which we refer to as the shuffled vector model.
The observation model (1) arises in multiple applications, which are discussed in detail for the shuffled vector model (2) in our earlier work [PWC16]. Here let us describe two applications that arise in the multivariate setting (), which we use as running examples throughout the paper.
Example 1** (Pose and correspondence estimation).**
Our first motivating application is the problem of pose and correspondence estimation in images [MSC09]; it is closely related to point-cloud matching in graphics [Man93]. Suppose that we are given two images of a similar object, with the coordinates of one image arising from an unknown linear transformation of the coordinates of the second. In order to determine the linear transformation, keypoints are detected in each of the images individually and then matched; see Figure 1 for an illustration. We emphasize that in practice, the keypoint detection algorithm also returns features that help in finding the matching permutation , but our goal here is to analyze whether there are procedures that are robust to such features being missing or corrupted. It is also worth noting that while in this example we have , the model is also valid for higher (but equal) parameters and , if we assume that in addition to the coordinates of the keypoints, other attributes like pixel brightness, colour, etc. in the two images are also related by a linear transformation.
Example 2** (Header-free communication).**
A second application is that of header-free communication in large communication networks [PWC16]. Suppose that we use multiple sensors to take noisy measurements of a unknown matrix of parameters; each measurement corresponds to a noisy linear observation of the form . In very large networks, such as those that arise in Internet of Things applications, it is often found that the bandwidth between a sensor and fusion center is mainly dominated by a header containing identity information—that is, by a bitstring that identifies sensor to the fusion center [KSF*+*09]. One possible solution to this problem is header-free communication, meaning that the identities of the sensors that sent the signal are no longer known to the fusion center. This absence can be modeled by introducing the unknown permutation matrix as in our model. If we are still able to achieve similar statistical performance without these headers, then such an approach is clearly preferable from a bandwidth standpoint.
With this motivation in hand, let us now provide a high-level overview of the main results of this paper. We focus on the multivariate model (1) with a fixed design matrix , and Gaussian222Our results also extend to the case of i.i.d. sub-Gaussian noise. noise . We evaluate an estimator based on its “denoising” capability, which we capture using the normalized prediction error . Our primary objective in this paper is to characterize the fundamental limits of denoising in a minimax sense. In particular, an estimator is any measurable mapping of the input to estimates of the permutation and regression matrix, and we measure the quality of these estimates via their uniform mean-squared error
[TABLE]
Our interest will be in upper and lower bounding this quantity as a function of the design matrix , dimensions and the noise variance . We also demonstrate an explicit (but computationally expensive) algorithm that achieves the minimax risk up to a factor, and analyze polynomial-time estimators with slightly larger prediction error.
In both of the examples discussed above, estimators with small minimax prediction error are of interest. In the pose and correspondence estimation problem, obtaining low prediction error is equivalent to obtaining near-identical keypoint locations on both images; in the sensor network example, we are interested in obtaining a set of noise-free linear functions of the input signal. It is important to note that depending on the application, multiple regimes of the parameter triplet are of interest. Therefore, in this paper, we focus on capturing the dependence of denoising error rates on all of these parameters, and also on the structure of the matrix .
Our work contributes to the growing body of literature on regression problems with unknown permutations, as well as related row-space perturbation problems including blind deconvolution [LS15], phase retrieval [CLS15], and dictionary learning [TF11]. Regression problems with unknown permutations have been considered in the context of statistical seriation and univariate isotonic matrix recovery [FMR16], and non-parametric ranking from pairwise comparisons [SBGW17], which involves bivariate isotonic matrix recovery. Moreover, the prediction error is used to evaluate estimators in both these applications.
Specializing to our setting, the shuffled vector model (2) was first considered in the context of compressive sensing with a sensor permutation [EBDG14]. The first theoretical results were provided by Unnikrishnan et al. [UHV15], who provided necessary and sufficient conditions needed to recover an adversarially chosen in the noiseless model with a random design matrix . Also in the random design setting, our own previous work [PWC16] focused on the complementary problem of recovering in the noisy model, and showed necessary and sufficient conditions on the SNR under which exact and approximate recovery were possible. An efficient algorithm to compute the maximum likelihood estimate was also provided for the special case .
1.1 Our contributions
First, we characterize the minimax prediction error of multivariate linear model with an unknown permutation up to a logarithmic factor, by analyzing the maximum likelihood estimator. Since the maximum likelihood estimate is NP-hard to compute in general [PWC16], we then propose a computationally efficient estimator based on singular value thresholding and sharply characterize its performance, showing that it achieves vanishing prediction error over a restricted range of parameters. We also propose a variant of this estimator that achieves the same error rates, but with the advantage that it does not require the noise variance to be known. Third, we propose an efficient spectral algorithm for the noiseless problem that is exact provided certain natural conditions are met. We demonstrate this algorithm on an image point cloud matching task. Finally, we extend our results to a richer class of models that allows for outliers in the dataset. In the next section, we collect our main theorems and discuss their consequences. Proofs are postponed to Section 3.
Notation:
We use to denote the set of permutation matrices. Let denote the identity matrix of dimension . We use the notation , , and to denote the Frobenius, operator, and nuclear norms of a matrix , and to denote universal constants that may change from line to line.
2 Main results
In this section, we state our main results and discuss some of their consequences. We divide our results into four subsections, having to do with minimax rates, polynomial time estimators, efficient procedures for the noiseless problem, and an extension of the model (1) that allows for outliers.
2.1 Minimax rates of prediction
Assuming that the noise is i.i.d. Gaussian, so the maximum likelihood estimate (MLE) of the parameters is given by
[TABLE]
This estimator is also sensible for non-Gaussian noise, as long as its tail behavior is similar to the Gaussian case (as can be formalized by the notion of sub-Gaussianity).
In this section, we begin by providing an upper bound the prediction error achieved by the maximum likelihood estimator for any design matrix . In general, however, it is impossible to prove a matching lower bound for an arbitrary matrix . As an extreme example, suppose that the matrix with identical rows: in this case, the permutation matrix plays no role whatsoever, and the problem is obviously much easier than with a generic matrix .
With this fact in mind, we derive lower bounds that apply provided the matrix lies in a restricted class, in order to define which we require some additional notation. For a vector , let denote the vector sorted in decreasing order, and let denote the -dimensional -ball of unit radius centered at [math]. Define the matrix class
[TABLE]
In rough terms, this condition defines matrices that are not “flat”, meaning that there is some vector in their range obeying the -separation condition defined above. It can be verified that a matrix with i.i.d. sub-Gaussian entries lies in the class with high probability for fixed constants . We are now ready to state our first main result:
Theorem 1**.**
For any triple , we have
[TABLE]
Conversely, for any matrix , and any estimator , we have
[TABLE]
where the constant depends on the value of the pair , but is independent of other problem parameters.
Theorem 1 characterizes the minimax rate up to a factor that is at most logarithmic in . It shows that the MLE is minimax optimal for prediction error up to logarithmic factors for all matrices that are not too flat. The bounds have the following interpretation, similar to the results of Flammarion et al. [FMR16] on prediction error for unimodal columns. The first term corresponds to a rate achieved even if the estimator knows the true permutation ; the second term quantifies the price paid for the combinatorial choice among permutations. As a result, we see that if , then the permutation does not play much of a role in the problem, and the rates resemble those of standard linear regression. Such a general behaviour is expected, since a large means that we get multiple observations with the same unknown permutation, and this should allow us to estimate better.
Clearly, a flat matrix is not influenced by the unknown permutation, and so the second term of the lower bound need not apply. As we demonstrate in the proof, it is likely that the flatness of can also be incorporated in order to prove a tighter upper bound in this case, but we choose to state the upper bound as holding uniformly for all matrices , with the loss of a logarithmic factor.
It is also worth mentioning that the logarithmic factor in the second term is shown to be nearly tight for the problem of unimodal matrix estimation with an unknown permutation [FMR16], suggesting that a similar factor may also appear in a tight version of our lower bound (5b). For the specific case where however, which corresponds to the shuffled vector model (2), our bounds are tight up to constant factors, and summarized by the following corollary.
Corollary 1**.**
In the case , for any matrix , we have
[TABLE]
In other words, the normalized minimax prediction error for the shuffled vector model does not decay with the parameters or , and so no estimator achieves consistent prediction for every parameter choice . Again, this is a consequence of the fact that—unlike when is large—we do not get independent observations with the permutation staying fixed, and herein lies the difficulty of the problem.
Both Theorem 1 and Corollary 1 provide non-adaptive minimax bounds. An interesting question is whether the least squares estimator is also minimax optimal up to logarithmic factors over finer classes of and , i.e., whether it is adaptive in some interesting way. One would expect that the estimator adapts to the parameter , the number of distinct entries in the matrix , similarly to the problem of monotone parameter recovery [FMR16].
2.2 Polynomial time estimators
As shown in our past work [PWC16], computing the MLE estimate (4) is NP-hard in general. Accordingly, it is natural to turn our attention to alternative estimators, and in particular ones that are guaranteed to run in polynomial time.
Here we analyze two simple methods for estimating the matrix , based either on singular value thresholding, and a closely related variant that uses an explicit regularization based on the nuclear norm. It is well-known that such methods are appropriate when the matrix is low-rank, or approximately low-rank. While the matrix is not low-rank, its rank is bounded by that of the matrix , a fact that we leverage in our bounds.
Given a matrix with the singular value decomposition , its singular value thresholded version at level is given by , where is the indicator function of its argument.
The singular value thresholding (SVT) operation serves the purpose of denoising the observation matrix, and has been analyzed in the context of more general matrix estimation problems by various authors (e.g., [CCS10, Cha15]).
Theorem 2**.**
For any matrices , the SVT estimate with satisfies
[TABLE]
Conversely, for any matrix with rank at most , there exist matrices and (that may depend ) such that for any threshold , we have
[TABLE]
with probability greater than .
Comparing inequalities (5b) (which holds for any denoised matrix, not just those having the form ) and (6b), we see that the SVT estimator, while computationally efficient, may be statistically sub-optimal. However, it is consistent in the case where is sufficiently small compared to and , and minimax optimal when is a constant. Intuitively, the rate it attains is a result of treating the full matrix as unknown, and so it is likely that better, efficient estimators exist that take the knowledge of into account.
A potential concern is that the SVT estimator is required to know the noise variance . This issue can be taken care of via the square-root LASSO “trick” [BCW11], which ensures a self-normalization that obviates the necessity for a noise-dependent threshold level. In particular, consider the estimate
[TABLE]
Using a choice of that no longer depends on , we have the following guarantee:
Theorem 3**.**
If , then for any choice of parameters and , the square-root LASSO estimate (7) with satisfies
[TABLE]
with probability greater than .
We prove Theorem 3 in Section 3.3 for completeness. However, it should be noted that the square-root LASSO has been analyzed for matrix completion problems [Klo14], and our proof follows similar lines for our different observation model. The condition does not significantly affect the claim, since our bounds no longer guarantee consistency of the estimate when this condition is violated.
While the optimization problem (7) can be solved efficiently, there may be cases when the noise is (sub)-Gaussian of known variance for which the SVT estimate can be computed more quickly. Hence, the SVT estimator is usually preferred in cases where the noise statistics are known.
2.3 Exact algorithm for the noiseless case
For the noiseless model, the only efficient algorithm known up to now is for the special case , as presented in our past work [PWC16]. It turns out that this algorithm has a natural generalization to higher dimensional problems, at least when certain conditions on the input matrices are satisfied. The higher dimensional generalization requires analyzing certain spectral properties of the input matrices.
In order to state the theorem, we require require a few definitions. Given a matrix , consider its reduced singular value decomposition , where is a matrix of its left singular vectors. The (left) leverage scores of the matrix are given the -norms of the rows of the matrix ; in analytical terms, we can express them as the -dimensional vector , where the operator extracts the diagonal of a square matrix. With this notation, the LevSort algorithm performs the following three steps on the input pair :
- (i)
Compute the leverage scores and . 2. (ii)
Find a permutation . 3. (iii)
Return the matrix \widehat{X}_{{\sf lev}}=\big{(}\widehat{\Pi}_{{\sf lev}}A\big{)}^{\dagger}Y, where denotes the Moore-Penrose pseudoinverse of a matrix .
Note that this algorithm runs in polynomial time, since it involves only spectral computations and a matching step that can be computed in time . As we demonstrate in the proof, step (ii) for the noiseless model actually returns a permutation matrix such that .
Theorem 4**.**
Consider an instantiation of the noiseless model with , and such and both have all distinct entries. Then the LevSort algorithm recovers the parameters exactly.
The LevSort algorithm is a generalization of our own algorithm [PWC16] to the matrix setting. However, instead of a simple sorting algorithm, we now require an additional spectral component. While showing the necessity of the condition is still open, an efficient algorithm that does not impose any conditions is unlikely to exist due to the general problem being NP-hard [PWC16]. Note that the condition includes as a special case all problems in which the matrices and are full rank, with .
In particular, the pose and correspondence estimation problem for 2D point clouds satisfies the conditions of Theorem 4 under some natural assumptions. We have for all such problems, and unless the linear transformation is degenerate. Furthermore, unless the keypoints are generated adversarially, the leverage scores of the matrix and the rows of are distinct. Thus, assuming that the noiseless version of model (1) exactly describes the keypoints detected in the two images (which is an idealization that may not be true in real data), we are guaranteed to find both the pose and the correspondence exactly.
In Figure 2, we demonstrate the guarantee of Theorem 4 on two image correspondence tasks when the keypoints detected in the two images are identical and the transformation between coordinates is linear.
2.4 Extensions to outliers
The results of Sections 2.1 and 2.2 also hold in a somewhat general setting, where the set of perturbations to the rows of the matrix is allowed to be larger than just the set of permutation matrices . In particular, defining the set of “clustering matrices” as
[TABLE]
we consider an observation model of the form
[TABLE]
where the matrices , , and are as before, and now represents a clustering matrix. Such a clustering condition ensures stochasticity of the matrix (not double stochasticity, as in the permutation model), and corresponds to the case where multiple responses may come from the same covariate, and some of the data may be permuted. Such a model is likely to better fit data from image correspondence problems when the keypoints detected in the two images are quite different. Also, such a formulation is loosely related to the -means clustering problem with Gaussian data [ABC*+*15].
As it turns out, Theorems 1, 2 and 3 also hold for this model, with minor modifications to the proofs. Defining the analogous MLE for this model as
[TABLE]
we have the following theorem.
Theorem 5**.**
- (a)
For any matrix , and for all parameters and , we have
[TABLE]
with probability greater than . 2. (b)
For any choice of parameters and , the SVT estimate with satisfies
[TABLE]
with probability greater than . 3. (c)
For any choice of parameters and , the square-root LASSO estimate (7) with satisfies
[TABLE]
with probability greater than .
Clearly, the lower bounds (5b) and (6b) hold immediately for the model (8) as a result of the inclusion .
3 Proofs
This section contains proofs of all our main results. We use to denote absolute constants that may change from line to line. We let denote the th largest singular value of a matrix .
3.1 Proof of Theorem 1
We split the proof into two natural parts, corresponding to the upper and lower bounds, respectively. The upper bound boils down to analyzing the Gaussian width [Pis99] of a certain set, which we obtain via Dudley’s entropy integral [Dud67] and bounds on the metric entropy of the observation space. The lower bound is obtained via a packing construction and an application of Fano’s inequality.
3.1.1 Proof of upper bound
Writing and , we have by the optimality of for problem (4) that , from which it follows that the error matrix satisfies the following basic inequality:
[TABLE]
where denotes the trace inner product between two matrices and . We prove inequality (5a) by proving the following claims.
[TABLE]
Proof of inequality (10a):
Applying the Cauchy Schwarz inequality to the RHS of inequality (9) yields
[TABLE]
Squaring both sides of inequality (11) and using standard sub-exponential tail bounds [Wai15] yields inequality (10a). ∎
Proof of inequality (10b):
Without loss of generality, by rescaling as necessary, we may assume that the noise has standard normal entries (). We use to denote the set of matrices whose columns lie in the range of for some permutation matrix , i.e.,
[TABLE]
Also define the set
[TABLE]
as well as the function
[TABLE]
Before proceeding with the proof, we state the definition of the covering number of a set.
Definition 1** (Covering number).**
A -cover of a set with respect to a metric is a set such that for each , there exists some such that . The -covering number is the cardinality of the smallest -cover.
The logarithm of the covering number is referred to as the metric entropy of a set. The following lemma bounds the metric entropy of the set . Let denote the Frobenius norm ball of radius centered at [math].
Lemma 1**.**
The metric entropy of the set in the Frobenius norm metric is bounded as
[TABLE]
We prove the lemma at the end of the section, taking it as given for the proof of inequality (10b).
Proof of inequality (10b).
By definition of , is easy to see that we have
[TABLE]
One can also verify that the set is star-shaped333A set is said to be star-shaped if implies that , and so the following critical inequality holds for some :
[TABLE]
We are interested in the smallest (strictly) positive solution to inequality (14). Moreover, we would like to show that for every , we have with probability greater than .
Define the “bad” event
[TABLE]
Using the star-shaped property of , it follows by a rescaling argument that
[TABLE]
The entries of are i.i.d. standard Gaussian, and the function is convex and Lipschitz with parameter . Consequently, by Borell’s theorem (see, for example, Milman and Schechtman [MS86] for a simple proof), the following holds for all :
[TABLE]
By the definition of , we have for any , and consequently, for all , we have
[TABLE]
Now either , or we have . In the latter case, conditioning on the complementary event , our basic inequality implies that . Consequently, we have
[TABLE]
Putting together the pieces yields
[TABLE]
with probability at least for every .
In order to determine a feasible satisfying the critical inequality (14), we need to bound the expectation . We now use Dudley’s entropy integral [Dud67] to bound . In particular, for a universal constant , we have
[TABLE]
where in step , we have made use of Lemma 1, and in step , we have used the change of variables . Now comparing with the critical inequality, we see that
[TABLE]
Putting together the pieces then proves claim (10b). ∎
It remains to prove Lemma 1.
Proof of Lemma 1.
We begin by finding the -covering number of
[TABLE]
Note that is isomorphic to , where denotes the tensor product. Note that is a linear subspace of dimension . Also, since the set is an -dimensional -ball of radius , we have by a volume ratio argument that
[TABLE]
By definition, we also have , and so by the union bound, we have
[TABLE]
In order to complete the proof, we notice that
[TABLE]
since it is sufficient to use two -covers of the set in conjunction in order to obtain a -cover of the set . ∎
3.1.2 Proof of lower bound
As alluded to before, the bound follows from a packing set construction and Fano’s inequality, which is a standard template used to prove minimax lower bounds. Suppose we wish to estimate a parameter over an indexed class of distributions in the square of a (pseudo-)metric . We refer to a subset of parameters as a local -packing set if
[TABLE]
Note that this set is a -packing in the metric with the average KL-divergence bounded by . The following result is a straightforward consequence of Fano’s inequality:
Lemma 2** (Local packing Fano lower bound).**
For any -packing set of cardinality , we have
[TABLE]
The remainder of argument is directed to establishing the following two claims:
[TABLE]
It is easy to see that both claims together prove the lemma.
Proof of claim (18a):
This claim is consequence of classical minimax bounds on linear regression. Since we are operating in the matrix setting, we include the proof for completeness.
The proof involves the construction of a packing set such that for all , we have and . Since we are effectively packing the space , standard results show that there exists such a packing of this space with .
Also note that with the underlying parameter , our observations have the distribution . Hence, the KL divergence between two observations and is simply
[TABLE]
Substituting this into the bound of Lemma 2 with , we have
[TABLE]
where we have again used to denote the minimax rate of prediction.
Setting completes the proof of claim (18a). Note that the proof of this claim did not require the assumption that .
Proof of claim (18b)
For ease of exposition, we first prove claim (18b) for matrices in a smaller class than . We let denote the -dimensional vector having in its first coordinates and [math] in the remaining coordinates.
Now consider the class of matrices that have in their range. By multiplying with and stacking of these vectors up as columns, we have a matrix whose first rows are identically and the rest are identically zero. Define the Hamming distance between two binary vectors We require the following lemma.
Lemma 3**.**
There exists a set of binary -vectors , each of Hamming weight and satisfying , having cardinality
The lemma is proved at the end of this section.
Proof of claim (18b)
Applying Lemma 3 and a rescaling argument, we see that there is a packing set such that
[TABLE]
Fixing some constant and choosing and , it can be verified that we obtain a packing set of size . We now have observation distributed as , and so
[TABLE]
Finally, substituting into the Fano bound of Lemma 2 yields
[TABLE]
Setting for a constant depending only on completes the proof provided the vector for with .
It remains to extend the proof to matrices in the class , and to prove Lemma 3.
By definition, if , then there exists a vector such that . We may assume that by a rescaling argument, and also that . By definition, we have
[TABLE]
It can also be verified that since , we must have . For the rest of the proof, we assume for simplicity of exposition that is an integer. Fixing the value , consider the -packing generated by permutations of the vector , given by Lemma 3 by taking . Using these permutations, we observe that
[TABLE]
where depends on the constants , and we have used condition (20) along with the fact that .
Following similar steps to before then proves lemma for all matrices .
It remains to prove Lemma 3.
Proof of Lemma 3
The proof follows by a volume ratio argument that underlies the proof of the Gilbert-Varshamov bound. In particular, the number of permuted vectors of that are within a Hamming distance of is given by . Now form a graph with all permuted vectors of as vertices and connect two vertices if the corresponding vectors have Hamming distance less than . Then such a graph has uniform degree and therefore contains an independent set of size . ∎
3.2 Proof of Theorem 2
Again, we divide our proof into two parts, corresponding to the upper and lower bounds respectively.
3.2.1 Proof of upper bound
For this proof, we use the shorthand . Also fix , and let be the number of singular values of greater than . Also, let denote the matrix formed by truncating to its top singular values. By triangle inequality, we have
[TABLE]
Now note that by standard results in random matrix theory (see, for example, [Wai15, Theorem 6.1]), we have with probability greater than . We condition on this event for the rest of the proof.
Consequently, for , we have
[TABLE]
and so . Additionally, we have
[TABLE]
Putting together the pieces yields
[TABLE]
a bound that holds with probability greater than . In order to complete the proof, we note that . ∎
3.2.2 Proof of lower bound
We split our analysis into two separate cases.
Case 1:
First suppose that . Consider any matrix , and . By definition of the thresholding operation, we have
[TABLE]
Triangle inequality yields
[TABLE]
Now with probability greater than , we have , so that conditioned on this event, we have
[TABLE]
which completes the proof.
Case 2:
We now suppose that . Let the matrix have the (reduced) singular value decomposition , and introduce the shorthand . Form the diagonal matrix . Now let , and consider the parameter matrix , where is an dimensional matrix with orthonormal columns. Note that such a choice exists when .
We now have
[TABLE]
For two matrices with , it can be verified that
[TABLE]
By the definition of the thresholding operation, the top singular values of the matrix are all either greater than , or equal to [math]. Hence, we have
[TABLE]
where the last step follows since , which completes the proof. ∎
3.3 Proof of Theorem 3
It is again helpful to write the observation model in the form , where represents the underlying matrix we are trying to predict. Let us denote the choice of in the statement of Theorem 3 by . We use the shorthand , and . Let and denote, respectively, the projection matrices onto the rowspace of the matrix and its orthogonal complement.
We require the following auxiliary lemmas for our proof:
Lemma 4**.**
We have
[TABLE]
Lemma 5**.**
If , we have
[TABLE]
We are now ready to prove Theorem 3.
Proof of Theorem 3.
First, note that by standard results on concentration of -random variables and random matrices (see, for instance, Wainwright [Wai15]), we have
[TABLE]
Hence, we have
[TABLE]
For the rest of the proof, we condition on the event .
Now, by definition of the quantity , we have
[TABLE]
Some simple algebra yields
[TABLE]
Now, from the definition of the estimate , we have
[TABLE]
Rearranging terms yields
[TABLE]
where step follows from Lemma 4, and the fact that .
Another rearrangement of inequality (22) yields
[TABLE]
where step follows from Lemma 4, and the fact that . Thus, we have established the upper bound , where
[TABLE]
Expanding the product of the two terms yields
[TABLE]
where step follows from Lemma 5, since .
We also note that
[TABLE]
Combining with the fact that satisfies the inequality , we find that
[TABLE]
where in step , we have used the Cauchy Schwarz inequality and the fact that projections are non-expansive to write
[TABLE]
Rearranging yields
[TABLE]
Squaring both sides, substituting the choice of , and using the condition completes the proof. ∎
The only remaining detail is to prove Lemmas 4 and 5.
3.3.1 Proof of Lemma 4
We write
[TABLE]
Rearranging yields the claim. ∎
3.3.2 Proof of Lemma 5
Rearranging the Cauchy Schwarz inequality for two matrices and yields
[TABLE]
Now setting and , we have
[TABLE]
where step follows from Hölder’s inequality and choice of .
Combining this with the basic inequality (22) yields
[TABLE]
Finally, using Lemma 4, we have
[TABLE]
which completes the proof. ∎
3.4 Proof of Theorem 4
We write the (reduced) singular value decomposition of a matrix as . We also adopt the shorthand for the rest of this proof. The LevSort algorithm clearly runs in polynomial time, since it involves a singular value decomposition and a sorting operation, both of which can be accomplished efficiently. Let us now verify the exactness guarantee.
Since the observation model (1) is noiseless and , we have . Moreover, by definition of the observation model, we have
[TABLE]
Consequently, the unknown matrix can be written as
[TABLE]
with representing an unknown unitary matrix (satisfying ). Substituting this representation of back into the noiseless observation model yields
[TABLE]
Now has a full-dimensional row-space, and so we have . We complete the proof by observing that
[TABLE]
so that we have the equivalence as claimed. The uniqueness of the parameters follows from the fact that the leverage score vectors and have distinct entries. ∎
3.5 Proof of Theorem 5
The proofs of Theorems 1, 2, and 3 apply to the model (8) with minor modifications. We briefly mention these modifications here, leaving the details to the reader.
Part (a) follows by mimicking the proof of Section 3.1.1 as is, with a small modification to the metric entropy of the observation space. In particular, the covering number of the observation space is now upper bounded by , and the rest of the proof follows as before.
Parts (b) and (c) follow by mimicking the proof of Sections 3.2.1 and 3.3, respectively, with the definition . Note that the clustering observation model can only decrease the rank of from before. ∎
4 Discussion
We conclude with a discussion of some possible future directions.
4.1 More general picture for regression problems
Multivariate linear regression is a specific case of the following problem with shuffled data , with the covariates and responses related by the equation
[TABLE]
where represents a function from some parametric or non-parametric family . The general behaviour of prediction error for problems of this form should be similar to that seen in our linear regression model, or the structured regression model of Flammarion et al. [FMR16]. In particular, provided the data is sufficiently diverse and the function class is sufficiently expressive, the minimax rate of prediction for the permuted model should be given by the sum of two terms: the minimax rate of the unpermuted model (or equivalently, with a known permutation), and an additional constant/logarithmic term that accounts for the permutation.
4.2 Necessity of flatness condition and adaptivity
Our condition on the matrix is a convenient one for the application of the Gilbert-Varshamov type bound on distances between permuted binary vectors. However, this sufficient condition may be far from necessary – we instead require some permutation codes of real numbers.
Conversely, the upper bound (5a) can be stated by explicitly taking the structure of the matrix into account; this will require bounds on the metric entropy of the union of subspaces generated by permutations of the range space of .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ABC + 15] P. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, and R. Ward. Relax, no need to round: Integrality of clustering formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science , pages 191–200. ACM, 2015.
- 2[BCW 11] A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika , 98(4):791–806, 2011.
- 3[CCS 10] J-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization , 20(4):1956–1982, 2010.
- 4[Cha 15] S. Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statistics , 43(1):177–214, 2015.
- 5[CLS 15] E. J. Candés, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory , 61(4):1985–2007, 2015.
- 6[Dud 67] Richard M Dudley. The sizes of compact subsets of Hilbert space and continuity of gaussian processes. Journal of Functional Analysis , 1(3):290–330, 1967.
- 7[EBDG 14] V. Emiya, A. Bonnefoy, L. Daudet, and R. Gribonval. Compressed sensing with unknown sensor permutation. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , pages 1040–1044. IEEE, 2014.
- 8[FMR 16] N. Flammarion, C. Mao, and P. Rigollet. Optimal rates of statistical seriation. ar Xiv preprint ar Xiv:1607.02435 , 2016.
