Denoising Linear Models with Permuted Data

Ashwin Pananjady; Martin J. Wainwright; Thomas A. Courtade

arXiv:1704.07461·stat.ML·April 26, 2017

Denoising Linear Models with Permuted Data

Ashwin Pananjady, Martin J. Wainwright, Thomas A. Courtade

PDF

TL;DR

This paper characterizes the minimax error rate for denoising in permuted linear models with Gaussian noise, analyzes efficient estimators, and provides algorithms applicable to image matching and datasets with outliers.

Contribution

It offers a sharp characterization of the minimax error rate and analyzes the performance of efficient estimators for denoising in permuted linear models.

Findings

01

Minimax error rate characterized up to logarithmic factors.

02

Efficient estimators shown to be consistent across various parameters.

03

Exact algorithm demonstrated on image point-cloud matching.

Abstract

The multivariate linear regression model with shuffled data and additive Gaussian noise arises in various correspondence estimation and matching problems. Focusing on the denoising aspect of this problem, we provide a characterization the minimax error rate that is sharp up to logarithmic factors. We also analyze the performance of two versions of a computationally efficient estimator, and establish their consistency for a large range of input parameters. Finally, we provide an exact algorithm for the noiseless problem and demonstrate its performance on an image point-cloud matching task. Our analysis also extends to datasets with outliers.

Equations193

Y

Y

y = Π^{*} A x^{*} + w,

y = Π^{*} A x^{*} + w,

R (Π, X)

R (Π, X)

Π \in P_{n} X \in R^{d \times m} in f R (Π, X) = Π \in P_{n} X \in R^{d \times m} in f \frac{1}{nm} Π^{*} \in P_{n} X^{*} \in R^{d \times m} sup E ∣ ∣ ∣ Π A X - Π^{*} A X^{*} ∣ ∣ ∣_{\mbox F}^{2} .

(Π_{ML}, X_{ML}) = ar g Π \in P_{n} X \in R^{d \times m} min ∣ ∣ ∣ Y - Π A X ∣ ∣ ∣_{\mbox F}^{2} .

(Π_{ML}, X_{ML}) = ar g Π \in P_{n} X \in R^{d \times m} min ∣ ∣ ∣ Y - Π A X ∣ ∣ ∣_{\mbox F}^{2} .

A (γ, ξ)

A (γ, ξ)

\frac{∣ ∣ ∣ Π _{ML} A X _{ML} - Π ^{*} A X ^{*} ∣ ∣ ∣ _{\mbox F}^{2}}{nm}

\frac{∣ ∣ ∣ Π _{ML} A X _{ML} - Π ^{*} A X ^{*} ∣ ∣ ∣ _{\mbox F}^{2}}{nm}

Π^{*} \in P_{n} X^{*} \in R^{d \times m} sup E [\frac{∣ ∣ ∣ Π A X - Π ^{*} A X ^{*} ∣ ∣ ∣ _{\mbox F}^{2}}{nm}]

Π^{*} \in P_{n} X^{*} \in R^{d \times m} sup E [\frac{∣ ∣ ∣ Π A X - Π ^{*} A X ^{*} ∣ ∣ ∣ _{\mbox F}^{2}}{nm}]

c_{2} σ^{2} \leq Π \in P_{n} x \in R^{d} in f Π^{*} \in P_{n} x^{*} \in R^{d} sup E [\frac{1}{n} ∥ Π A x - Π^{*} A x^{*} ∥_{2}^{2}] \leq c_{1} σ^{2} .

c_{2} σ^{2} \leq Π \in P_{n} x \in R^{d} in f Π^{*} \in P_{n} x^{*} \in R^{d} sup E [\frac{1}{n} ∥ Π A x - Π^{*} A x^{*} ∥_{2}^{2}] \leq c_{1} σ^{2} .

\frac{1}{nm} ∣ ∣ ∣ T_{λ} (Y) - Π^{*} A X^{*} ∣ ∣ ∣_{\mbox F}^{2} \leq c_{1} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m})

\frac{1}{nm} ∣ ∣ ∣ T_{λ} (Y) - Π^{*} A X^{*} ∣ ∣ ∣_{\mbox F}^{2} \leq c_{1} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m})

\frac{1}{nm} ∣ ∣ ∣ T_{λ} (Y) - Π_{0} A X_{0} ∣ ∣ ∣_{\mbox F}^{2} \geq c_{2} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m}),

\frac{1}{nm} ∣ ∣ ∣ T_{λ} (Y) - Π_{0} A X_{0} ∣ ∣ ∣_{\mbox F}^{2} \geq c_{2} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m}),

Y_{sr} (λ) = ar g Y^{'} min ∣ ∣ ∣ Y - Y^{'} ∣ ∣ ∣_{\mbox F} + λ ∣ ∣ ∣ Y^{'} ∣ ∣ ∣_{\mbox n u c} .

Y_{sr} (λ) = ar g Y^{'} min ∣ ∣ ∣ Y - Y^{'} ∣ ∣ ∣_{\mbox F} + λ ∣ ∣ ∣ Y^{'} ∣ ∣ ∣_{\mbox n u c} .

\frac{1}{nm} ∣ ∣ ∣ Y_{sr} (λ) - Π^{*} A X^{*} ∣ ∣ ∣_{\mbox F}^{2} \leq c_{1} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m})

\frac{1}{nm} ∣ ∣ ∣ Y_{sr} (λ) - Π^{*} A X^{*} ∣ ∣ ∣_{\mbox F}^{2} \leq c_{1} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m})

C_{n} = {D \in {0, 1}^{n \times n} ∣ D 1 = 1},

C_{n} = {D \in {0, 1}^{n \times n} ∣ D 1 = 1},

Y = D^{*} A X^{*} + W,

Y = D^{*} A X^{*} + W,

(D_{ML}, X_{ML}) = ar g D \in C_{n} X \in R^{d \times m} min ∣ ∣ ∣ Y - D A X ∣ ∣ ∣_{\mbox F}^{2},

(D_{ML}, X_{ML}) = ar g D \in C_{n} X \in R^{d \times m} min ∣ ∣ ∣ Y - D A X ∣ ∣ ∣_{\mbox F}^{2},

\frac{∣ ∣ ∣ D _{ML} A X _{ML} - D ^{*} A X ^{*} ∣ ∣ ∣ _{\mbox F}^{2}}{nm}

\frac{∣ ∣ ∣ D _{ML} A X _{ML} - D ^{*} A X ^{*} ∣ ∣ ∣ _{\mbox F}^{2}}{nm}

\frac{1}{nm} ∣ ∣ ∣ T_{λ} (Y) - D^{*} A X^{*} ∣ ∣ ∣_{\mbox F}^{2} \leq c_{1} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m})

\frac{1}{nm} ∣ ∣ ∣ T_{λ} (Y) - D^{*} A X^{*} ∣ ∣ ∣_{\mbox F}^{2} \leq c_{1} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m})

\frac{1}{nm} ∣ ∣ ∣ Y_{sr} (λ) - D^{*} A X^{*} ∣ ∣ ∣_{\mbox F}^{2} \leq c_{1} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m})

\frac{1}{nm} ∣ ∣ ∣ Y_{sr} (λ) - D^{*} A X^{*} ∣ ∣ ∣_{\mbox F}^{2} \leq c_{1} σ^{2} rank (A) (\frac{1}{n} + \frac{1}{m})

\frac{1}{2} ∣ ∣ ∣ Δ ∣ ∣ ∣_{\mbox F}^{2}

\frac{1}{2} ∣ ∣ ∣ Δ ∣ ∣ ∣_{\mbox F}^{2}

Pr {\frac{∣ ∣ ∣ Δ ∣ ∣ ∣ _{\mbox F}^{2}}{nm} \geq 8 σ^{2}} \leq e^{- \frac{nm}{8}}, and

Pr {\frac{∣ ∣ ∣ Δ ∣ ∣ ∣ _{\mbox F}^{2}}{nm} \geq 8 σ^{2}} \leq e^{- \frac{nm}{8}}, and

Pr {\frac{∣ ∣ ∣ Δ ∣ ∣ ∣ _{\mbox F}^{2}}{nm} \geq c_{2} σ^{2} (\frac{d}{n} + \frac{lo g n}{m})} \leq e^{- c (n l o g n + m rank (A))} .

\frac{1}{2} ∣ ∣ ∣ Δ ∣ ∣ ∣_{\mbox F}

\frac{1}{2} ∣ ∣ ∣ Δ ∣ ∣ ∣_{\mbox F}

U_{m} (A)

U_{m} (A)

U_{m}^{diff} (A)

U_{m}^{diff} (A)

Z (t)

Z (t)

lo g N (δ, U_{m}^{diff} (A) \cap B_{F} (t), ∣ ∣ ∣ \cdot ∣ ∣ ∣_{\mbox F})

lo g N (δ, U_{m}^{diff} (A) \cap B_{F} (t), ∣ ∣ ∣ \cdot ∣ ∣ ∣_{\mbox F})

\displaystyle\frac{1}{2}|\!|\!|\widehat{\Delta}|\!|\!|_{{\mbox{\tiny{F}}}}^{2}\leq Z\big{(}|\!|\!|\widehat{\Delta}|\!|\!|_{{\mbox{\tiny{F}}}}\big{)}.

\displaystyle\frac{1}{2}|\!|\!|\widehat{\Delta}|\!|\!|_{{\mbox{\tiny{F}}}}^{2}\leq Z\big{(}|\!|\!|\widehat{\Delta}|\!|\!|_{{\mbox{\tiny{F}}}}\big{)}.

E [Z (δ_{n, m})] \leq \frac{δ _{n, m}^{2}}{2} .

E [Z (δ_{n, m})] \leq \frac{δ _{n, m}^{2}}{2} .

E_{t}

E_{t}

Pr [E_{t}] \leq Pr [Z (δ_{n, m}) \geq 2 δ_{n, m} t δ_{n, m}] for all t \geq δ_{n, m} .

Pr [E_{t}] \leq Pr [Z (δ_{n, m}) \geq 2 δ_{n, m} t δ_{n, m}] for all t \geq δ_{n, m} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Regression

Full text

Denoising Linear Models with Permuted Data

[TABLE]

Abstract

The multivariate linear regression model with shuffled data and additive Gaussian noise arises in various correspondence estimation and matching problems. Focusing on the denoising aspect of this problem, we provide a characterization the minimax error rate that is sharp up to logarithmic factors. We also analyze the performance of two versions of a computationally efficient estimator, and establish their consistency for a large range of input parameters. Finally, we provide an exact algorithm for the noiseless problem and demonstrate its performance on an image point-cloud matching task. Our analysis also extends to datasets with outliers.

1 Introduction

The linear model is a ubiquitous and well-studied tool for predicting responses $y$ based on a vector $a$ of covariates or predictors. In this paper, we consider the multivariate version of the model, with vector-valued responses $y_{i}\in\mathbb{R}^{m}$ , and covariates $a_{i}\in\mathbb{R}^{d}$ . In the standard formulation of this problem, estimation is performed on the basis of a data set of $n$ pairs $\{a_{i},y_{i}\}_{i=1}^{n}$ , in which each response $y_{i}$ is correctly associated with the covariate vector $a_{i}$ that generated it. Our focus is instead on the following variant of the standard set-up: the input consists of the permuted data set $\{a_{i},y_{\pi_{i}}\}_{i=1}^{n}$ , where $\pi$ represents an unknown permutation. The presence of this unknown permutation—which can be viewed as a nuisance parameter—introduces substantial challenges to this problem.

It is convenient to introduce matrix-vector notation so as to state the problem more precisely. If we form the matrices $A\in\mathbb{R}^{n\times d}$ and $Y\in\mathbb{R}^{n\times m}$ with $a_{i}^{T}$ and $y_{i}^{T}$ , respectively, as their $i^{th}$ row, we arrive at the model

[TABLE]

where $\Pi^{*}$ is an unknown $n\times n$ permutation matrix, $X^{*}\in\mathbb{R}^{d\times m}$ is an unknown matrix of parameters, and $W$ is the additive observation noise111We refer to the setting $W=0$ a.s. as the noiseless case.. When $m=1$ , this reduces to the vector linear regression model with an unknown permutation, given by

[TABLE]

which we refer to as the shuffled vector model.

The observation model (1) arises in multiple applications, which are discussed in detail for the shuffled vector model (2) in our earlier work [PWC16]. Here let us describe two applications that arise in the multivariate setting ( $m>1$ ), which we use as running examples throughout the paper.

Example 1 (Pose and correspondence estimation).

Our first motivating application is the problem of pose and correspondence estimation in images [MSC09]; it is closely related to point-cloud matching in graphics [Man93]. Suppose that we are given two images of a similar object, with the coordinates of one image arising from an unknown linear transformation of the coordinates of the second. In order to determine the linear transformation, keypoints are detected in each of the images individually and then matched; see Figure 1 for an illustration. We emphasize that in practice, the keypoint detection algorithm also returns features that help in finding the matching permutation $\Pi^{*}$ , but our goal here is to analyze whether there are procedures that are robust to such features being missing or corrupted. It is also worth noting that while in this example we have $d=m=2$ , the model is also valid for higher (but equal) parameters $d$ and $m$ , if we assume that in addition to the coordinates of the keypoints, other attributes like pixel brightness, colour, etc. in the two images are also related by a linear transformation.

Example 2 (Header-free communication).

A second application is that of header-free communication in large communication networks [PWC16]. Suppose that we use multiple sensors to take noisy measurements of a unknown matrix $X^{*}$ of parameters; each measurement corresponds to a noisy linear observation of the form $a_{i}^{\top}X^{*}+w_{i}^{\top}$ . In very large networks, such as those that arise in Internet of Things applications, it is often found that the bandwidth between a sensor and fusion center is mainly dominated by a header containing identity information—that is, by a bitstring that identifies sensor $i$ to the fusion center [KSF*+*09]. One possible solution to this problem is header-free communication, meaning that the identities of the sensors that sent the signal are no longer known to the fusion center. This absence can be modeled by introducing the unknown permutation matrix as in our model. If we are still able to achieve similar statistical performance without these headers, then such an approach is clearly preferable from a bandwidth standpoint.

With this motivation in hand, let us now provide a high-level overview of the main results of this paper. We focus on the multivariate model (1) with a fixed design matrix $A$ , and Gaussian222Our results also extend to the case of i.i.d. sub-Gaussian noise. noise $W_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,\sigma^{2})$ . We evaluate an estimator $(\widehat{\Pi},\widehat{X})$ based on its “denoising” capability, which we capture using the normalized prediction error $\frac{1}{nm}|\!|\!|\widehat{\Pi}A\widehat{X}-\Pi^{*}AX^{*}|\!|\!|_{{\mbox{\tiny{F}}}}^{2}$ . Our primary objective in this paper is to characterize the fundamental limits of denoising in a minimax sense. In particular, an estimator is any measurable mapping of the input $(y,A)$ to estimates $(\widehat{\Pi},\widehat{X})$ of the permutation and regression matrix, and we measure the quality of these estimates via their uniform mean-squared error

[TABLE]

Our interest will be in upper and lower bounding this quantity as a function of the design matrix $A$ , dimensions $(n,m,d)$ and the noise variance $\sigma^{2}$ . We also demonstrate an explicit (but computationally expensive) algorithm that achieves the minimax risk up to a $\log(n)$ factor, and analyze polynomial-time estimators with slightly larger prediction error.

In both of the examples discussed above, estimators with small minimax prediction error are of interest. In the pose and correspondence estimation problem, obtaining low prediction error is equivalent to obtaining near-identical keypoint locations on both images; in the sensor network example, we are interested in obtaining a set of noise-free linear functions of the input signal. It is important to note that depending on the application, multiple regimes of the parameter triplet $(n,m,d)$ are of interest. Therefore, in this paper, we focus on capturing the dependence of denoising error rates on all of these parameters, and also on the structure of the matrix $A$ .

Our work contributes to the growing body of literature on regression problems with unknown permutations, as well as related row-space perturbation problems including blind deconvolution [LS15], phase retrieval [CLS15], and dictionary learning [TF11]. Regression problems with unknown permutations have been considered in the context of statistical seriation and univariate isotonic matrix recovery [FMR16], and non-parametric ranking from pairwise comparisons [SBGW17], which involves bivariate isotonic matrix recovery. Moreover, the prediction error is used to evaluate estimators in both these applications.

Specializing to our setting, the shuffled vector model (2) was first considered in the context of compressive sensing with a sensor permutation [EBDG14]. The first theoretical results were provided by Unnikrishnan et al. [UHV15], who provided necessary and sufficient conditions needed to recover an adversarially chosen $x^{*}$ in the noiseless model with a random design matrix $A$ . Also in the random design setting, our own previous work [PWC16] focused on the complementary problem of recovering $\Pi^{*}$ in the noisy model, and showed necessary and sufficient conditions on the SNR under which exact and approximate recovery were possible. An efficient algorithm to compute the maximum likelihood estimate was also provided for the special case $d=1$ .

1.1 Our contributions

First, we characterize the minimax prediction error of multivariate linear model with an unknown permutation up to a logarithmic factor, by analyzing the maximum likelihood estimator. Since the maximum likelihood estimate is NP-hard to compute in general [PWC16], we then propose a computationally efficient estimator based on singular value thresholding and sharply characterize its performance, showing that it achieves vanishing prediction error over a restricted range of parameters. We also propose a variant of this estimator that achieves the same error rates, but with the advantage that it does not require the noise variance to be known. Third, we propose an efficient spectral algorithm for the noiseless problem that is exact provided certain natural conditions are met. We demonstrate this algorithm on an image point cloud matching task. Finally, we extend our results to a richer class of models that allows for outliers in the dataset. In the next section, we collect our main theorems and discuss their consequences. Proofs are postponed to Section 3.

Notation:

We use $\mathcal{P}_{n}$ to denote the set of permutation matrices. Let $I_{d}$ denote the identity matrix of dimension $d$ . We use the notation $|\!|\!|M|\!|\!|_{{\mbox{\tiny{F}}}}$ , $|\!|\!|M|\!|\!|_{{\tiny{\mbox{op}}}}$ , and $|\!|\!|M|\!|\!|_{{\tiny{\mbox{nuc}}}}$ to denote the Frobenius, operator, and nuclear norms of a matrix $M$ , and $c,c_{1},c_{2}$ to denote universal constants that may change from line to line.

2 Main results

In this section, we state our main results and discuss some of their consequences. We divide our results into four subsections, having to do with minimax rates, polynomial time estimators, efficient procedures for the noiseless problem, and an extension of the model (1) that allows for outliers.

2.1 Minimax rates of prediction

Assuming that the noise $W$ is i.i.d. Gaussian, so the maximum likelihood estimate (MLE) of the parameters $(\Pi^{*},X^{*})$ is given by

[TABLE]

This estimator is also sensible for non-Gaussian noise, as long as its tail behavior is similar to the Gaussian case (as can be formalized by the notion of sub-Gaussianity).

In this section, we begin by providing an upper bound the prediction error achieved by the maximum likelihood estimator for any design matrix $A$ . In general, however, it is impossible to prove a matching lower bound for an arbitrary matrix $A$ . As an extreme example, suppose that the matrix $A$ with identical rows: in this case, the permutation matrix $\Pi^{*}$ plays no role whatsoever, and the problem is obviously much easier than with a generic matrix $A$ .

With this fact in mind, we derive lower bounds that apply provided the matrix $A$ lies in a restricted class, in order to define which we require some additional notation. For a vector $v$ , let $v^{s}$ denote the vector sorted in decreasing order, and let $\mathbb{B}_{2,n}(1)$ denote the $n$ -dimensional $\ell_{2}$ -ball of unit radius centered at [math]. Define the matrix class

[TABLE]

In rough terms, this condition defines matrices that are not “flat”, meaning that there is some vector in their range obeying the $(\gamma,\xi)$ -separation condition defined above. It can be verified that a matrix $A$ with i.i.d. sub-Gaussian entries lies in the class $\mathcal{A}(C_{1},C_{2}/\sqrt{n})$ with high probability for fixed constants $C_{1},C_{2}$ . We are now ready to state our first main result:

Theorem 1.

For any triple $(A,X^{*},\Pi^{*})\in\mathbb{R}^{n\times d}\times\mathbb{R}^{d\times m}\times\mathcal{P}_{n}$ , we have

[TABLE]

Conversely, for any matrix $A\in\mathcal{A}(C_{1},C_{2}/\sqrt{n})$ , and any estimator $(\widehat{\Pi},\widehat{X})$ , we have

[TABLE]

where the constant $c_{2}$ depends on the value of the pair $(C_{1},C_{2})$ , but is independent of other problem parameters.

Theorem 1 characterizes the minimax rate up to a factor that is at most logarithmic in $n$ . It shows that the MLE is minimax optimal for prediction error up to logarithmic factors for all matrices that are not too flat. The bounds have the following interpretation, similar to the results of Flammarion et al. [FMR16] on prediction error for unimodal columns. The first term corresponds to a rate achieved even if the estimator knows the true permutation $\Pi^{*}$ ; the second term quantifies the price paid for the combinatorial choice among $n!$ permutations. As a result, we see that if $m\gg\log n$ , then the permutation does not play much of a role in the problem, and the rates resemble those of standard linear regression. Such a general behaviour is expected, since a large $m$ means that we get multiple observations with the same unknown permutation, and this should allow us to estimate $\widehat{\Pi}$ better.

Clearly, a flat matrix is not influenced by the unknown permutation, and so the second term of the lower bound need not apply. As we demonstrate in the proof, it is likely that the flatness of $A$ can also be incorporated in order to prove a tighter upper bound in this case, but we choose to state the upper bound as holding uniformly for all matrices $A$ , with the loss of a logarithmic factor.

It is also worth mentioning that the logarithmic factor in the second term is shown to be nearly tight for the problem of unimodal matrix estimation with an unknown permutation [FMR16], suggesting that a similar factor may also appear in a tight version of our lower bound (5b). For the specific case where $m=1$ however, which corresponds to the shuffled vector model (2), our bounds are tight up to constant factors, and summarized by the following corollary.

Corollary 1.

In the case $m=1$ , for any matrix $A\in\mathcal{A}(C_{1},C_{2}/\sqrt{n})$ , we have

[TABLE]

In other words, the normalized minimax prediction error for the shuffled vector model does not decay with the parameters $n$ or $d$ , and so no estimator achieves consistent prediction for every parameter choice $(\Pi^{*},x^{*})$ . Again, this is a consequence of the fact that—unlike when $m$ is large—we do not get independent observations with the permutation staying fixed, and herein lies the difficulty of the problem.

Both Theorem 1 and Corollary 1 provide non-adaptive minimax bounds. An interesting question is whether the least squares estimator is also minimax optimal up to logarithmic factors over finer classes of $\Pi^{*}$ and $X^{*}$ , i.e., whether it is adaptive in some interesting way. One would expect that the estimator adapts to the parameter $\kappa(AX^{*})$ , the number of distinct entries in the matrix $AX^{*}$ , similarly to the problem of monotone parameter recovery [FMR16].

2.2 Polynomial time estimators

As shown in our past work [PWC16], computing the MLE estimate (4) is NP-hard in general. Accordingly, it is natural to turn our attention to alternative estimators, and in particular ones that are guaranteed to run in polynomial time.

Here we analyze two simple methods for estimating the matrix $\Pi^{*}AX^{*}$ , based either on singular value thresholding, and a closely related variant that uses an explicit regularization based on the nuclear norm. It is well-known that such methods are appropriate when the matrix is low-rank, or approximately low-rank. While the matrix $Y^{*}$ is not low-rank, its rank is bounded by that of the matrix $A$ , a fact that we leverage in our bounds.

Given a matrix $M$ with the singular value decomposition $M=\sum_{i=1}^{r}\sigma_{i}u_{i}v_{i}^{\top}$ , its singular value thresholded version at level $\lambda$ is given by $T_{\lambda}(M)=\sum_{i=1}^{r}\sigma_{i}\mathbb{I}(\sigma_{i}\geq\lambda)u_{i}v_{i}^{\top}$ , where $\mathbb{I}(\cdot)$ is the indicator function of its argument.

The singular value thresholding (SVT) operation serves the purpose of denoising the observation matrix, and has been analyzed in the context of more general matrix estimation problems by various authors (e.g., [CCS10, Cha15]).

Theorem 2.

For any matrices $(\Pi^{*},X^{*})$ , the SVT estimate with $\lambda=1.1\sigma(\sqrt{n}+\sqrt{m})$ satisfies

[TABLE]

Conversely, for any matrix $A$ with rank at most $m$ , there exist matrices $\Pi_{0}$ and $X_{0}$ (that may depend $A$ ) such that for any threshold $\lambda>0$ , we have

[TABLE]

with probability greater than $1-e^{-cnm}$ .

Comparing inequalities (5b) (which holds for any denoised matrix, not just those having the form $\widehat{\Pi}A\widehat{X}$ ) and (6b), we see that the SVT estimator, while computationally efficient, may be statistically sub-optimal. However, it is consistent in the case where $\operatorname{{\sf rank}}(A)$ is sufficiently small compared to $m$ and $n$ , and minimax optimal when $\operatorname{{\sf rank}}(A)$ is a constant. Intuitively, the rate it attains is a result of treating the full matrix $\Pi^{*}A$ as unknown, and so it is likely that better, efficient estimators exist that take the knowledge of $A$ into account.

A potential concern is that the SVT estimator is required to know the noise variance $\sigma^{2}$ . This issue can be taken care of via the square-root LASSO “trick” [BCW11], which ensures a self-normalization that obviates the necessity for a noise-dependent threshold level. In particular, consider the estimate

[TABLE]

Using a choice of $\lambda$ that no longer depends on $\sigma$ , we have the following guarantee:

Theorem 3.

If $\operatorname{{\sf rank}}(A)\left(\frac{1}{n}+\frac{1}{m}\right)\leq 1/20$ , then for any choice of parameters $\Pi^{*}$ and $X^{*}$ , the square-root LASSO estimate (7) with $\lambda=2.1\left(\frac{1}{\sqrt{n}}+\frac{1}{\sqrt{m}}\right)$ satisfies

[TABLE]

with probability greater than $1-2e^{-cnm}$ .

We prove Theorem 3 in Section 3.3 for completeness. However, it should be noted that the square-root LASSO has been analyzed for matrix completion problems [Klo14], and our proof follows similar lines for our different observation model. The condition $\operatorname{{\sf rank}}(A)\left(\frac{1}{n}+\frac{1}{m}\right)\leq 1/20$ does not significantly affect the claim, since our bounds no longer guarantee consistency of the estimate $\widehat{Y}_{\mathsf{sr}}(\lambda)$ when this condition is violated.

While the optimization problem (7) can be solved efficiently, there may be cases when the noise is (sub)-Gaussian of known variance for which the SVT estimate can be computed more quickly. Hence, the SVT estimator is usually preferred in cases where the noise statistics are known.

2.3 Exact algorithm for the noiseless case

For the noiseless model, the only efficient algorithm known up to now is for the special case $d=m=1$ , as presented in our past work [PWC16]. It turns out that this algorithm has a natural generalization to higher dimensional problems, at least when certain conditions on the input matrices $(A,Y)$ are satisfied. The higher dimensional generalization requires analyzing certain spectral properties of the input matrices.

In order to state the theorem, we require require a few definitions. Given a matrix $M\in\mathbb{R}^{n\times d}$ , consider its reduced singular value decomposition $M=U_{M}\Sigma_{M}V_{M}^{\top}$ , where $U_{M}$ is a matrix of its left singular vectors. The (left) leverage scores of the matrix $M$ are given the $\ell_{2}$ -norms of the rows of the matrix $U_{M}$ ; in analytical terms, we can express them as the $n$ -dimensional vector $\ell(M)=\operatorname{diag}(U_{M}U_{M}^{\top})$ , where the operator $\operatorname{diag}$ extracts the diagonal of a square matrix. With this notation, the LevSort algorithm performs the following three steps on the input pair $(Y,A)$ :

(i)

Compute the leverage scores $\ell(Y)$ and $\ell(A)$ . 2. (ii)

Find a permutation $\widehat{\Pi}_{{\sf lev}}\in\arg\min_{\Pi}\|\ell(Y)-\widehat{\Pi}_{{\sf lev}}\;\ell(A)\|_{2}^{2}$ . 3. (iii)

Return the matrix $\widehat{X}_{{\sf lev}}=\big{(}\widehat{\Pi}_{{\sf lev}}A\big{)}^{\dagger}Y$ , where $M^{\dagger}$ denotes the Moore-Penrose pseudoinverse of a matrix $M$ .

Note that this algorithm runs in polynomial time, since it involves only spectral computations and a matching step that can be computed in time $O(n\log n)$ . As we demonstrate in the proof, step (ii) for the noiseless model actually returns a permutation matrix $\widehat{\Pi}_{{\sf lev}}$ such that $\ell(Y)=\widehat{\Pi}_{{\sf lev}}\ell(A)$ .

Theorem 4.

Consider an instantiation of the noiseless model with $\operatorname{{\sf rank}}(A)\leq\operatorname{{\sf rank}}(X^{*})$ , and such $\ell(A)$ and $\ell(Y)$ both have all distinct entries. Then the LevSort algorithm recovers the parameters $(\Pi^{*},X^{*})$ exactly.

The LevSort algorithm is a generalization of our own algorithm [PWC16] to the matrix setting. However, instead of a simple sorting algorithm, we now require an additional spectral component. While showing the necessity of the condition $\operatorname{{\sf rank}}(A)\leq\operatorname{{\sf rank}}(X^{*})$ is still open, an efficient algorithm that does not impose any conditions is unlikely to exist due to the general problem being NP-hard [PWC16]. Note that the condition includes as a special case all problems in which the matrices $A$ and $X^{*}$ are full rank, with $d\leq m$ .

In particular, the pose and correspondence estimation problem for 2D point clouds satisfies the conditions of Theorem 4 under some natural assumptions. We have $d=m=2$ for all such problems, and $\operatorname{{\sf rank}}(X^{*})=2$ unless the linear transformation is degenerate. Furthermore, unless the keypoints are generated adversarially, the leverage scores of the matrix $A$ and the rows of $Y$ are distinct. Thus, assuming that the noiseless version of model (1) exactly describes the keypoints detected in the two images (which is an idealization that may not be true in real data), we are guaranteed to find both the pose and the correspondence exactly.

In Figure 2, we demonstrate the guarantee of Theorem 4 on two image correspondence tasks when the keypoints detected in the two images are identical and the transformation between coordinates is linear.

2.4 Extensions to outliers

The results of Sections 2.1 and 2.2 also hold in a somewhat general setting, where the set of perturbations to the rows of the matrix $A$ is allowed to be larger than just the set of permutation matrices $\mathcal{P}_{n}$ . In particular, defining the set of “clustering matrices” $\mathcal{C}_{n}$ as

[TABLE]

we consider an observation model of the form

[TABLE]

where the matrices $A$ , $X^{*}$ , and $W$ are as before, and $D^{*}\in\mathcal{C}_{n}$ now represents a clustering matrix. Such a clustering condition ensures stochasticity of the matrix $D^{*}$ (not double stochasticity, as in the permutation model), and corresponds to the case where multiple responses may come from the same covariate, and some of the data may be permuted. Such a model is likely to better fit data from image correspondence problems when the keypoints detected in the two images are quite different. Also, such a formulation is loosely related to the $k$ -means clustering problem with Gaussian data [ABC*+*15].

As it turns out, Theorems 1, 2 and 3 also hold for this model, with minor modifications to the proofs. Defining the analogous MLE for this model as

[TABLE]

we have the following theorem.

Theorem 5.

(a)

For any matrix $A$ , and for all parameters $X^{*}\in\mathbb{R}^{d\times m}$ and $D^{*}\in\mathcal{C}_{n}$ , we have

[TABLE]

with probability greater than $1-e^{-c(n\log n+m\operatorname{{\sf rank}}(A))}$ . 2. (b)

For any choice of parameters $D^{*}$ and $X^{*}$ , the SVT estimate with $\lambda=1.1\sigma(\sqrt{n}+\sqrt{m})$ satisfies

[TABLE]

with probability greater than $1-e^{-cnm}$ . 3. (c)

For any choice of parameters $D^{*}$ and $X^{*}$ , the square-root LASSO estimate (7) with $\lambda=2.1\left(\frac{1}{\sqrt{n}}+\frac{1}{\sqrt{m}}\right)$ satisfies

[TABLE]

with probability greater than $1-2e^{-cnm}$ .

Clearly, the lower bounds (5b) and (6b) hold immediately for the model (8) as a result of the inclusion $\mathcal{P}_{n}\subset\mathcal{C}_{n}$ .

3 Proofs

This section contains proofs of all our main results. We use $C,c,c^{\prime}$ to denote absolute constants that may change from line to line. We let $\sigma_{i}(M)$ denote the $i$ th largest singular value of a matrix $M$ .

3.1 Proof of Theorem 1

We split the proof into two natural parts, corresponding to the upper and lower bounds, respectively. The upper bound boils down to analyzing the Gaussian width [Pis99] of a certain set, which we obtain via Dudley’s entropy integral [Dud67] and bounds on the metric entropy of the observation space. The lower bound is obtained via a packing construction and an application of Fano’s inequality.

3.1.1 Proof of upper bound

Writing $Y^{*}=\Pi^{*}AX^{*}$ and $\widehat{Y}=\widehat{\Pi}_{{\sf ML}}A\widehat{X}_{{\sf ML}}$ , we have by the optimality of $\widehat{Y}$ for problem (4) that $|\!|\!|Y-\widehat{Y}|\!|\!|_{{\mbox{\tiny{F}}}}^{2}\leq|\!|\!|Y-Y^{*}|\!|\!|_{{\mbox{\tiny{F}}}}^{2}$ , from which it follows that the error matrix $\widehat{\Delta}=\widehat{Y}-Y^{*}$ satisfies the following basic inequality:

[TABLE]

where $\langle\!\langle{A},\;{B}\rangle\!\rangle$ denotes the trace inner product between two matrices $A$ and $B$ . We prove inequality (5a) by proving the following claims.

[TABLE]

Proof of inequality (10a):

Applying the Cauchy Schwarz inequality to the RHS of inequality (9) yields

[TABLE]

Squaring both sides of inequality (11) and using standard sub-exponential tail bounds [Wai15] yields inequality (10a). ∎

Proof of inequality (10b):

Without loss of generality, by rescaling as necessary, we may assume that the noise $W$ has standard normal entries ( $\sigma^{2}=1$ ). We use $\mathbb{U}_{m}(A)$ to denote the set of matrices whose $m$ columns lie in the range of $\Pi A$ for some permutation matrix $\Pi$ , i.e.,

[TABLE]

Also define the set

[TABLE]

as well as the function

[TABLE]

Before proceeding with the proof, we state the definition of the covering number of a set.

Definition 1 (Covering number).

A $\delta$ -cover of a set $\mathbb{T}$ with respect to a metric $\rho$ is a set $\left\{\theta^{1},\theta^{2},\ldots,\theta^{N}\right\}\subset\mathbb{T}$ such that for each $\theta\in\mathbb{T}$ , there exists some $i\in[N]$ such that $\rho(\theta,\theta_{i})\leq\delta$ . The $\delta$ -covering number $N(\delta,\mathbb{T},\rho)$ is the cardinality of the smallest $\delta$ -cover.

The logarithm of the covering number is referred to as the metric entropy of a set. The following lemma bounds the metric entropy of the set $\mathbb{U}^{{\sf diff}}_{m}(A)$ . Let $\mathbb{B}_{F}(t)$ denote the Frobenius norm ball of radius $t$ centered at [math].

Lemma 1.

The metric entropy of the set $\mathbb{U}^{{\sf diff}}_{m}(A)\cap\mathbb{B}_{F}(t)$ in the Frobenius norm metric is bounded as

[TABLE]

We prove the lemma at the end of the section, taking it as given for the proof of inequality (10b).

Proof of inequality (10b).

By definition of $Z(t)$ , is easy to see that we have

[TABLE]

One can also verify that the set $\mathbb{U}^{{\sf diff}}_{m}(A)$ is star-shaped333A set $S$ is said to be star-shaped if $t\in S$ implies that $\alpha t\in S\text{ for all }\alpha\in[0,1]$ , and so the following critical inequality holds for some $\delta_{n,m}>0$ :

[TABLE]

We are interested in the smallest (strictly) positive solution to inequality (14). Moreover, we would like to show that for every $t\geq\delta_{n,m}$ , we have $|\!|\!|\widehat{\Delta}|\!|\!|_{{\mbox{\tiny{F}}}}\leq c\sqrt{t\delta_{n,m}}$ with probability greater than $1-ce^{-c^{\prime}t\delta_{n,m}}$ .

Define the “bad” event

[TABLE]

Using the star-shaped property of $\mathbb{U}^{{\sf diff}}_{m}(A)$ , it follows by a rescaling argument that

[TABLE]

The entries of $W$ are i.i.d. standard Gaussian, and the function $W\mapsto Z(t)$ is convex and Lipschitz with parameter $t$ . Consequently, by Borell’s theorem (see, for example, Milman and Schechtman [MS86] for a simple proof), the following holds for all $t\geq\delta_{n,m}$ :

[TABLE]

By the definition of $\delta_{n,m}$ , we have $\mathbb{E}[Z(\delta_{n,m})]\leq\delta_{n,m}^{2}\leq\delta_{n,m}\sqrt{t\delta_{n,m}}$ for any $t\geq\delta_{n,m}$ , and consequently, for all $t\geq\delta_{n,m}$ , we have

[TABLE]

Now either $|\!|\!|\Delta|\!|\!|_{{\mbox{\tiny{F}}}}\leq\sqrt{t\delta_{n,m}}$ , or we have $\|\Delta\|_{F}>\sqrt{t\delta_{n,m}}$ . In the latter case, conditioning on the complementary event $\mathcal{E}^{c}_{t}$ , our basic inequality implies that $\frac{1}{2}|\!|\!|\Delta|\!|\!|_{{\mbox{\tiny{F}}}}^{2}\leq 2|\!|\!|\Delta|\!|\!|_{{\mbox{\tiny{F}}}}\sqrt{t\delta_{n,m}}$ . Consequently, we have

[TABLE]

Putting together the pieces yields

[TABLE]

with probability at least $1-2e^{-ct\delta_{n,m}}$ for every $t\geq\delta_{n,m}$ .

In order to determine a feasible $\delta_{n,m}$ satisfying the critical inequality (14), we need to bound the expectation $\mathbb{E}[Z(\delta_{n,m})]$ . We now use Dudley’s entropy integral [Dud67] to bound $\mathbb{E}[Z(t)]$ . In particular, for a universal constant $C$ , we have

[TABLE]

where in step ${\sf(i)}$ , we have made use of Lemma 1, and in step ${\sf(ii)}$ , we have used the change of variables $u=\delta/t$ . Now comparing with the critical inequality, we see that

[TABLE]

Putting together the pieces then proves claim (10b). ∎

It remains to prove Lemma 1.

Proof of Lemma 1.

We begin by finding the $\delta$ -covering number of

[TABLE]

Note that $\mathbb{U}_{m}^{\Pi}$ is isomorphic to $\operatorname{{\sf range}}(I_{m}\otimes\Pi A)$ , where $\otimes$ denotes the tensor product. Note that $\operatorname{{\sf range}}(I_{m}\otimes\Pi A)$ is a linear subspace of dimension $m\cdot\operatorname{{\sf rank}}(A)$ . Also, since the set $\operatorname{{\sf range}}(I_{m}\otimes\Pi A)\cap\mathbb{B}_{2}^{nm}(t)$ is an $m\cdot\operatorname{{\sf rank}}(A)$ -dimensional $\ell_{2}$ -ball of radius $t$ , we have by a volume ratio argument that

[TABLE]

By definition, we also have $\mathbb{U}_{m}(A)=\bigcup_{\Pi\in\mathcal{P}_{n}}\mathbb{U}_{m}^{\Pi}(A)$ , and so by the union bound, we have

[TABLE]

In order to complete the proof, we notice that

[TABLE]

since it is sufficient to use two $\delta/2$ -covers of the set $\mathbb{U}_{m}(A)\cap\mathbb{B}_{F}(t)$ in conjunction in order to obtain a $\delta$ -cover of the set $\mathbb{U}^{{\sf diff}}_{m}(A)\cap\mathbb{B}_{F}(t)$ . ∎

3.1.2 Proof of lower bound

As alluded to before, the bound follows from a packing set construction and Fano’s inequality, which is a standard template used to prove minimax lower bounds. Suppose we wish to estimate a parameter $\theta$ over an indexed class of distributions $\mathcal{P}=\{\mathbb{P}_{\theta}\;\mid\;\theta\in\Theta\}$ in the square of a (pseudo-)metric $\rho$ . We refer to a subset of parameters $\{\theta^{1},\theta^{2},\ldots,\theta^{M}\}$ as a local $(\delta,\epsilon)$ -packing set if

[TABLE]

Note that this set is a $\delta$ -packing in the $\rho$ metric with the average KL-divergence bounded by $\epsilon$ . The following result is a straightforward consequence of Fano’s inequality:

Lemma 2 (Local packing Fano lower bound).

For any $(\delta,\epsilon)$ -packing set of cardinality $M$ , we have

[TABLE]

The remainder of argument is directed to establishing the following two claims:

[TABLE]

It is easy to see that both claims together prove the lemma.

Proof of claim (18a):

This claim is consequence of classical minimax bounds on linear regression. Since we are operating in the matrix setting, we include the proof for completeness.

The proof involves the construction of a packing set $\{\Pi AX_{i}\}_{i=1}^{M}$ such that for all $i\neq j\in[M]$ , we have $\frac{|\!|\!|\Pi AX_{i}|\!|\!|_{{\mbox{\tiny{F}}}}}{\sqrt{nm}}\leq 4\delta$ and $\frac{|\!|\!|\Pi AX_{i}-\Pi AX_{j}|\!|\!|_{{\mbox{\tiny{F}}}}}{\sqrt{nm}}\geq\delta$ . Since we are effectively packing the space $\frac{1}{\sqrt{nm}}\operatorname{{\sf range}}(I_{m}\otimes\Pi A)$ , standard results show that there exists such a packing of this space with $\log M\geq\operatorname{{\sf rank}}(I_{m}\otimes\Pi A)\log 2$ .

Also note that with the underlying parameter $X_{i}$ , our observations have the distribution $\mathbb{P}_{i}=\mathcal{N}(\Pi AX_{i},\sigma^{2}I_{nm})$ . Hence, the KL divergence between two observations $i$ and $j$ is simply

[TABLE]

Substituting this into the bound of Lemma 2 with $\rho(\theta_{1},\theta_{2})=\|\theta_{1}-\theta_{2}\|_{F}$ , we have

[TABLE]

where we have again used $\mathcal{M}$ to denote the minimax rate of prediction.

Setting $\delta^{2}=c\frac{\sigma^{2}\operatorname{{\sf rank}}(A)}{n}$ completes the proof of claim (18a). Note that the proof of this claim did not require the assumption that $A\in\mathcal{A}(C_{1},C_{2}/\sqrt{n})$ .

Proof of claim (18b)

For ease of exposition, we first prove claim (18b) for matrices in a smaller class than $\mathcal{A}(C_{1},C_{2}/\sqrt{n})$ . We let $\mathbf{1}^{p}_{n}$ denote the $n$ -dimensional vector having $1$ in its first $p$ coordinates and [math] in the remaining coordinates.

Now consider the class of matrices that have $\mathbf{1}^{p}_{n}$ in their range. By multiplying with $\delta$ and stacking $m$ of these vectors up as columns, we have a matrix $\widetilde{Y}^{1}\in\mathbb{R}^{n\times m}$ whose first $p$ rows are identically $\delta$ and the rest are identically zero. Define the Hamming distance between two binary vectors $\mathsf{d}_{\mathsf{H}}(u,v)=\#\{i:u_{i}\neq v_{i}\}.$ We require the following lemma.

Lemma 3.

There exists a set of binary $n$ -vectors $\{v_{i}\}_{i=1}^{M}$ , each of Hamming weight $p$ and satisfying $\mathsf{d}_{\mathsf{H}}(v_{i},v_{j})\geq h$ , having cardinality $M=\frac{\binom{n}{p}}{\sum_{i=1}^{\lfloor\frac{h-1}{2}\rfloor}\binom{n-p}{i}\binom{p}{i}}.$

The lemma is proved at the end of this section.

Proof of claim (18b)

Applying Lemma 3 and a rescaling argument, we see that there is a packing set $\{\Pi_{i}\widetilde{Y}^{1}\}_{i=1}^{M}$ such that

[TABLE]

Fixing some constant $\gamma\in(0,1)$ and choosing $p=\gamma n$ and $h=\frac{n}{2}\min\left\{\gamma/2,(1-\gamma)/2\right\}$ , it can be verified that we obtain a packing set of size $M\geq e^{\gamma\log(1/\gamma)n}$ . We now have observation $i$ distributed as $\mathbb{P}_{i}=\mathcal{N}(\Pi_{i}\widetilde{Y}^{1},\sigma^{2}I_{nm})$ , and so

[TABLE]

Finally, substituting into the Fano bound of Lemma 2 yields

[TABLE]

Setting $\delta^{2}=c(\gamma)\frac{\sigma^{2}}{m}$ for a constant $c(\gamma)$ depending only on $\gamma$ completes the proof provided the vector $\mathbf{1}^{p}_{n}\in{\sf range}(A)$ for $p=\gamma n$ with $\gamma\in(0,1)$ .

It remains to extend the proof to matrices in the class $\mathcal{A}(C_{1},C_{2}/\sqrt{n})$ , and to prove Lemma 3.

By definition, if $A\in\mathcal{A}(C_{1},C_{2}/\sqrt{n})$ , then there exists a vector $a\in\operatorname{{\sf range}}(A)\cap\mathbb{B}_{2}(1)$ such that $a^{s}_{C_{1}n}\geq a^{s}_{C_{1}n+1}+C_{2}/\sqrt{n}$ . We may assume that $\|a\|_{2}=1$ by a rescaling argument, and also that $a=a^{s}$ . By definition, we have

[TABLE]

It can also be verified that since $\|a\|_{2}=1$ , we must have $C_{2}\leq 2$ . For the rest of the proof, we assume for simplicity of exposition that $C_{1}n$ is an integer. Fixing the value $\epsilon=\frac{n}{2}\min(C_{1},1-C_{1})$ , consider the $\epsilon$ -packing generated by permutations $\left\{\Pi_{i}\right\}_{i=1}^{M}$ of the vector $\mathbf{1}^{C_{1}n}_{n}$ , given by Lemma 3 by taking $v_{i}=\Pi_{i}\mathbf{1}^{C_{1}n}_{n}$ . Using these permutations, we observe that

[TABLE]

where $c$ depends on the constants $(C_{1},C_{2})$ , and we have used condition (20) along with the fact that $\mathsf{d}_{\mathsf{H}}(v_{i},v_{j})\geq\epsilon$ .

Following similar steps to before then proves lemma for all matrices $A\in\mathcal{A}(C_{1},C_{2}/\sqrt{n})$ .

It remains to prove Lemma 3.

Proof of Lemma 3

The proof follows by a volume ratio argument that underlies the proof of the Gilbert-Varshamov bound. In particular, the number of permuted vectors of $\mathbf{1}_{n}^{p}$ that are within a Hamming distance $h-1$ of $\mathbf{1}_{n}^{p}$ is given by $\Delta=\sum_{i=1}^{\lfloor\frac{h-1}{2}\rfloor}\binom{n-p}{i}\binom{p}{p-i}$ . Now form a graph with all $\binom{n}{p}$ permuted vectors of $\mathbf{1}_{n}^{p}$ as vertices and connect two vertices if the corresponding vectors have Hamming distance less than $h$ . Then such a graph has uniform degree $\Delta$ and therefore contains an independent set of size $\frac{\binom{n}{p}}{\Delta}$ . ∎

3.2 Proof of Theorem 2

Again, we divide our proof into two parts, corresponding to the upper and lower bounds respectively.

3.2.1 Proof of upper bound

For this proof, we use the shorthand $Y^{*}=\Pi^{*}AX^{*}$ . Also fix $\delta=0.1$ , and let $s$ be the number of singular values of $Y^{*}$ greater than $\frac{\delta}{1+\delta}\lambda$ . Also, let $Y^{*}_{s}$ denote the matrix formed by truncating $Y^{*}$ to its top $s$ singular values. By triangle inequality, we have

[TABLE]

Now note that by standard results in random matrix theory (see, for example, [Wai15, Theorem 6.1]), we have $\lambda\geq(1+\delta)|\!|\!|W|\!|\!|_{{\tiny{\mbox{op}}}}$ with probability greater than $1-e^{-\frac{\delta^{2}}{2}n(\sqrt{n}+\sqrt{m})^{2}}$ . We condition on this event for the rest of the proof.

Consequently, for $j\geq s+1$ , we have

[TABLE]

and so $\operatorname{{\sf rank}}(T_{\lambda}(Y))\leq s$ . Additionally, we have

[TABLE]

Putting together the pieces yields

[TABLE]

a bound that holds with probability greater than $1-e^{-cnm}$ . In order to complete the proof, we note that $\operatorname{{\sf rank}}(Y^{*})\leq\operatorname{{\sf rank}}(A)$ . ∎

3.2.2 Proof of lower bound

We split our analysis into two separate cases.

Case 1:

First suppose that $\lambda\leq\frac{\sigma}{3}(\sqrt{n}+\sqrt{m})$ . Consider any matrix $Y^{*}=\Pi^{*}AX^{*}$ , and $Y=Y^{*}+W$ . By definition of the thresholding operation, we have

[TABLE]

Triangle inequality yields

[TABLE]

Now with probability greater than $1-e^{-cnm}$ , we have $|\!|\!|W|\!|\!|_{{\mbox{\tiny{F}}}}^{2}\geq\sigma^{2}\frac{nm}{2}$ , so that conditioned on this event, we have

[TABLE]

which completes the proof.

Case 2:

We now suppose that $\lambda>\frac{\sigma}{3}(\sqrt{n}+\sqrt{m})$ . Let the matrix $A$ have the (reduced) singular value decomposition $A=U_{A}\Sigma_{A}V_{A}^{\top}$ , and introduce the shorthand $r:\,=\operatorname{{\sf rank}}(A)$ . Form the diagonal matrix $L=\frac{\sqrt{n}+\sqrt{m}}{6}I_{r}$ . Now let $\Pi_{0}=I_{n}$ , and consider the parameter matrix $X_{0}=V_{A}\Sigma_{A}^{-1}LV^{\top}$ , where $V$ is an $m\times\operatorname{{\sf rank}}(A)$ dimensional matrix $V$ with orthonormal columns. Note that such a choice exists when $\operatorname{{\sf rank}}(A)\leq m$ .

We now have

[TABLE]

For two matrices $A,B\in\mathbb{R}^{n\times m}$ with $k=\min\{n,m\}$ , it can be verified that

[TABLE]

By the definition of the thresholding operation, the top singular values of the matrix $T_{\lambda}(U_{A}LV^{\top}+W)$ are all either greater than $\lambda$ , or equal to [math]. Hence, we have

[TABLE]

where the last step follows since $\lambda>\frac{\sigma}{3}(\sqrt{n}+\sqrt{m})$ , which completes the proof. ∎

3.3 Proof of Theorem 3

It is again helpful to write the observation model in the form $Y=Y^{*}+W$ , where $Y^{*}=\Pi^{*}AX^{*}$ represents the underlying matrix we are trying to predict. Let us denote the choice of $\lambda$ in the statement of Theorem 3 by $\lambda_{0}=2.1\>\frac{\sqrt{n}+\sqrt{m}}{\sqrt{nm}}$ . We use the shorthand $R(M)=|\!|\!|Y-M|\!|\!|_{{\mbox{\tiny{F}}}}$ , and $\Delta=Y^{*}-\widehat{Y}_{{\sf sr}}(\lambda_{0})$ . Let $P_{M}$ and $P^{\perp}_{M}$ denote, respectively, the projection matrices onto the rowspace of the matrix $M$ and its orthogonal complement.

We require the following auxiliary lemmas for our proof:

Lemma 4.

We have

[TABLE]

Lemma 5.

If $\lambda_{0}\geq 2\frac{|\!|\!|W|\!|\!|_{{\tiny{\mbox{op}}}}}{|\!|\!|W|\!|\!|_{{\mbox{\tiny{F}}}}}$ , we have

[TABLE]

We are now ready to prove Theorem 3.

Proof of Theorem 3.

First, note that by standard results on concentration of $\chi^{2}$ -random variables and random matrices (see, for instance, Wainwright [Wai15]), we have

[TABLE]

Hence, we have

[TABLE]

For the rest of the proof, we condition on the event $\{\lambda_{0}\geq 2\frac{|\!|\!|W|\!|\!|_{{\tiny{\mbox{op}}}}}{|\!|\!|W|\!|\!|_{{\mbox{\tiny{F}}}}}\}$ .

Now, by definition of the quantity $R(M)$ , we have

[TABLE]

Some simple algebra yields

[TABLE]

Now, from the definition of the estimate $\widehat{Y}_{{\sf sr}}(\lambda_{0})$ , we have

[TABLE]

Rearranging terms yields

[TABLE]

where step ${\sf(i)}$ follows from Lemma 4, and the fact that $\lambda>0$ .

Another rearrangement of inequality (22) yields

[TABLE]

where step ${\sf(ii)}$ follows from Lemma 4, and the fact that $|\!|\!|P_{Y^{*}}\Delta|\!|\!|_{{\tiny{\mbox{nuc}}}}>0$ . Thus, we have established the upper bound $(R(\widehat{Y}_{{\sf sr}}(\lambda_{0})))^{2}-(R(Y^{*}))^{2}\leq T_{1}\;T_{2}$ , where

[TABLE]

Expanding the product of the two terms yields

[TABLE]

where step ${\sf(iii)}$ follows from Lemma 5, since $\lambda_{0}^{2}|\!|\!|P^{\perp}_{Y^{*}}\Delta|\!|\!|_{{\tiny{\mbox{nuc}}}}^{2}-4\lambda_{0}^{2}|\!|\!|P^{\perp}_{Y^{*}}\Delta|\!|\!|_{{\tiny{\mbox{nuc}}}}\;|\!|\!|P_{Y^{*}}\Delta|\!|\!|_{{\tiny{\mbox{nuc}}}}\leq 0$ .

We also note that

[TABLE]

Combining with the fact that $\lambda_{0}$ satisfies the inequality $2|\!|\!|W|\!|\!|_{{\tiny{\mbox{op}}}}\leq\lambda_{0}R(Y^{*})$ , we find that

[TABLE]

where in step ${\sf(iv)}$ , we have used the Cauchy Schwarz inequality and the fact that projections are non-expansive to write

[TABLE]

Rearranging yields

[TABLE]

Squaring both sides, substituting the choice of $\lambda_{0}$ , and using the condition $\operatorname{{\sf rank}}(A)\left(\frac{1}{n}+\frac{1}{m}\right)\leq 1/20$ completes the proof. ∎

The only remaining detail is to prove Lemmas 4 and 5.

3.3.1 Proof of Lemma 4

We write

[TABLE]

Rearranging yields the claim. ∎

3.3.2 Proof of Lemma 5

Rearranging the Cauchy Schwarz inequality for two matrices $A$ and $B$ yields

[TABLE]

Now setting $A=Y-\widehat{Y}_{{\sf sr}}(\lambda_{0})$ and $B=Y-Y^{*}$ , we have

[TABLE]

where step ${\sf(i)}$ follows from Hölder’s inequality and choice of $\lambda_{0}\geq 2\frac{|\!|\!|W|\!|\!|_{{\tiny{\mbox{op}}}}}{|\!|\!|W|\!|\!|_{{\mbox{\tiny{F}}}}}$ .

Combining this with the basic inequality (22) yields

[TABLE]

Finally, using Lemma 4, we have

[TABLE]

which completes the proof. ∎

3.4 Proof of Theorem 4

We write the (reduced) singular value decomposition of a matrix $M$ as $M=U_{M}\Sigma_{M}V_{M}^{\top}$ . We also adopt the shorthand $r_{M}=\operatorname{{\sf rank}}(M)$ for the rest of this proof. The LevSort algorithm clearly runs in polynomial time, since it involves a singular value decomposition and a sorting operation, both of which can be accomplished efficiently. Let us now verify the exactness guarantee.

Since the observation model (1) is noiseless and $r_{A}\leq r_{X^{*}}$ , we have $r_{Y}=r_{A}$ . Moreover, by definition of the observation model, we have

[TABLE]

Consequently, the unknown matrix $X^{*}$ can be written as

[TABLE]

with $U$ representing an unknown $r_{A}\times r_{A}$ unitary matrix (satisfying $U^{\top}U=UU^{\top}=I$ ). Substituting this representation of $X^{*}$ back into the noiseless observation model yields

[TABLE]

Now $\Sigma_{Y}V_{Y}^{\top}$ has a full-dimensional row-space, and so we have $U_{Y}=\Pi^{*}U_{A}U$ . We complete the proof by observing that

[TABLE]

so that we have the equivalence $\ell(Y)=\Pi^{*}\ell(A)$ as claimed. The uniqueness of the parameters $(\Pi^{*},X^{*})$ follows from the fact that the leverage score vectors $\ell(A)$ and $\ell(Y)$ have distinct entries. ∎

3.5 Proof of Theorem 5

The proofs of Theorems 1, 2, and 3 apply to the model (8) with minor modifications. We briefly mention these modifications here, leaving the details to the reader.

Part (a) follows by mimicking the proof of Section 3.1.1 as is, with a small modification to the metric entropy of the observation space. In particular, the covering number of the observation space is now upper bounded by $n^{n}\cdot N(\delta,\mathbb{U}_{m}^{\Pi}(A)\cap\mathbb{B}_{F}(t),|\!|\!|\cdot|\!|\!|_{{\mbox{\tiny{F}}}})$ , and the rest of the proof follows as before.

Parts (b) and (c) follow by mimicking the proof of Sections 3.2.1 and 3.3, respectively, with the definition $Y^{*}=D^{*}AX^{*}$ . Note that the clustering observation model can only decrease the rank of $Y^{*}$ from before. ∎

4 Discussion

We conclude with a discussion of some possible future directions.

4.1 More general picture for regression problems

Multivariate linear regression is a specific case of the following problem with shuffled data $\{(a_{\pi(i)},y_{i})\}_{i=1}^{n}$ , with the covariates $a_{i}\in\mathbb{R}^{d}$ and responses $y_{i}\in\mathbb{R}^{m}$ related by the equation

[TABLE]

where $f$ represents a function from some parametric or non-parametric family $\mathcal{F}$ . The general behaviour of prediction error for problems of this form should be similar to that seen in our linear regression model, or the structured regression model of Flammarion et al. [FMR16]. In particular, provided the data $a_{i}$ is sufficiently diverse and the function class $\mathcal{F}$ is sufficiently expressive, the minimax rate of prediction for the permuted model should be given by the sum of two terms: the minimax rate of the unpermuted model (or equivalently, with a known permutation), and an additional constant/logarithmic term that accounts for the permutation.

4.2 Necessity of flatness condition and adaptivity

Our condition on the matrix $A$ is a convenient one for the application of the Gilbert-Varshamov type bound on distances between permuted binary vectors. However, this sufficient condition may be far from necessary – we instead require some permutation codes of real numbers.

Conversely, the upper bound (5a) can be stated by explicitly taking the structure of the matrix $A$ into account; this will require bounds on the metric entropy of the union of subspaces generated by permutations of the range space of $A$ .

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ABC + 15] P. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, and R. Ward. Relax, no need to round: Integrality of clustering formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science , pages 191–200. ACM, 2015.
2[BCW 11] A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika , 98(4):791–806, 2011.
3[CCS 10] J-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization , 20(4):1956–1982, 2010.
4[Cha 15] S. Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statistics , 43(1):177–214, 2015.
5[CLS 15] E. J. Candés, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory , 61(4):1985–2007, 2015.
6[Dud 67] Richard M Dudley. The sizes of compact subsets of Hilbert space and continuity of gaussian processes. Journal of Functional Analysis , 1(3):290–330, 1967.
7[EBDG 14] V. Emiya, A. Bonnefoy, L. Daudet, and R. Gribonval. Compressed sensing with unknown sensor permutation. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , pages 1040–1044. IEEE, 2014.
8[FMR 16] N. Flammarion, C. Mao, and P. Rigollet. Optimal rates of statistical seriation. ar Xiv preprint ar Xiv:1607.02435 , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 Introduction

Example 1** (Pose and correspondence estimation).**

Example 2** (Header-free communication).**

1.1 Our contributions

Notation:

2 Main results

2.1 Minimax rates of prediction

Theorem 1**.**

Corollary 1**.**

2.2 Polynomial time estimators

Theorem 2**.**

Theorem 3**.**

2.3 Exact algorithm for the noiseless case

Theorem 4**.**

2.4 Extensions to outliers

Theorem 5**.**

3 Proofs

3.1 Proof of Theorem 1

3.1.1 Proof of upper bound

Proof of inequality (10a):

Proof of inequality (10b):

Definition 1** (Covering number).**

Lemma 1**.**

Proof of inequality (10b).

Proof of Lemma 1.

3.1.2 Proof of lower bound

Lemma 2** (Local packing Fano lower bound).**

Proof of claim (18a):

Proof of claim (18b)

Lemma 3**.**

Proof of claim (18b)

Proof of Lemma 3

3.2 Proof of Theorem 2

3.2.1 Proof of upper bound

3.2.2 Proof of lower bound

Case 1:

Case 2:

3.3 Proof of Theorem 3

Lemma 4**.**

Lemma 5**.**

Proof of Theorem 3.

3.3.1 Proof of Lemma 4

3.3.2 Proof of Lemma 5

3.4 Proof of Theorem 4

3.5 Proof of Theorem 5

4 Discussion

4.1 More general picture for regression problems

4.2 Necessity of flatness condition and adaptivity

Example 1 (Pose and correspondence estimation).

Example 2 (Header-free communication).

Theorem 1.

Corollary 1.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.

Definition 1 (Covering number).

Lemma 1.

Lemma 2 (Local packing Fano lower bound).

Lemma 3.

Lemma 4.

Lemma 5.