Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence
Vasileios Charisopoulos, Yudong Chen, Damek Davis, Mateo D\'iaz, Lijun, Ding, Dmitriy Drusvyatskiy

TL;DR
This paper demonstrates that nonsmooth formulations for low-rank matrix recovery are better conditioned and enable faster, dimension-independent convergence of optimization algorithms, also offering robustness to outliers.
Contribution
It shows that nonsmooth penalty formulations avoid ill-conditioning in low-rank matrix recovery, leading to rapid convergence and robustness, unifying several key problems.
Findings
Nonsmooth formulations have better conditioning than smooth ones.
Standard algorithms converge rapidly and dimension-independently.
Nonsmooth methods are robust to outliers.
Abstract
The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations do not suffer from the same type of ill-conditioning. Consequently, standard algorithms for nonsmooth optimization, such as subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. Moreover, nonsmooth formulations are naturally robust against outliers. Our framework subsumes such important computational tasks as phase retrieval, blind deconvolution, quadratic sensing, matrix completion, and robust PCA.…
| Problem | Measurement | Regime | |
|---|---|---|---|
| (sub-)Gaussian sensing | |||
| Quadratic sensing I | |||
| Quadratic sensing II | |||
| Bilinear sensing |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPrincipal Components Analysis
Low-rank matrix recovery with composite optimization:
good conditioning and rapid convergence
Vasileios Charisopoulos Yudong Chen Damek Davis
Mateo Díaz Lijun Ding Dmitriy Drusvyatskiy School of ORIE, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/vc333/School of ORIE, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/yudong.chen/School of ORIE, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/dsd95/.CAM, Cornell University. Ithaca, NY 14850, USA; people.cam.cornell.edu/md825/School of ORIE, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/ld446/.Department of Mathematics, U. Washington, Seattle, WA 98195; www.math.washington.edu/$\scriptstyle\sim$ddrusv. Research of Drusvyatskiy was supported by the NSF DMS 1651851 and CCF 1740551 awards.
Abstract
The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations do not suffer from the same type of ill-conditioning. Consequently, standard algorithms for nonsmooth optimization, such as subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. Moreover, nonsmooth formulations are naturally robust against outliers. Our framework subsumes such important computational tasks as phase retrieval, blind deconvolution, quadratic sensing, matrix completion, and robust PCA. Numerical experiments on these problems illustrate the benefits of the proposed approach.
Contents
-
5 General convergence guarantees for subgradient & prox-linear methods
-
6.2 The RIP and -outlier bounds: quadratic and bilinear sensing
1 Introduction
Recovering a low-rank matrix from noisy linear measurements has become an increasingly central task in data science. Important and well-studied examples include phase retrieval [55, 12, 42], blind deconvolution [1, 38, 41, 57], matrix completion [9, 21, 56], covariance matrix estimation [18, 40], and robust principal component analysis [15, 11]. Optimization-based approaches for low-rank matrix recovery naturally lead to nonconvex formulations, which are NP hard in general. To overcome this issue, in the last two decades researchers have developed convex relaxations that succeed with high probability under appropriate statistical assumptions. Convex techniques, however, have a well-documented limitation: the parameter space describing the relaxations is usually much larger than that of the target problem. Consequently, standard algorithms applied on convex relaxations may not scale well to the large problems. Consequently, there has been a renewed interest in directly optimizing nonconvex formulations with iterative methods within the original parameter space of the problem. Aside from a few notable exceptions on specific problems [33, 3, 32], most algorithms of this type proceed in two-stages. The first stage—initialization—yields a rough estimate of an optimal solution, often using spectral techniques. The second stage—local refinement—uses a local search algorithm that rapidly converges to an optimal solution, when initialized at the output of the initialization stage.
This work focuses on developing provable low-rank matrix recovery algorithms based on nonconvex problem formulations. We focus primarily on local refinement and describe a set of unifying sufficient conditions leading to rapid local convergence of iterative methods. In contrast to the current literature on the topic, which typically relies on smooth problem formulations and gradient-based methods, our primary focus is on nonsmooth formulations that exhibit sharp growth away from the solution set. Such formulations are well-known in the nonlinear programming community to be amenable to rapidly convergent local-search algorithms. Along the way, we will observe an apparent benefit of nonsmooth formulations over their smooth counterparts. All nonsmooth formulations analyzed in this paper are “well-conditioned,” resulting in fast “out-of-the-box” convergence guarantees. In contrast, standard smooth formulations for the same recovery tasks can be poorly conditioned, in the sense that classical convergence guarantees of nonlinear programming are overly pessimistic. Overcoming the poor conditioning typically requires nuanced problem and algorithmic specific analysis (e.g. [57, 42, 46, 17]), which nonsmooth formulations manage to avoid for the problems considered here.
Setting the stage, consider a rank matrix and a linear map from the space of matrices to the space of measurements. The goal of low-rank matrix recovery is to recover from the image vector , possibly corrupted by noise. Typical nonconvex approaches proceed by choosing some penalty function with which to measure the residual for a trial solution . Then, in the case that is symmetric and positive semidefinite, one may focus on the formulation
[TABLE]
or when is rectangular, one may instead use the formulation
[TABLE]
Here, is a convex set that incorporates prior knowledge about and is often used to enforce favorable structure on the decision variables. The penalty is chosen specifically to penalize measurement misfit and/or enforce structure on the residual errors.
Algorithms and conditioning for smooth formulations
Most widely-used penalties are smooth and convex. Indeed, the squared -norm is ubiquitous in this context. With such penalties, problems (1.1) and (1.2) are smooth and thus are amenable to gradient-based methods. The linear rate of convergence of gradient descent is governed by the “local condition number” of . Indeed, if the estimate, holds for all in a neighborhood of the solution set, then gradient descent converges to the solution set at the linear rate . It is known that for several widely-studied problems including phase retrieval, blind deconvolution, and matrix completion, the ratio scales inversely with the problem dimension. Consequently, generic nonlinear programming guarantees yield efficiency estimates that are far too pessimistic. Instead, near-dimension independent guarantees can be obtained by arguing that is well conditioned along the “relevant” directions or that is well-conditioned within a restricted region of space that the iterates never escape (e.g. [57, 42, 46]). Techniques of this type have been elegantly and successfully used over the past few years to obtain algorithms with near-optimal sample complexity. One byproduct of such techniques, however, is that the underlying arguments are finely tailored to each particular problem and algorithm at hand. We refer the reader to the recent surveys [20] for details.
Algorithms and conditioning for nonsmooth formulations
The goal of our work is to justify the following principle:
Statistical assumptions for common recovery problems guarantee that (1.1) and (1.2) are well-conditioned when is an appropriate nonsmooth convex penalty.
To explain what we mean by “good conditioning,” let us treat (1.1) and (1.2) within the broader convex composite problem class:
[TABLE]
where is a smooth map on the space of matrices and is a closed convex set. Indeed, in the symmetric and positive semidefinite case, we identify with matrices and define , while in the asymmetric case, we identify with pairs of matrices and define . Though compositional problems (1.3) have been well-studied in nonlinear programming [6, 7, 31], their computational promise in data science has only begun recently to emerge. For example, the papers [28, 22, 26] discuss stochastic and inexact algorithms on composite problems, while the papers [27, 24], [16], and [39] investigate applications to phase retrieval, blind deconvolution, and matrix sensing, respectively.
A number of algorithms are available for problems of the form (1.3), and hence for (1.1) and (1.2). Two most notable ones are the projected subgradient111Here, the subdifferential is formally obtained through the chain rule , where is the subdifferential in the sense of convex analysis. method [23, 34]
[TABLE]
and the prox-linear algorithm [6, 37, 25]
[TABLE]
Notice that each iteration of the subgradient method is relatively cheap, requiring access only to the subgradients of and the nearest-point projection onto . The prox-linear method in contrast requires solving a strongly convex problem in each iteration. That being said, the prox-linear method has much stronger convergence guarantees than the subgradient method, as we will review shortly.
The local convergence guarantees of both methods are straightforward to describe, and underlie what we mean by “good conditioning”. Define , and for any define the convex model . Suppose there exist constants satisfying the two properties:
- •
(approximation) for all ,
- •
(sharpness) for all .
The approximation and sharpness properties have intuitive meanings. The former says that the nonconvex function is well approximated by the convex model , with quality that degrades quadratically as deviates from . In particular, this property guarantees that the quadratically perturbed function is convex on . Yet another consequence of the approximation property is that the epigraph of admits a supporting concave quadratic with amplitute at each of its points. Sharpness, in turn, asserts that must grow at least linearly as moves away from the solution set. In other words, the function values should robustly distinguish between optimal and suboptimal solutions. In statistical contexts, one can interpret sharpness as strong identifiability of the statistical model. The three figures below illustrate the approximation and sharpness properties for idealized objectives in phase retrieval, blind deconvolution, and robust PCA problems.
Approximation and sharpness, taken together, guarantee rapid convergence of numerical methods when initialized within the tube:
[TABLE]
For common low-rank recovery problems, has an intuitive interpretation: it consists of those matrices that are within constant relative error of the solution. We note that standard spectral initialization techniques, in turn, can generate such matrices with nearly optimal sample complexity. We refer the reader to the survey [20], and references therein, for details.
Guiding strategy.
The following is the guiding algorithmic principle of this work:
When initialized at , the prox-linear algorithm converges quadratically to the solution set ; the subgradient method, in turn, converges linearly with a rate governed by ratio , where is the Lipschitz constant of on .222Both the parameters and must be properly chosen for these guarantees to take hold.
In light of this observation, our strategy can be succinctly summarized as follows. We will show that for a variety of low-rank recovery problems, the parameters (or variants) are dimension independent under standard statistical assumptions. Consequently, the formulations (1.1) and (1.2) are “well-conditioned”, and subgradient and prox-linear methods converge rapidly when initialized within constant relative error of the optimal solution.
Approximation and sharpness via the Restricted Isometry Property
We begin verifying our thesis by showing that the composite problems, (1.1) and (1.2), are well-conditioned under the following Restricted Isometry Property (RIP): there exists a norm and numerical constants so that
[TABLE]
for all matrices of rank at most . We argue that under RIP,
the nonsmooth norm is a natural penalty function to use.
Indeed, as we will show, the composite loss in the symmetric setting admits constants that depend only on the RIP parameters and the extremal singular values of :
[TABLE]
In particular, the initialization ratio scales as and the condition number scales as . Consequently, the rapid local convergence guarantees previously described immediately take-hold. The asymmetric setting is slightly more nuanced since the objective function is sharp only on bounded sets. Nonetheless, it can be analyzed in a similar way leading to analogous rapid convergence guarantees. Incidentally, we show that the prox-linear method converges rapidly without any modification; this is in contrast to smooth methods, which typically require incorporating an auxiliary regularization term into the objective (e.g. [57]). We note that similar results in the symmetric setting were independently obtained in the complimentary work [39], albeit with a looser estimate of ; the two treatments of the asymmetric setting are distinct, however.333The authors of [39] provide a bound on that scales with the Frobenius norm . We instead derive a sharper bound that scales as . As a byproduct, the linear rate of convergence for the subgradient method scales only with the condition number instead of .
After establishing basic properties of the composite loss, we turn our attention to verifying RIP in several concrete scenarios. We note that the seminal works [50, 13] showed that if arises from a Gaussian ensemble, then in the regime RIP holds with high probability for the scaled norm . More generally when is highly structured, RIP may be most naturally measured in a non-Euclidean norm. For example, RIP with respect to the scaled norm holds for phase retrieval [29, 27], blind deconvolution [16], and quadratic sensing [18]; in contrast, RIP relative to the scaled norm fails for all three problems. In particular, specializing our results to the aforementioned recovery tasks yields solution methodologies with best known sample and computational complexity guarantees. Notice that while one may “smooth-out” the norm by squaring it, we argue that it may be more natural to optimize the norm directly as a nonsmooth penalty. Moreover, we show that penalization enables exact recovery even if a constant fraction of measurements is corrupted by outliers.
Beyond RIP: matrix completion and robust PCA
The RIP assumption provides a nice vantage point for analyzing the problem parameters . There are, however, a number of important problems, which do not satisfy RIP. Nonetheless, the general paradigm based on the interplay of sharpness and approximation is still powerful. We consider two such settings, matrix completion and robust principal component analysis (PCA), leveraging some intermediate results from [19].
The goal of the matrix completion problem [9] is to recover a low rank matrix from its partially observed entries. We focus on the formulation
[TABLE]
where is the projection onto the index set of observed entries and
[TABLE]
is the set of incoherent matrices. To analyze the conditioning of this formulation, we assume that the indices in are chosen as i.i.d. Bernoulli with parameter and that all nonzero singular values of are equal to one. Using results of [19], we quickly deduce sharpness with high probability. The error in approximation, however, takes the following nonstandard form. In the regime for some constants and , the estimate holds with high probability:
[TABLE]
The following modification of the prox-linear method therefore arises naturally:
[TABLE]
We show that subgradient methods and the prox-linear method, thus modified, both converge at a dimension independent linear rate when initialized near the solution. Namely, as long as and are below some constant thresholds, both the subgradient and the modified prox-linear methods converge linearly with high probability:
[TABLE]
Here is a numerical constant. Notice that the prox-linear method enjoys a much faster rate of convergence that is independent of any unknown constants or problem parameters—an observation fully supported by our numerical experiments.
As the final example, we consider the problem of robust PCA [11, 15], which aims to decompose a given matrix into a sum of a low-rank and a sparse matrix. We consider two different problem formulations:
[TABLE]
and
[TABLE]
where and are appropriately defined convex regions. Under standard incoherence assumptions, we show that the formulation (1.5) is well-conditioned, and therefore subgradient and prox-linear methods are applicable. Still, formulation (1.5) has a major drawback in that one must know properties of the optimal sparse matrix in order to define the constraint set , in order to ensure good conditioning. Consequently, we analyze formulation (1.6) as a more practical alternative.
The analysis of (1.6) is more challenging than that of (1.5). Indeed, it appears that we must replace the Frobenius norm in the approximation/sharpness conditions with the sum of the row norms . With this set-up, we verify the convex approximation property in general:
[TABLE]
and sharpness only when . We conjecture, however, that an analogous sharpness bound holds for all . It is easy to see that the quadratic convergence guarantees for the prox-linear method do not rely on the Euclidean nature of the norm, and the algorithm becomes applicable. To the best of our knowledge, it is not yet known how to adapt linearly convergent subgradient methods to the non-Euclidean setting.
Robust recovery with sparse outliers and dense noise
The aforementioned guarantees lead to exact recovery of under noiseless or sparsely corrupted measurements . A more realistic noise model allows for further corruption by a dense noise vector of small norm. Exact recovery is no longer possible with such errors. Instead, we should only expect to recover up to a tolerance proportional to the size of . Indeed, we show that appropriately modified subgradient and prox-linear algorithms converge linearly and quadratically, respectively, up to the tolerance for an appropriate norm . Finally, we discuss in detail the case of recovering a low rank PSD matrix from the corrupted measurements , where represents sparse outliers and represents small dense noise. To the best of our knowledge, theoretical guarantees for this error model have not been previously established in the nonconvex low-rank recovery literature. Surprisingly, we show it is possible to recover the matrix up to a tolerance independent of the norm or location of the outliers .
Numerical experiments
We conclude with an experimental evaluation of our theoretical findings on quadratic and bilinear matrix sensing, matrix completion, and robust PCA problems. In the first set of experiments, we test the robustness of the proposed methods against varying combinations of rank/corruption level by reporting the empirical recovery rate across independent runs of synthetic problem instances. All the aforementioned model problems exhibit sharp phase transitions, yet our methods succeed for more than moderate levels of corruption (or unobserved entries in the case of matrix completion). For example, in the case of matrix sensing, we can corrupt almost half of the measurements and still retain perfect recovery rates. Interestingly, our experimental findings indicate that the prox-linear method can tolerate slightly higher levels of corruption compared to the subgradient method, making it the method of choice for small-to-moderate dimensions.
We then demonstrate that the convergence rate analysis is fully supported by empirical evidence. In particular, we test the subgradient and prox-linear methods for different rank/corruption configurations. In the case of quadratic/bilinear sensing and robust PCA, we observe that the subgradient method converges linearly and the prox-linear method converges quadratically, as expected. In particular, our numerical experiments appear to support our sharpness conjecture for the robust PCA problem. In the case of matrix completion, both algorithms converge linearly. The prox-linear method in particular, converges extremely quickly, reaching high accuracy solutions in under iterations for reasonable values of .
In the noiseless setting, we compare against gradient descent with constant step-size on smooth formulations of each problem (except for robust PCA). We notice that the Polyak subgradient method outperforms gradient descent in all cases. That being said, one can heuristically equip gradient descent with the Polyak step-size as well. To the best of our knowledge, the gradient method with Polyak step-size has has not been investigated on smooth problem formulations we consider here. Experimentally, we see that the Polyak (sub)gradient methods on smooth and nonsmooth formulations perform comparably in the noiseless setting.
Outline of the paper
The outline of the paper is as follows. Section 2 records some basic notation we will use. Section 3 informally discusses the sharpness and approximation properties, and their impact on convergence of the subgradient and prox-linear methods. Section 4 analyzes the parameters under RIP. Section 5 rigorously discusses convergence guarantees of numerical methods under regularity conditions. Section 6 reviews examples of problems satisfying RIP and deduces convergence guarantees for subgradient and prox-linear algorithms. Sections 7 and 8 discuss the matrix completion and robust PCA problems, respectively. Section 9 discusses robust recovery up to a noise tolerance. The final Section 10 illustrates the developed theory and algorithms with numerical experiments on quadratic/bi-linear sensing, matrix completion, and robust PCA problems.
2 Preliminaries
In this section, we summarize the basic notation we will use throughout the paper. Henceforth, the symbol will denote a Euclidean space with inner product and the induced norm . The closed unit ball in will be denoted by , while a closed ball of radius around a point will be written as . For any point and a set , the distance and the nearest-point projection in -norm are defined by
[TABLE]
respectively. For any pair of functions and on , the notation will mean that there exists a numerical constant such that for all . Given a linear map between Euclidean spaces, , the adjoint map will be written as . We will use for the -dimensional identity matrix and for the zero matrix with variable sizes. The symbol will be shorthand for the set
We will always endow the Euclidean space of vectors with the usual dot-product and the induced -norm. More generally, the norm of a vector will be denoted by . Similarly, we will equip the space of rectangular matrices with the trace product and the induced Frobenius norm . The operator norm of a matrix will be written as . The symbol will denote the vector of singular values of a matrix in nonincreasing order. We also define the row-wise matrix norms . The symbols , , , and will denote the sets of symmetric, positive semidefinite, orthogonal, and invertible matrices, respectively.
Nonsmooth functions will play a central role in this work. Consequently, we will require some basic constructions of generalized differentiation, as described for example in the monographs [52, 45, 4]. Consider a function and a point , with finite. The subdifferential of at , denoted by , is the set of all vectors satisfying
[TABLE]
Here denotes any function satisfying as . Thus, a vector lies in the subdifferential precisely when the linear function lower-bounds up to first-order around . Standard results show that for a convex function the subdifferential reduces to the subdifferential in the sense of convex analysis, while for a differentiable function it consists only of the gradient: . For any closed convex functions and and -smooth map , the chain rule holds [52, Theorem 10.6]:
[TABLE]
We say that a point is stationary for whenever the inclusion holds. Equivalently, stationary points are precisely those that satisfy first-order necessary conditions for minimality: the directional derivative is nonnegative in every direction.
We say a that a random vector in is -sub-gaussian whenever for all unit vectors . The sub-gaussian norm of a real-valued random variable is defined to be , while the sub-exponential norm is defined by .
3 Regularity conditions and algorithms (informal)
As outlined in Section 1, we consider the low-rank matrix recovery problem within the framework of compositional optimization:
[TABLE]
where is a closed convex set, is a finite convex function and is a -smooth map. We depart from previous work on low-rank matrix recovery by allowing to be nonsmooth. We primary focus on those algorithms for (3.1) that converge rapidly (linearly or faster) when initialized sufficiently close to the solution set.
Such rapid convergence guarantees rely on some regularity of the optimization problem. In the compositional setting, regularity conditions take the following appealing form.
Assumption A**.**
Suppose that the following properties hold for the composite optimization problem (3.1) for some real numbers .
(Approximation accuracy) The convex models satisfy the estimate
[TABLE] 2. 2.
(Sharpness) The set of minimizers is nonempty and we have
[TABLE] 3. 3.
(Subgradient bound) The bound, , holds for any in the tube
[TABLE]
As pointed out in the introduction, these three properties are quite intuitive: The approximation accuracy guarantees that the objective function is well approximated by the convex model , up to a quadratic error relative to the basepoint . Sharpness stipulates that the objective function should grow at least linearly as one moves away from the solution set. The subgradient bound, in turn, asserts that the subgradients of are bounded in norm by on the tube . In particular, this property is implied by Lipschitz continuity on .
Lemma 3.1** (Subgradient bound and Lipschitz continuity [52, Theorem 9.13]).**
Suppose a function is -Lipschitz on an open set . Then the estimate holds for all .
The definition of the tube might look unintuitive at first. Some thought, however, shows that it arises naturally since it provably contains no extraneous stationary points of the problem. In particular, will serve as a basin of attraction of numerical methods; see the forthcoming Section 5 for details. The following general principle has recently emerged [23, 27, 24, 16]. Under Assumption A, basic numerical methods converge rapidly when initialized within the tube . Let us consider three such procedures and briefly describe their convergence properties. Detailed convergence guarantees are deferred to Section 5.
Algorithm 1 is the so-called Polyak subgradient method. In each iteration , the method travels in the negative direction of a subgradient , followed by a nearest-point projection onto . The step-length is governed by the current functional gap . In particular, one must have the value explicitly available to implement the procedure. This value is sometimes known; case in point, the minimal value of the penalty formulations (1.1) and (1.2) for low-rank recovery is zero when the linear measurements are exact. When the minimal value is not known, one can instead use Algorithm 2, which replaces the step-length with a preset geometrically decaying sequence. Notice that the per iteration cost of both subgradient methods is dominated by a single subgradient evaluation and a projection onto . Under appropriate parameter settings, Assumption A guarantees that both methods converge at a linear rate governed by the ratio , when initialized within . The prox-linear algorithm (Algorithm 2), in contrast, converges quadratically to the optimal solution, when initialized within . The caveat is that each iteration of the prox-linear method requires solving a strongly convex subproblem. Note that for low-rank recovery problems (1.1) and (1.2), the size of the subproblems is proportional to the size of the factors and not the size of the matrices.
In the subsequent sections, we show that Assumption A (or a close variant) holds with favorable parameters for common low-rank matrix recovery problems.
4 Regularity under RIP
In this section, we consider the low-rank recovery problems (1.1) and (1.2), and show that restricted isometry properties of the map naturally yield well-conditioned compositional formulations.444The guarantees we develop in the symmetric setting are similar to those in the recent preprint [39], albeit we obtain a sharper bound on ; the two sets of results were obtained independently. The guarantees for the asymmetric setting are different and are complementary to each other: we analyze the conditioning of the basic problem formulation (1.2), while [39] introduces a regularization term that improves the basin of attraction for the subgradient method by a factor of the condition number of . The arguments are short and elementary, and yet apply to such important problems as phase retrieval, blind deconvolution, and covariance matrix estimation.
Setting the stage, consider a linear map , an arbitrary rank matrix , and a vector modeling a corrupted estimate of the measurements . Recall that the goal of low-rank matrix recovery is to determine given and . By the term symmetric setting, we mean that is symmetric and positive semidefinite, whereas by asymmetric setting we mean that is an arbitrary rank matrix. We will treat the two settings in parallel. In the symmetric setting, we use to denote any fixed matrix for which the factorization holds. Similarly, in the asymmetric case, and denote any fixed and matrices, respectively, satisfying .
We are interested in the set of all possible factorization of . Consequently, we will often appeal to the following representations:
[TABLE]
Throughout, we will let refer to the set (4.1) in the symmetric case and to (4.2) in the asymmetric setting.
Henceforth, fix an arbitrary norm on . The following property, widely used in the literature on low-rank recovery, will play a central role in this section.
Assumption B** (Restricted Isometry Property (RIP)).**
There exist constants such that for all matrices of rank at most the following bound holds:
[TABLE]
Assumption B is classical and is satisfied in various important problems with the rescaled -norm and -norm .555In the latter case, RIP also goes by the name of Restricted Uniform Boundedness (RUB) [8]. In Section 6 we discuss a number of such examples including matrix sensing under (sub-)Gaussian design, phase retrieval, blind deconvolution, and quadratic/bilinear sensing. We summarize the RIP properties for these examples in Table 1 and refer the reader to Section 6 for the precise statements.
In light of Assumption B, it it natural to take the norm as the penalty in (1.1) and (1.2) . Then the symmetric problem (1.1) becomes
[TABLE]
while the asymmetric formulation (1.2) becomes
[TABLE]
Our immediate goal is to show that under Assumption B, the problems (4.3) and (4.4) are well-conditioned in the sense of Assumption A. We note that the asymmetric setting is more nuanced that its symmetric counterpart because Assumption A can only be guaranteed to hold on bounded sets. Nonetheless, as we discuss in Section 5, a localized version of Assumption A suffices to guarantee rapid local convergence of subgradient and prox-linear methods. In particular, our analysis of the local sharpness in the asymmetric setting is new and illuminating; it shows that the regularization technique suggested in [39] is not needed at all for the prox-linear method. This conclusion contrasts with known techniques in the smooth setting, where regularization is often used.
4.1 Approximation and Lipschitz continuity
We begin with the following elementary proposition, which estimates the subgradient bound and the approximation modulus in the symmetric setting. In what follows, we will use the expressions
[TABLE]
Proposition 4.1** (Approximation accuracy and Lipschitz continuity (symmetric)).**
Suppose Assumption B holds. Then for all the following estimates hold:
[TABLE]
Proof.
To see the first estimate, observe that
[TABLE]
where (4.5) follows from the reverse triangle inequality and (4.6) uses Assumption B. Next, for any we successively compute:
[TABLE]
where (4.7) follows from the reverse triangle inequality and (4.8) uses Assumption B. The proof is complete. ∎
The estimates of and in the asymmetric setting are completely analogous; we record them in the following proposition.
Proposition 4.2** (Approximation accuracy and Lipschitz continuity (asymmetric)).**
Suppose Assumption B holds. Then for all and the following estimates hold:
[TABLE]
Proof.
To see the first estimate, observe that
[TABLE]
where the last estimate follows from Young’s inequality Next, we successively compute:
[TABLE]
The result follows by noting that for all .
∎
4.2 Sharpness
We next move on to estimates of the sharpness constant . We first deal with the noiseless setting in Section 4.2.1, and then move on to the general case when the measurements are corrupted by outliers in Section 4.2.2.
4.2.1 Sharpness in the noiseless regime
We begin with with the symmetric setting in the noiseless case . By Assumption B, we have the estimate
[TABLE]
It follows that the set of minimizers coincides with the set of minimizers of the function , namely
[TABLE]
Thus to argue sharpness of it suffices to estimate the sharpness constant of the function . Fortunately, this calculation was already done in [57, Lemma 5.4].
Proposition 4.3** ([57, Lemma 5.4]).**
For any matrices , we have the bound
[TABLE]
Consequently if Assumption B holds in the noiseless setting , then the bound holds:
[TABLE]
We next consider the asymmetric case. By exactly the same reasoning as before, the set of minimizers of coincides with the set of minimizers of the function , namely
[TABLE]
Thus to argue sharpness of it suffices to estimate the sharpness constant of the function . Such a sharpness guarantee in the rank one case was recently shown in [16, Proposition 4.2].
Proposition 4.4** ([16, Proposition 4.2]).**
Fix a rank matrix and a constant . Then for any and satisfying
[TABLE]
the following estimate holds:
[TABLE]
Notice that in contrast to the symmetric setting, the sharpness estimate is only valid on bounded sets. Indeed, this is unavoidable even in the setting . To see this, define and for any set and . It is routine to compute
[TABLE]
Therefore letting tend to zero (or infinity) the quotient tends to zero.
The following corollary is a higher rank extension of Proposition 4.4.
Theorem 4.5** (Sharpness (asymmetric and noiseless)).**
Fix a constant and define and , where is any compact singular value decomposition of . Then for all and satisfying
[TABLE]
the estimate holds:
[TABLE]
Proof.
Define and consider a pair of matrices and satisfying (4.10). Let be an invertible matrix satisfying
[TABLE]
As a first step, we successively compute
[TABLE]
We next aim to lower bound the first term on the right. To this end, observe
[TABLE]
We claim that the cross-term is non-negative. To see this, observe that first order optimality conditions in (4.11) directly imply that satisfies the equality
[TABLE]
Thus we obtain
[TABLE]
Therefore, returning to (4.13) we conclude that
[TABLE]
Combining (4.12) and (4.14), we obtain
[TABLE]
Finally, we estimate . To this end, first note that
[TABLE]
We now aim to lower bound the left-hand-side in terms of . Observe
[TABLE]
Similarly, we have
[TABLE]
Hence using (4.16), we obtain the estimate
[TABLE]
Using this estimate in (4.15) completes the proof. ∎
4.2.2 Sharpness in presence of outliers
The most important example of the norm for us is the scaled -norm . Indeed, all the examples in the forthcoming Section 6 will satisfy RIP relative to this norm. In this section, we will show that the -norm has an added advantage. Under reasonable RIP-type conditions, sharpness will hold even if up to a half of the measurements are grossly corrupted.
Henceforth, for any set , define the restricted map . We interpret the set as corresponding to (arbitrarily) outlying measurements, while its complement corresponds to exact measurements. Motivated by the work [27] on robust phase retrieval, we make the following assumption.
Assumption C** (-outlier bounds).**
There exists a set and a constant such that the following hold.
Equality holds for all . 2.
For all matrices of rank at most , we have
[TABLE]
The assumption is simple to interpret. To elucidate the bound (4.17), let us suppose that the restricted maps and satisfy Assumption B (RIP) with constants , and , , respectively. Then for any rank matrix we immediately deduce the estimate
[TABLE]
where denotes the corruption frequency. In particular, the right-hand side is positive as long as the corruption frequency is below the threshold .
Combining Assumption C with Proposition 4.3 quickly yields sharpness of the objective even in the noisy setting.
Proposition 4.6** (Sharpness with outliers (symmetric)).**
Suppose that Assumption C holds. Then
[TABLE]
Proof.
Defining , we have the following bound:
[TABLE]
where the first inequality follows by the reverse triangle inequality, the second inequality follows by Assumption , and the final inequality follows from Proposition 4.3. The proof is complete. ∎
The argument in the asymmetric setting is completely analogous.
Proposition 4.7** (Sharpness with outliers (asymmetric)).**
Suppose that Assumption C holds. Fix a constant and define and , where is any compact singular value decomposition of . Then for all and satisfying
[TABLE]
The estimate holds:
[TABLE]
5 General convergence guarantees for subgradient & prox-linear methods
In this section, we formally develop convergence guarantees for Algorithms 1, 2, and 3 under Assumption A, and deduce performance guarantees in the RIP setting. To this end, it will be useful to first consider a broader class than the compositional problems (3.1). We say that a function is -weakly convex666Weakly convex functions also go by other names such as lower-, uniformly prox-regularity, paraconvex, and semiconvex. We refer the reader to the seminal works on the topic [51, 49, 47, 53, 2]. if the perturbed function is convex. In particular, a composite function satisfying the approximation guarantee
[TABLE]
is automatically -weakly convex [26, Lemma 4.2]. Subgradients of weakly convex functions are very well-behaved. Indeed, notice that in general the little-o term in the expression (2.1) may depend on the basepoint , and may therefore be nonuniform. The subgradients of weakly convex functions, on the other hand, automatically satisfy a uniform type of lower-approximation property. Indeed, a lower-semicontinuous function is -weakly convex if and only if it satisfies:
[TABLE]
Setting the stage, we introduce the following assumption.
Assumption D**.**
Consider the optimization problem,
[TABLE]
Suppose that the following properties hold for some real numbers .
(Weak convexity) The set is closed and convex, while the function is -weakly convex. 2. 2.
(Sharpness) The set of minimizers is nonempty and the following inequality holds:
[TABLE]
In particular, notice that Assumption A implies Assumption D. Taken together, weak convexity and sharpness provide an appealing framework for deriving local rapid convergence guarantees for numerical methods. In this section, we specifically focus on two such procedures: the subgradient and prox-linear algorithms. We aim to estimate both the radius of rapid converge around the solution set and the rate of convergence. Note that both of the algorithms, when initialized at a stationary point could stay there for all subsequent iterations. Since we are interested in finding global minima, we therefore estimate the neighborhood of the solution set that has no extraneous stationary points. This is the content of the following simple lemma.
Lemma 5.1** ([23, Lemma 3.1]).**
Suppose that Assumption D holds. Then the problem (5.1) has no stationary points satisfying
[TABLE]
It is worthwhile to note that the estimate of the radius in Lemma 5.1 is tight [16, Section 3]. Hence, let us define for any the tube
[TABLE]
Thus we would like to search for algorithms whose basin of attraction is a tube for some numerical constant . Such a basin of attraction is in essence optimal.
The rate of convergence of the subgradient methods (Algorithms 1 and 2) relies on the subgradient bound and the condition measure:
[TABLE]
A straightforward argument [23, Lemma 3.2] shows . The following theorem appears as [23, Theorem 4.1], while its application to phase retrieval was investigated in [24].
Theorem 5.2** (Polyak subgradient method).**
Suppose that Assumption D holds and fix a real number . Then Algorithm 1 initialized at any point produces iterates that converge -linearly to , that is
[TABLE]
The following theorem appears as [23, Theorem 6.1]. The convex version of the result dates back to Goffin [34].
Theorem 5.3** (Geometrically decaying subgradient method).**
Suppose that Assumption D holds, fix a real number , and suppose . Set in Algorithm 2. Then the iterates generated by Algorithm 2, initialized at any point , satisfy:
[TABLE]
Let us now specialize to the composite setting under Assumption A. Since Assumption A implies Assumption D, both subgradient Algorithms 1 and 2 will enjoy a linear rate of convergence when initialized sufficiently close the solution set. The following theorem, on the other hand, shows that the prox-linear method will enjoy a quadratic rate of convergence (at the price of a higher per-iteration cost). Guarantees of this type have appeared, for example, in [27, 7, 25].
Theorem 5.4** (Prox-linear algorithm).**
Suppose Assumption A holds. Choose any in Algorithm 3 and set . Then Algorithm 3 initialized at any point converges quadratically:
[TABLE]
We now apply the results above to the low-rank matrix factorization problem under RIP, whose regularity properties were verified in Section 4. In particular, we have the following efficiency guarantees of the subgradient and prox-linear methods applied to this problem.
Corollary 5.5** (Convergence guarantees under RIP (symmetric)).**
Suppose Assumptions B and C are valid with and consider the optimization problem
[TABLE]
Choose any matrix satisfying
[TABLE]
Define the condition number . Then the following are true.
(Polyak subgradient)* Algorithm 1 initialized at produces iterates that converge linearly to , that is*
[TABLE] 2. 2.
(geometric subgradient)* Algorithm 2 with , and initialized at converges linearly:*
[TABLE] 3. 3.
(prox-linear)* Algorithm 3 with and initialized at converges quadratically:*
[TABLE]
5.1 Guarantees under local regularity
As explained in Section 4, Assumptions A and D are reasonable in the symmetric setting under RIP. The asymmetric setting is more nuanced. Indeed, the solution set is unbounded, while uniform bounds on the sharpness and subgradient norms are only valid on bounded sets. One remedy, discussed in [39], is to modify the optimization formulation by introducing a form of regularization:
[TABLE]
In this section, we take a different approach that requires no modification to the optimization problem nor the algorithms. The key idea is to show that if the problem is well-conditioned only on a neighborhood of a particular solution, then the iterates will remain in the neighborhood provided the initial point is sufficiently close to the solution. In fact, we will see that the iterates themselves must converge. The proofs of the results in this section (Theorems 5.6, 5.7, and 5.8) are deferred to Appendix A.
We begin with the following localized version of Assumption D.
Assumption E**.**
Consider the optimization problem,
[TABLE]
Fix an arbitrary point and suppose that the following properties hold for some real numbers .
(Local weak convexity) The set is closed and convex, and the bound holds:
[TABLE] 2. 2.
(Local sharpness) The inequality holds:
[TABLE]
The following two theorems establish convergence guarantees of the two subgradient methods under Assumption E. Abusing notation slightly, we define the local quantities:
[TABLE]
Theorem 5.6** (Polyak subgradient method (local regularity)).**
Suppose Assumption E holds and fix an arbitrary point satisfying
[TABLE]
Then Algorithm 1 initialized at produces iterates that always lie in and satisfy
[TABLE]
Moreover the iterates converge to some point at the R-linear rate
[TABLE]
Theorem 5.7** (Geometrically decaying subgradient method (local regularity)).**
Suppose that Assumption E holds and that . Define , , and . Then Algorithm 2 initialized at any point generates iterates that always lie in and satisfy
[TABLE]
Moreover, the iterates converge to some point at the R-linear rate
[TABLE]
We end the section by specializing to the composite setting and analyzing the prox-linear method. The following is the localized version of Assumption A.
Assumption F**.**
Consider the optimization problem,
[TABLE]
where the function and the set are convex and is differentiable. Fix a point and suppose that the following properties holds for some real numbers .
(Approximation accuracy) The convex models satisfy the estimate:
[TABLE] 2. 2.
(Sharpness) The inequality holds:
[TABLE]
The following theorem provides convergence guarantees for the prox-linear method under Assumption F.
Theorem 5.8** (Prox-linear (local)).**
Suppose Assumption F holds, choose any , and fix an arbitrary point satisfying
[TABLE]
Then Algorithm 3 initialized at generates iterates that always lie in and satisfy
[TABLE]
Moreover the iterates converge to some point at the quadratic rate
[TABLE]
With the above generic results in hand, we can now derive the convergence guarantees for the subgradient and prox-linear methods for asymmetric low-rank matrix recovery problems. To summarize, the prox-linear method converges quadratically, as long as it is initialized within constant relative error of the solution. The guarantees for the subgradient methods are less satisfactory: the size of the region of the linear convergence scales with the condition number of . The reason is that the proof estimates the region of convergence using the length of the iterate path, which scales with the condition number. The dependence on the condition number in general can be eliminated by introducing regularization , as suggested in the work [39]. Still the results we present here are notable even for the subgradient method. For example, we see that for rank instances satisfying RIP (e.g. blind deconvolution), the condition number of is always one and therefore regularization is not required at all for subgradient and prox-linear methods.
Corollary 5.9** (Convergence guarantees under RIP (asymmetric)).**
Suppose Assumptions B and C are valid777 with and consider the optimization problem
[TABLE]
Define and , where is any compact singular value decomposition of . Define also the condition number . Then there exists depending only on , , and such that the following are true.
(Polyak subgradient)* Algorithm 1 initialized at satisfying , will generate an iterate sequence that converges at the linear rate:*
[TABLE] 2. 2.
(geometric subgradient)* Algorithm 2 initialized at satisfying , will generate an iterate sequence that converges at the linear rate:*
[TABLE] 3. 3.
(prox-linear)* Algorithm 3 initialized at satisfying and , will generate an iterate sequence that converges at the quadratic rate:*
[TABLE]
6 Examples of RIP
In this section, we survey three matrix recovery problems from different fields, including physics, signal processing, control theory, wireless communications, and machine learning, among others. In all cases, the problems satisfy RIP and the -outlier bounds and consequently, the convergence results in Corollaries 5.5 and 5.9 immediately apply. Most of the RIP results in this section were previously known (albeit under more restrictive assumptions); we provide self-contained arguments in the Appendix B for the sake of completeness. On the other hand, using nonsmooth optimization in these problems and the corresponding convergence guarantees based on RIP are, for the most part, new.
For the rest of this section we will assume the following data-generating mechanism.
Definition 6.1** (Data-generating mechanism).**
A random linear map and a random index set are drawn independently of each other. We assume moreover that the outlier frequency satisfies almost surely. We then observe the corrupted measurements
[TABLE]
where is an arbitrary vector. In particular, could be correlated with .
Throughout this section, we consider four distinct linear operators .
Matrix Sensing.
In this scenario, measurements are generated as follows:
[TABLE]
where are fixed matrices.
Quadratic Sensing I .
In this scenario, is assumed to be a PSD rank matrix with factorization and measurements are generated as follows:
[TABLE]
where are fixed vectors.
Quadratic Sensing II .
In this scenario, is assumed to be a PSD rank matrix with factorization and measurements are generated as follows:
[TABLE]
where are fixed vectors.
Bilinear Sensing.
In this scenario, is assumed to be a matrix with factorization and measurements are generated as follows:
[TABLE]
where and are fixed vectors.
The matrix, quadratic, and bilinear sensing problems have been considered in a number of papers and in a variety of applications. The first theoretical properties for matrix sensing were discussed in [30, 50, 13]. Quadratic sensing in its full generality appeared in [18] and is a higher-rank generalization of the much older (real) phase retrieval problem [10, 14, 35]. Besides phase retrieval, quadratic sensing has applications to covariance sketching, shallow neural networks, and quantum state tomography; see for example [40] for a discussion. Bilinear sensing is a natural modification of quadratic sensing and is a higher-rank generalization of the blind deconvolution problem [1]; it was first proposed and studied in [8].
The reader is reminded that once RIP guarantees, in particular Assumptions B and C, are established for the above four operators, the guarantees of Corollaries 5.5 and Corollary 5.9 immediately take hold for the problems
[TABLE]
and
[TABLE]
respectively. Thus, we turn our attention to establishing such guarantees.
6.1 Warm-up: RIP for matrix sensing with Gaussian design
In this section, we are primarily interested in the RIP for the above four linear operators. However, as a warm-up, we first consider the -RIP property for matrix sensing with Gaussian . The following result appears in [50, 13].
Theorem 6.2** (-RIP for matrix sensing**).
For any there exist constants depending only on such that if the entries of are i.i.d. standard Gaussian and , then with probability at least , the estimate
[TABLE]
holds simultaneously for all of rank at most . Consequently, Assumption B is satisfied.
Following the general recipe of the paper, we see that the nonsmooth formulation
[TABLE]
is immediately amenable to subgradient and prox-linear algorithms in the noiseless setting . In particular, a direct analogue of Corollary 5.9, which was stated for the penalty function , holds; we omit the straightforward details.
6.2 The RIP and -outlier bounds: quadratic and bilinear sensing
We now turn our attention to the RIP for more general classes of linear maps than the i.i.d. Gaussian matrices considered in Theorem 6.2. To establish such guarantees, one must ensure that the linear maps have light tails and are robustly injective on certain spaces of matrices. The first property leads to tight concentration results, while the second yields the existence of a lower RIP constant .
Assumption G** (Matrix Sensing).**
The matrices are i.i.d. realizations of an -sub-Gaussian random matrix888By this we mean that the vectorized matrix is a -sub-gaussian random vector. Furthermore, there exists a numerical constant such that
[TABLE]
Assumption H** (Quadratic Sensing I).**
The vectors are i.i.d. realizations of a -sub-Gaussian random variable Furthermore, there exists a numerical constant such that
[TABLE]
Assumption I** (Quadratic Sensing II).**
The vectors are i.i.d. realizations of a -sub-Gaussian random variable Furthermore, there exists a numerical constant such that
[TABLE]
Assumption J** (Bilinear Sensing).**
The vectors and are i.i.d. realizations of -sub-Gaussian random vectors and respectively. Furthermore, there exists a numerical constant such that
[TABLE]
The Assumptions G-J are all valid for i.i.d. Gaussian realizations with independent identity covariance, as the following lemma shows. We defer its proof to Appendix B.1.
Lemma 6.3**.**
Assumption G holds for matrices with i.i.d. standard Gaussian entries. Assumptions H and I hold for vectors with i.i.d. standard Gaussian entries. Assumption J holds for vectors and with i.i.d. standard Gaussian entries.
We can now state the main RIP guarantees under the above assumptions. Throughout all the results, we fix the data generating mechanism as in Definition 6.1. Then, we wish to establish the inequalities
[TABLE]
and
[TABLE]
and, hence, Assumptions B and C, respectively, for certain constants and . We defer the proof of this theorem to Appendix B.2.
Theorem 6.4** ( RIP and -outlier bounds).**
There exist numerical constants depending only on such that the following hold for the corresponding measurement operators described in Equations (6.2), (6.3), (6.4), and (6.5), respectively
(Matrix sensing)* Suppose Assumption G holds. Then provided , we have with probability at least that every matrix of rank at most satisfies (6.11) and (6.12) with constants and .* 2. 2.
(Quadratic sensing I)* Suppose Assumption H holds. Then provided , we have with probability at least that every matrix of rank at most satisfies (6.11) and (6.12) with constants and .* 3. 3.
(Quadratic sensing II)* Suppose Assumption I holds. Then provided , we have with probability at least that every matrix of rank at most satisfies (6.11) and (6.12) with constants and .* 4. 4.
(Bilinear sensing)* Suppose Assumption J holds. Then provided , we have with probability at least that every matrix of rank at most satisfies (6.11) and (6.12) with constants and .*
The guarantees of Theorem 6.4 were previously known under stronger assumptions. In particular, item (1) generalizes the results in [39] for the pure Gaussian setting. The case of item (2) can be found, in a sightly different form, in [29, 27]. Item (3) sharpens slightly the analogous guarantee in [18] by weakening the assumptions on the moments of the measuring vectors to the uniform lower bound (6.9). Special cases of item (4) were established in [16], for the case , and [8], for Gaussian measurement vectors.
We note that all linear mappings require the same number of measurements in order to satisfy RIP and outlier bounds, except for quadratic sensing I operator, which incurs an extra -factor. This reveals the utility of the quadratic sensing II operator, which achieves optimal sample complexity. For larger scale problems, a shortcoming of matrix sensing operator (6.2) is that scalars are required to represent the map . In contrast, all other measurement operators may be represented with only scalars.
7 Matrix Completion
In the previous sections, we saw that low-rank recovery problems satisfying RIP lead to well-conditioned nonsmooth formulations. We claim, however, that the general framework of sharpness and approximation is applicable even for problems without RIP. We consider two such problems, namely matrix completion in this section and robust PCA in Section 8, to follow. Both problems will be considered in the symmetric setting.
The goal of matrix completion problem is to recover a PSD rank matrix given access only to a subset of its entries. Henceforth, let be a matrix satisfying . Throughout, we assume incoherence condition, , for some . We also make the fairly strong assumption that the singular values of are all equal . This assumption is needed for our theoretical results. We let be an index set generated by the Bernoulli model, that is, independently for all . Let be the projection onto the entries indexed by . We consider the following optimization formulation of the problem
[TABLE]
We will show that both the Polyak subgradient method and an appropriately modified prox-linear algorithm converge linearly to the solution set under reasonable initialization. Moreover, we will see that the linear rate of convergence for the prox-linear method is much better than that for the subgradient method.
To simplify notation, we set
[TABLE]
We begin by estimating the sharpness constant of the objective function. Fortunately, this estimate follows directly from inequalities (58) and (59a) in [19].
Lemma 7.1** (Sharpness [19]).**
There are numerical constant such that the following holds. If , then with probability , the estimate
[TABLE]
holds uniformly for all with .
Let us next estimate the approximation accuracy , where
[TABLE]
To this end, we will require the following result.
Lemma 7.2** (Lemma 5 in [19]).**
There is a numerical constant such that the following holds. If for some , then with probability at least , the estimates
; and 2. 2.
**
hold uniformly for all matrices with and .
An estimate of the approximation error is now immediate.
Lemma 7.3** (Approximation accuracy and Lipschitz continuity).**
There is a numerical constant such that the following holds. If for some , then with probability at least , the estimates
[TABLE]
holds uniformly for all .
Proof.
The first inequality follows immediately by observing the estimate
[TABLE]
and using Lemma 7.2. To see the second inequality, observe
[TABLE]
where the last inequality follows by Part 2 of Lemma 7.2. ∎
Note that the approximation bound in Lemma 7.2 is not in terms of the square Euclidean norm. Therefore the results in Section 5 do not apply directly. Nonetheless, it is straightforward to modify the prox-linear method to take into account the new approximation bound. The proof of the following lemma appears in the appendix.
Lemma 7.4**.**
Suppose that Assumption A holds with the approximation property replaced by
[TABLE]
for some real . Consider the iterates generated by the process:
[TABLE]
Then as long as satisfies , the iterates converge linearly:
[TABLE]
Combining Lemma 7.4 with our estimates of the sharpness and approximation accuracy, we deduce the following convergence guarantee for matrix completion.
Corollary 7.5** (Prox-linear method for matrix completion).**
There are numerical constants such that the following holds. If for some , then with probability at least , the iterates generated by the modified prox-linear algorithm
[TABLE]
satisfy
[TABLE]
In particular, the iterates converge linearly as long as .
Proof.
By invoking Proposition 4.3 and Lemmas 7.1 and 7.3 we may appeal to Lemma 7.4 with , , and . The result follows immediately. ∎
To summarize, there exist numerical constants such that the following is true with probability at least . In the regime
[TABLE]
the prox-linear method will converge at the rapid linear rate,
[TABLE]
when initialized at satisfying .
As for the prox-linear method, the results of Section 5 do not immediately yield convergence guarantees for the Polyak subgradient method. Nonetheless, it straightforward to show that the standard Polyak subgradient method still enjoys local linear convergence guarantees. The proof is a straightforward modification of the argument in [23, Theorem 3.1], and appears in the appendix.
Theorem 7.6**.**
Suppose that Assumption A holds with the approximation property replaced by
[TABLE]
for some real . Consider the iterates generated by the Polyak subgradient method in Algorithm 1. Then as long as the sharpness constant satisfies and satisfies for some , the iterates converge linearly
[TABLE]
Finally, combining Theorem 7.6 with our estimates of the sharpness and approximation accuracy, we deduce the following convergence guarantee for matrix completion.
Corollary 7.7** (Subgradient method for matrix completion).**
There are numerical constants such that the following holds. If for some , then with probability at least , the iterates generated by the iterates generated by the Polyak Subgradient method in Algorithm 1 satisfy
[TABLE]
In particular, the iterates converge linearly as long as .
Proof.
First, observe that we have the bound by Lemma 7.3. By invoking Proposition 4.3 and Lemmas 7.1 and 7.3 we may appeal to Theorem 7.6 with , , , and . The result follows immediately. ∎
To summarize, there exist numerical constants such that the following is true with probability at least . In the regime
[TABLE]
the Polyak subgradient method will converge at the linear rate,
[TABLE]
when initialized at satisfying . Notice that the prox-linear method enjoys a much faster linear rate of convergence than the subgradient method—an observation fully supported by numerical experiments in Section 10. The caveat is that the per iteration cost of the prox-linear method is significantly higher than that of the subgradient method.
8 Robust PCA
The goal of robust PCA is to decompose a given matrix into a sum of a low-rank matrix and a sparse matrix , where represents the principal components, the corruption, and the observed data [15, 11, 59]. In this section, we explore methods of nonsmooth optimization for recovering such a decomposition, focusing on two different problem formulations. We only consider the symmetric version of the problem.
8.1 The Euclidean formulation
Setting the stage, we assume that the matrix admits a decomposition , where the matrices and satisfy the following for some parameters and :
The matrix has rank and can be factored as for some matrix satisfying and .999Recall that is the maximum row norm. 2. 2.
The matrix is sparse in the sense that it has at most nonzero entries per column/row.
The goal is to recover and given . The first formulation we consider is the following:
[TABLE]
where the constraint sets are defined by
[TABLE]
Note that the problem formulation requires knowing the norms of the rows of . The same assumption was also made in [19, 32]. While admittedly unrealistic, this formulation provides a nice illustration of the paradigm we advocate here. The following technical lemma will be useful in proving the regularity conditions needed for rapid convergence. The proof is given in Appendix D.1.
Lemma 8.1**.**
For all and , the estimate holds:
[TABLE]
Equipped with the above lemma, we can estimate the sharpness and approximation parameters for the formulation (8.1).
Lemma 8.2** (Regularity constants).**
For all and , the estimates hold:
[TABLE]
and
[TABLE]
Moreover, for any and , the Lipschitz bounds holds:
[TABLE]
Proof.
Let . To establish the bound (8.2), we observe that
[TABLE]
where the first inequality follows from Proposition 4.3 and Lemma 8.1. Now set
[TABLE]
and . With this notation, we apply the Fenchel-Young inequality to show that for any , we have
[TABLE]
Thus, for any , we have
[TABLE]
Now, let us choose so that Namely set With this choice of and the bound , the claimed bound (8.2) follows immediately. The bound (8.3) follows from the reverse triangle inequality:
[TABLE]
Finally observe
[TABLE]
where we use the bound in the final inequality. The proof is complete. ∎
To summarize, there exist numerical constants such that the following is true. In the regime
[TABLE]
the Polyak subgradient method will converge at the linear rate,
[TABLE]
and the prox-linear method will converge quadratically when initialized at satisfying .
8.2 The non-Euclidean formulation
We next turn to a different formulation for robust PCA that does not require knowledge of row norms of . In particular, we consider the formulation
[TABLE]
for a constant . Unlike Section 8.1, here we consider a randomized model for the sparse matrix . We assume that there are real such that
can be factored as for some matrix satisfying . 2. 2.
We assume the random corruption model
[TABLE]
where are i.i.d. Bernoulli random variables with and is an arbitrary and fixed symmetric matrix.
In this setting, the approximation function at is given by
[TABLE]
We begin by computing an estimate of the approximation accuracy .
Lemma 8.3** (Approximation accuracy).**
The estimate holds:
[TABLE]
Proof.
As in the proof of Proposition 4.1, we compute
[TABLE]
thereby completing the argument. ∎
Notice that the error is bounded in terms of the non-Euclidean norm . Thus, although in principle one may apply subgradient methods to the formulation (8.4), their convergence guarantees, which fundamentally relied on the Euclidean norm, would yield potentially overly pessimistic performance predictions. On the other hand, the convergence guarantees for the prox-linear method do not require the norm to be Euclidean. Indeed, the following is true, with a proof that is nearly identical as that of Theorem 5.8.
Theorem 8.4**.**
Suppose that Assumption A holds where is replaced by an arbitrary norm . Choose any and set in Algorithm 3. Then Algorithm 3 initialized at any point satisfying converges quadratically:
[TABLE]
To apply the above generic convergence guarantees for the prox-linear method, it remains to show that the objective function in (8.4) is sharp relative to the norm . A key step in showing such a result is to prove that
[TABLE]
for a quantity depending only on . One may prove this inequality using Proposition 4.3 together with the equivalence of the norms and . Doing so however leads to a dimension-dependent , resulting in a poor rate of convergence and region of attraction. We instead seek to directly establish sharpness relative to the norm . In the rank one setting, this can be done using the following theorem.
Theorem 8.5** (Sharpness (rank one)).**
Consider two vectors satisfying
[TABLE]
Then the estimate holds:
[TABLE]
The proof of this result appears in Appendix D.2. We leave as an intriguing open question to determine if an analogous result holds in the higher rank setting.
Conjecture 8.6** (Sharpness (general rank)).**
Fix a rank matrix and set . Then there exist constants depending only on such that the estimate holds:
[TABLE]
for all satisfying .
Assuming this conjecture, we can then show that the loss function is sharp under the randomized corruption model. We first state the following technical lemma, whose proof is deferred to Appendix D.3. In what what follows, given a matrix , the notation always refers to the th row of .
Lemma 8.7**.**
Assume Conjecture 8.6. Then there exist constants so that for all satisfying , we have that with probability , the following bound holds:
[TABLE]
for all satisfying .
We remark that we expect to scale with in the above bound, yielding a ratio dependent on the conditioning of . Given the above lemma, sharpness of quickly follows.
Lemma 8.8** (Sharpness of Non-Euclidean Robust PCA).**
Assume Conjecture 8.6. Then there exists a constants so that for all satisfying , we have that with probability , the following bound holds:
[TABLE]
for all satisfying and .
Proof.
The reverse triangle inequality implies that
[TABLE]
The result them follows from the the sharpness of the function together with Lemma 8.7. ∎
Combining Lemma 8.8 and Theorem 8.4, we deduce the following convergence guarantee.
Theorem 8.9** (Convergence for non-Euclidean Robust PCA).**
Assume Conjecture 8.6. Then there exist constants so that for all satisfying and satisfying , we have that with probability , the iterates generated by the prox-linear algorithm
[TABLE]
satisfy
[TABLE]
In particular, the iterates converge quadratically as long as the initial iterate satisfies
[TABLE]
9 Recovery up to a Tolerance
Thus far, we have developed exact recovery guarantees under noiseless or sparsely corrupted measurements. We showed that sharpness together with weak convexity imply rapid local convergence of numerical methods under these settings. In practical scenarios, however, it might be unlikely that any, let alone a constant fraction of measurements, are perfectly observed. Instead, a more realistic model incorporates additive errors that are the sum of a sparse, but otherwise arbitrary vector and a dense vector with relatively small norm. Exact recovery is in general not possible under this noise model. Instead, we should only expect to recover the signal up to an error.
To develop algorithms for this scenario, we need only observe that the previously developed sharpness results all yield a corresponding “sharpness up to a tolerance” result. Indeed, all problems considered thus far, are convex composite and sharp:
[TABLE]
where is convex and -Lipschitz with respect to some norm , is a smooth map, and . Now consider a fixed additive error vector , and the perturbed problem
[TABLE]
The triangle inequality immediately implies that the perturbed problem is sharp up to tolerance :
[TABLE]
In particular, any minimizer of satisfies
[TABLE]
where as before we set . In this section, we show that subgradient and prox-linear algorithms applied to the perturbed problem (9.1) converge rapidly up to a tolerance on the order of . To see the generality of the above approach, we note that even the robust recovery problems considered in Section 4.2.2, in which a constant fraction of measurements are already corrupted, may be further corrupted through additive error vector . We will study this problem in detail in Section 9.1.
Throughout the rest of the section, let us define the noise level:
[TABLE]
Mirroring the discussion in Section 5, define the annulus:
[TABLE]
for some . Note that for the annulus to be nonempty, we must ensure . We will see that serves as a region of rapid convergence for some numerical constant . As before, we also define subgradient bound and the condition measure:
[TABLE]
In all examples considered in the paper, it is possible to show directly that as defined in Assumption D. A similar result is true in the general case, as well. Indeed, the following Lemma provides a bound for in terms of the subgradients of on a slight expansion of the tube from (5.2); the proof appears in the appendix.
Lemma 9.1**.**
Suppose so that is nonempty. Then the following bound holds:
[TABLE]
We will now design algorithms whose basin of attraction is the annulus for some . To that end, the following modified sharpness bound will be useful for us. The reader should be careful to note the appearance of , not in the following bound.
Lemma 9.2** (Approximate sharpness).**
We have the following bound:
[TABLE]
Proof.
For any , observe as claimed. ∎
Next, we show that satisfies the following approximate subgradient inequality.
Lemma 9.3** (Approximate subgradient inequality).**
The following bound holds:
[TABLE]
Proof.
First notice that for all . Furthermore, we have . Therefore, it follows that for any we have
[TABLE]
as desired. ∎
Now consider the following modified Polyak method. It is important to note that the stepsize assumes knowledge of rather than . This distinction is important because it often happens that , whereas is in general unknown; for example, consider any noiseless problem analyzed thus far. We note that the standard Polyak subgradient method may also be applied to without any changes and has similar theoretical guarantees. The proof appears in the appendix.
Theorem 9.4** (Polyak subgradient method).**
Suppose that Assumption D holds and suppose that . Then Algorithm 4 initialized at any point produces iterates that converge -linearly to up to tolerance , that is
[TABLE]
Next we provide theoretical guarantees for Algorithm 5.3, where one does not know the optimal value . The proof of this result is a straightforward modification of [23, Theorem 6.1] based on the Lemmas 9.2 and 9.3, and therefore we omit it.
Theorem 9.5** (Geometrically decaying subgradient method).**
Suppose that Assumption D holds, fix a real number , and suppose . Suppose also so that is nonempty. Set in Algorithm 2. Then the iterates generated by Algorithm 2 on the perturbed problem (9.1), initialized at a point , satisfy:
[TABLE]
Finally, we analyze the prox-linear algorithm applied to the problem . In contrast to the Polyak method, one does not need to know the optimal value . The proof appears in the appendix.
Theorem 9.6** (Prox-linear algorithm).**
Suppose Assumptions A holds. Choose any in Algorithm 3 applied to the perturbed problem (9.1) and set . Suppose moreover so that is nonempty. Then Algorithm 3 initialized at any point converges quadratically up to tolerance :
[TABLE]
9.1 Example: sparse outliers and dense noise under RIP
To further illustrate the ideas of this section, we now generalize the results of Section 4.2.2, in particular Assumption C, to the following observation model.
Assumption K** (-outlier bounds).**
There exists vectors , a set , and a constant such that the following hold.
. 2.
Equality holds for all . 3.
For all matrices of rank at most , we have
[TABLE]
Given these assumptions we follow the notation of the previous section and let
[TABLE]
Then we have the following proposition:
Proposition 9.7**.**
Suppose Assumption B and K are valid. Then the following hold:
(Sharpness)* We have*
[TABLE] 2. 2.
(Weak Convexity)* The function is -weakly convex.* 3. 3.
(Minimizers)* All minimizers of satisfy*
[TABLE] 4. 4.
(Lipschitz Bound)* We have the bound*
[TABLE]
Proof.
Sharpness follows from Proposition 4.6, while weak convexity follows from Proposition 4.1. The minimizer bound follows from (9.2). Finally, due to Lemma 3.1, the argument given in Proposition (4.1), but applied instead to , guarantees that
[TABLE]
In turn the supremum may be bounded as follows: Let denote the closest point to in . Then
[TABLE]
as desired. ∎
In particular, combining Proposition 9.7 with the previous results in this section, we deduce the following. As long as the noise satisfies
[TABLE]
for a sufficiently small constant , the subgradient and prox-linear methods converge rapidly to within tolerance
[TABLE]
when initialized at a matrix satisfying
[TABLE]
for some small constant . The formal statement is summarized in the following corollary.
Corollary 9.8** (Convergence guarantees under RIP with sparse outliers and dense noise (symmetric)).**
Suppose Assumptions B is and K are valid with and define the condition number . Then there exists numerical constants such that the following hold. Suppose the noise level satisfies
[TABLE]
and define the tolerance
[TABLE]
Then as long as the matrix satisfies
[TABLE]
the following are true.
(Polyak subgradient)* Algorithm 1 initialized at produces iterates that converge linearly to , that is*
[TABLE] 2. 2.
(geometric subgradient)* Algorithm 2 with , and initialized at converges linearly:*
[TABLE] 3. 3.
(prox-linear)* Algorithm 3 with and initialized at converges quadratically:*
[TABLE]
10 Numerical Experiments
In this section, we demonstrate the theory and algorithms developed in the previous sections on a number of low-rank matrix recovery problems, namely quadratic and bilinear sensing, low rank matrix completion, and robust PCA.
10.1 Robustness to outliers
In our first set of experiments, we empirically test the robustness of our optimization methods to outlying measurements. We generate phase transition plots, where each pixel corresponds to the empirical probability of successful recovery over test runs using randomly generated problem instances. Brighter pixels represent higher recovery rates. All generated instances obey the following:
The initial estimate is specified reasonably close to the ground truth. In particular, given a target symmetric positive semidefinite matrix , we set
[TABLE]
Here, is a scalar that controls the quality of initialization and is a random unit “direction”. The asymmetric setting is completely analogous. 2. 2.
When using the subgradient method with geometrically decreasing step-size, we set . 3. 3.
For the quadratic sensing, bilinear sensing, and matrix completion problems, we mark a test run as a success when the normalized distance is less than . Here we set in the symmetric setting and in the asymmetric setting. For the robust PCA problem, we stop when .
Moreover, we set the seed of the random number generator at the beginning of each batch of experiments to enable reproducibility.
Quadratic and Bilinear sensing.
Figures 2 and 3 depict the phase transition plots for bilinear (6.5) and symmetrized quadratic (6.4) sensing formulations using Gaussian measurement vectors. In the experiments, we corrupt a fraction of measurements with additive Gaussian noise of unit entrywise variance. Empirically, we observe that increasing the variance of the additive noise does not affect recovery rates. Both problems exhibit a sharp phase transition at very similar scales. Moreover, increasing the rank of the generating signal does not seem to dramatically affect the recovery rate for either problem. Under additive noise, we can recover the true signal (up to natural ambiguity) even if we corrupt as much as half of the measurements.
Robust PCA.
We generate robust PCA instances for and . The corruption matrix follows the assumptions in Section 8.2, where for simplicity we set . We observed that increasing or decreasing the variance did not affect the probability of successful recovery, so our experiments use . We use the subgradient method, Algorithm 3, and the prox-linear algorithm (8.5). Notice that we have not presented any guarantees for the subgradient method on this problem, in contrast to the prox-linear method. The subproblems for the prox-linear method are solved by ADMM with graph splitting as in [48]. We set tolerance for the proximal subproblems, which we continue solve for at most iterations. We choose in all subproblems. The phase transition plots are shown in Figure 4. It appears that the prox-linear method is more robust to additive sparse corruption, since the empirical recovery rate for the subgradient method decays faster as the rank increases.
Matrix completion.
We next perform experiments on the low-rank matrix completion problem that test successful recovery against the sampling frequency. We generate random instances of the problem, where we let the probability of observing an entry, , range in with increments of . Figure 5 depicts the empirical recovery rate using the Polyak subgradient method and the modified prox-linear algorithm (7.1). Similarly to the quadratic/bilinear sensing problems, low-rank matrix completion exhibits a sharp phase transition. As predicted in Section 7, the ratio appears to be driving the required observation probability for successful recovery. Finally, we empirically observe that the prox-linear method can “tolerate” slightly smaller sampling frequencies.
10.2 Convergence behavior
We empirically validate the rapid convergence guarantees of the subgradient and prox-linear methods, given a proper initialization. Moreover, we compare the subgradient method with gradient descent, i.e. gradient descent applied to a smooth formulation of each problem, using the same initial estimate in the noiseless setting. In all the cases below, the step sizes for the gradient method were tuned for best performance. Moreover, we noticed that the gradient descent method, equipped with the Polyak step size performed at least as well as gradient descent with constant step size. That being said, we were unable to locate any theoretical guarantees in the literature for gradient descent with the Polyak step-size for the problems we consider here.
Quadratic and Bilinear sensing.
For the quadratic and bilinear sensing problems, we apply gradient descent on the smooth formulations
[TABLE]
In Figure 6, we plot the performance of Algorithm 2 for matrix sensing problems with different rank / corruption level; remarkably, the level of noise does not significantly affect the rate of convergence. Additionally, the convergence behavior is almost identical for the two problems for similar rank/noise configurations. Figure 7 depicts the behavior of Algorithm 1 versus gradient descent with empirically tuned step sizes. The subgradient method significantly outperforms gradient descent. For completeness, we also depict the convergence rate of Algorithm 3 for both problems in Figure 8, where we solve the proximal subproblems approximately.
Matrix completion.
In our comparison with smooth methods, we apply gradient descent on the following minimization problem:
[TABLE]
Figure 9 depicts the convergence behavior of Algorithm 1 (solid lines) versus gradient descent applied to Problem (10.1) with a tuned step size (dashed lines), initialized under the same conditions for low-rank matrix completion instances. As the theory suggests, higher sampling frequency implies better convergence rates. The subgradient method outperforms gradient descent in all regimes.
Figure 10 depicts the performance of the modified prox-linear method (7.1) in the same setting as Figure 9. In most cases, the prox-linear algorithm converges within just iterations, at what appears to be a rapid linear rate of convergence. Each convex subproblem is solved using a variant of the graph-splitting ADMM algorithm [48].
Robust PCA.
For the robust PCA problem, we consider different rank/corruption level configurations to better understand how they affect convergence for the subgradient and prox-linear methods, using the non-Euclidean formulation of Section 8.2. We depict all configurations in the same plot for a fixed optimization algorithm to better demonstrate the effect of each parameter, as shown in Figure 11. The parameters of the prox-linear method are chosen in the same way reported in Section 10.1. In particular, our numerical experiments appear to support our sharpness Conjecture 8.6 for the robust PCA problem.
10.2.1 Recovery up to tolerance
In this last section, we test the performance of the prox-linear method and the modified Polyak subgradient method (Algorithm 4) for the quadratic sensing and matrix completion problems, under a dense noise model of Section 9. In the former setting, we set , so 1/4th of our measurements is corrupted with large magnitude noise. For matrix completion, we observe of the entries. In both settings, we add Gaussian noise which is rescaled to satisfy and test . The relevant plots can be found in Figures 12 and 13. The numerical experiments fully support the developed theory, with the iterates converging rapidly up to the tolerance that is proportional to the noise level. Incidentally, we observe that the modified prox-linear method (7.1) is more robust to additive noise for the matrix completion problem, with Algorithm 4 exhibiting heavy fluctuations and failing to converge for the highest level of dense noise.
Appendix A Proofs in Section 5
In this section, we prove rapid local convergence guarantees for the subgradient and prox-linear algorithms under regularity conditions that hold only locally around a particular solution. We will use the Euclidean norm throughout this section; therefore to simplify the notation, we will drop the subscript two. Thus denotes the on a Euclidean space throughout.
We will need the following quantitative version of Lemma 5.1.
Lemma A.1**.**
Suppose Assumption E holds and let be arbitrary. Then for any point , the estimate holds:
[TABLE]
Proof.
Consider any point satisfying . Let be arbitrary and note . Thus for any we deduce
[TABLE]
Therefore we deduce the lower bound on the subgradients as claimed. ∎
A.1 Proof of Theorem 5.6
Let be the first index (possibly infinite) such that . We claim that (5.4) holds for all . We show this by induction. To this end, suppose (5.4) holds for all indices up to . In particular, we deduce . Let and note , since
[TABLE]
Thus we deduce
[TABLE]
Here, the estimate (A.1) follows from the fact that the projection is nonexpansive, (A.2) uses local weak convexity, (A.4) follow from the estimate , while (A.3) and (A.5) use local sharpness. We therefore deduce
[TABLE]
Thus (5.4) holds for all indices up to . We next show that is infinite. To this end, observe
[TABLE]
where (A.7) follows by Lemma A.1 with , the bound in (A.8) follows by (A.6) and the assumption on finally (A.9) holds thanks to (A.6). Thus applying the triangle inequality we get the contradiction . Consequently all the iterates for lie in and satisfy (5.4).
Finally, let be any limit point of the sequence . We then successively compute
[TABLE]
This completes the proof.
A.2 Proof of Theorem 5.7
Fix an arbitrary index and observe
[TABLE]
Hence, we conclude the uniform bound on the iterates:
[TABLE]
and the R-linear rate of convergence
[TABLE]
where is any limit point of the iterate sequence.
Let us now show that the iterates do not escape . To this end, observe
[TABLE]
We must therefore verify the estimate , or equivalently Clearly, it suffices to verify which holds by the definition of . Thus all the iterates lie in . Moreover , the rest of the proof is identical to that in [23, Theorem 5.1].
A.3 Proof of Theorem 5.8
Fix any index such that and let be arbitrary. Since the function is -strongly convex and is its minimizer, we deduce
[TABLE]
Setting and appealing to approximation accuracy, we obtain the descent guarantee
[TABLE]
In particular, the function values are decreasing along the iterate sequence. Next choosing any and setting in (A.10) yields
[TABLE]
Appealing to approximation accuracy and lower-bounding by zero, we conclude
[TABLE]
Using sharpness we deduce the contraction guarantee
[TABLE]
where the last inequality uses the assumption . Let be the first index satisfying . We then deduce
[TABLE]
where (A.14) follows from (A.11) and (A.15) follows from (A.13). Thus we conclude , which is a contradiction. Therefore all the iterates , for , lie in . Combing this with (A.12) and sharpness yields the claimed quadratic converge guarantee
[TABLE]
Finally, let be any limit point of the sequence . We then deduce
[TABLE]
where (A.16) follows from (A.13). The theorem is proved.
Appendix B Proofs in Section 6
B.1 Proof of Lemma 6.3
In order to prove that the assumption in each case, we will prove a stronger “small-ball condition” [44, 43], which immediately implies the claimed lower bounds on the expectation by Markov’s inequality. More precisely, we will show that there exist numerical constants such that
(Matrix Sensing)
[TABLE] 2. 2.
(Quadratic Sensing I)
[TABLE] 3. 3.
(Quadratic Sensing II)
[TABLE] 4. 4.
(Bilinear Sensing)
[TABLE]
These conditions immediately imply Assumptions G-J. Indeed, by Markov’s inequality, in the case of matrix sensing we deduce
[TABLE]
The same reasoning applies to all the other problems.
Matrix sensing.
Consider any matrix with Then, since follows a standard normal distribution, we may set to be the median of and to obtain
[TABLE]
Quadratic Sensing I.
Fix a matrix with and . Let be an eigenvalue decomposition of . Using the rotational invariance of the Gaussian distribution, we deduce
[TABLE]
where denotes equality in distribution. Next, let be a standard normal variable. We will now invoke Proposition F.2. Let be the numerical constant appearing in the proposition. Notice that the function given by
[TABLE]
is continuous and strictly increasing, and it satisfies and Hence we may set . Proposition F.2 then yields
[TABLE]
By taking the supremum of both sides of the inequality we conclude that Assumption H holds with and
Quadratic sensing II.
Let be an eigenvalue decomposition of . Using the rotational invariance of the Gaussian distribution, we deduce
[TABLE]
where the last relation follows since are independent standard normal random variables with mean zero and variance two. We will now invoke Proposition F.2. Let be the numerical constant appearing in the proposition. Let and be independent standard normal variables. Notice that the function given by
[TABLE]
is continuous, strictly increasing, satisfies and approaches one at infinity. Defining and applying Proposition F.2, we get
[TABLE]
By taking the supremum of both sides of the inequality we conclude that Assumption I holds with and
We omit the details for the bilinear case, which follow by similar arguments.
B.2 Proof of Theorem 6.4
The proofs in this section rely on the following proposition, which shows that that pointwise concentration imply uniform concentration. We defer the proof to Appendix B.3.
Proposition B.1**.**
Let be a random linear mapping with property that for any fixed matrix of rank at most with norm and any fixed subset of indices satisfying , the following hold:
The measurements are i.i.d. 2.
RIP holds in expected value:
[TABLE]
where is a universal constant and is a positive-valued function that could potentially depend on the rank of . 3.
There exist a universal constant and a positive-valued function such that for any the deviation bound
[TABLE]
holds with probability at least
Then, there exist universal constants depending only on and such that if is a fixed subset of indices satisfying and
[TABLE]
then with probability at least every matrix of rank at most satisfies
[TABLE]
and
[TABLE]
Due to scale invariance of the above result, we need only verify its assumptions in the case that . We implicitly use this observation below.
B.2.1 Part 1 of Theorem 6.4 (Matrix sensing)
Lemma B.2**.**
The random variable is sub-gaussian with parameter Consequently,
[TABLE]
Moreover, there exists a universal constant such that for any the deviation bound
[TABLE]
holds with probability at least
Proof.
Assumption G immediately implies the lower bound in (B.5). To prove the upper bound, first note that by assumption we have
[TABLE]
This bound has two consequences, first is a sub-gaussian random variable with parameter and second [58, Proposition 2.5.2]. Thus, we have proved (B.5).
To prove the deviation bound (B.6) we introduce the random variables
[TABLE]
Since is sub-gaussian, we have for all see [58, Lemma 2.6.8]. Hence, Hoeffding’s inequality for sub-gaussian random variables [58, Theorem 2.6.2] gives the desired upper bound on ∎
Applying Proposition B.1 with and now yields the result. ∎
B.2.2 Part 2 of Theorem 6.4 (Quadratic sensing I)
Lemma B.3**.**
The random variable is sub-exponential with parameter Consequently,
[TABLE]
Moreover, there exists a universal constant such that for any the deviation bound
[TABLE]
holds with probability at least
Proof.
Assumption H gives the lower bound in (B.7). To prove the upper bound, first note that where and are the th singular values and vectors of , respectively. Hence
[TABLE]
where the first inequality follows since is a norm, the second one follows since [58, Lemma 2.7.7], and the third inequality holds since . This bound has two consequences, first is a sub-exponential random variable with parameter and second [58, Exercise 2.7.2]. Thus, we have proved (B.7).
To prove the deviation bound (B.8) we introduce the random variables
[TABLE]
Since is sub-exponential, we have for all see [58, Exercise 2.7.10]. Hence, Bernstein inequality for sub-exponential random variables [58, Theorem 2.8.2] gives the desired upper bound on ∎
Applying Proposition B.1 with with and now yields the result. ∎
B.2.3 Part 3 of Theorem 6.4 (Quadratic sensing II)
Lemma B.4**.**
The random variable is sub-exponential with parameter Consequently,
[TABLE]
Moreover, there exists a universal constant such that for any the deviation bound
[TABLE]
holds with probability at least
Proof.
Assumption I implies the lower bound in (B.9). To prove the upper bound, we will show that . By definition of the Orlicz norm for any random variable hence without loss of generality we may remove the absolute value. Recall that where and are the th singular values and vectors of , respectively. Hence, the random variable of interest can be rewritten as
[TABLE]
By assumption the random variables are -sub-gaussian, this implies that are -sub-exponential, since .
Recall the following characterization of the Orlicz norm for mean-zero random variables
[TABLE]
where the see [58, Proposition 2.7.1]. To prove that the random variable (B.11) is sub-exponential we will exploit this characterization. Since each inner product squared is sub-exponential, the equivalence implies the existence of a constant for which the uniform bound
[TABLE]
holds. Let be an arbitrary scalar with , then by expanding the moment generating function of (B.11) we get
[TABLE]
where the inequality follows by (B.13) and the last relation follows since is unit norm. Combining this with (B.12) gives
[TABLE]
This bound has two consequences, first is a sub-exponential random variable with parameter and second [58, Exercise 2.7.2]. Thus, we have proved (B.9).
To prove the deviation bound (B.10) we introduce the random variables
[TABLE]
The sub-exponentiality of implies for all see [58, Exercise 2.7.10]. Hence, Bernstein inequality for sub-exponential random variables [58, Theorem 2.8.2] gives the desired upper bound on ∎
Applying Proposition B.1 with and now yields the result. ∎
B.2.4 Part 4 of Theorem 6.4 (Bilinear sensing)
Lemma B.5**.**
The random variable is sub-exponential with parameter Consequently,
[TABLE]
Moreover, there exists a universal constant such that for any the deviation bound
[TABLE]
holds with probability at least
Proof.
As before the lower bound in (B.14) is implied by Assumption J. To prove the upper bound, we will show that . By definition of the Orlicz norm for any random variable hence we may remove the absolute value. Recall that where and are the th singular values and vectors of , respectively. Hence, the random variable of interest can be rewritten as
[TABLE]
By assumption the random variables and are -sub-gaussian, this implies that are -sub-exponential.
To prove that the random variable (B.16) is sub-exponential we will again use (B.12). Since each random variable is sub-exponential, the equivalence implies the existence of a constant for which the uniform bound
[TABLE]
holds. Let be an arbitrary scalar with , then by expanding the moment generating function of (B.16) we get
[TABLE]
where the inequality follows by (B.17) and the last relation follows since is unitary. Combining this with (B.12) gives
[TABLE]
Thus, we have proved (B.14).
Once again, to show the deviation bound (B.15) we introduce the random variables
[TABLE]
and apply Bernstein’s inequality for sub-exponential random variables [58, Theorem 2.8.2] to get the stated upper bound on ∎
Applying Proposition B.1 with and now yields the result. ∎
B.3 Proof of Proposition B.1
Choose and let be the ()-net guaranteed by Lemma F.1. Pick some so that (B.2) can hold, we will fix the value of this parameter later in the proof. Let denote the event that the following two estimates hold for all matrices in :
[TABLE]
Throughout the proof, we will assume that the event holds. We will estimate the probability of at the end of the proof. Meanwhile, seeking to establish RIP, define the quantity
[TABLE]
We aim first to provide a high probability bound on .
Let be arbitrary and let be the closest point to in . Then we have
[TABLE]
where (B.20) follows from (B.19) and (B.21) follows from the triangle inequality. To simplify the third term in (B.21), using SVD, we deduce that there exist two orthogonal matrices of rank at most satisfying With this decomposition in hand, we compute
[TABLE]
where the second inequality follows from the definition of and the estimate Thus, we arrive at the bound
[TABLE]
As was arbitrary, we may take the supremum of both sides of the inequality, yielding . Rearranging yields the bound
[TABLE]
Assuming that , we further deduce that
[TABLE]
establishing that the random variable is bounded by in the event .
Now let denote either or . We now provide a uniform lower bound on . Indeed,
[TABLE]
where (B.25) uses the forward and reverse triangle inequalities, (B.26) follows from (B.18), the estimate (B.27) follows from the forward and reverse triangle inequalities, and (B.28) follows from (B.22) and (B.24). Switching the roles of and in the above sequence of inequalities, and choosing , we deduce
[TABLE]
In particular, setting , we deduce
[TABLE]
and therefore using (B.1), we conclude the RIP property
[TABLE]
Next, let and note that
[TABLE]
where the equality follows by assumption . Therefore every satisfies
[TABLE]
Setting in (B.29) and (B.30), we deduce the claimed estimates (B.3) and (B.4). Finally, let us estimate the probability of . Using the union bound and Lemma F.1 yields
[TABLE]
where is the function guaranteed by assumption .
By (B.1) we get . Then we deduce
[TABLE]
Hence as long as , we can be sure
[TABLE]
Proving the desired result. ∎
Appendix C Proof in Section 7
C.1 Proof of Lemma 7.4
Define . Fix an iteration and choose . Then the estimate holds:
[TABLE]
Rearranging and using the sharpness and approximation accuracy assumptions, we deduce
[TABLE]
The result follows.
C.2 Proof of Theorem 7.6
First notice that for any , we have . Therefore, since is a convex function, we have that for all and , the bound
[TABLE]
Consequently, given that , we have
[TABLE]
Here, the estimate (C.2) follows from the fact that the projection is nonexpansive, (C.3) uses the bound in (C.1), (C.5) follow from the estimate , while (C.4) and (C.6) use local sharpness. The result then follows by the upper bound .
Appendix D Proofs in Section 8
D.1 Proof of Lemma 8.1
The inequality can be established using an argument similar to that for bounding the term in [19, Section 6.6]. We provide the proof below for completeness. Define the shorthand and , and let denote the -th standard basis vector of . Simple algebra gives
[TABLE]
We claim that for each . To see this, fix any and let , , and We have
[TABLE]
Rearranging terms gives , whence
[TABLE]
where step the second inequality holds because by assumption. The claim follows from noting that .
Using the claim, we get that
[TABLE]
Using a similar argument and the fact that , we obtain
[TABLE]
Putting everything together, we have
[TABLE]
The claim follows.
D.2 Proof of Theorem 8.5
Without loss of generality, suppose that is closer to than to . Consider the following expression:
[TABLE]
We now produce a few different lower bounds by testing against different . In what follows, we set , i.e., the positive solution of the equation .
Case 1:
Suppose that
[TABLE]
Then set , to get
[TABLE]
Case 2:
Suppose that
[TABLE]
Then set , to get
[TABLE]
Case 3:
Suppose that
[TABLE]
Define . Observe that
[TABLE]
and
[TABLE]
Putting these two bounds together, we find that
[TABLE]
Altogether, we find that
[TABLE]
as desired.
D.3 Proof of Lemma 8.7
We start by stating a claim we will use to prove the lemma. Let us introduce some notation. Consider the set
[TABLE]
Define the random variable
[TABLE]
Claim 1**.**
There exist constants such that with probability at least
[TABLE]
Before proving this claim, let us show how it implies the theorem. Let
[TABLE]
Set and . Notice that
[TABLE]
Therefore, because and
[TABLE]
we have that
[TABLE]
where the last line follows by Conjecture 8.6. This proves the desired result.
Proof of the Claim.
Our goal is to show that the random variable is highly concentrated around its mean. We may apply the standard symmetrization inequality [5, Lemma 11.4] to bound the expectation as follows:
[TABLE]
Observing that and can both be bounded by
[TABLE]
where the final inequality follows from Bernstein’s inequality and a union bound, we find that
[TABLE]
To prove that is well concentrated around , we apply Theorem F.3. To apply this theorem, we set and define the collection , where by
[TABLE]
We also bound
[TABLE]
and
[TABLE]
Therefore, due to Theorem F.3 there exists a constant so that with , we have that with probability , the bound
[TABLE]
where the last line follows since by assumption ∎
Appendix E Proofs in Section 9
E.1 Proof of Lemma 9.1
The proof follows the same strategy as [24, Theorem 6.1]. Fix and let . Then for all , we have, from Lemma 9.3, that
[TABLE]
Therefore, the function
[TABLE]
satisfies
[TABLE]
Now, for some to be determined momentarily, define
[TABLE]
First order optimality conditions and the sum rule immediately imply that
[TABLE]
Thus,
[TABLE]
Now we estimate . Indeed, from the definition of we have
[TABLE]
Consequently, we have . Thus, setting and recalling that we find that
[TABLE]
Likewise, we have
[TABLE]
Therefore, setting , we find that
[TABLE]
as desired.
E.2 Proof of Theorem 9.4
Let , suppose , and let . Notice that Lemma 9.2 implies . We successively compute
[TABLE]
Here, the estimate (E.1) follows from the fact that the projection is nonexpansive, (E.2) uses Lemma 9.3, the estimate (E.4) follows from the assumption , the estimate (E.5) follows from the estimate , while (E.3) and (E.6) use Lemma 9.2. We therefore deduce
[TABLE]
Consequently either we have or . Therefore, by induction, the proof is complete.
E.3 Proof of Theorem 9.6
Let , suppose , and let . Then
[TABLE]
Rearranging yields the result.
Appendix F Auxiliary lemmas
Lemma F.1** (Lemma 3.1 in [13]).**
Let . There exists an -net (with respect to ) of obeying
[TABLE]
Proposition F.2** (Corollary 1.4 in [54]).**
Consider real-valued random variables and let be a unit vector. Let such that
[TABLE]
Then the following holds
[TABLE]
where is a universal constant.
Theorem F.3** (Talagrand’s Functional Bernstein for non-identically distributed variables [36, Theorem 1.1(c)]).**
Let be a countable index set. Let be independent vector-valued random variables of the form . Let . Assume that for all and , and . Let
[TABLE]
Then for each , we have the tail bound
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Ahmed, B. Recht, and J. Romberg. Blind deconvolution using convex programming. IEEE Transactions on Information Theory , 60(3):1711–1732, 2014.
- 2[2] P. Albano and P. Cannarsa. Singularities of semiconcave functions in Banach spaces. In Stochastic analysis, control, optimization and applications , Systems Control Found. Appl., pages 171–190. Birkhäuser Boston, Boston, MA, 1999.
- 3[3] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems , pages 3873–3881, 2016.
- 4[4] J.M. Borwein and A.S. Lewis. Convex analysis and nonlinear optimization . CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, 3. Springer-Verlag, New York, 2000. Theory and examples.
- 5[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence . Oxford university press, 2013.
- 6[6] J.V. Burke. Descent methods for composite nondifferentiable optimization problems. Math. Programming , 33(3):260–279, 1985.
- 7[7] J.V. Burke and M.C. Ferris. A Gauss-Newton method for convex composite optimization. Math. Programming , 71(2, Ser. A):179–194, 1995.
- 8[8] T.T. Cai and A. Zhang. ROP: matrix recovery via rank-one projections. Ann. Statist. , 43(1):102–138, 2015.
