Low-rank matrix recovery with composite optimization: good conditioning   and rapid convergence

Vasileios Charisopoulos; Yudong Chen; Damek Davis; Mateo D\'iaz; Lijun; Ding; Dmitriy Drusvyatskiy

arXiv:1904.10020·math.OC·April 24, 2019

Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence

Vasileios Charisopoulos, Yudong Chen, Damek Davis, Mateo D\'iaz, Lijun, Ding, Dmitriy Drusvyatskiy

PDF

TL;DR

This paper demonstrates that nonsmooth formulations for low-rank matrix recovery are better conditioned and enable faster, dimension-independent convergence of optimization algorithms, also offering robustness to outliers.

Contribution

It shows that nonsmooth penalty formulations avoid ill-conditioning in low-rank matrix recovery, leading to rapid convergence and robustness, unifying several key problems.

Findings

01

Nonsmooth formulations have better conditioning than smooth ones.

02

Standard algorithms converge rapidly and dimension-independently.

03

Nonsmooth methods are robust to outliers.

Abstract

The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations do not suffer from the same type of ill-conditioning. Consequently, standard algorithms for nonsmooth optimization, such as subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. Moreover, nonsmooth formulations are naturally robust against outliers. Our framework subsumes such important computational tasks as phase retrieval, blind deconvolution, quadratic sensing, matrix completion, and robust PCA.…

Tables1

Table 1. Table 1 : Common problems satisfying ℓ 1 / ℓ 2 subscript ℓ 1 subscript ℓ 2 \ell_{1}/\ell_{2} RIP in Assumption B . The table summarizes the ℓ 1 / ℓ 2 subscript ℓ 1 subscript ℓ 2 \ell_{1}/\ell_{2} RIP for (sub-)Gaussian sensing, quadratic sensing (e.g., phase retrieval), and bilinear sensing (e.g., blind deconvolution) under standard (sub-)Gaussian assumptions on the data generating mechanism. In all cases, we set | | | ⋅ | | | = 1 m ∥ ⋅ ∥ 1 \mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}=\frac{1}{m}\|\cdot\|_{1} and assume for simplicity d 1 = d 2 = d subscript 𝑑 1 subscript 𝑑 2 𝑑 d_{1}=d_{2}=d . The symbols c 𝑐 c and C 𝐶 C refer to numerical constants, p fail subscript 𝑝 fail p_{\mathrm{fail}} refers to the proportion of corrupted measurements, κ 3 subscript 𝜅 3 \kappa_{3} is a constant multiple of ( 1 − 2 p fail ) 1 2 subscript 𝑝 fail \left(1-2p_{\mathrm{fail}}\right) . See Section 6 for details.

Problem	Measurement $𝒜 {(M)}_{i}$	$(κ_{1}, κ_{2})$	Regime
(sub-)Gaussian sensing	$⟨ P_{i}, M ⟩$	$(c, C)$	$m ≿ \frac{r d}{{(1 - 2 p_{fail})}^{2}} \ln (1 + \frac{1}{1 - 2 p_{fail}})$
Quadratic sensing I	$p_{i}^{⊤} M p_{i}$	$(c, C \sqrt{r})$	$m ≿ \frac{r^{2} d}{{(1 - 2 p_{fail})}^{2}} \ln (1 + \frac{\sqrt{r}}{1 - 2 p_{fail}})$
Quadratic sensing II	$p_{i}^{⊤} M p_{i} - {\tilde{p}}_{i}^{⊤} M {\tilde{p}}_{i}$	$(c, C)$	$m ≿ \frac{r d}{{(1 - 2 p_{fail})}^{2}} \ln (1 + \frac{1}{1 - 2 p_{fail}})$
Bilinear sensing	$p_{i}^{⊤} M q_{i}$	$(c, C)$	$m ≿ \frac{r d}{{(1 - 2 p_{fail})}^{2}} \ln (1 + \frac{1}{1 - 2 p_{fail}})$

Equations807

X \in R^{d \times r} min f (X) := h (A (X X^{⊤}) - b) subject to X \in D,

X \in R^{d \times r} min f (X) := h (A (X X^{⊤}) - b) subject to X \in D,

X \in R^{d_{1} \times r}, Y \in R^{r \times d_{2}} min f (X, Y) := h (A (X Y) - b) subject to (X, Y) \in D .

X \in R^{d_{1} \times r}, Y \in R^{r \times d_{2}} min f (X, Y) := h (A (X Y) - b) subject to (X, Y) \in D .

x \in X min f (x) := h (F (x)),

x \in X min f (x) := h (F (x)),

x_{t + 1} = proj_{X} (x_{t} - α_{t} v_{t}) with v_{t} \in \partial f (x_{t}),

x_{t + 1} = proj_{X} (x_{t} - α_{t} v_{t}) with v_{t} \in \partial f (x_{t}),

x_{t+1}=\operatornamewithlimits{argmin}_{x\in\mathcal{X}}~{}h\Big{(}F(x_{t})+\nabla F(x_{t})(x-x_{t})\Big{)}+\frac{\beta}{2}\|x-x_{t}\|^{2}_{2}.

x_{t+1}=\operatornamewithlimits{argmin}_{x\in\mathcal{X}}~{}h\Big{(}F(x_{t})+\nabla F(x_{t})(x-x_{t})\Big{)}+\frac{\beta}{2}\|x-x_{t}\|^{2}_{2}.

\mathcal{T}=\Big{\{}x\in\mathcal{X}:{\rm dist}(x,\mathcal{X}^{*})\leq\frac{\mu}{\rho}\Big{\}}.

\mathcal{T}=\Big{\{}x\in\mathcal{X}:{\rm dist}(x,\mathcal{X}^{*})\leq\frac{\mu}{\rho}\Big{\}}.

κ_{1} ∥ W ∥_{F} \leq ∣ ∣ ∣ A (W) ∣ ∣ ∣ \leq κ_{2} ∥ W ∥_{F},

κ_{1} ∥ W ∥_{F} \leq ∣ ∣ ∣ A (W) ∣ ∣ ∣ \leq κ_{2} ∥ W ∥_{F},

μ = 0.9 κ_{1} σ_{r} (M_{♯}), ρ = κ_{2}, L = 0.9 κ_{1} σ_{r} (M_{♯}) + 2 κ_{2} σ_{1} (M_{♯}) .

μ = 0.9 κ_{1} σ_{r} (M_{♯}), ρ = κ_{2}, L = 0.9 κ_{1} σ_{r} (M_{♯}) + 2 κ_{2} σ_{1} (M_{♯}) .

X \in X argmin f (X) = ∥ Π_{Ω} (X X^{⊤}) - Π_{Ω} (M_{♯}) ∥_{2},

X \in X argmin f (X) = ∥ Π_{Ω} (X X^{⊤}) - Π_{Ω} (M_{♯}) ∥_{2},

X = {X \in R^{d \times r} : ∥ X ∥_{2, \infty} \leq \frac{ν r ∥ M _{♯} ∥ _{op}}{d}}

X = {X \in R^{d \times r} : ∥ X ∥_{2, \infty} \leq \frac{ν r ∥ M _{♯} ∥ _{op}}{d}}

∣ f (Y) - f_{X} (Y) ∣ \leq 1 + ϵ ∥ Y - X ∥_{2}^{2} + ϵ ∥ X - Y ∥_{F} for all X, Y \in X .

∣ f (Y) - f_{X} (Y) ∣ \leq 1 + ϵ ∥ Y - X ∥_{2}^{2} + ϵ ∥ X - Y ∥_{F} for all X, Y \in X .

X_{k + 1} = X \in X argmin f_{X_{k}} (X) + 1 + ϵ ∥ X - X_{k} ∥_{F}^{2} + ϵ ∥ X - X_{k} ∥_{F} .

X_{k + 1} = X \in X argmin f_{X_{k}} (X) + 1 + ϵ ∥ X - X_{k} ∥_{F}^{2} + ϵ ∥ X - X_{k} ∥_{F} .

dist (X_{k}, X^{*}) ≲ {(1 - \frac{c}{ν r})^{k /2} 2^{- k} subgradient prox-linear .

dist (X_{k}, X^{*}) ≲ {(1 - \frac{c}{ν r})^{k /2} 2^{- k} subgradient prox-linear .

(X, S) \in D_{1} min F ((X, S)) = ∥ X X^{⊤} + S - W ∥_{F},

(X, S) \in D_{1} min F ((X, S)) = ∥ X X^{⊤} + S - W ∥_{F},

X \in D_{2} min f (X) = ∥ X X^{⊤} - W ∥_{1},

X \in D_{2} min f (X) = ∥ X X^{⊤} - W ∥_{1},

∣ f (Y) - f_{X} (Y) ∣ \leq ∥ Y - X ∥_{2, 1}^{2} for all X, Y

∣ f (Y) - f_{X} (Y) ∣ \leq ∥ Y - X ∥_{2, 1}^{2} for all X, Y

dist (x; Q) = y \in Q in f ∥ x - y ∥_{2} and proj_{Q} (x) = y \in Q argmin ∥ x - y ∥_{2},

dist (x; Q) = y \in Q in f ∥ x - y ∥_{2} and proj_{Q} (x) = y \in Q argmin ∥ x - y ∥_{2},

f (y) \geq f (x) + ⟨ ξ, y - x ⟩ + o (∥ y - x ∥_{2}) as y \to x .

f (y) \geq f (x) + ⟨ ξ, y - x ⟩ + o (∥ y - x ∥_{2}) as y \to x .

\partial (h \circ F + g) (x) = \nabla F (x)^{*} \partial h (F (x)) + \partial g (x) .

\partial (h \circ F + g) (x) = \nabla F (x)^{*} \partial h (F (x)) + \partial g (x) .

x \in X min f (x) := h (F (x)),

x \in X min f (x) := h (F (x)),

∣ f (y) - f_{x} (y) ∣ \leq \frac{ρ}{2} ∥ y - x ∥_{2}^{2} \forall x, y \in X .

∣ f (y) - f_{x} (y) ∣ \leq \frac{ρ}{2} ∥ y - x ∥_{2}^{2} \forall x, y \in X .

f (x) - X in f f \geq μ \cdot dist (x, X^{*}) \forall x \in X .

f (x) - X in f f \geq μ \cdot dist (x, X^{*}) \forall x \in X .

T := {x \in X : dist (x, X) \leq \frac{μ}{ρ}} .

T := {x \in X : dist (x, X) \leq \frac{μ}{ρ}} .

{X \in R^{d_{1} \times r} : X X^{⊤} = M_{♯}}

{X \in R^{d_{1} \times r} : X X^{⊤} = M_{♯}}

{(X, Y) \in R^{d_{1} \times r} \times R^{r \times d_{2}} : X Y = M_{♯}}

κ_{1} ∥ W ∥_{F} \leq ∣ ∣ ∣ A (W) ∣ ∣ ∣ \leq κ_{2} ∥ W ∥_{F} .

κ_{1} ∥ W ∥_{F} \leq ∣ ∣ ∣ A (W) ∣ ∣ ∣ \leq κ_{2} ∥ W ∥_{F} .

X \in R^{d \times r} min f (X) := ∣ ∣ ∣ A (X X^{⊤}) - b ∣ ∣ ∣,

X \in R^{d \times r} min f (X) := ∣ ∣ ∣ A (X X^{⊤}) - b ∣ ∣ ∣,

X \in R^{d_{1} \times r}, Y \in R^{r \times d_{2}} min f (X, Y) := ∣ ∣ ∣ A (X Y) - b ∣ ∣ ∣ .

X \in R^{d_{1} \times r}, Y \in R^{r \times d_{2}} min f (X, Y) := ∣ ∣ ∣ A (X Y) - b ∣ ∣ ∣ .

f_{X} (Z)

f_{X} (Z)

f_{(X, Y)} (X, Y)

∣ f (Z) - f_{X} (Z) ∣

∣ f (Z) - f_{X} (Z) ∣

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPrincipal Components Analysis

Full text

Low-rank matrix recovery with composite optimization:

good conditioning and rapid convergence

Vasileios Charisopoulos Yudong Chen Damek Davis

Mateo Díaz Lijun Ding Dmitriy Drusvyatskiy School of ORIE, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/vc333/School of ORIE, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/yudong.chen/School of ORIE, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/dsd95/.CAM, Cornell University. Ithaca, NY 14850, USA; people.cam.cornell.edu/md825/School of ORIE, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/ld446/.Department of Mathematics, U. Washington, Seattle, WA 98195; www.math.washington.edu/$\scriptstyle\sim$ddrusv. Research of Drusvyatskiy was supported by the NSF DMS 1651851 and CCF 1740551 awards.

Abstract

The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations do not suffer from the same type of ill-conditioning. Consequently, standard algorithms for nonsmooth optimization, such as subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. Moreover, nonsmooth formulations are naturally robust against outliers. Our framework subsumes such important computational tasks as phase retrieval, blind deconvolution, quadratic sensing, matrix completion, and robust PCA. Numerical experiments on these problems illustrate the benefits of the proposed approach.

1 Introduction
2 Preliminaries
3 Regularity conditions and algorithms (informal)
4 Regularity under RIP
4.1 Approximation and Lipschitz continuity
4.2 Sharpness
4.2.1 Sharpness in the noiseless regime
4.2.2 Sharpness in presence of outliers
5 General convergence guarantees for subgradient & prox-linear methods
5.1 Guarantees under local regularity
6 Examples of $\ell_{1}/\ell_{2}$ RIP
6.1 Warm-up: $\ell_{2}/\ell_{2}$ RIP for matrix sensing with Gaussian design
6.2 The $\ell_{1}/\ell_{2}$ RIP and $\mathcal{I}$ -outlier bounds: quadratic and bilinear sensing
7 Matrix Completion
8 Robust PCA
8.1 The Euclidean formulation
8.2 The non-Euclidean formulation
9 Recovery up to a Tolerance
9.1 Example: sparse outliers and dense noise under $\ell_{1}/\ell_{2}$ RIP
10 Numerical Experiments
10.1 Robustness to outliers
10.2 Convergence behavior
10.2.1 Recovery up to tolerance
A Proofs in Section 5
A.1 Proof of Theorem 5.6
A.2 Proof of Theorem 5.7
A.3 Proof of Theorem 5.8
B Proofs in Section 6
B.1 Proof of Lemma 6.3
B.2 Proof of Theorem 6.4
B.2.1 Part 1 of Theorem 6.4 (Matrix sensing)
B.2.2 Part 2 of Theorem 6.4 (Quadratic sensing I)
B.2.3 Part 3 of Theorem 6.4 (Quadratic sensing II)
B.2.4 Part 4 of Theorem 6.4 (Bilinear sensing)
B.3 Proof of Proposition B.1
C Proof in Section 7
C.1 Proof of Lemma 7.4
C.2 Proof of Theorem 7.6
D Proofs in Section 8
D.1 Proof of Lemma 8.1
D.2 Proof of Theorem 8.5
D.3 Proof of Lemma 8.7
E Proofs in Section 9
E.1 Proof of Lemma 9.1
E.2 Proof of Theorem 9.4
E.3 Proof of Theorem 9.6
F Auxiliary lemmas

1 Introduction

Recovering a low-rank matrix from noisy linear measurements has become an increasingly central task in data science. Important and well-studied examples include phase retrieval [55, 12, 42], blind deconvolution [1, 38, 41, 57], matrix completion [9, 21, 56], covariance matrix estimation [18, 40], and robust principal component analysis [15, 11]. Optimization-based approaches for low-rank matrix recovery naturally lead to nonconvex formulations, which are NP hard in general. To overcome this issue, in the last two decades researchers have developed convex relaxations that succeed with high probability under appropriate statistical assumptions. Convex techniques, however, have a well-documented limitation: the parameter space describing the relaxations is usually much larger than that of the target problem. Consequently, standard algorithms applied on convex relaxations may not scale well to the large problems. Consequently, there has been a renewed interest in directly optimizing nonconvex formulations with iterative methods within the original parameter space of the problem. Aside from a few notable exceptions on specific problems [33, 3, 32], most algorithms of this type proceed in two-stages. The first stage—initialization—yields a rough estimate of an optimal solution, often using spectral techniques. The second stage—local refinement—uses a local search algorithm that rapidly converges to an optimal solution, when initialized at the output of the initialization stage.

This work focuses on developing provable low-rank matrix recovery algorithms based on nonconvex problem formulations. We focus primarily on local refinement and describe a set of unifying sufficient conditions leading to rapid local convergence of iterative methods. In contrast to the current literature on the topic, which typically relies on smooth problem formulations and gradient-based methods, our primary focus is on nonsmooth formulations that exhibit sharp growth away from the solution set. Such formulations are well-known in the nonlinear programming community to be amenable to rapidly convergent local-search algorithms. Along the way, we will observe an apparent benefit of nonsmooth formulations over their smooth counterparts. All nonsmooth formulations analyzed in this paper are “well-conditioned,” resulting in fast “out-of-the-box” convergence guarantees. In contrast, standard smooth formulations for the same recovery tasks can be poorly conditioned, in the sense that classical convergence guarantees of nonlinear programming are overly pessimistic. Overcoming the poor conditioning typically requires nuanced problem and algorithmic specific analysis (e.g. [57, 42, 46, 17]), which nonsmooth formulations manage to avoid for the problems considered here.

Setting the stage, consider a rank $r$ matrix $M_{\sharp}\in{\bf R}^{d_{1}\times d_{2}}$ and a linear map $\mathcal{A}\colon{\bf R}^{d_{1}\times d_{2}}\to{\bf R}^{m}$ from the space of matrices to the space of measurements. The goal of low-rank matrix recovery is to recover $M_{\sharp}$ from the image vector $b=\mathcal{A}(M_{\sharp})$ , possibly corrupted by noise. Typical nonconvex approaches proceed by choosing some penalty function $h(\cdot)$ with which to measure the residual $\mathcal{A}(M)-b$ for a trial solution $M$ . Then, in the case that $M_{\sharp}$ is symmetric and positive semidefinite, one may focus on the formulation

[TABLE]

or when $M_{\sharp}$ is rectangular, one may instead use the formulation

[TABLE]

Here, $\mathcal{D}$ is a convex set that incorporates prior knowledge about $M_{\sharp}$ and is often used to enforce favorable structure on the decision variables. The penalty $h$ is chosen specifically to penalize measurement misfit and/or enforce structure on the residual errors.

Algorithms and conditioning for smooth formulations

Most widely-used penalties $h(\cdot)$ are smooth and convex. Indeed, the squared $\ell_{2}$ -norm $h(z)=\tfrac{1}{2}\|z\|^{2}_{2}$ is ubiquitous in this context. With such penalties, problems (1.1) and (1.2) are smooth and thus are amenable to gradient-based methods. The linear rate of convergence of gradient descent is governed by the “local condition number” of $f$ . Indeed, if the estimate, $\mu I\preceq\nabla^{2}f(X)\preceq LI,$ holds for all $X$ in a neighborhood of the solution set, then gradient descent converges to the solution set at the linear rate $1-\mu/L$ . It is known that for several widely-studied problems including phase retrieval, blind deconvolution, and matrix completion, the ratio $\mu/L$ scales inversely with the problem dimension. Consequently, generic nonlinear programming guarantees yield efficiency estimates that are far too pessimistic. Instead, near-dimension independent guarantees can be obtained by arguing that $\nabla^{2}f$ is well conditioned along the “relevant” directions or that $\nabla^{2}f$ is well-conditioned within a restricted region of space that the iterates never escape (e.g. [57, 42, 46]). Techniques of this type have been elegantly and successfully used over the past few years to obtain algorithms with near-optimal sample complexity. One byproduct of such techniques, however, is that the underlying arguments are finely tailored to each particular problem and algorithm at hand. We refer the reader to the recent surveys [20] for details.

Algorithms and conditioning for nonsmooth formulations

The goal of our work is to justify the following principle:

Statistical assumptions for common recovery problems guarantee that (1.1) and (1.2) are well-conditioned when $h$ is an appropriate nonsmooth convex penalty.

To explain what we mean by “good conditioning,” let us treat (1.1) and (1.2) within the broader convex composite problem class:

[TABLE]

where $F(\cdot)$ is a smooth map on the space of matrices and $\mathcal{X}$ is a closed convex set. Indeed, in the symmetric and positive semidefinite case, we identify $x$ with matrices $X$ and define $F(X)=\mathcal{A}(XX^{\top})-b$ , while in the asymmetric case, we identify $x$ with pairs of matrices $(X,Y)$ and define $F(X,Y)=\mathcal{A}(XY)-b$ . Though compositional problems (1.3) have been well-studied in nonlinear programming [6, 7, 31], their computational promise in data science has only begun recently to emerge. For example, the papers [28, 22, 26] discuss stochastic and inexact algorithms on composite problems, while the papers [27, 24], [16], and [39] investigate applications to phase retrieval, blind deconvolution, and matrix sensing, respectively.

A number of algorithms are available for problems of the form (1.3), and hence for (1.1) and (1.2). Two most notable ones are the projected subgradient111Here, the subdifferential is formally obtained through the chain rule $\partial f(x)=\nabla F(x)^{*}\partial h(F(x))$ , where $\partial h(\cdot)$ is the subdifferential in the sense of convex analysis. method [23, 34]

[TABLE]

and the prox-linear algorithm [6, 37, 25]

[TABLE]

Notice that each iteration of the subgradient method is relatively cheap, requiring access only to the subgradients of $f$ and the nearest-point projection onto $\mathcal{X}$ . The prox-linear method in contrast requires solving a strongly convex problem in each iteration. That being said, the prox-linear method has much stronger convergence guarantees than the subgradient method, as we will review shortly.

The local convergence guarantees of both methods are straightforward to describe, and underlie what we mean by “good conditioning”. Define $\mathcal{X}^{*}:=\operatornamewithlimits{argmin}_{\mathcal{X}}f$ , and for any $x\in\mathcal{X}$ define the convex model $f_{x}(y)=h(F(x)+\nabla F(x)(y-x))$ . Suppose there exist constants $\rho,\mu>0$ satisfying the two properties:

•

(approximation) $\left|f(y)-f_{x}(y)\right|\leq\frac{\rho}{2}\|y-x\|^{2}_{2}$ for all $x,y\in\mathcal{X}$ ,

•

(sharpness) $f(x)-\inf f\geq\mu\cdot{\rm dist}(x,\mathcal{X}^{*})$ for all $x\in\mathcal{X}$ .

The approximation and sharpness properties have intuitive meanings. The former says that the nonconvex function $f(y)$ is well approximated by the convex model $f_{x}(y)$ , with quality that degrades quadratically as $y$ deviates from $x$ . In particular, this property guarantees that the quadratically perturbed function $x\mapsto f(x)+\frac{\rho}{2}\|x\|^{2}_{2}$ is convex on $\mathcal{X}$ . Yet another consequence of the approximation property is that the epigraph of $f$ admits a supporting concave quadratic with amplitute $\rho$ at each of its points. Sharpness, in turn, asserts that $f$ must grow at least linearly as $x$ moves away from the solution set. In other words, the function values should robustly distinguish between optimal and suboptimal solutions. In statistical contexts, one can interpret sharpness as strong identifiability of the statistical model. The three figures below illustrate the approximation and sharpness properties for idealized objectives in phase retrieval, blind deconvolution, and robust PCA problems.

Approximation and sharpness, taken together, guarantee rapid convergence of numerical methods when initialized within the tube:

[TABLE]

For common low-rank recovery problems, $\mathcal{T}$ has an intuitive interpretation: it consists of those matrices that are within constant relative error of the solution. We note that standard spectral initialization techniques, in turn, can generate such matrices with nearly optimal sample complexity. We refer the reader to the survey [20], and references therein, for details.

Guiding strategy.

The following is the guiding algorithmic principle of this work:

When initialized at $x_{0}\in\mathcal{T}$ , the prox-linear algorithm converges quadratically to the solution set $\mathcal{X}^{*}$ ; the subgradient method, in turn, converges linearly with a rate governed by ratio $\frac{\mu}{L}\in(0,1)$ , where $L$ is the Lipschitz constant of $f$ on $\mathcal{T}$ .222Both the parameters $\alpha_{t}$ and $\beta$ must be properly chosen for these guarantees to take hold.

In light of this observation, our strategy can be succinctly summarized as follows. We will show that for a variety of low-rank recovery problems, the parameters $\mu,L,\rho>0$ (or variants) are dimension independent under standard statistical assumptions. Consequently, the formulations (1.1) and (1.2) are “well-conditioned”, and subgradient and prox-linear methods converge rapidly when initialized within constant relative error of the optimal solution.

Approximation and sharpness via the Restricted Isometry Property

We begin verifying our thesis by showing that the composite problems, (1.1) and (1.2), are well-conditioned under the following Restricted Isometry Property (RIP): there exists a norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}$ and numerical constants $\kappa_{1},\kappa_{2}>0$ so that

[TABLE]

for all matrices $W\in{\bf R}^{d_{1}\times d_{2}}$ of rank at most $2r$ . We argue that under RIP,

the nonsmooth norm $h=\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}$ is a natural penalty function to use.

Indeed, as we will show, the composite loss $h(F(x))$ in the symmetric setting admits constants $\mu,\rho,L$ that depend only on the RIP parameters and the extremal singular values of $M_{\sharp}$ :

[TABLE]

In particular, the initialization ratio scales as $\frac{\mu}{\rho}\asymp\frac{\kappa_{1}}{\kappa_{2}}\sqrt{\sigma_{r}(M_{\sharp})}$ and the condition number scales as $\frac{L}{\mu}\asymp 1+\frac{\kappa_{2}}{\kappa_{1}}\sqrt{\frac{\sigma_{1}(M_{\sharp})}{\sigma_{r}(M_{\sharp})}}$ . Consequently, the rapid local convergence guarantees previously described immediately take-hold. The asymmetric setting is slightly more nuanced since the objective function is sharp only on bounded sets. Nonetheless, it can be analyzed in a similar way leading to analogous rapid convergence guarantees. Incidentally, we show that the prox-linear method converges rapidly without any modification; this is in contrast to smooth methods, which typically require incorporating an auxiliary regularization term into the objective (e.g. [57]). We note that similar results in the symmetric setting were independently obtained in the complimentary work [39], albeit with a looser estimate of $L$ ; the two treatments of the asymmetric setting are distinct, however.333The authors of [39] provide a bound on $L$ that scales with the Frobenius norm $\sqrt{\|M_{\sharp}}\|_{F}$ . We instead derive a sharper bound that scales as $\sqrt{\|M_{\sharp}\|_{\rm op}}$ . As a byproduct, the linear rate of convergence for the subgradient method scales only with the condition number $\sigma_{1}(M_{\sharp})/\sigma_{r}(M_{\sharp})$ instead of $\|M_{\sharp}\|_{F}/\sigma_{r}(M_{\sharp})$ .

After establishing basic properties of the composite loss, we turn our attention to verifying RIP in several concrete scenarios. We note that the seminal works [50, 13] showed that if $\mathcal{A}(\cdot)$ arises from a Gaussian ensemble, then in the regime $m\gtrsim r(d_{1}+d_{2})$ RIP holds with high probability for the scaled $\ell_{2}$ norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}z\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}=m^{-1/2}\|z\|_{2}$ . More generally when $\mathcal{A}$ is highly structured, RIP may be most naturally measured in a non-Euclidean norm. For example, RIP with respect to the scaled $\ell_{1}$ norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}z\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}=m^{-1}\|z\|_{1}$ holds for phase retrieval [29, 27], blind deconvolution [16], and quadratic sensing [18]; in contrast, RIP relative to the scaled $\ell_{2}$ norm fails for all three problems. In particular, specializing our results to the aforementioned recovery tasks yields solution methodologies with best known sample and computational complexity guarantees. Notice that while one may “smooth-out” the $\ell_{2}$ norm by squaring it, we argue that it may be more natural to optimize the $\ell_{1}$ norm directly as a nonsmooth penalty. Moreover, we show that $\ell_{1}$ penalization enables exact recovery even if a constant fraction of measurements is corrupted by outliers.

Beyond RIP: matrix completion and robust PCA

The RIP assumption provides a nice vantage point for analyzing the problem parameters $\mu,\rho,L>0$ . There are, however, a number of important problems, which do not satisfy RIP. Nonetheless, the general paradigm based on the interplay of sharpness and approximation is still powerful. We consider two such settings, matrix completion and robust principal component analysis (PCA), leveraging some intermediate results from [19].

The goal of the matrix completion problem [9] is to recover a low rank matrix $M_{\sharp}$ from its partially observed entries. We focus on the formulation

[TABLE]

where $\Pi_{\Omega}$ is the projection onto the index set of observed entries $\Omega$ and

[TABLE]

is the set of incoherent matrices. To analyze the conditioning of this formulation, we assume that the indices in $\Omega$ are chosen as i.i.d. Bernoulli with parameter $p\in(0,1)$ and that all nonzero singular values of $M_{\sharp}$ are equal to one. Using results of [19], we quickly deduce sharpness with high probability. The error in approximation, however, takes the following nonstandard form. In the regime $p\geq\frac{c}{\epsilon^{2}}(\frac{\nu^{2}r^{2}}{d}+\frac{\log d}{d})$ for some constants $c>0$ and $\epsilon\in(0,1)$ , the estimate holds with high probability:

[TABLE]

The following modification of the prox-linear method therefore arises naturally:

[TABLE]

We show that subgradient methods and the prox-linear method, thus modified, both converge at a dimension independent linear rate when initialized near the solution. Namely, as long as $\epsilon$ and ${\rm dist}(X_{0},\mathcal{X}^{*})$ are below some constant thresholds, both the subgradient and the modified prox-linear methods converge linearly with high probability:

[TABLE]

Here $c>0$ is a numerical constant. Notice that the prox-linear method enjoys a much faster rate of convergence that is independent of any unknown constants or problem parameters—an observation fully supported by our numerical experiments.

As the final example, we consider the problem of robust PCA [11, 15], which aims to decompose a given matrix $W$ into a sum of a low-rank and a sparse matrix. We consider two different problem formulations:

[TABLE]

and

[TABLE]

where $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ are appropriately defined convex regions. Under standard incoherence assumptions, we show that the formulation (1.5) is well-conditioned, and therefore subgradient and prox-linear methods are applicable. Still, formulation (1.5) has a major drawback in that one must know properties of the optimal sparse matrix $S_{\sharp}$ in order to define the constraint set $\mathcal{D}_{1}$ , in order to ensure good conditioning. Consequently, we analyze formulation (1.6) as a more practical alternative.

The analysis of (1.6) is more challenging than that of (1.5). Indeed, it appears that we must replace the Frobenius norm $\|X\|_{F}$ in the approximation/sharpness conditions with the sum of the row norms $\|X\|_{2,1}$ . With this set-up, we verify the convex approximation property in general:

[TABLE]

and sharpness only when $r=1$ . We conjecture, however, that an analogous sharpness bound holds for all $r$ . It is easy to see that the quadratic convergence guarantees for the prox-linear method do not rely on the Euclidean nature of the norm, and the algorithm becomes applicable. To the best of our knowledge, it is not yet known how to adapt linearly convergent subgradient methods to the non-Euclidean setting.

Robust recovery with sparse outliers and dense noise

The aforementioned guarantees lead to exact recovery of $M_{\sharp}$ under noiseless or sparsely corrupted measurements $b$ . A more realistic noise model allows for further corruption by a dense noise vector $e$ of small norm. Exact recovery is no longer possible with such errors. Instead, we should only expect to recover $M_{\sharp}$ up to a tolerance proportional to the size of $e$ . Indeed, we show that appropriately modified subgradient and prox-linear algorithms converge linearly and quadratically, respectively, up to the tolerance $\delta=O(\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}e\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}/\mu)$ for an appropriate norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}$ . Finally, we discuss in detail the case of recovering a low rank PSD matrix $M_{\sharp}$ from the corrupted measurements $\mathcal{A}(M_{\sharp})+\Delta+e$ , where $\Delta$ represents sparse outliers and $e$ represents small dense noise. To the best of our knowledge, theoretical guarantees for this error model have not been previously established in the nonconvex low-rank recovery literature. Surprisingly, we show it is possible to recover the matrix $M_{\sharp}$ up to a tolerance independent of the norm or location of the outliers $\Delta$ .

Numerical experiments

We conclude with an experimental evaluation of our theoretical findings on quadratic and bilinear matrix sensing, matrix completion, and robust PCA problems. In the first set of experiments, we test the robustness of the proposed methods against varying combinations of rank/corruption level by reporting the empirical recovery rate across independent runs of synthetic problem instances. All the aforementioned model problems exhibit sharp phase transitions, yet our methods succeed for more than moderate levels of corruption (or unobserved entries in the case of matrix completion). For example, in the case of matrix sensing, we can corrupt almost half of the measurements $\mathcal{A}_{i}(M)$ and still retain perfect recovery rates. Interestingly, our experimental findings indicate that the prox-linear method can tolerate slightly higher levels of corruption compared to the subgradient method, making it the method of choice for small-to-moderate dimensions.

We then demonstrate that the convergence rate analysis is fully supported by empirical evidence. In particular, we test the subgradient and prox-linear methods for different rank/corruption configurations. In the case of quadratic/bilinear sensing and robust PCA, we observe that the subgradient method converges linearly and the prox-linear method converges quadratically, as expected. In particular, our numerical experiments appear to support our sharpness conjecture for the robust PCA problem. In the case of matrix completion, both algorithms converge linearly. The prox-linear method in particular, converges extremely quickly, reaching high accuracy solutions in under $25$ iterations for reasonable values of $p$ .

In the noiseless setting, we compare against gradient descent with constant step-size on smooth formulations of each problem (except for robust PCA). We notice that the Polyak subgradient method outperforms gradient descent in all cases. That being said, one can heuristically equip gradient descent with the Polyak step-size as well. To the best of our knowledge, the gradient method with Polyak step-size has has not been investigated on smooth problem formulations we consider here. Experimentally, we see that the Polyak (sub)gradient methods on smooth and nonsmooth formulations perform comparably in the noiseless setting.

Outline of the paper

The outline of the paper is as follows. Section 2 records some basic notation we will use. Section 3 informally discusses the sharpness and approximation properties, and their impact on convergence of the subgradient and prox-linear methods. Section 4 analyzes the parameters $\mu,\rho,L$ under RIP. Section 5 rigorously discusses convergence guarantees of numerical methods under regularity conditions. Section 6 reviews examples of problems satisfying RIP and deduces convergence guarantees for subgradient and prox-linear algorithms. Sections 7 and 8 discuss the matrix completion and robust PCA problems, respectively. Section 9 discusses robust recovery up to a noise tolerance. The final Section 10 illustrates the developed theory and algorithms with numerical experiments on quadratic/bi-linear sensing, matrix completion, and robust PCA problems.

2 Preliminaries

In this section, we summarize the basic notation we will use throughout the paper. Henceforth, the symbol ${\bf E}$ will denote a Euclidean space with inner product $\langle\cdot,\cdot\rangle$ and the induced norm $\|x\|_{2}=\sqrt{\langle x,x\rangle}$ . The closed unit ball in ${\bf E}$ will be denoted by $\mathbb{B}$ , while a closed ball of radius $\epsilon>0$ around a point $x$ will be written as $B_{\epsilon}(x)$ . For any point $x\in{\bf E}$ and a set $Q\subset{\bf E}$ , the distance and the nearest-point projection in $\ell_{2}$ -norm are defined by

[TABLE]

respectively. For any pair of functions $f$ and $g$ on ${\bf E}$ , the notation $f\lesssim g$ will mean that there exists a numerical constant $C$ such that $f(x)\leq Cg(x)$ for all $x\in{\bf E}$ . Given a linear map between Euclidean spaces, $\mathcal{A}\colon{\bf E}\to{\bf Y}$ , the adjoint map will be written as $\mathcal{A}^{*}\colon{\bf Y}\to{\bf E}$ . We will use $I_{d}$ for the $d$ -dimensional identity matrix and $\mathbf{0}$ for the zero matrix with variable sizes. The symbol $[m]$ will be shorthand for the set $\{1,\dots,m\}.$

We will always endow the Euclidean space of vectors ${\bf R}^{d}$ with the usual dot-product $\langle x,y\rangle=x^{\top}y$ and the induced $\ell_{2}$ -norm. More generally, the $\ell_{p}$ norm of a vector $x$ will be denoted by $\|x\|_{p}=(\sum_{i}|x_{i}|^{p})^{1/p}$ . Similarly, we will equip the space of rectangular matrices ${\bf R}^{d_{1}\times d_{2}}$ with the trace product $\langle X,Y\rangle=\mathrm{Tr}(X^{\top}Y)$ and the induced Frobenius norm $\|X\|_{F}=\sqrt{\mathrm{Tr}(X^{\top}X)}$ . The operator norm of a matrix $X\in{\bf R}^{d_{1}\times d_{2}}$ will be written as $\|X\|_{\textrm{op}}$ . The symbol $\sigma(X)$ will denote the vector of singular values of a matrix $X$ in nonincreasing order. We also define the row-wise matrix norms $\|X\|_{b,a}=\|(\|X_{1\cdot}\|_{b},\|X_{2\cdot}\|_{b}\,\ldots,\|X_{d_{1}\cdot}\|_{b})\|_{a}$ . The symbols $\mathcal{S}^{d}$ , $\mathcal{S}^{d}_{+}$ , $O(d)$ , and $GL(d)$ will denote the sets of symmetric, positive semidefinite, orthogonal, and invertible matrices, respectively.

Nonsmooth functions will play a central role in this work. Consequently, we will require some basic constructions of generalized differentiation, as described for example in the monographs [52, 45, 4]. Consider a function $f\colon{\bf E}\to{\bf R}\cup\{+\infty\}$ and a point $x$ , with $f(x)$ finite. The subdifferential of $f$ at $x$ , denoted by $\partial f(x)$ , is the set of all vectors $\xi\in{\bf E}$ satisfying

[TABLE]

Here $o(r)$ denotes any function satisfying $o(r)/r\to 0$ as $r\to 0$ . Thus, a vector $\xi$ lies in the subdifferential $\partial f(x)$ precisely when the linear function $y\mapsto f(x)+\langle\xi,y-x\rangle$ lower-bounds $f$ up to first-order around $x$ . Standard results show that for a convex function $f$ the subdifferential $\partial f(x)$ reduces to the subdifferential in the sense of convex analysis, while for a differentiable function it consists only of the gradient: $\partial f(x)=\{\nabla f(x)\}$ . For any closed convex functions $h\colon\bf{Y}\to{\bf R}$ and $g\colon{\bf E}\to{\bf R}\cup\{+\infty\}$ and $C^{1}$ -smooth map $F\colon{\bf E}\to\bf{Y}$ , the chain rule holds [52, Theorem 10.6]:

[TABLE]

We say that a point $x$ is stationary for $f$ whenever the inclusion $0\in\partial f(x)$ holds. Equivalently, stationary points are precisely those that satisfy first-order necessary conditions for minimality: the directional derivative is nonnegative in every direction.

We say a that a random vector $X$ in ${\bf R}^{d}$ is $\eta$ -sub-gaussian whenever $\mathbb{E}\exp\left(\frac{\langle u,X\rangle^{2}}{\eta^{2}}\right)\leq 2$ for all unit vectors $u\in{\bf R}^{d}$ . The sub-gaussian norm of a real-valued random variable $X$ is defined to be $\|X\|_{\psi_{2}}=\inf\{t>0:\mathbb{E}\exp\left(\frac{X^{2}}{t^{2}}\right)\leq 2\}$ , while the sub-exponential norm is defined by $\|X\|_{\psi_{1}}=\inf\{t>0:\mathbb{E}\exp\left(\frac{|X|}{t}\right)\leq 2\}$ .

3 Regularity conditions and algorithms (informal)

As outlined in Section 1, we consider the low-rank matrix recovery problem within the framework of compositional optimization:

[TABLE]

where $\mathcal{X}\subset{\bf E}$ is a closed convex set, $h\colon\bf{Y}\to{\bf R}$ is a finite convex function and $F\colon{\bf E}\to\bf{Y}$ is a $C^{1}$ -smooth map. We depart from previous work on low-rank matrix recovery by allowing $h$ to be nonsmooth. We primary focus on those algorithms for (3.1) that converge rapidly (linearly or faster) when initialized sufficiently close to the solution set.

Such rapid convergence guarantees rely on some regularity of the optimization problem. In the compositional setting, regularity conditions take the following appealing form.

Assumption A.

Suppose that the following properties hold for the composite optimization problem (3.1) for some real numbers $\mu,\rho,L>0$ .

(Approximation accuracy) The convex models $f_{x}(y):=h(F(x)+\nabla F(x)(y-x))$ satisfy the estimate

[TABLE] 2. 2.

(Sharpness) The set of minimizers $\displaystyle\mathcal{X}^{*}:=\operatornamewithlimits{argmin}_{x\in\mathcal{X}}f(x)$ is nonempty and we have

[TABLE] 3. 3.

(Subgradient bound) The bound, $\sup_{\zeta\in\partial f(x)}\|\zeta\|_{2}\leq L$ , holds for any $x$ in the tube

[TABLE]

As pointed out in the introduction, these three properties are quite intuitive: The approximation accuracy guarantees that the objective function $f$ is well approximated by the convex model $f_{x}$ , up to a quadratic error relative to the basepoint $x$ . Sharpness stipulates that the objective function should grow at least linearly as one moves away from the solution set. The subgradient bound, in turn, asserts that the subgradients of $f$ are bounded in norm by $L$ on the tube $\mathcal{T}$ . In particular, this property is implied by Lipschitz continuity on $\mathcal{T}$ .

Lemma 3.1 (Subgradient bound and Lipschitz continuity [52, Theorem 9.13]).

Suppose a function $f\colon{\bf E}\to{\bf R}$ is $L$ -Lipschitz on an open set $U\subset{\bf E}$ . Then the estimate $\sup_{\zeta\in\partial f(x)}\|\zeta\|_{2}\leq L$ holds for all $x\in U$ .

The definition of the tube $\mathcal{T}$ might look unintuitive at first. Some thought, however, shows that it arises naturally since it provably contains no extraneous stationary points of the problem. In particular, $\mathcal{T}$ will serve as a basin of attraction of numerical methods; see the forthcoming Section 5 for details. The following general principle has recently emerged [23, 27, 24, 16]. Under Assumption A, basic numerical methods converge rapidly when initialized within the tube $\mathcal{T}$ . Let us consider three such procedures and briefly describe their convergence properties. Detailed convergence guarantees are deferred to Section 5.

Algorithm 1 is the so-called Polyak subgradient method. In each iteration $k$ , the method travels in the negative direction of a subgradient $\zeta_{k}\in\partial f(x_{k})$ , followed by a nearest-point projection onto $\mathcal{X}$ . The step-length is governed by the current functional gap $f(x_{k})-\min_{\mathcal{X}}f$ . In particular, one must have the value $\min_{\mathcal{X}}f$ explicitly available to implement the procedure. This value is sometimes known; case in point, the minimal value of the penalty formulations (1.1) and (1.2) for low-rank recovery is zero when the linear measurements are exact. When the minimal value $\min_{\mathcal{X}}f$ is not known, one can instead use Algorithm 2, which replaces the step-length $(f(x_{k})-\min_{\mathcal{X}}f)/\|\zeta_{k}\|_{2}$ with a preset geometrically decaying sequence. Notice that the per iteration cost of both subgradient methods is dominated by a single subgradient evaluation and a projection onto $\mathcal{X}$ . Under appropriate parameter settings, Assumption A guarantees that both methods converge at a linear rate governed by the ratio $\frac{\mu}{L}$ , when initialized within $\mathcal{T}$ . The prox-linear algorithm (Algorithm 2), in contrast, converges quadratically to the optimal solution, when initialized within $\mathcal{T}$ . The caveat is that each iteration of the prox-linear method requires solving a strongly convex subproblem. Note that for low-rank recovery problems (1.1) and (1.2), the size of the subproblems is proportional to the size of the factors and not the size of the matrices.

In the subsequent sections, we show that Assumption A (or a close variant) holds with favorable parameters $\rho,\mu,L>0$ for common low-rank matrix recovery problems.

4 Regularity under RIP

In this section, we consider the low-rank recovery problems (1.1) and (1.2), and show that restricted isometry properties of the map $\mathcal{A}(\cdot)$ naturally yield well-conditioned compositional formulations.444The guarantees we develop in the symmetric setting are similar to those in the recent preprint [39], albeit we obtain a sharper bound on $L$ ; the two sets of results were obtained independently. The guarantees for the asymmetric setting are different and are complementary to each other: we analyze the conditioning of the basic problem formulation (1.2), while [39] introduces a regularization term $\|X^{\top}X-YY^{\top}\|_{F}$ that improves the basin of attraction for the subgradient method by a factor of the condition number of $M_{\sharp}$ . The arguments are short and elementary, and yet apply to such important problems as phase retrieval, blind deconvolution, and covariance matrix estimation.

Setting the stage, consider a linear map $\mathcal{A}\colon{\bf R}^{d_{1}\times d_{2}}\to{\bf R}^{m}$ , an arbitrary rank $r$ matrix $M_{\sharp}\in{\bf R}^{d_{1}\times d_{2}}$ , and a vector $b\in{\bf R}^{m}$ modeling a corrupted estimate of the measurements $\mathcal{A}(M_{\sharp})$ . Recall that the goal of low-rank matrix recovery is to determine $M_{\sharp}$ given $\mathcal{A}$ and $b$ . By the term symmetric setting, we mean that $M_{\sharp}$ is symmetric and positive semidefinite, whereas by asymmetric setting we mean that $M_{\sharp}$ is an arbitrary rank $r$ matrix. We will treat the two settings in parallel. In the symmetric setting, we use $X_{\sharp}$ to denote any fixed $d\times r$ matrix for which the factorization $M_{\sharp}=X_{\sharp}X_{\sharp}^{\top}$ holds. Similarly, in the asymmetric case, $X_{\sharp}$ and $Y_{\sharp}$ denote any fixed $d_{1}\times r$ and $r\times d_{2}$ matrices, respectively, satisfying $M_{\sharp}=X_{\sharp}Y_{\sharp}$ .

We are interested in the set of all possible factorization of $M_{\sharp}$ . Consequently, we will often appeal to the following representations:

[TABLE]

Throughout, we will let $\mathcal{D}^{*}(M_{\sharp})$ refer to the set (4.1) in the symmetric case and to (4.2) in the asymmetric setting.

Henceforth, fix an arbitrary norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}$ on ${\bf R}^{m}$ . The following property, widely used in the literature on low-rank recovery, will play a central role in this section.

Assumption B (Restricted Isometry Property (RIP)).

There exist constants $\kappa_{1},\kappa_{2}>0$ such that for all matrices $W\in{\bf R}^{d_{1}\times d_{2}}$ of rank at most $2r$ the following bound holds:

[TABLE]

Assumption B is classical and is satisfied in various important problems with the rescaled $\ell_{2}$ -norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}=\frac{1}{\sqrt{m}}\|\cdot\|_{2}$ and $\ell_{1}$ -norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}=\frac{1}{m}\|\cdot\|_{1}$ .555In the latter case, RIP also goes by the name of Restricted Uniform Boundedness (RUB) [8]. In Section 6 we discuss a number of such examples including matrix sensing under (sub-)Gaussian design, phase retrieval, blind deconvolution, and quadratic/bilinear sensing. We summarize the RIP properties for these examples in Table 1 and refer the reader to Section 6 for the precise statements.

In light of Assumption B, it it natural to take the norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}$ as the penalty $h(\cdot)$ in (1.1) and (1.2) . Then the symmetric problem (1.1) becomes

[TABLE]

while the asymmetric formulation (1.2) becomes

[TABLE]

Our immediate goal is to show that under Assumption B, the problems (4.3) and (4.4) are well-conditioned in the sense of Assumption A. We note that the asymmetric setting is more nuanced that its symmetric counterpart because Assumption A can only be guaranteed to hold on bounded sets. Nonetheless, as we discuss in Section 5, a localized version of Assumption A suffices to guarantee rapid local convergence of subgradient and prox-linear methods. In particular, our analysis of the local sharpness in the asymmetric setting is new and illuminating; it shows that the regularization technique suggested in [39] is not needed at all for the prox-linear method. This conclusion contrasts with known techniques in the smooth setting, where regularization is often used.

4.1 Approximation and Lipschitz continuity

We begin with the following elementary proposition, which estimates the subgradient bound $L$ and the approximation modulus $\rho$ in the symmetric setting. In what follows, we will use the expressions

[TABLE]

Proposition 4.1 (Approximation accuracy and Lipschitz continuity (symmetric)).

Suppose Assumption B holds. Then for all $X,Z\in{\bf R}^{d\times r}$ the following estimates hold:

[TABLE]

Proof.

To see the first estimate, observe that

[TABLE]

where (4.5) follows from the reverse triangle inequality and (4.6) uses Assumption B. Next, for any $X,Z\in\mathcal{X}$ we successively compute:

[TABLE]

where (4.7) follows from the reverse triangle inequality and (4.8) uses Assumption B. The proof is complete. ∎

The estimates of $L$ and $\rho$ in the asymmetric setting are completely analogous; we record them in the following proposition.

Proposition 4.2 (Approximation accuracy and Lipschitz continuity (asymmetric)).

Suppose Assumption B holds. Then for all $X,\widehat{X}\in{\bf R}^{d_{1}\times r}$ and $Y,\widehat{Y}\in{\bf R}^{r\times d_{2}}$ the following estimates hold:

[TABLE]

Proof.

To see the first estimate, observe that

[TABLE]

where the last estimate follows from Young’s inequality $2ab\leq a^{2}+b^{2}.$ Next, we successively compute:

[TABLE]

The result follows by noting that $a+b\leq\sqrt{2(a^{2}+b^{2})}$ for all $a,b\in{\bf R}$ .

∎

4.2 Sharpness

We next move on to estimates of the sharpness constant $\mu$ . We first deal with the noiseless setting $b=\mathcal{A}(M_{\sharp})$ in Section 4.2.1, and then move on to the general case when the measurements are corrupted by outliers in Section 4.2.2.

4.2.1 Sharpness in the noiseless regime

We begin with with the symmetric setting in the noiseless case $b=\mathcal{A}(M_{\sharp})$ . By Assumption B, we have the estimate

[TABLE]

It follows that the set of minimizers $\operatornamewithlimits{argmin}_{X\in{\bf R}^{d\times r}}f(X)$ coincides with the set of minimizers of the function $X\mapsto\|XX^{\top}-X_{\sharp}X^{\top}_{\sharp}\|_{F}$ , namely

[TABLE]

Thus to argue sharpness of $f$ it suffices to estimate the sharpness constant of the function $X\mapsto\|XX^{\top}-X_{\sharp}X^{\top}_{\sharp}\|_{F}$ . Fortunately, this calculation was already done in [57, Lemma 5.4].

Proposition 4.3 ([57, Lemma 5.4]).

For any matrices $X,Z\in{\bf R}^{d\times r}$ , we have the bound

[TABLE]

Consequently if Assumption B holds in the noiseless setting $b=\mathcal{A}(M_{\sharp})$ , then the bound holds:

[TABLE]

We next consider the asymmetric case. By exactly the same reasoning as before, the set of minimizers of $f(X,Y)$ coincides with the set of minimizers of the function $(X,Y)\mapsto\|XY-X_{\sharp}Y_{\sharp}\|_{F}$ , namely

[TABLE]

Thus to argue sharpness of $f$ it suffices to estimate the sharpness constant of the function $(X,Y)\mapsto\|XY-X_{\sharp}Y_{\sharp}\|_{F}$ . Such a sharpness guarantee in the rank one case was recently shown in [16, Proposition 4.2].

Proposition 4.4 ([16, Proposition 4.2]).

Fix a rank $1$ matrix $M_{\sharp}\in{\bf R}^{d_{1}\times d_{2}}$ and a constant $\nu\geq 1$ . Then for any $x\in{\bf R}^{d_{1}}$ and $w\in{\bf R}^{d_{2}}$ satisfying

[TABLE]

the following estimate holds:

[TABLE]

Notice that in contrast to the symmetric setting, the sharpness estimate is only valid on bounded sets. Indeed, this is unavoidable even in the setting $d_{1}=d_{2}=2$ . To see this, define $M_{\sharp}=e_{2}e_{2}^{\top}$ and for any $\alpha>0$ set $x=\alpha e_{1}$ and $w=\tfrac{1}{\alpha}e_{1}$ . It is routine to compute

[TABLE]

Therefore letting $\alpha$ tend to zero (or infinity) the quotient tends to zero.

The following corollary is a higher rank extension of Proposition 4.4.

Theorem 4.5 (Sharpness (asymmetric and noiseless)).

Fix a constant $\nu>0$ and define $X_{\sharp}:=U\sqrt{\Lambda}$ and $Y_{\sharp}=\sqrt{\Lambda}V^{\top}$ , where $M_{\sharp}=U\Lambda V^{\top}$ is any compact singular value decomposition of $M_{\sharp}$ . Then for all $X\in{\bf R}^{d_{1}\times r}$ and $Y\in{\bf R}^{r\times d_{2}}$ satisfying

[TABLE]

the estimate holds:

[TABLE]

Proof.

Define $\delta:=\frac{1}{1+2(1+\sqrt{2})\nu}$ and consider a pair of matrices $X$ and $Y$ satisfying (4.10). Let $A\in GL(r)$ be an invertible matrix satisfying

[TABLE]

As a first step, we successively compute

[TABLE]

We next aim to lower bound the first term on the right. To this end, observe

[TABLE]

We claim that the cross-term is non-negative. To see this, observe that first order optimality conditions in (4.11) directly imply that $A$ satisfies the equality

[TABLE]

Thus we obtain

[TABLE]

Therefore, returning to (4.13) we conclude that

[TABLE]

Combining (4.12) and (4.14), we obtain

[TABLE]

Finally, we estimate $\min\{\sigma_{r}(A^{-1}),\sigma_{r}(A)\}$ . To this end, first note that

[TABLE]

We now aim to lower bound the left-hand-side in terms of $\min\{\sigma_{r}(A^{-1}),\sigma_{r}(A)\}$ . Observe

[TABLE]

Similarly, we have

[TABLE]

Hence using (4.16), we obtain the estimate

[TABLE]

Using this estimate in (4.15) completes the proof. ∎

4.2.2 Sharpness in presence of outliers

The most important example of the norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}$ for us is the scaled $\ell_{1}$ -norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}=\frac{1}{m}\|\cdot\|_{1}$ . Indeed, all the examples in the forthcoming Section 6 will satisfy RIP relative to this norm. In this section, we will show that the $\ell_{1}$ -norm has an added advantage. Under reasonable RIP-type conditions, sharpness will hold even if up to a half of the measurements are grossly corrupted.

Henceforth, for any set $\mathcal{I}$ , define the restricted map $\mathcal{A}_{\mathcal{I}}:=\left(\mathcal{A}(X)\right)_{i\in\mathcal{I}}$ . We interpret the set $\mathcal{I}$ as corresponding to (arbitrarily) outlying measurements, while its complement corresponds to exact measurements. Motivated by the work [27] on robust phase retrieval, we make the following assumption.

Assumption C ( $\mathcal{I}$ -outlier bounds).

There exists a set $\mathcal{I}\subset\{1,\ldots,m\}$ and a constant $\kappa_{3}>0$ such that the following hold.

$\mathrm{(C1)}$

Equality holds $b_{i}=\mathcal{A}(M_{\sharp})_{i}$ for all $i\notin\mathcal{I}$ . 2. $\mathrm{(C2)}$

For all matrices $W$ of rank at most $2r$ , we have

[TABLE]

The assumption is simple to interpret. To elucidate the bound (4.17), let us suppose that the restricted maps $\mathcal{A}_{\mathcal{I}}$ and $\mathcal{A}_{\mathcal{I}^{c}}$ satisfy Assumption B (RIP) with constants $\hat{\kappa}_{1}$ , $\hat{\kappa}_{2}$ and $\kappa_{1}$ , $\kappa_{2}$ , respectively. Then for any rank $2r$ matrix $X$ we immediately deduce the estimate

[TABLE]

where $p_{\mathrm{fail}}=\frac{|\mathcal{I}|}{m}$ denotes the corruption frequency. In particular, the right-hand side is positive as long as the corruption frequency is below the threshold $p_{\mathrm{fail}}<\frac{\kappa_{1}}{\kappa_{1}+\hat{\kappa}_{2}}$ .

Combining Assumption C with Proposition 4.3 quickly yields sharpness of the objective even in the noisy setting.

Proposition 4.6 (Sharpness with outliers (symmetric)).

Suppose that Assumption C holds. Then

[TABLE]

Proof.

Defining $\Delta:=\mathcal{A}(X_{\sharp}X_{\sharp}^{\top})-b$ , we have the following bound:

[TABLE]

where the first inequality follows by the reverse triangle inequality, the second inequality follows by Assumption $\mathrm{(C2)}$ , and the final inequality follows from Proposition 4.3. The proof is complete. ∎

The argument in the asymmetric setting is completely analogous.

Proposition 4.7 (Sharpness with outliers (asymmetric)).

Suppose that Assumption C holds. Fix a constant $\nu>0$ and define $X_{\sharp}:=U\sqrt{\Lambda}$ and $Y_{\sharp}=\sqrt{\Lambda}V^{\top}$ , where $M_{\sharp}=U\Lambda V^{\top}$ is any compact singular value decomposition of $M_{\sharp}$ . Then for all $X\in{\bf R}^{d_{1}\times r}$ and $Y\in{\bf R}^{r\times d_{2}}$ satisfying

[TABLE]

The estimate holds:

[TABLE]

5 General convergence guarantees for subgradient & prox-linear methods

In this section, we formally develop convergence guarantees for Algorithms 1, 2, and 3 under Assumption A, and deduce performance guarantees in the RIP setting. To this end, it will be useful to first consider a broader class than the compositional problems (3.1). We say that a function $f\colon{\bf E}\rightarrow{\bf R}\cup\{+\infty\}$ is $\rho$ -weakly convex666Weakly convex functions also go by other names such as lower- $C^{2}$ , uniformly prox-regularity, paraconvex, and semiconvex. We refer the reader to the seminal works on the topic [51, 49, 47, 53, 2]. if the perturbed function $x\mapsto f(x)+\frac{\rho}{2}\|x\|^{2}_{2}$ is convex. In particular, a composite function $f=h\circ F$ satisfying the approximation guarantee

[TABLE]

is automatically $\rho$ -weakly convex [26, Lemma 4.2]. Subgradients of weakly convex functions are very well-behaved. Indeed, notice that in general the little-o term in the expression (2.1) may depend on the basepoint $x$ , and may therefore be nonuniform. The subgradients of weakly convex functions, on the other hand, automatically satisfy a uniform type of lower-approximation property. Indeed, a lower-semicontinuous function $f$ is $\rho$ -weakly convex if and only if it satisfies:

[TABLE]

Setting the stage, we introduce the following assumption.

Assumption D.

Consider the optimization problem,

[TABLE]

Suppose that the following properties hold for some real numbers $\mu,\rho>0$ .

(Weak convexity) The set $\mathcal{X}$ is closed and convex, while the function $f\colon{\bf E}\to{\bf R}$ is $\rho$ -weakly convex. 2. 2.

(Sharpness) The set of minimizers $\displaystyle\mathcal{X}^{*}:=\operatornamewithlimits{argmin}_{x\in\mathcal{X}}f(x)$ is nonempty and the following inequality holds:

[TABLE]

In particular, notice that Assumption A implies Assumption D. Taken together, weak convexity and sharpness provide an appealing framework for deriving local rapid convergence guarantees for numerical methods. In this section, we specifically focus on two such procedures: the subgradient and prox-linear algorithms. We aim to estimate both the radius of rapid converge around the solution set and the rate of convergence. Note that both of the algorithms, when initialized at a stationary point could stay there for all subsequent iterations. Since we are interested in finding global minima, we therefore estimate the neighborhood of the solution set that has no extraneous stationary points. This is the content of the following simple lemma.

Lemma 5.1 ([23, Lemma 3.1]).

Suppose that Assumption D holds. Then the problem (5.1) has no stationary points $x$ satisfying

[TABLE]

It is worthwhile to note that the estimate $\frac{2\mu}{\rho}$ of the radius in Lemma 5.1 is tight [16, Section 3]. Hence, let us define for any $\gamma>0$ the tube

[TABLE]

Thus we would like to search for algorithms whose basin of attraction is a tube $\mathcal{T}_{\gamma}$ for some numerical constant $\gamma>0$ . Such a basin of attraction is in essence optimal.

The rate of convergence of the subgradient methods (Algorithms 1 and 2) relies on the subgradient bound and the condition measure:

[TABLE]

A straightforward argument [23, Lemma 3.2] shows $\tau\in[0,1]$ . The following theorem appears as [23, Theorem 4.1], while its application to phase retrieval was investigated in [24].

Theorem 5.2 (Polyak subgradient method).

Suppose that Assumption D holds and fix a real number $\gamma\in(0,1)$ . Then Algorithm 1 initialized at any point $x_{0}\in\mathcal{T}_{\gamma}$ produces iterates that converge $Q$ -linearly to $\mathcal{X}^{*}$ , that is

[TABLE]

The following theorem appears as [23, Theorem 6.1]. The convex version of the result dates back to Goffin [34].

Theorem 5.3 (Geometrically decaying subgradient method).

Suppose that Assumption D holds, fix a real number $\gamma\in(0,1)$ , and suppose $\tau\leq\sqrt{\frac{1}{2-\gamma}}$ . Set $\lambda:=\frac{\gamma\mu^{2}}{\rho L}\textrm{ and }q:=\sqrt{1-(1-\gamma)\tau^{2}}$ in Algorithm 2. Then the iterates $x_{k}$ generated by Algorithm 2, initialized at any point $x_{0}\in\mathcal{T}_{\gamma}$ , satisfy:

[TABLE]

Let us now specialize to the composite setting under Assumption A. Since Assumption A implies Assumption D, both subgradient Algorithms 1 and 2 will enjoy a linear rate of convergence when initialized sufficiently close the solution set. The following theorem, on the other hand, shows that the prox-linear method will enjoy a quadratic rate of convergence (at the price of a higher per-iteration cost). Guarantees of this type have appeared, for example, in [27, 7, 25].

Theorem 5.4 (Prox-linear algorithm).

Suppose Assumption A holds. Choose any $\beta\geq\rho$ in Algorithm 3 and set $\gamma:=\rho/\beta$ . Then Algorithm 3 initialized at any point $x_{0}\in\mathcal{T}_{\gamma}$ converges quadratically:

[TABLE]

We now apply the results above to the low-rank matrix factorization problem under RIP, whose regularity properties were verified in Section 4. In particular, we have the following efficiency guarantees of the subgradient and prox-linear methods applied to this problem.

Corollary 5.5 (Convergence guarantees under RIP (symmetric)).

Suppose Assumptions B and C are valid with $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}=\frac{1}{m}\|\cdot\|_{1}$ and consider the optimization problem

[TABLE]

Choose any matrix $X_{0}$ satisfying

[TABLE]

Define the condition number $\chi:=\sigma_{1}(M_{\sharp})/\sigma_{r}(M_{\sharp})$ . Then the following are true.

(Polyak subgradient)* Algorithm 1 initialized at $X_{0}$ produces iterates that converge linearly to $\mathcal{D}^{\ast}(M_{\sharp})$ , that is*

[TABLE] 2. 2.

(geometric subgradient)* Algorithm 2 with $\lambda=\frac{0.81\kappa_{3}^{2}\sqrt{\sigma_{r}(M_{\sharp})}}{2\kappa_{2}(\kappa_{3}+2\kappa_{2}\sqrt{\chi})}$ , $q=\sqrt{1-\frac{0.2}{1+4\kappa_{2}^{2}\chi/\kappa_{3}^{2}}}$ and initialized at $X_{0}$ converges linearly:*

[TABLE] 3. 3.

(prox-linear)* Algorithm 3 with $\beta=\rho$ and initialized at $X_{0}$ converges quadratically:*

[TABLE]

5.1 Guarantees under local regularity

As explained in Section 4, Assumptions A and D are reasonable in the symmetric setting under RIP. The asymmetric setting is more nuanced. Indeed, the solution set is unbounded, while uniform bounds on the sharpness and subgradient norms are only valid on bounded sets. One remedy, discussed in [39], is to modify the optimization formulation by introducing a form of regularization:

[TABLE]

In this section, we take a different approach that requires no modification to the optimization problem nor the algorithms. The key idea is to show that if the problem is well-conditioned only on a neighborhood of a particular solution, then the iterates will remain in the neighborhood provided the initial point is sufficiently close to the solution. In fact, we will see that the iterates themselves must converge. The proofs of the results in this section (Theorems 5.6, 5.7, and 5.8) are deferred to Appendix A.

We begin with the following localized version of Assumption D.

Assumption E.

Consider the optimization problem,

[TABLE]

Fix an arbitrary point $\bar{x}\in\mathcal{X}^{*}$ and suppose that the following properties hold for some real numbers $\epsilon,\mu,\rho>0$ .

(Local weak convexity) The set $\mathcal{X}$ is closed and convex, and the bound holds:

[TABLE] 2. 2.

(Local sharpness) The inequality holds:

[TABLE]

The following two theorems establish convergence guarantees of the two subgradient methods under Assumption E. Abusing notation slightly, we define the local quantities:

[TABLE]

Theorem 5.6 (Polyak subgradient method (local regularity)).

Suppose Assumption E holds and fix an arbitrary point $x_{0}\in B_{\epsilon/4}(\bar{x})$ satisfying

[TABLE]

Then Algorithm 1 initialized at $x_{0}$ produces iterates $x_{k}$ that always lie in $B_{\epsilon}(\bar{x})$ and satisfy

[TABLE]

Moreover the iterates converge to some point $x_{\infty}\in\mathcal{X}^{*}$ at the R-linear rate

[TABLE]

Theorem 5.7 (Geometrically decaying subgradient method (local regularity)).

Suppose that Assumption E holds and that $\tau\leq\frac{1}{\sqrt{2}}$ . Define $\gamma=\frac{\epsilon\rho}{4L+\epsilon\rho}$ , $\lambda=\frac{\gamma\mu^{2}}{\rho L}$ , and $q=\sqrt{1-(1-\gamma)\tau^{2}}$ . Then Algorithm 2 initialized at any point $x_{0}\in B_{\epsilon/4}(\bar{x})\cap\mathcal{T}_{\gamma}$ generates iterates $x_{k}$ that always lie in $B_{\epsilon}(\bar{x})$ and satisfy

[TABLE]

Moreover, the iterates converge to some point $x_{\infty}\in\mathcal{X}^{*}$ at the R-linear rate

[TABLE]

We end the section by specializing to the composite setting and analyzing the prox-linear method. The following is the localized version of Assumption A.

Assumption F.

Consider the optimization problem,

[TABLE]

where the function $h(\cdot)$ and the set $\mathcal{X}$ are convex and $F(\cdot)$ is differentiable. Fix a point $\bar{x}\in\mathcal{X}^{*}$ and suppose that the following properties holds for some real numbers $\epsilon,\mu,\rho>0$ .

(Approximation accuracy) The convex models $f_{x}(y):=h(F(x)+\nabla F(x)(y-x))$ satisfy the estimate:

[TABLE] 2. 2.

(Sharpness) The inequality holds:

[TABLE]

The following theorem provides convergence guarantees for the prox-linear method under Assumption F.

Theorem 5.8 (Prox-linear (local)).

Suppose Assumption F holds, choose any $\beta\geq\rho$ , and fix an arbitrary point $x_{0}\in B_{\epsilon/2}(\bar{x})$ satisfying

[TABLE]

Then Algorithm 3 initialized at $x_{0}$ generates iterates $x_{k}$ that always lie in $B_{\epsilon}(\bar{x})$ and satisfy

[TABLE]

Moreover the iterates converge to some point $x_{\infty}\in\mathcal{X}^{*}$ at the quadratic rate

[TABLE]

With the above generic results in hand, we can now derive the convergence guarantees for the subgradient and prox-linear methods for asymmetric low-rank matrix recovery problems. To summarize, the prox-linear method converges quadratically, as long as it is initialized within constant relative error of the solution. The guarantees for the subgradient methods are less satisfactory: the size of the region of the linear convergence scales with the condition number of $M_{\sharp}$ . The reason is that the proof estimates the region of convergence using the length of the iterate path, which scales with the condition number. The dependence on the condition number in general can be eliminated by introducing regularization $\|X^{\top}X-YY^{\top}\|_{F}$ , as suggested in the work [39]. Still the results we present here are notable even for the subgradient method. For example, we see that for rank $r=1$ instances satisfying RIP (e.g. blind deconvolution), the condition number of $M_{\sharp}$ is always one and therefore regularization is not required at all for subgradient and prox-linear methods.

Corollary 5.9 (Convergence guarantees under RIP (asymmetric)).

Suppose Assumptions B and C are valid777 with $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}=\frac{1}{m}\|\cdot\|_{1}$ and consider the optimization problem

[TABLE]

Define $X_{\sharp}:=U\sqrt{\Lambda}$ and $Y_{\sharp}=\sqrt{\Lambda}V^{\top}$ , where $M_{\sharp}=U\Lambda V^{\top}$ is any compact singular value decomposition of $M_{\sharp}$ . Define also the condition number $\chi:=\sigma_{1}(M_{\sharp})/\sigma_{r}(M_{\sharp})$ . Then there exists $\eta>0$ depending only on $\kappa_{2}$ , $\kappa_{3}$ , and $\sigma(M_{\sharp})$ such that the following are true.

(Polyak subgradient)* Algorithm 1 initialized at $(X_{0},Y_{0})$ satisfying $\frac{\|(X_{0},Y_{0})-(X_{\sharp},Y_{\sharp})\|_{F}}{\sqrt{\sigma_{r}(M_{\sharp})}}\lesssim\min\{1,\frac{\kappa_{3}^{2}}{\kappa_{2}^{2}\chi},\frac{\kappa_{3}}{\kappa_{2}}\}$ , will generate an iterate sequence that converges at the linear rate:*

[TABLE] 2. 2.

(geometric subgradient)* Algorithm 2 initialized at $(X_{0},Y_{0})$ satisfying $\frac{\|(X_{0},Y_{0})-(X_{\sharp},Y_{\sharp})\|_{F}}{\sqrt{\sigma_{r}(M_{\sharp})}}\lesssim\min\{1,\frac{\kappa_{3}}{\kappa_{2}\chi}\}$ , will generate an iterate sequence that converges at the linear rate:*

[TABLE] 3. 3.

(prox-linear)* Algorithm 3 initialized at $(X_{0},Y_{0})$ satisfying $\frac{f(x_{0})-\min_{\mathcal{X}}f}{\sigma_{r}(M_{\sharp})}\lesssim\min\{\kappa_{2},\kappa_{3}^{2}/\kappa_{2}\}$ and $\frac{\|(X_{0},Y_{0})-(X_{\sharp},Y_{\sharp})\|_{F}}{\sqrt{\sigma_{r}(M_{\sharp})}}\lesssim 1$ , will generate an iterate sequence that converges at the quadratic rate:*

[TABLE]

6 Examples of $\ell_{1}/\ell_{2}$ RIP

In this section, we survey three matrix recovery problems from different fields, including physics, signal processing, control theory, wireless communications, and machine learning, among others. In all cases, the problems satisfy $\ell_{1}/\ell_{2}$ RIP and the $\mathcal{I}$ -outlier bounds and consequently, the convergence results in Corollaries 5.5 and 5.9 immediately apply. Most of the RIP results in this section were previously known (albeit under more restrictive assumptions); we provide self-contained arguments in the Appendix B for the sake of completeness. On the other hand, using nonsmooth optimization in these problems and the corresponding convergence guarantees based on RIP are, for the most part, new.

For the rest of this section we will assume the following data-generating mechanism.

Definition 6.1 (Data-generating mechanism).

A random linear map $\mathcal{A}\colon{\bf R}^{d_{1}\times d_{2}}\to{\bf R}^{m}$ and a random index set $\mathcal{I}\subset[m]$ are drawn independently of each other. We assume moreover that the outlier frequency $p_{\mathrm{fail}}:=|\mathcal{I}|/m$ satisfies $p_{\mathrm{fail}}\in[0,1/2)$ almost surely. We then observe the corrupted measurements

[TABLE]

where $\eta$ is an arbitrary vector. In particular, $\eta$ could be correlated with $\mathcal{A}$ .

Throughout this section, we consider four distinct linear operators $\mathcal{A}$ .

Matrix Sensing.

In this scenario, measurements are generated as follows:

[TABLE]

where $P_{i}\in{\bf R}^{d_{1}\times d_{2}}$ are fixed matrices.

Quadratic Sensing I .

In this scenario, $M_{\sharp}\in{\bf R}^{d\times d}$ is assumed to be a PSD rank $r$ matrix with factorization $M_{\sharp}=X_{\sharp}X_{\sharp}^{\top}$ and measurements are generated as follows:

[TABLE]

where $p_{i}\in{\bf R}^{d}$ are fixed vectors.

Quadratic Sensing II .

In this scenario, $M_{\sharp}\in{\bf R}^{d\times d}$ is assumed to be a PSD rank $r$ matrix with factorization $M_{\sharp}=X_{\sharp}X_{\sharp}^{\top}$ and measurements are generated as follows:

[TABLE]

where $p_{i},\tilde{p}_{i}\in{\bf R}^{d}$ are fixed vectors.

Bilinear Sensing.

In this scenario, $M_{\sharp}\in{\bf R}^{d_{1}\times d_{2}}$ is assumed to be a $r$ matrix with factorization $M_{\sharp}=XY$ and measurements are generated as follows:

[TABLE]

where $p_{i}\in{\bf R}^{d_{1}}$ and $q_{i}\in{\bf R}^{d_{2}}$ are fixed vectors.

The matrix, quadratic, and bilinear sensing problems have been considered in a number of papers and in a variety of applications. The first theoretical properties for matrix sensing were discussed in [30, 50, 13]. Quadratic sensing in its full generality appeared in [18] and is a higher-rank generalization of the much older (real) phase retrieval problem [10, 14, 35]. Besides phase retrieval, quadratic sensing has applications to covariance sketching, shallow neural networks, and quantum state tomography; see for example [40] for a discussion. Bilinear sensing is a natural modification of quadratic sensing and is a higher-rank generalization of the blind deconvolution problem [1]; it was first proposed and studied in [8].

The reader is reminded that once $\ell_{1}/\ell_{2}$ RIP guarantees, in particular Assumptions B and C, are established for the above four operators, the guarantees of Corollaries 5.5 and Corollary 5.9 immediately take hold for the problems

[TABLE]

and

[TABLE]

respectively. Thus, we turn our attention to establishing such guarantees.

6.1 Warm-up: $\ell_{2}/\ell_{2}$ RIP for matrix sensing with Gaussian design

In this section, we are primarily interested in the $\ell_{1}/\ell_{2}$ RIP for the above four linear operators. However, as a warm-up, we first consider the $\ell_{2}/\ell_{2}$ -RIP property for matrix sensing with Gaussian $P_{i}$ . The following result appears in [50, 13].

Theorem 6.2 ( $\ell_{2}/\ell_{2}$ -RIP for matrix sensing).

For any $\delta\in(0,1)$ there exist constants $c,C>0$ depending only on $\delta$ such that if the entries of $P_{i}$ are i.i.d. standard Gaussian and $m\geq cr(d_{1}+d_{2})\log(d_{1}d_{2})$ , then with probability at least $1-\exp\left(-Cm\right)$ , the estimate

[TABLE]

holds simultaneously for all $M\in{\bf R}^{d_{1}\times d_{2}}$ of rank at most $2r$ . Consequently, Assumption B is satisfied.

Following the general recipe of the paper, we see that the nonsmooth formulation

[TABLE]

is immediately amenable to subgradient and prox-linear algorithms in the noiseless setting $\mathcal{I}=\emptyset$ . In particular, a direct analogue of Corollary 5.9, which was stated for the penalty function $h=\frac{1}{m}\|\cdot\|_{1}$ , holds; we omit the straightforward details.

6.2 The $\ell_{1}/\ell_{2}$ RIP and $\mathcal{I}$ -outlier bounds: quadratic and bilinear sensing

We now turn our attention to the $\ell_{1}/\ell_{2}$ RIP for more general classes of linear maps than the i.i.d. Gaussian matrices considered in Theorem 6.2. To establish such guarantees, one must ensure that the linear maps $\mathcal{A}$ have light tails and are robustly injective on certain spaces of matrices. The first property leads to tight concentration results, while the second yields the existence of a lower RIP constant $\kappa_{1}$ .

Assumption G (Matrix Sensing).

The matrices $\{P_{i}\}$ are i.i.d. realizations of an $\eta$ -sub-Gaussian random matrix888By this we mean that the vectorized matrix $\mathbf{vec}(P)$ is a $\eta$ -sub-gaussian random vector. $P\in{\bf R}^{d_{1}\times d_{2}}.$ Furthermore, there exists a numerical constant $\alpha>0$ such that

[TABLE]

Assumption H (Quadratic Sensing I).

The vectors $\{p_{i}\}$ are i.i.d. realizations of a $\eta$ -sub-Gaussian random variable $p\in{\bf R}^{d}.$ Furthermore, there exists a numerical constant $\alpha>0$ such that

[TABLE]

Assumption I (Quadratic Sensing II).

The vectors $\{p_{i}\},\{\tilde{p}_{i}\}$ are i.i.d. realizations of a $\eta$ -sub-Gaussian random variable $p\in{\bf R}^{d}.$ Furthermore, there exists a numerical constant $\alpha>0$ such that

[TABLE]

Assumption J (Bilinear Sensing).

The vectors $\{p_{i}\}$ and $\{q_{i}\}$ are i.i.d. realizations of $\eta$ -sub-Gaussian random vectors $p\in{\bf R}^{d_{1}}$ and $q\in{\bf R}^{d_{2}},$ respectively. Furthermore, there exists a numerical constant $\alpha>0$ such that

[TABLE]

The Assumptions G-J are all valid for i.i.d. Gaussian realizations with independent identity covariance, as the following lemma shows. We defer its proof to Appendix B.1.

Lemma 6.3.

Assumption G holds for matrices $P$ with i.i.d. standard Gaussian entries. Assumptions H and I hold for vectors $p,\tilde{p}$ with i.i.d. standard Gaussian entries. Assumption J holds for vectors $p$ and $q$ with i.i.d. standard Gaussian entries.

We can now state the main RIP guarantees under the above assumptions. Throughout all the results, we fix the data generating mechanism as in Definition 6.1. Then, we wish to establish the inequalities

[TABLE]

and

[TABLE]

and, hence, Assumptions B and C, respectively, for certain constants $\kappa_{1},\kappa_{2},$ and $\kappa_{3}$ . We defer the proof of this theorem to Appendix B.2.

Theorem 6.4 ( $\ell_{1}/\ell_{2}$ RIP and $\mathcal{I}$ -outlier bounds).

There exist numerical constants $c_{1},\dots,c_{6}>0$ depending only on $\alpha,\eta$ such that the following hold for the corresponding measurement operators described in Equations (6.2), (6.3), (6.4), and (6.5), respectively

(Matrix sensing)* Suppose Assumption G holds. Then provided $m\geq\frac{c_{1}}{(1-2p_{\mathrm{fail}})^{2}}r(d_{1}+d_{2}+1)\ln\left(c_{2}+\frac{c_{2}}{1-2p_{\mathrm{fail}}}\right)$ , we have with probability at least $1-4\exp\left(-c_{3}(1-2p_{\mathrm{fail}})^{2}m\right)$ that every matrix $M\in{\bf R}^{d_{1}\times d_{2}}$ of rank at most $2r$ satisfies (6.11) and (6.12) with constants $\kappa_{1}=c_{4},\kappa_{2}=c_{5}$ and $\kappa_{3}=c_{6}(1-2p_{\mathrm{fail}})$ .* 2. 2.

(Quadratic sensing I)* Suppose Assumption H holds. Then provided $m\geq\frac{c_{1}}{(1-2p_{\mathrm{fail}})^{2}}r^{2}(2d+1)\ln\left(c_{2}+\frac{c_{2}}{1-2p_{\mathrm{fail}}}\sqrt{r}\right)$ , we have with probability at least $1-4\exp\left(-c_{3}(1-2p_{\mathrm{fail}})^{2}m/r\right)$ that every matrix $M\in{\bf R}^{d\times d}$ of rank at most $2r$ satisfies (6.11) and (6.12) with constants $\kappa_{1}=c_{4},\kappa_{2}=c_{5}\cdot\sqrt{r}$ and $\kappa_{3}=c_{6}(1-2p_{\mathrm{fail}})$ .* 3. 3.

(Quadratic sensing II)* Suppose Assumption I holds. Then provided $m\geq\frac{c_{1}}{(1-2p_{\mathrm{fail}})^{2}}r(2d+1)\ln\left(c_{2}+\frac{c_{2}}{1-2p_{\mathrm{fail}}}\right)$ , we have with probability at least $1-4\exp\left(-c_{3}(1-2p_{\mathrm{fail}})^{2}m\right)$ that every matrix $M\in{\bf R}^{d\times d}$ of rank at most $2r$ satisfies (6.11) and (6.12) with constants $\kappa_{1}=c_{4},\kappa_{2}=c_{5}$ and $\kappa_{3}=c_{6}(1-2p_{\mathrm{fail}})$ .* 4. 4.

(Bilinear sensing)* Suppose Assumption J holds. Then provided $m\geq\frac{c_{1}}{(1-2p_{\mathrm{fail}})^{2}}r(d_{1}+d_{2}+1)\ln\left(c_{2}+\frac{c_{2}}{1-2p_{\mathrm{fail}}}\right)$ , we have with probability at least $1-4\exp\left(-c_{3}(1-2p_{\mathrm{fail}})^{2}m\right)$ that every matrix $M\in{\bf R}^{d_{1}\times d_{2}}$ of rank at most $2r$ satisfies (6.11) and (6.12) with constants $\kappa_{1}=c_{4},\kappa_{2}=c_{5}$ and $\kappa_{3}=c_{6}(1-2p_{\mathrm{fail}})$ .*

The guarantees of Theorem 6.4 were previously known under stronger assumptions. In particular, item (1) generalizes the results in [39] for the pure Gaussian setting. The case $r=1$ of item (2) can be found, in a sightly different form, in [29, 27]. Item (3) sharpens slightly the analogous guarantee in [18] by weakening the assumptions on the moments of the measuring vectors to the uniform lower bound (6.9). Special cases of item (4) were established in [16], for the case $r=1$ , and [8], for Gaussian measurement vectors.

We note that all linear mappings require the same number of measurements in order to satisfy RIP and $\mathcal{I}$ outlier bounds, except for quadratic sensing I operator, which incurs an extra $r$ -factor. This reveals the utility of the quadratic sensing II operator, which achieves optimal sample complexity. For larger scale problems, a shortcoming of matrix sensing operator (6.2) is that $md_{1}d_{2}$ scalars are required to represent the map $\mathcal{A}$ . In contrast, all other measurement operators may be represented with only $m(d_{1}+d_{2})$ scalars.

7 Matrix Completion

In the previous sections, we saw that low-rank recovery problems satisfying RIP lead to well-conditioned nonsmooth formulations. We claim, however, that the general framework of sharpness and approximation is applicable even for problems without RIP. We consider two such problems, namely matrix completion in this section and robust PCA in Section 8, to follow. Both problems will be considered in the symmetric setting.

The goal of matrix completion problem is to recover a PSD rank $r$ matrix $M_{\sharp}\in\mathcal{S}^{d}$ given access only to a subset of its entries. Henceforth, let $X_{\sharp}\in{\bf R}^{d\times r}$ be a matrix satisfying $M_{\sharp}=X_{\sharp}X_{\sharp}^{\top}$ . Throughout, we assume incoherence condition, $\|X_{\sharp}\|_{2,\infty}\leq\sqrt{\frac{\nu r}{d}}$ , for some $\nu>0$ . We also make the fairly strong assumption that the singular values of $X_{\sharp}$ are all equal $\sigma_{1}(X_{\sharp})=\sigma_{2}(X_{\sharp})=\ldots=\sigma_{r}(X_{\sharp})=1$ . This assumption is needed for our theoretical results. We let $\Omega\subseteq[d]\times[d]$ be an index set generated by the Bernoulli model, that is, $\mathbb{P}((i,j),(j,i)\in\Omega)=p$ independently for all $1\leq i\leq j\leq d$ . Let $\Pi_{\Omega}\colon\mathcal{S}^{d}\to{\bf R}^{|\Omega|}$ be the projection onto the entries indexed by $\Omega$ . We consider the following optimization formulation of the problem

[TABLE]

We will show that both the Polyak subgradient method and an appropriately modified prox-linear algorithm converge linearly to the solution set under reasonable initialization. Moreover, we will see that the linear rate of convergence for the prox-linear method is much better than that for the subgradient method.

To simplify notation, we set

[TABLE]

We begin by estimating the sharpness constant $\mu$ of the objective function. Fortunately, this estimate follows directly from inequalities (58) and (59a) in [19].

Lemma 7.1 (Sharpness [19]).

There are numerical constant $c_{1},c_{2}>0$ such that the following holds. If $p\geq c_{2}(\frac{\nu^{2}r^{2}}{d}+\frac{\log d}{d})$ , then with probability $1-c_{1}d^{-2}$ , the estimate

[TABLE]

holds uniformly for all $X\in\mathcal{X}$ with ${\rm dist}(X,\mathcal{D}^{*})\leq c_{1}$ .

Let us next estimate the approximation accuracy $|f(Z)-f_{X}(Z)|$ , where

[TABLE]

To this end, we will require the following result.

Lemma 7.2 (Lemma 5 in [19]).

There is a numerical constant $c>0$ such that the following holds. If $p\geq\frac{c}{\epsilon^{2}}(\frac{\nu^{2}r^{2}}{d}+\frac{\log d}{d})$ for some $\epsilon\in(0,1)$ , then with probability at least $1-2d^{-4}$ , the estimates

$\frac{1}{\sqrt{p}}\|\Pi_{\Omega}(HH^{\top})\|_{F}\leq\sqrt{(1+\epsilon)}\|H\|_{F}^{2}+\sqrt{\epsilon}\|H\|_{F}$ ; and 2. 2.

$\frac{1}{\sqrt{p}}\|\Pi_{\Omega}(GH^{\top})\|_{F}\leq\sqrt{\nu r}\|G\|_{F}$ **

hold uniformly for all matrices $H$ with $\|H\|_{2,\infty}\leq 6\sqrt{\frac{\nu r}{d}}$ and $G\in{\bf R}^{d\times r}$ .

An estimate of the approximation error $|f(Z)-f_{X}(Z)|$ is now immediate.

Lemma 7.3 (Approximation accuracy and Lipschitz continuity).

There is a numerical constant $c>0$ such that the following holds. If $p\geq\frac{c}{\epsilon^{2}}(\frac{\nu^{2}r^{2}}{d}+\frac{\log d}{d})$ for some $\epsilon\in(0,1)$ , then with probability at least $1-2d^{-4}$ , the estimates

[TABLE]

holds uniformly for all $X,Y\in\mathcal{X}$ .

Proof.

The first inequality follows immediately by observing the estimate

[TABLE]

and using Lemma 7.2. To see the second inequality, observe

[TABLE]

where the last inequality follows by Part 2 of Lemma 7.2. ∎

Note that the approximation bound in Lemma 7.2 is not in terms of the square Euclidean norm. Therefore the results in Section 5 do not apply directly. Nonetheless, it is straightforward to modify the prox-linear method to take into account the new approximation bound. The proof of the following lemma appears in the appendix.

Lemma 7.4.

Suppose that Assumption A holds with the approximation property replaced by

[TABLE]

for some real $a,b\geq 0$ . Consider the iterates generated by the process:

[TABLE]

Then as long as $x_{0}$ satisfies ${\rm dist}(x_{0},\mathcal{X}^{*})\leq\frac{\mu-2b}{2a}$ , the iterates converge linearly:

[TABLE]

Combining Lemma 7.4 with our estimates of the sharpness and approximation accuracy, we deduce the following convergence guarantee for matrix completion.

Corollary 7.5 (Prox-linear method for matrix completion).

There are numerical constants $c_{0},c,C>0$ such that the following holds. If $p\geq\frac{c}{\epsilon^{2}}(\frac{\nu^{2}r^{2}}{d}+\frac{\log d}{d})$ for some $\epsilon\in(0,1)$ , then with probability at least $1-c_{0}d^{-2}$ , the iterates generated by the modified prox-linear algorithm

[TABLE]

satisfy

[TABLE]

In particular, the iterates converge linearly as long as ${\rm dist}(X_{0},\mathcal{D}^{*})<\frac{C-2\sqrt{\epsilon}}{2\sqrt{(1+\epsilon)}}$ .

Proof.

By invoking Proposition 4.3 and Lemmas 7.1 and 7.3 we may appeal to Lemma 7.4 with $a=\sqrt{p(1+\epsilon)}$ , $b=\sqrt{p\epsilon}$ , and $\mu=\sqrt{2c_{1}p(\sqrt{2}-1)}$ . The result follows immediately. ∎

To summarize, there exist numerical constants $c_{0},c_{1},c_{2},c_{3}>0$ such that the following is true with probability at least $1-c_{0}d^{-2}$ . In the regime

[TABLE]

the prox-linear method will converge at the rapid linear rate,

[TABLE]

when initialized at $X_{0}\in\mathcal{X}$ satisfying ${\rm dist}(X_{0},\mathcal{D}^{*})<c_{2}$ .

As for the prox-linear method, the results of Section 5 do not immediately yield convergence guarantees for the Polyak subgradient method. Nonetheless, it straightforward to show that the standard Polyak subgradient method still enjoys local linear convergence guarantees. The proof is a straightforward modification of the argument in [23, Theorem 3.1], and appears in the appendix.

Theorem 7.6.

Suppose that Assumption A holds with the approximation property replaced by

[TABLE]

for some real $a,b\geq 0$ . Consider the iterates $\{x_{k}\}$ generated by the Polyak subgradient method in Algorithm 1. Then as long as the sharpness constant satisfies $\mu>2b$ and $x_{0}$ satisfies ${\rm dist}(x_{0},\mathcal{X}^{*})\leq\gamma\cdot\frac{\mu-2b}{2a}$ for some $\gamma<1$ , the iterates converge linearly

[TABLE]

Finally, combining Theorem 7.6 with our estimates of the sharpness and approximation accuracy, we deduce the following convergence guarantee for matrix completion.

Corollary 7.7 (Subgradient method for matrix completion).

There are numerical constants $c_{0},c,C>0$ such that the following holds. If $p\geq\frac{c}{\epsilon^{2}}(\frac{\nu^{2}r^{2}}{d}+\frac{\log d}{d})$ for some $\epsilon\in(0,1)$ , then with probability at least $1-c_{0}d^{-2}$ , the iterates generated by the iterates $\{X_{k}\}$ generated by the Polyak Subgradient method in Algorithm 1 satisfy

[TABLE]

In particular, the iterates converge linearly as long as ${\rm dist}(X_{0},\mathcal{D}^{*})<\frac{C-2\sqrt{\epsilon}}{4\sqrt{(1+\epsilon)}}$ .

Proof.

First, observe that we have the bound $L\leq\sqrt{p\nu r}$ by Lemma 7.3. By invoking Proposition 4.3 and Lemmas 7.1 and 7.3 we may appeal to Theorem 7.6 with $\gamma=1/2$ , $a=\sqrt{p(1+\epsilon)}$ , $b=\sqrt{p\epsilon}$ , and $\mu=\sqrt{2c_{1}p(\sqrt{2}-1)}$ . The result follows immediately. ∎

To summarize, there exist numerical constants $c_{0},c_{1},c_{2},c_{3}>0$ such that the following is true with probability at least $1-c_{0}d^{-2}$ . In the regime

[TABLE]

the Polyak subgradient method will converge at the linear rate,

[TABLE]

when initialized at $X_{0}\in\mathcal{X}$ satisfying ${\rm dist}(X_{0},\mathcal{D}^{*})<c_{2}$ . Notice that the prox-linear method enjoys a much faster linear rate of convergence than the subgradient method—an observation fully supported by numerical experiments in Section 10. The caveat is that the per iteration cost of the prox-linear method is significantly higher than that of the subgradient method.

8 Robust PCA

The goal of robust PCA is to decompose a given matrix $W$ into a sum of a low-rank matrix $M_{\sharp}$ and a sparse matrix $S_{\sharp}$ , where $M_{\sharp}$ represents the principal components, $S_{\sharp}$ the corruption, and $W$ the observed data [15, 11, 59]. In this section, we explore methods of nonsmooth optimization for recovering such a decomposition, focusing on two different problem formulations. We only consider the symmetric version of the problem.

8.1 The Euclidean formulation

Setting the stage, we assume that the matrix $W\in{\bf R}^{d\times d}$ admits a decomposition $W=M_{\sharp}+S_{\sharp}$ , where the matrices $M_{\sharp}$ and $S_{\sharp}$ satisfy the following for some parameters $\nu>0$ and $k\in\mathbb{N}$ :

The matrix $M_{\sharp}\in{\bf R}^{d\times d}$ has rank $r$ and can be factored as $M_{\sharp}=X_{\sharp}X^{\top}_{\sharp}$ for some matrix $X_{\sharp}\in{\bf R}^{d\times r}$ satisfying $\|X_{\sharp}\|_{{\rm op}}\leq 1$ and $\|X_{\sharp}\|_{2,\infty}\leq\sqrt{\frac{\nu r}{d}}$ .999Recall that $\|X\|_{2,\infty}=\max_{i\in[d]}\|X_{i\cdot}\|_{2}$ is the maximum row norm. 2. 2.

The matrix $S_{\sharp}$ is sparse in the sense that it has at most $k$ nonzero entries per column/row.

The goal is to recover $M_{\sharp}$ and $S_{\sharp}$ given $W$ . The first formulation we consider is the following:

[TABLE]

where the constraint sets are defined by

[TABLE]

Note that the problem formulation requires knowing the $\ell_{1}$ norms of the rows of $S_{\sharp}$ . The same assumption was also made in [19, 32]. While admittedly unrealistic, this formulation provides a nice illustration of the paradigm we advocate here. The following technical lemma will be useful in proving the regularity conditions needed for rapid convergence. The proof is given in Appendix D.1.

Lemma 8.1.

For all $X\in\mathcal{X}$ and $S\in\mathcal{S}$ , the estimate holds:

[TABLE]

Equipped with the above lemma, we can estimate the sharpness and approximation parameters $\mu,\rho$ for the formulation (8.1).

Lemma 8.2 (Regularity constants).

For all $X\in\mathcal{X}$ and $S\in\mathcal{S}$ , the estimates hold:

[TABLE]

and

[TABLE]

Moreover, for any $X_{1},X_{2}\in\mathcal{X}$ and $S_{1},S_{2}\in\mathcal{S}$ , the Lipschitz bounds holds:

[TABLE]

Proof.

Let $X_{\sharp}\in\mathrm{proj}_{\mathcal{D}^{\ast}(M_{\sharp})}(X)$ . To establish the bound (8.2), we observe that

[TABLE]

where the first inequality follows from Proposition 4.3 and Lemma 8.1. Now set

[TABLE]

and $s:=\frac{1}{2}\sigma_{r}^{2}(X_{\sharp})$ . With this notation, we apply the Fenchel-Young inequality to show that for any $\varepsilon>0$ , we have

[TABLE]

Thus, for any $\varepsilon>0$ , we have

[TABLE]

Now, let us choose $\varepsilon>0$ so that $s-a\varepsilon=1-a/\varepsilon.$ Namely set $\varepsilon=\frac{-(1-s)+\sqrt{(1-s)^{2}+4a^{2}}}{2a}.$ With this choice of $\varepsilon$ and the bound $s-a\varepsilon\geq\frac{1}{2}\sigma_{r}^{2}(X_{\sharp})-10\sqrt{\nu rk/d}$ , the claimed bound (8.2) follows immediately. The bound (8.3) follows from the reverse triangle inequality:

[TABLE]

Finally observe

[TABLE]

where we use the bound $\|X_{i}\|_{\mathrm{op}}\leq\sqrt{d}\|X_{i}\|_{2,\infty}\leq\sqrt{\nu r}$ in the final inequality. The proof is complete. ∎

To summarize, there exist numerical constants $c_{0},c_{1},c_{2}>0$ such that the following is true. In the regime

[TABLE]

the Polyak subgradient method will converge at the linear rate,

[TABLE]

and the prox-linear method will converge quadratically when initialized at $X_{0}\in\mathcal{X}$ satisfying ${\rm dist}(X_{0},\mathcal{D}^{*}(M_{\sharp}))<c_{2}\sigma_{r}(X_{\sharp})$ .

8.2 The non-Euclidean formulation

We next turn to a different formulation for robust PCA that does not require knowledge of $\ell_{1}$ row norms of $S_{\sharp}$ . In particular, we consider the formulation

[TABLE]

for a constant $C>1$ . Unlike Section 8.1, here we consider a randomized model for the sparse matrix $S_{\sharp}$ . We assume that there are real $\nu,\tau>0$ such that

$M_{\sharp}\in{\bf R}^{d\times d}$ can be factored as $M_{\sharp}=X_{\sharp}X^{\top}_{\sharp}$ for some matrix $X_{\sharp}\in{\bf R}^{d\times r}$ satisfying $\|X_{\sharp}\|_{2,\infty}\leq\sqrt{\frac{\nu r}{d}}\|X_{\sharp}\|_{\rm op}$ . 2. 2.

We assume the random corruption model

[TABLE]

where $\delta_{ij}$ are i.i.d. Bernoulli random variables with $\tau=\mathbb{P}(\delta_{ij}=1)$ and $\hat{S}$ is an arbitrary and fixed $d\times d$ symmetric matrix.

In this setting, the approximation function at $X$ is given by

[TABLE]

We begin by computing an estimate of the approximation accuracy $|f(Z)-f_{X}(Z)|$ .

Lemma 8.3 (Approximation accuracy).

The estimate holds:

[TABLE]

Proof.

As in the proof of Proposition 4.1, we compute

[TABLE]

thereby completing the argument. ∎

Notice that the error $|f(Z)-f_{X}(Z)|$ is bounded in terms of the non-Euclidean norm $\|Z-X\|_{2,1}$ . Thus, although in principle one may apply subgradient methods to the formulation (8.4), their convergence guarantees, which fundamentally relied on the Euclidean norm, would yield potentially overly pessimistic performance predictions. On the other hand, the convergence guarantees for the prox-linear method do not require the norm to be Euclidean. Indeed, the following is true, with a proof that is nearly identical as that of Theorem 5.8.

Theorem 8.4.

Suppose that Assumption A holds where $\|\cdot\|$ is replaced by an arbitrary norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}$ . Choose any $\beta\geq\rho$ and set $\gamma:=\rho/\beta$ in Algorithm 3. Then Algorithm 3 initialized at any point $x_{0}$ satisfying ${\rm dist}_{\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}}(x_{0},\mathcal{X}^{*})<\frac{\mu}{\rho}$ converges quadratically:

[TABLE]

To apply the above generic convergence guarantees for the prox-linear method, it remains to show that the objective function $f$ in (8.4) is sharp relative to the norm $\|\cdot\|_{1,2}$ . A key step in showing such a result is to prove that

[TABLE]

for a quantity $c$ depending only on $X_{\sharp}$ . One may prove this inequality using Proposition 4.3 together with the equivalence of the norms $\|\cdot\|_{F}$ and $\|\cdot\|_{1,2}$ . Doing so however leads to a dimension-dependent $c$ , resulting in a poor rate of convergence and region of attraction. We instead seek to directly establish sharpness relative to the norm $\|\cdot\|_{2,1}$ . In the rank one setting, this can be done using the following theorem.

Theorem 8.5 (Sharpness (rank one)).

Consider two vectors $x,\bar{x}\in{\bf R}^{d}$ satisfying

[TABLE]

Then the estimate holds:

[TABLE]

The proof of this result appears in Appendix D.2. We leave as an intriguing open question to determine if an analogous result holds in the higher rank setting.

Conjecture 8.6 (Sharpness (general rank)).

Fix a rank $r$ matrix $X_{\sharp}\in{\bf R}^{d\times r}$ and set $\mathcal{D}^{*}:=\{X\in\mathcal{X}:XX^{\top}=X_{\sharp}X_{\sharp}^{\top}\}$ . Then there exist constants $c,\gamma>0$ depending only on $X_{\sharp}$ such that the estimate holds:

[TABLE]

for all $X\in\mathcal{X}$ satisfying ${\rm dist}_{\|\cdot\|_{2,1}}(X,\mathcal{D}^{*})\leq\gamma$ .

Assuming this conjecture, we can then show that the loss function $f$ is sharp under the randomized corruption model. We first state the following technical lemma, whose proof is deferred to Appendix D.3. In what what follows, given a matrix $X\in{\bf R}^{d\times r}$ , the notation $X_{i}$ always refers to the $i$ th row of $X$ .

Lemma 8.7.

Assume Conjecture 8.6. Then there exist constants $c_{1},c_{2},c_{3}>0$ so that for all $d$ satisfying $d\geq\frac{c_{1}\log d}{\tau}$ , we have that with probability $1-d^{-c_{2}}$ , the following bound holds:

[TABLE]

for all $X\in\mathcal{X}$ satisfying ${\rm dist}_{\|\cdot\|_{2,1}}(X,\mathcal{D}^{*})\leq\gamma$ .

We remark that we expect $c$ to scale with $\|X_{\sharp}\|_{\mathrm{op}}$ in the above bound, yielding a ratio $\|X_{\sharp}\|_{\mathrm{op}}/c$ dependent on the conditioning of $X_{\sharp}$ . Given the above lemma, sharpness of $f$ quickly follows.

Lemma 8.8 (Sharpness of Non-Euclidean Robust PCA).

Assume Conjecture 8.6. Then there exists a constants $c_{1},c_{2},c_{3}>0$ so that for all $d$ satisfying $d\geq\frac{c_{1}\log d}{\tau}$ , we have that with probability $1-d^{-c_{2}}$ , the following bound holds:

[TABLE]

for all $X\in\mathcal{X}$ satisfying and ${\rm dist}_{\|\cdot\|_{2,1}}(X,\mathcal{D}^{*}(M_{\sharp}))\leq\gamma$ .

Proof.

The reverse triangle inequality implies that

[TABLE]

The result them follows from the the sharpness of the function $\|XX^{\top}-X_{\sharp}X_{\sharp}^{\top}\|_{1}$ together with Lemma 8.7. ∎

Combining Lemma 8.8 and Theorem 8.4, we deduce the following convergence guarantee.

Theorem 8.9 (Convergence for non-Euclidean Robust PCA).

Assume Conjecture 8.6. Then there exist constants $c_{1},c_{2},c_{3}>0$ so that for all $\tau$ satisfying $1-2\tau-2c_{3}C\sqrt{\tau\nu r\log d}\|X_{\sharp}\|_{op}/c>0$ and $d$ satisfying $d\geq\frac{c_{1}\log d}{\tau}$ , we have that with probability $1-d^{-c_{2}}$ , the iterates generated by the prox-linear algorithm

[TABLE]

satisfy

[TABLE]

In particular, the iterates converge quadratically as long as the initial iterate $X_{0}\in\mathcal{X}$ satisfies

[TABLE]

9 Recovery up to a Tolerance

Thus far, we have developed exact recovery guarantees under noiseless or sparsely corrupted measurements. We showed that sharpness together with weak convexity imply rapid local convergence of numerical methods under these settings. In practical scenarios, however, it might be unlikely that any, let alone a constant fraction of measurements, are perfectly observed. Instead, a more realistic model incorporates additive errors that are the sum of a sparse, but otherwise arbitrary vector and a dense vector with relatively small norm. Exact recovery is in general not possible under this noise model. Instead, we should only expect to recover the signal up to an error.

To develop algorithms for this scenario, we need only observe that the previously developed sharpness results all yield a corresponding “sharpness up to a tolerance” result. Indeed, all problems considered thus far, are convex composite and sharp:

[TABLE]

where $h$ is convex and $\eta$ -Lipschitz with respect to some norm $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}$ , $F$ is a smooth map, and $\mu>0$ . Now consider a fixed additive error vector $e$ , and the perturbed problem

[TABLE]

The triangle inequality immediately implies that the perturbed problem is sharp up to tolerance $2\eta\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}e\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}$ :

[TABLE]

In particular, any minimizer $x^{*}$ of $\tilde{f}$ satisfies

[TABLE]

where as before we set $\mathcal{X}^{*}=\operatornamewithlimits{argmin}_{\mathcal{X}}f$ . In this section, we show that subgradient and prox-linear algorithms applied to the perturbed problem (9.1) converge rapidly up to a tolerance on the order of $\eta\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}e\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}/\mu$ . To see the generality of the above approach, we note that even the robust recovery problems considered in Section 4.2.2, in which a constant fraction of measurements are already corrupted, may be further corrupted through additive error vector $e$ . We will study this problem in detail in Section 9.1.

Throughout the rest of the section, let us define the noise level:

[TABLE]

Mirroring the discussion in Section 5, define the annulus:

[TABLE]

for some $\gamma>0$ . Note that for the annulus $\widetilde{\mathcal{T}}_{\gamma}$ to be nonempty, we must ensure $\epsilon<\frac{\mu^{2}\gamma}{56\rho}$ . We will see that $\widetilde{\mathcal{T}}_{\gamma}$ serves as a region of rapid convergence for some numerical constant $\gamma$ . As before, we also define subgradient bound and the condition measure:

[TABLE]

In all examples considered in the paper, it is possible to show directly that $\tilde{L}\leq L$ as defined in Assumption D. A similar result is true in the general case, as well. Indeed, the following Lemma provides a bound for $\tilde{L}$ in terms of the subgradients of $f$ on a slight expansion of the tube $\mathcal{T}_{1}$ from (5.2); the proof appears in the appendix.

Lemma 9.1.

Suppose $\epsilon<\frac{\mu^{2}}{56\rho}$ so that $\widetilde{\mathcal{T}}_{1}$ is nonempty. Then the following bound holds:

[TABLE]

We will now design algorithms whose basin of attraction is the annulus $\widetilde{\mathcal{T}}_{\gamma}$ for some $\gamma$ . To that end, the following modified sharpness bound will be useful for us. The reader should be careful to note the appearance of $\inf_{\mathcal{X}}f$ , not $\inf_{\mathcal{X}}\tilde{f}$ in the following bound.

Lemma 9.2 (Approximate sharpness).

We have the following bound:

[TABLE]

Proof.

For any $x\in\mathcal{X}$ , observe $\tilde{f}(x)-\inf f\geq f(x)-\inf f-\varepsilon\geq\mu\cdot{\rm dist}(x,\mathcal{X}^{\ast})-\varepsilon,$ as claimed. ∎

Next, we show that $\tilde{f}$ satisfies the following approximate subgradient inequality.

Lemma 9.3 (Approximate subgradient inequality).

The following bound holds:

[TABLE]

Proof.

First notice that $|f_{x}(y)-\tilde{f}_{x}(y)|\leq\varepsilon$ for all $x,y$ . Furthermore, we have $\partial\tilde{f}(x)=\nabla F(x)^{*}\partial h(F(x)+e)=\partial\tilde{f}_{x}(x)$ . Therefore, it follows that for any $\zeta\in\partial\tilde{f}_{x}(x)$ we have

[TABLE]

as desired. ∎

Now consider the following modified Polyak method. It is important to note that the stepsize assumes knowledge of $\min_{\mathcal{X}}f$ rather than $\min_{\mathcal{X}}\tilde{f}$ . This distinction is important because it often happens that $\min_{\mathcal{X}}f=0$ , whereas $\min_{\mathcal{X}}\tilde{f}$ is in general unknown; for example, consider any noiseless problem analyzed thus far. We note that the standard Polyak subgradient method may also be applied to $\tilde{f}$ without any changes and has similar theoretical guarantees. The proof appears in the appendix.

Theorem 9.4 (Polyak subgradient method).

Suppose that Assumption D holds and suppose that $\varepsilon\leq\mu^{2}/14\rho$ . Then Algorithm 4 initialized at any point $x_{0}\in\widetilde{\mathcal{T}}_{1}$ produces iterates that converge $Q$ -linearly to $\mathcal{X}^{*}$ up to tolerance $14\varepsilon/\mu$ , that is

[TABLE]

Next we provide theoretical guarantees for Algorithm 5.3, where one does not know the optimal value $\min_{\mathcal{X}}f$ . The proof of this result is a straightforward modification of [23, Theorem 6.1] based on the Lemmas 9.2 and 9.3, and therefore we omit it.

Theorem 9.5 (Geometrically decaying subgradient method).

Suppose that Assumption D holds, fix a real number $\gamma\in(0,1)$ , and suppose $\tilde{\tau}\leq\frac{14}{11}\sqrt{\frac{1}{2-\gamma}}$ . Suppose also $\epsilon<\frac{\mu^{2}\gamma}{56\rho}$ so that $\widetilde{\mathcal{T}}_{\gamma}$ is nonempty. Set $\lambda:=\frac{\gamma\mu^{2}}{4\rho\tilde{L}}\textrm{ and }q:=\sqrt{1-(1-\gamma)\tilde{\tau}^{2}}$ in Algorithm 2. Then the iterates $x_{k}$ generated by Algorithm 2 on the perturbed problem (9.1), initialized at a point $x_{0}\in\tilde{\mathcal{T}}_{\gamma}$ , satisfy:

[TABLE]

Finally, we analyze the prox-linear algorithm applied to the problem $\min_{\mathcal{X}}\tilde{f}$ . In contrast to the Polyak method, one does not need to know the optimal value $\min_{\mathcal{X}}f$ . The proof appears in the appendix.

Theorem 9.6 (Prox-linear algorithm).

Suppose Assumptions A holds. Choose any $\beta\geq\rho$ in Algorithm 3 applied to the perturbed problem (9.1) and set $\gamma:=\rho/\beta$ . Suppose moreover $\epsilon<\frac{\mu^{2}\gamma}{56\rho}$ so that $\tilde{\mathcal{T}}_{\gamma}$ is nonempty. Then Algorithm 3 initialized at any point $x_{0}\in\widetilde{\mathcal{T}}_{\gamma}$ converges quadratically up to tolerance $14\varepsilon/\mu$ :

[TABLE]

9.1 Example: sparse outliers and dense noise under $\ell_{1}/\ell_{2}$ RIP

To further illustrate the ideas of this section, we now generalize the results of Section 4.2.2, in particular Assumption C, to the following observation model.

Assumption K ( $\mathcal{I}$ -outlier bounds).

There exists vectors $e,\Delta\in{\bf R}^{m}$ , a set $\mathcal{I}\subset\{1,\ldots,m\}$ , and a constant $\kappa_{3}>0$ such that the following hold.

$\mathrm{(C1)}$

$b=\mathcal{A}(M_{\sharp})+\Delta+e$ . 2. $\mathrm{(C2)}$

Equality holds $\Delta_{i}=0$ for all $i\notin\mathcal{I}$ . 3. $\mathrm{(C3)}$

For all matrices $W$ of rank at most $2r$ , we have

[TABLE]

Given these assumptions we follow the notation of the previous section and let

[TABLE]

Then we have the following proposition:

Proposition 9.7.

Suppose Assumption B and K are valid. Then the following hold:

(Sharpness)* We have*

[TABLE] 2. 2.

(Weak Convexity)* The function $f$ is $\rho:=2\kappa_{2}$ -weakly convex.* 3. 3.

(Minimizers)* All minimizers of $\tilde{f}$ satisfy*

[TABLE] 4. 4.

(Lipschitz Bound)* We have the bound*

[TABLE]

Proof.

Sharpness follows from Proposition 4.6, while weak convexity follows from Proposition 4.1. The minimizer bound follows from (9.2). Finally, due to Lemma 3.1, the argument given in Proposition (4.1), but applied instead to $\tilde{f}$ , guarantees that

[TABLE]

In turn the supremum may be bounded as follows: Let $X_{\star}=X_{\sharp}R$ denote the closest point to $X$ in $\mathcal{D}^{\ast}(M)$ . Then

[TABLE]

as desired. ∎

In particular, combining Proposition 9.7 with the previous results in this section, we deduce the following. As long as the noise satisfies

[TABLE]

for a sufficiently small constant $c_{0}>0$ , the subgradient and prox-linear methods converge rapidly to within tolerance

[TABLE]

when initialized at a matrix $X_{0}$ satisfying

[TABLE]

for some small constant $c_{1}$ . The formal statement is summarized in the following corollary.

Corollary 9.8 (Convergence guarantees under RIP with sparse outliers and dense noise (symmetric)).

Suppose Assumptions B is and K are valid with $\mathopen{|\mkern-1.5mu|\mkern-1.5mu|}\cdot\mathclose{|\mkern-1.5mu|\mkern-1.5mu|}=\frac{1}{m}\|\cdot\|_{1}$ and define the condition number $\chi=\sigma_{1}(M_{\sharp})/\sigma_{r}(M_{\sharp})$ . Then there exists numerical constants $c_{0},c_{1},c_{2},c_{3},c_{4},c_{5},c_{6}>0$ such that the following hold. Suppose the noise level satisfies

[TABLE]

and define the tolerance

[TABLE]

Then as long as the matrix $X_{0}$ satisfies

[TABLE]

the following are true.

(Polyak subgradient)* Algorithm 1 initialized at $X_{0}$ produces iterates that converge linearly to $\mathcal{D}^{\ast}(M_{\sharp})$ , that is*

[TABLE] 2. 2.

(geometric subgradient)* Algorithm 2 with $\lambda=\frac{c_{5}\kappa_{3}^{2}\sqrt{\sigma_{r}(M_{\sharp})}}{\kappa_{2}(\kappa_{3}+2\kappa_{2}\sqrt{\chi})}$ , $q=\sqrt{1-\frac{c_{2}}{1+c_{3}\kappa_{2}^{2}\chi/\kappa_{3}^{2}}}$ and initialized at $X_{0}$ converges linearly:*

[TABLE] 3. 3.

(prox-linear)* Algorithm 3 with $\beta=\rho$ and initialized at $X_{0}$ converges quadratically:*

[TABLE]

10 Numerical Experiments

In this section, we demonstrate the theory and algorithms developed in the previous sections on a number of low-rank matrix recovery problems, namely quadratic and bilinear sensing, low rank matrix completion, and robust PCA.

10.1 Robustness to outliers

In our first set of experiments, we empirically test the robustness of our optimization methods to outlying measurements. We generate phase transition plots, where each pixel corresponds to the empirical probability of successful recovery over $50$ test runs using randomly generated problem instances. Brighter pixels represent higher recovery rates. All generated instances obey the following:

The initial estimate is specified reasonably close to the ground truth. In particular, given a target symmetric positive semidefinite matrix $X_{\sharp}$ , we set

[TABLE]

Here, $\delta$ is a scalar that controls the quality of initialization and $\Delta$ is a random unit “direction”. The asymmetric setting is completely analogous. 2. 2.

When using the subgradient method with geometrically decreasing step-size, we set $\lambda=1.0,\;q=0.98$ . 3. 3.

For the quadratic sensing, bilinear sensing, and matrix completion problems, we mark a test run as a success when the normalized distance $\|M-M_{\sharp}\|_{F}/\|M_{\sharp}\|_{F}$ is less than $10^{-5}$ . Here we set $M=XX^{\top}$ in the symmetric setting and $M=XY$ in the asymmetric setting. For the robust PCA problem, we stop when $\|M-M_{\sharp}\|_{1}/\|M_{\sharp}\|_{1}<10^{-5}$ .

Moreover, we set the seed of the random number generator at the beginning of each batch of experiments to enable reproducibility.

Quadratic and Bilinear sensing.

Figures 2 and 3 depict the phase transition plots for bilinear (6.5) and symmetrized quadratic (6.4) sensing formulations using Gaussian measurement vectors. In the experiments, we corrupt a fraction of measurements with additive Gaussian noise of unit entrywise variance. Empirically, we observe that increasing the variance of the additive noise does not affect recovery rates. Both problems exhibit a sharp phase transition at very similar scales. Moreover, increasing the rank of the generating signal does not seem to dramatically affect the recovery rate for either problem. Under additive noise, we can recover the true signal (up to natural ambiguity) even if we corrupt as much as half of the measurements.

Robust PCA.

We generate robust PCA instances for $d=80$ and $r\in\{1,2,4,8,16\}$ . The corruption matrix $S_{\sharp}$ follows the assumptions in Section 8.2, where for simplicity we set $\hat{S}_{ij}\sim\mathsf{N}(0,\sigma^{2})$ . We observed that increasing or decreasing the variance $\sigma^{2}$ did not affect the probability of successful recovery, so our experiments use $\sigma=1$ . We use the subgradient method, Algorithm 3, and the prox-linear algorithm (8.5). Notice that we have not presented any guarantees for the subgradient method on this problem, in contrast to the prox-linear method. The subproblems for the prox-linear method are solved by ADMM with graph splitting as in [48]. We set tolerance $\epsilon_{k}=\frac{10^{-4}}{2k}$ for the proximal subproblems, which we continue solve for at most $500$ iterations. We choose $\gamma=10$ in all subproblems. The phase transition plots are shown in Figure 4. It appears that the prox-linear method is more robust to additive sparse corruption, since the empirical recovery rate for the subgradient method decays faster as the rank increases.

Matrix completion.

We next perform experiments on the low-rank matrix completion problem that test successful recovery against the sampling frequency. We generate random instances of the problem, where we let the probability of observing an entry, $\mathbb{P}(\delta_{ij}=1)$ , range in $[0.02,0.6]$ with increments of $0.02$ . Figure 5 depicts the empirical recovery rate using the Polyak subgradient method and the modified prox-linear algorithm (7.1). Similarly to the quadratic/bilinear sensing problems, low-rank matrix completion exhibits a sharp phase transition. As predicted in Section 7, the ratio $\frac{r^{2}}{d}$ appears to be driving the required observation probability for successful recovery. Finally, we empirically observe that the prox-linear method can “tolerate” slightly smaller sampling frequencies.

10.2 Convergence behavior

We empirically validate the rapid convergence guarantees of the subgradient and prox-linear methods, given a proper initialization. Moreover, we compare the subgradient method with gradient descent, i.e. gradient descent applied to a smooth formulation of each problem, using the same initial estimate in the noiseless setting. In all the cases below, the step sizes for the gradient method were tuned for best performance. Moreover, we noticed that the gradient descent method, equipped with the Polyak step size $\eta:=\tau\frac{\nabla f}{\left\|\nabla f\right\|^{2}}$ performed at least as well as gradient descent with constant step size. That being said, we were unable to locate any theoretical guarantees in the literature for gradient descent with the Polyak step-size for the problems we consider here.

Quadratic and Bilinear sensing.

For the quadratic and bilinear sensing problems, we apply gradient descent on the smooth formulations

[TABLE]

In Figure 6, we plot the performance of Algorithm 2 for matrix sensing problems with different rank / corruption level; remarkably, the level of noise does not significantly affect the rate of convergence. Additionally, the convergence behavior is almost identical for the two problems for similar rank/noise configurations. Figure 7 depicts the behavior of Algorithm 1 versus gradient descent with empirically tuned step sizes. The subgradient method significantly outperforms gradient descent. For completeness, we also depict the convergence rate of Algorithm 3 for both problems in Figure 8, where we solve the proximal subproblems approximately.

Matrix completion.

In our comparison with smooth methods, we apply gradient descent on the following minimization problem:

[TABLE]

Figure 9 depicts the convergence behavior of Algorithm 1 (solid lines) versus gradient descent applied to Problem (10.1) with a tuned step size $\eta=0.004$ (dashed lines), initialized under the same conditions for low-rank matrix completion instances. As the theory suggests, higher sampling frequency implies better convergence rates. The subgradient method outperforms gradient descent in all regimes.

Figure 10 depicts the performance of the modified prox-linear method (7.1) in the same setting as Figure 9. In most cases, the prox-linear algorithm converges within just $15$ iterations, at what appears to be a rapid linear rate of convergence. Each convex subproblem is solved using a variant of the graph-splitting ADMM algorithm [48].

Robust PCA.

For the robust PCA problem, we consider different rank/corruption level configurations to better understand how they affect convergence for the subgradient and prox-linear methods, using the non-Euclidean formulation of Section 8.2. We depict all configurations in the same plot for a fixed optimization algorithm to better demonstrate the effect of each parameter, as shown in Figure 11. The parameters of the prox-linear method are chosen in the same way reported in Section 10.1. In particular, our numerical experiments appear to support our sharpness Conjecture 8.6 for the robust PCA problem.

10.2.1 Recovery up to tolerance

In this last section, we test the performance of the prox-linear method and the modified Polyak subgradient method (Algorithm 4) for the quadratic sensing and matrix completion problems, under a dense noise model of Section 9. In the former setting, we set $p_{\mathrm{fail}}=0.25$ , so 1/4th of our measurements is corrupted with large magnitude noise. For matrix completion, we observe $p=25\%$ of the entries. In both settings, we add Gaussian noise $e$ which is rescaled to satisfy $\left\|e\right\|_{F}=\delta\sigma_{r}(X_{\sharp}),$ and test $\delta:=10^{-k}\sigma_{r}(X_{\sharp}),\;k\in\{1,\dots,4\}$ . The relevant plots can be found in Figures 12 and 13. The numerical experiments fully support the developed theory, with the iterates converging rapidly up to the tolerance that is proportional to the noise level. Incidentally, we observe that the modified prox-linear method (7.1) is more robust to additive noise for the matrix completion problem, with Algorithm 4 exhibiting heavy fluctuations and failing to converge for the highest level of dense noise.

Appendix A Proofs in Section 5

In this section, we prove rapid local convergence guarantees for the subgradient and prox-linear algorithms under regularity conditions that hold only locally around a particular solution. We will use the Euclidean norm throughout this section; therefore to simplify the notation, we will drop the subscript two. Thus $\|\cdot\|$ denotes the $\ell_{2}$ on a Euclidean space $\mathbf{E}$ throughout.

We will need the following quantitative version of Lemma 5.1.

Lemma A.1.

Suppose Assumption E holds and let $\gamma\in(0,2)$ be arbitrary. Then for any point $x\in B_{\epsilon/2}(\bar{x})\cap\mathcal{T}_{\gamma}\backslash\mathcal{X}^{\ast}$ , the estimate holds:

[TABLE]

Proof.

Consider any point $x\in B_{\epsilon/2}(\bar{x})$ satisfying ${\rm dist}(x,\mathcal{X}^{*})\leq\gamma\frac{\mu}{\rho}$ . Let $x^{*}\in\mathrm{proj}_{\mathcal{X}^{*}}(x)$ be arbitrary and note $x^{*}\in B_{\epsilon}(\bar{x})$ . Thus for any $\zeta\in\partial f(x)$ we deduce

[TABLE]

Therefore we deduce the lower bound on the subgradients $\|\zeta\|\geq\mu-\frac{\rho}{2}\cdot{\rm dist}(x,\mathcal{X}^{*})\geq\left(1-\tfrac{\gamma}{2}\right)\mu,$ as claimed. ∎

A.1 Proof of Theorem 5.6

Let $k$ be the first index (possibly infinite) such that $x_{k}\notin B_{\epsilon/2}(\bar{x})$ . We claim that (5.4) holds for all $i<k$ . We show this by induction. To this end, suppose (5.4) holds for all indices up to $i-1$ . In particular, we deduce ${\rm dist}(x_{i},\mathcal{X}^{*})\leq{\rm dist}(x_{0},\mathcal{X}^{*})\leq\frac{\mu}{2\rho}$ . Let $x^{*}\in\mathrm{proj}_{\mathcal{X}^{*}}(x_{i})$ and note $x^{*}\in B_{\epsilon}(\bar{x})$ , since

[TABLE]

Thus we deduce

[TABLE]

Here, the estimate (A.1) follows from the fact that the projection $\mathrm{proj}_{Q}(\cdot)$ is nonexpansive, (A.2) uses local weak convexity, (A.4) follow from the estimate ${\rm dist}(x_{i},\mathcal{X}^{*})\leq\frac{\mu}{2\rho}$ , while (A.3) and (A.5) use local sharpness. We therefore deduce

[TABLE]

Thus (5.4) holds for all indices up to $k-1$ . We next show that $k$ is infinite. To this end, observe

[TABLE]

where (A.7) follows by Lemma A.1 with $\gamma=1/2$ , the bound in (A.8) follows by (A.6) and the assumption on ${\rm dist}(x_{0},\mathcal{X}^{*}),$ finally (A.9) holds thanks to (A.6). Thus applying the triangle inequality we get the contradiction $\|x_{k}-\bar{x}\|\leq\epsilon/2$ . Consequently all the iterates $x_{k}$ for $k=0,1,\ldots,\infty$ lie in $B_{\epsilon/2}(\bar{x})$ and satisfy (5.4).

Finally, let $x_{\infty}$ be any limit point of the sequence $\{x_{i}\}$ . We then successively compute

[TABLE]

This completes the proof.

A.2 Proof of Theorem 5.7

Fix an arbitrary index $k$ and observe

[TABLE]

Hence, we conclude the uniform bound on the iterates:

[TABLE]

and the R-linear rate of convergence

[TABLE]

where $x_{\infty}$ is any limit point of the iterate sequence.

Let us now show that the iterates do not escape $B_{\epsilon/2}(\bar{x})$ . To this end, observe

[TABLE]

We must therefore verify the estimate $\tfrac{\lambda}{1-q}\leq\tfrac{\epsilon}{4}$ , or equivalently $\gamma\leq\frac{\epsilon\rho L(1-\gamma)\tau^{2}}{4\mu^{2}(1+\sqrt{1-(1-\gamma)\tau^{2}})}.$ Clearly, it suffices to verify $\gamma\leq\frac{\epsilon\rho(1-\gamma)}{4L},$ which holds by the definition of $\gamma$ . Thus all the iterates $x_{k}$ lie in $B_{\epsilon/2}(\bar{x})$ . Moreover $\tau\leq\sqrt{\frac{1}{2}}\leq\sqrt{\frac{1}{2-\gamma}}$ , the rest of the proof is identical to that in [23, Theorem 5.1].

A.3 Proof of Theorem 5.8

Fix any index $i$ such that $x_{i}\in B_{\epsilon}(\bar{x})$ and let $x\in\mathcal{X}$ be arbitrary. Since the function $z\mapsto f_{x_{i}}(z)+\frac{\beta}{2}\|z-x_{i}\|^{2}$ is $\beta$ -strongly convex and $x_{i+1}$ is its minimizer, we deduce

[TABLE]

Setting $x=x_{i}$ and appealing to approximation accuracy, we obtain the descent guarantee

[TABLE]

In particular, the function values are decreasing along the iterate sequence. Next choosing any $x^{*}\in\mathrm{proj}_{\mathcal{X}^{*}}(x_{i})$ and setting $x=x^{*}$ in (A.10) yields

[TABLE]

Appealing to approximation accuracy and lower-bounding $\frac{\beta}{2}\|x_{i+1}-x^{*}\|^{2}$ by zero, we conclude

[TABLE]

Using sharpness we deduce the contraction guarantee

[TABLE]

where the last inequality uses the assumption $f(x_{0})-\min_{\mathcal{X}}f\leq\frac{\mu^{2}}{2\beta}$ . Let $k>0$ be the first index satisfying $x_{k}\notin B_{\epsilon}(\bar{x})$ . We then deduce

[TABLE]

where (A.14) follows from (A.11) and (A.15) follows from (A.13). Thus we conclude $\|x_{k}-\bar{x}\|\leq\epsilon$ , which is a contradiction. Therefore all the iterates $x_{k}$ , for $k=0,1,\ldots,\infty$ , lie in $B_{\epsilon}(\bar{x})$ . Combing this with (A.12) and sharpness yields the claimed quadratic converge guarantee

[TABLE]

Finally, let $x_{\infty}$ be any limit point of the sequence $\{x_{i}\}$ . We then deduce

[TABLE]

where (A.16) follows from (A.13). The theorem is proved.

Appendix B Proofs in Section 6

B.1 Proof of Lemma 6.3

In order to prove that the assumption in each case, we will prove a stronger “small-ball condition” [44, 43], which immediately implies the claimed lower bounds on the expectation by Markov’s inequality. More precisely, we will show that there exist numerical constants $\mu_{0},p_{0}>0$ such that

(Matrix Sensing)

[TABLE] 2. 2.

(Quadratic Sensing I)

[TABLE] 3. 3.

(Quadratic Sensing II)

[TABLE] 4. 4.

(Bilinear Sensing)

[TABLE]

These conditions immediately imply Assumptions G-J. Indeed, by Markov’s inequality, in the case of matrix sensing we deduce

[TABLE]

The same reasoning applies to all the other problems.

Matrix sensing.

Consider any matrix $M$ with $\|M\|_{F}=1.$ Then, since $g:=\langle P,M\rangle$ follows a standard normal distribution, we may set $\mu_{0}$ to be the median of $|g|$ and $p_{0}=1/2$ to obtain

[TABLE]

Quadratic Sensing I.

Fix a matrix $M$ with $\text{Rank }\,M\leq 2r$ and $\|M\|_{F}=1$ . Let $M=UDU^{\top}$ be an eigenvalue decomposition of $M$ . Using the rotational invariance of the Gaussian distribution, we deduce

[TABLE]

where $\stackrel{{\scriptstyle\mathit{d}}}{{=}}$ denotes equality in distribution. Next, let $z$ be a standard normal variable. We will now invoke Proposition F.2. Let $C>0$ be the numerical constant appearing in the proposition. Notice that the function $\phi\colon{\bf R}_{+}\rightarrow{\bf R}$ given by

[TABLE]

is continuous and strictly increasing, and it satisfies $\phi(0)=0$ and $\lim_{t\rightarrow\infty}\phi(t)=1.$ Hence we may set $\mu_{0}=\phi^{-1}(\min\{1/2C,1/2\})$ . Proposition F.2 then yields

[TABLE]

By taking the supremum of both sides of the inequality we conclude that Assumption H holds with $\mu_{0}$ and $p_{0}=1/2.$

Quadratic sensing II.

Let $M=UDU^{\top}$ be an eigenvalue decomposition of $M$ . Using the rotational invariance of the Gaussian distribution, we deduce

[TABLE]

where the last relation follows since $\left(p_{k}-\tilde{p}_{k}\right),\left(p_{k}+\tilde{p}_{k}\right)$ are independent standard normal random variables with mean zero and variance two. We will now invoke Proposition F.2. Let $C>0$ be the numerical constant appearing in the proposition. Let $z$ and $\tilde{z}$ be independent standard normal variables. Notice that the function $\phi:{\bf R}_{+}\rightarrow{\bf R}$ given by

[TABLE]

is continuous, strictly increasing, satisfies $\phi(0)=0$ and approaches one at infinity. Defining $\mu_{0}=\phi^{-1}(\min\{1/2C,1/2\})$ and applying Proposition F.2, we get

[TABLE]

By taking the supremum of both sides of the inequality we conclude that Assumption I holds with $\mu_{0}$ and $p_{0}=1/2.$

We omit the details for the bilinear case, which follow by similar arguments.

B.2 Proof of Theorem 6.4

The proofs in this section rely on the following proposition, which shows that that pointwise concentration imply uniform concentration. We defer the proof to Appendix B.3.

Proposition B.1.

Let $\mathcal{A}:{\bf R}^{d_{1}\times d_{2}}\rightarrow{\bf R}^{m}$ be a random linear mapping with property that for any fixed matrix $M\in{\bf R}^{d_{1}\times d_{2}}$ of rank at most $2r$ with norm $\|M\|_{F}=1$ and any fixed subset of indices $\mathcal{I}\subseteq\{1,\dots,m\}$ satisfying $|\mathcal{I}|<m/2$ , the following hold:

$\mathrm{(1)}$

The measurements $\mathcal{A}(M)_{1},\dots,\mathcal{A}(M)_{m}$ are i.i.d. 2. $\mathrm{(2)}$

RIP holds in expected value:

[TABLE]

where $\alpha>0$ is a universal constant and $\beta$ is a positive-valued function that could potentially depend on the rank of $M$ . 3. $\mathrm{(3)}$

There exist a universal constant $K>0$ and a positive-valued function $c(m,r)$ such that for any $t\in[0,K]$ the deviation bound

[TABLE]

holds with probability at least $1-2\exp(-t^{2}c(m,r)).$

Then, there exist universal constants $c_{1},\dots,c_{6}>0$ depending only on $\alpha$ and $K$ such that if $\mathcal{I}\subseteq\{1,\dots,m\}$ is a fixed subset of indices satisfying $|\mathcal{I}|<m/2$ and

[TABLE]

then with probability at least $1-4\exp\left(-c_{3}(1-2|\mathcal{I}|/m)^{2}c(m,r)\right)$ every matrix $M\in{\bf R}^{d_{1}\times d_{2}}$ of rank at most $2r$ satisfies

[TABLE]

and

[TABLE]

Due to scale invariance of the above result, we need only verify its assumptions in the case that $\|M\|_{F}=1$ . We implicitly use this observation below.

B.2.1 Part 1 of Theorem 6.4 (Matrix sensing)

Lemma B.2.

The random variable $|\langle P,M\rangle|$ is sub-gaussian with parameter $C\eta.$ Consequently,

[TABLE]

Moreover, there exists a universal constant $c>0$ such that for any $t\in[0,\infty)$ the deviation bound

[TABLE]

holds with probability at least $1-2\exp\left(-\frac{ct^{2}}{\eta^{2}}m\right).$

Proof.

Assumption G immediately implies the lower bound in (B.5). To prove the upper bound, first note that by assumption we have

[TABLE]

This bound has two consequences, first $\langle P,M\rangle$ is a sub-gaussian random variable with parameter $\eta$ and second $\mathbb{E}|\langle P,M\rangle|\lesssim\eta$ [58, Proposition 2.5.2]. Thus, we have proved (B.5).

To prove the deviation bound (B.6) we introduce the random variables

[TABLE]

Since $|\langle P_{i},M\rangle|$ is sub-gaussian, we have $\|Y_{i}\|_{\psi_{2}}\lesssim\eta$ for all $i,$ see [58, Lemma 2.6.8]. Hence, Hoeffding’s inequality for sub-gaussian random variables [58, Theorem 2.6.2] gives the desired upper bound on $\mathbb{P}\left(\frac{1}{m}\left|\sum_{i=1}^{m}Y_{i}\right|\geq t\right).$ ∎

Applying Proposition B.1 with $\beta(r)\asymp\eta$ and $c(m,r)\asymp m/\eta^{2}$ now yields the result. ∎

B.2.2 Part 2 of Theorem 6.4 (Quadratic sensing I)

Lemma B.3.

The random variable $|p^{\top}Mp|$ is sub-exponential with parameter $\sqrt{2r}\eta^{2}.$ Consequently,

[TABLE]

Moreover, there exists a universal constant $c>0$ such that for any $t\in[0,\sqrt{2r}\eta]$ the deviation bound

[TABLE]

holds with probability at least $1-2\exp\left(-\frac{ct^{2}}{\eta^{4}}m/r\right).$

Proof.

Assumption H gives the lower bound in (B.7). To prove the upper bound, first note that $M=\sum_{k=1}^{2r}\sigma_{k}u_{k}u_{k}^{\top}$ where $\sigma_{k}$ and $u_{k}$ are the $k$ th singular values and vectors of $M$ , respectively. Hence

[TABLE]

where the first inequality follows since $\|\cdot\|_{\psi_{1}}$ is a norm, the second one follows since $\|XY\|_{\psi_{1}}\leq\|X\|_{\psi_{2}}\|Y\|_{\psi_{2}}$ [58, Lemma 2.7.7], and the third inequality holds since $\|\sigma\|_{1}\leq\sqrt{2r}\|\sigma\|_{2}$ . This bound has two consequences, first $p^{\top}Mp$ is a sub-exponential random variable with parameter $\sqrt{r}\eta^{2}$ and second $\mathbb{E}p^{\top}Mp\leq\sqrt{2r}\eta^{2}$ [58, Exercise 2.7.2]. Thus, we have proved (B.7).

To prove the deviation bound (B.8) we introduce the random variables

[TABLE]

Since $p^{\top}Mp$ is sub-exponential, we have $\|Y_{i}\|_{\psi_{1}}\lesssim\sqrt{r}\eta^{2}$ for all $i,$ see [58, Exercise 2.7.10]. Hence, Bernstein inequality for sub-exponential random variables [58, Theorem 2.8.2] gives the desired upper bound on $\mathbb{P}\left(\frac{1}{m}\left|\sum_{i=1}^{m}Y_{i}\right|\geq t\right).$ ∎

Applying Proposition B.1 with with $\beta(r)\asymp\sqrt{r}\eta^{2}$ and $c(m,r)\asymp m/{\eta^{4}}r$ now yields the result. ∎

B.2.3 Part 3 of Theorem 6.4 (Quadratic sensing II)

Lemma B.4.

The random variable $|p^{\top}Mp-\tilde{p}^{\top}M\tilde{p}|$ is sub-exponential with parameter $C\eta^{2}.$ Consequently,

[TABLE]

Moreover, there exists a universal constant $c>0$ such that for any $t\in[0,\eta^{2}]$ the deviation bound

[TABLE]

holds with probability at least $1-2\exp\left(-\frac{ct^{2}}{\eta^{4}}m\right).$

Proof.

Assumption I implies the lower bound in (B.9). To prove the upper bound, we will show that $\||p^{\top}Mp-\tilde{p}^{\top}M\tilde{p}^{\top}|\|_{\psi_{1}}\leq\eta^{2}$ . By definition of the Orlicz norm $\||X|\|_{\psi_{1}}=\|X\|_{\psi_{1}}$ for any random variable $X,$ hence without loss of generality we may remove the absolute value. Recall that $M=\sum_{k=1}^{2r}\sigma_{k}u_{k}u_{k}^{\top}$ where $\sigma_{k}$ and $u_{k}$ are the $k$ th singular values and vectors of $M$ , respectively. Hence, the random variable of interest can be rewritten as

[TABLE]

By assumption the random variables $\langle u_{k},p\rangle$ are $\eta$ -sub-gaussian, this implies that $\langle u_{k},p\rangle^{2}$ are $\eta^{2}$ -sub-exponential, since $\|\langle u_{k},p\rangle^{2}\|_{\psi_{1}}\leq\|\langle u_{k},p\rangle\|_{\psi_{2}}^{2}$ .

Recall the following characterization of the Orlicz norm for mean-zero random variables

[TABLE]

where the $Q\asymp\tilde{Q},$ see [58, Proposition 2.7.1]. To prove that the random variable (B.11) is sub-exponential we will exploit this characterization. Since each inner product squared $\langle u_{k},p\rangle^{2}$ is sub-exponential, the equivalence implies the existence of a constant $c>0$ for which the uniform bound

[TABLE]

holds. Let $\lambda$ be an arbitrary scalar with $|\lambda|\leq 1/c\eta^{4}$ , then by expanding the moment generating function of (B.11) we get

[TABLE]

where the inequality follows by (B.13) and the last relation follows since $\sigma$ is unit norm. Combining this with (B.12) gives

[TABLE]

This bound has two consequences, first $|p^{\top}Mp-\tilde{p}^{\top}M\tilde{p}^{\top}|$ is a sub-exponential random variable with parameter $C\eta^{2}$ and second $\mathbb{E}|p^{\top}Mp-\tilde{p}^{\top}M\tilde{p}^{\top}|\leq C\eta^{2}$ [58, Exercise 2.7.2]. Thus, we have proved (B.9).

To prove the deviation bound (B.10) we introduce the random variables

[TABLE]

The sub-exponentiality of $\mathcal{A}(M)_{i}$ implies $\|Y_{i}\|_{\psi_{1}}\lesssim\eta^{2}$ for all $i,$ see [58, Exercise 2.7.10]. Hence, Bernstein inequality for sub-exponential random variables [58, Theorem 2.8.2] gives the desired upper bound on $\mathbb{P}\left(\frac{1}{m}\left|\sum_{i=1}^{m}Y_{i}\right|\geq t\right).$ ∎

Applying Proposition B.1 with $\beta(r)\asymp\eta^{2}$ and $c(m,r)\asymp m/{\eta^{4}}$ now yields the result. ∎

B.2.4 Part 4 of Theorem 6.4 (Bilinear sensing)

Lemma B.5.

The random variable $|p^{\top}Mq|$ is sub-exponential with parameter $C\eta^{2}.$ Consequently,

[TABLE]

Moreover, there exists a universal constant $c>0$ such that for any $t\in[0,\eta^{2}]$ the deviation bound

[TABLE]

holds with probability at least $1-2\exp\left(-\frac{ct^{2}}{\eta^{4}}m\right).$

Proof.

As before the lower bound in (B.14) is implied by Assumption J. To prove the upper bound, we will show that $\||p^{\top}Mq|\|_{\psi_{1}}\leq\eta^{2}$ . By definition of the Orlicz norm $\||X|\|_{\psi_{1}}=\|X\|_{\psi_{1}}$ for any random variable $X,$ hence we may remove the absolute value. Recall that $M=\sum_{k=1}^{2r}\sigma_{k}u_{k}v_{k}^{\top}$ where $\sigma_{k}$ and $(u_{k},v_{k})$ are the $k$ th singular values and vectors of $M$ , respectively. Hence, the random variable of interest can be rewritten as

[TABLE]

By assumption the random variables $\langle p,u_{k}\rangle$ and $\langle v_{k},q\rangle$ are $\eta$ -sub-gaussian, this implies that $\langle p,u_{k}\rangle\langle v_{k},q\rangle$ are $\eta^{2}$ -sub-exponential.

To prove that the random variable (B.16) is sub-exponential we will again use (B.12). Since each random variable $\langle p,u_{k}\rangle\langle v_{k},q\rangle$ is sub-exponential, the equivalence implies the existence of a constant $c>0$ for which the uniform bound

[TABLE]

holds. Let $\lambda$ be an arbitrary scalar with $|\lambda|\leq 1/c\eta^{4}$ , then by expanding the moment generating function of (B.16) we get

[TABLE]

where the inequality follows by (B.17) and the last relation follows since $\sigma$ is unitary. Combining this with (B.12) gives

[TABLE]

Thus, we have proved (B.14).

Once again, to show the deviation bound (B.15) we introduce the random variables

[TABLE]

and apply Bernstein’s inequality for sub-exponential random variables [58, Theorem 2.8.2] to get the stated upper bound on $\mathbb{P}\left(\frac{1}{m}\left|\sum_{i=1}^{m}Y_{i}\right|\geq t\right).$ ∎

Applying Proposition B.1 with $\beta(r)\asymp\eta^{2}$ and $c(m,r)\asymp m/{\eta^{4}}$ now yields the result. ∎

B.3 Proof of Proposition B.1

Choose $\epsilon\in(0,\sqrt{2})$ and let $\mathcal{N}$ be the ( $\epsilon/\sqrt{2}$ )-net guaranteed by Lemma F.1. Pick some $t\in(0,K]$ so that (B.2) can hold, we will fix the value of this parameter later in the proof. Let $\mathcal{E}$ denote the event that the following two estimates hold for all matrices in $M\in\mathcal{N}$ :

[TABLE]

Throughout the proof, we will assume that the event $\mathcal{E}$ holds. We will estimate the probability of $\mathcal{E}$ at the end of the proof. Meanwhile, seeking to establish RIP, define the quantity

[TABLE]

We aim first to provide a high probability bound on $c_{2}$ .

Let $M\in S_{2r}$ be arbitrary and let $M_{\star}$ be the closest point to $M$ in $\mathcal{N}$ . Then we have

[TABLE]

where (B.20) follows from (B.19) and (B.21) follows from the triangle inequality. To simplify the third term in (B.21), using SVD, we deduce that there exist two orthogonal matrices $M_{1},M_{2}$ of rank at most $2r$ satisfying $M-M_{\star}=M_{1}+M_{2}.$ With this decomposition in hand, we compute

[TABLE]

where the second inequality follows from the definition of $c_{2}$ and the estimate $\|M_{1}\|_{F}+\|M_{2}\|_{F}\leq\sqrt{2}\|(M_{1},M_{2})\|_{F}=\sqrt{2}\|M_{1}+M_{2}\|_{F}.$ Thus, we arrive at the bound

[TABLE]

As $M$ was arbitrary, we may take the supremum of both sides of the inequality, yielding $c_{2}\leq\frac{1}{m}\sup_{M\in S_{2r}}\mathbb{E}\|\mathcal{A}(M)\|_{1}+t+2c_{2}\epsilon$ . Rearranging yields the bound

[TABLE]

Assuming that $\epsilon\leq 1/4$ , we further deduce that

[TABLE]

establishing that the random variable $c_{2}$ is bounded by $\bar{\sigma}$ in the event $\mathcal{E}$ .

Now let $\hat{\mathcal{I}}$ denote either $\hat{\mathcal{I}}=\emptyset$ or $\hat{\mathcal{I}}=\mathcal{I}$ . We now provide a uniform lower bound on $\frac{1}{m}\|\mathcal{A}_{\hat{\mathcal{I}}^{c}}(M)\|_{1}-\frac{1}{m}\|\mathcal{A}_{\hat{\mathcal{I}}}(M)\|_{1}$ . Indeed,

[TABLE]

where (B.25) uses the forward and reverse triangle inequalities, (B.26) follows from (B.18), the estimate (B.27) follows from the forward and reverse triangle inequalities, and (B.28) follows from (B.22) and (B.24). Switching the roles of $\mathcal{I}$ and $\mathcal{I}^{c}$ in the above sequence of inequalities, and choosing $\epsilon=t/4\bar{\sigma}$ , we deduce

[TABLE]

In particular, setting $\hat{\mathcal{I}}=\emptyset$ , we deduce

[TABLE]

and therefore using (B.1), we conclude the RIP property

[TABLE]

Next, let $\hat{\mathcal{I}}=\mathcal{I}$ and note that

[TABLE]

where the equality follows by assumption $\mathrm{(1)}$ . Therefore every $M\in S_{2r}$ satisfies

[TABLE]

Setting $t=\frac{2}{3}\min\{\alpha,\alpha(1-2|\mathcal{I}|/m)/2\}=\frac{1}{3}\alpha(1-2|\mathcal{I}|/m)$ in (B.29) and (B.30), we deduce the claimed estimates (B.3) and (B.4). Finally, let us estimate the probability of $\mathcal{E}$ . Using the union bound and Lemma F.1 yields

[TABLE]

where $c(m,r)$ is the function guaranteed by assumption $\mathrm{(3)}$ .

By (B.1) we get $1/\epsilon=4\bar{\sigma}/t\lesssim 2+\beta(r)/(1-2|\mathcal{I}|/m)$ . Then we deduce

[TABLE]

Hence as long as $c(m,r)\geq\frac{9c_{1}(d_{1}+d_{2}+1)r^{2}\ln\left(c_{2}+\frac{c_{2}\beta(r)}{1-2|\mathcal{I}|/m}\right)}{\alpha^{2}\left(1-\frac{2|\mathcal{I}|}{m}\right)^{2}}$ , we can be sure

[TABLE]

Proving the desired result. ∎

Appendix C Proof in Section 7

C.1 Proof of Lemma 7.4

Define $P(x,y)=a\|y-x\|^{2}_{2}+b\|y-x\|_{2}$ . Fix an iteration $k$ and choose $x^{*}\in\mathrm{proj}_{\mathcal{X}^{*}}(x_{k})$ . Then the estimate holds:

[TABLE]

Rearranging and using the sharpness and approximation accuracy assumptions, we deduce

[TABLE]

The result follows.

C.2 Proof of Theorem 7.6

First notice that for any $y$ , we have $\partial f(y)=\partial f_{y}(y)$ . Therefore, since $f_{y}$ is a convex function, we have that for all $x,y\in\mathcal{X}$ and $v\in\partial f(y)$ , the bound

[TABLE]

Consequently, given that ${\rm dist}(x_{i},\mathcal{X}^{*})\leq\gamma\cdot\frac{\mu-2b}{2a}$ , we have

[TABLE]

Here, the estimate (C.2) follows from the fact that the projection $\mathrm{proj}_{\mathcal{X}}(\cdot)$ is nonexpansive, (C.3) uses the bound in (C.1), (C.5) follow from the estimate ${\rm dist}(x_{i},\mathcal{X}^{*})\leq\gamma\cdot\frac{\mu-2b}{2a}$ , while (C.4) and (C.6) use local sharpness. The result then follows by the upper bound $\|\zeta_{i}\|\leq L$ .

Appendix D Proofs in Section 8

D.1 Proof of Lemma 8.1

The inequality can be established using an argument similar to that for bounding the $T_{3}$ term in [19, Section 6.6]. We provide the proof below for completeness. Define the shorthand $\Delta_{S}:=S-S_{\sharp}$ and $\Delta_{X}=X-X_{\sharp}$ , and let $e_{j}\in\mathbb{R}^{d}$ denote the $j$ -th standard basis vector of $\mathbb{R}^{d}$ . Simple algebra gives

[TABLE]

We claim that $\|\Delta_{S}e_{j}\|_{1}\leq 2\sqrt{k}\|\Delta_{S}e_{j}\|_{2}$ for each $j\in[d]$ . To see this, fix any $j\in[d]$ and let $v:=Se_{j}$ , $v^{*}:=S_{\sharp}e_{j}$ , and $T:=\text{support}(v^{*}).$ We have

[TABLE]

Rearranging terms gives $\|(v-v^{*})_{T^{c}}\|_{1}\leq\|(v-v^{*})_{T}\|_{1}$ , whence

[TABLE]

where step the second inequality holds because $|T|\leq k$ by assumption. The claim follows from noting that $v-v^{*}=\Delta_{S}e_{j}$ .

Using the claim, we get that

[TABLE]

Using a similar argument and the fact that $\|\Delta_{X}\|_{2,\infty}\leq\|X\|_{2,\infty}+\|X_{\sharp}\|_{2,\infty}\leq 3\sqrt{\frac{\nu r}{d}}$ , we obtain

[TABLE]

Putting everything together, we have

[TABLE]

The claim follows.

D.2 Proof of Theorem 8.5

Without loss of generality, suppose that $x$ is closer to $\bar{x}$ than to $-\bar{x}$ . Consider the following expression:

[TABLE]

We now produce a few different lower bounds by testing against different $V$ . In what follows, we set $a=\sqrt{2}-1$ , i.e., the positive solution of the equation $1-a^{2}=2a$ .

Case 1:

Suppose that

[TABLE]

Then set $\bar{V}=\mathrm{sign}((x-\bar{x})^{\top}\mathrm{sign}(\bar{x}))\cdot\mathrm{sign}(\bar{x})\mathrm{sign}(\bar{x})^{\top}$ , to get

[TABLE]

Case 2:

Suppose that

[TABLE]

Then set $\bar{V}=\mathrm{sign}(\mathrm{sign}(x-\bar{x})^{\top}\bar{x})\cdot\mathrm{sign}(x-\bar{x})\mathrm{sign}(x-\bar{x})^{\top}$ , to get

[TABLE]

Case 3:

Suppose that

[TABLE]

Define $\bar{V}=\frac{1}{2}(\mathrm{sign}(\bar{x}(x-\bar{x})^{\top})+\mathrm{sign}((x-\bar{x})\bar{x}^{\top}))$ . Observe that

[TABLE]

and

[TABLE]

Putting these two bounds together, we find that

[TABLE]

Altogether, we find that

[TABLE]

as desired.

D.3 Proof of Lemma 8.7

We start by stating a claim we will use to prove the lemma. Let us introduce some notation. Consider the set

[TABLE]

Define the random variable

[TABLE]

Claim 1.

There exist constants $c_{2},c_{3}>0$ such that with probability at least $1-\exp(-c_{2}\log d)$

[TABLE]

Before proving this claim, let us show how it implies the theorem. Let

[TABLE]

Set $\Delta_{-}=X-X_{\sharp}R$ and $\Delta_{+}=X+X_{\sharp}R$ . Notice that

[TABLE]

Therefore, because $(\Delta_{+},\Delta_{-})\in S$ and

[TABLE]

we have that

[TABLE]

where the last line follows by Conjecture 8.6. This proves the desired result.

Proof of the Claim.

Our goal is to show that the random variable $Z$ is highly concentrated around its mean. We may apply the standard symmetrization inequality [5, Lemma 11.4] to bound the expectation $\mathbb{E}Z$ as follows:

[TABLE]

Observing that $T_{1}$ and $T_{2}$ can both be bounded by

[TABLE]

where the final inequality follows from Bernstein’s inequality and a union bound, we find that

[TABLE]

To prove that $Z$ is well concentrated around $\mathbb{E}Z$ , we apply Theorem F.3. To apply this theorem, we set $\mathcal{S}=S$ and define the collection $(Z_{ij,s})_{ij,s\in\mathcal{S}}$ , where $s=(\Delta_{+},\Delta_{-})$ by

[TABLE]

We also bound

[TABLE]

and

[TABLE]

Therefore, due to Theorem F.3 there exists a constant $c_{1},c_{2},c_{3}>0$ so that with $t=c_{2}\log d$ , we have that with probability $1-e^{-c_{2}\log d}$ , the bound

[TABLE]

where the last line follows since by assumption $\log d/d\lesssim\tau.$ ∎

Appendix E Proofs in Section 9

E.1 Proof of Lemma 9.1

The proof follows the same strategy as [24, Theorem 6.1]. Fix $x\in\widetilde{\mathcal{T}}_{1}$ and let $\zeta\in\partial\tilde{f}(x)$ . Then for all $y$ , we have, from Lemma 9.3, that

[TABLE]

Therefore, the function

[TABLE]

satisfies

[TABLE]

Now, for some $\gamma>0$ to be determined momentarily, define

[TABLE]

First order optimality conditions and the sum rule immediately imply that

[TABLE]

Thus,

[TABLE]

Now we estimate $\|x-\hat{x}\|_{2}$ . Indeed, from the definition of $\hat{x}$ we have

[TABLE]

Consequently, we have $\|x-\hat{x}\|\leq 2\gamma$ . Thus, setting $\gamma=\sqrt{2\varepsilon/\rho}$ and recalling that $\varepsilon\leq\mu^{2}/56\rho$ we find that

[TABLE]

Likewise, we have

[TABLE]

Therefore, setting $L=\sup\left\{\|\zeta\|_{2}:\zeta\in\partial f(x),{\rm dist}(x,\mathcal{X}^{\ast})\leq\frac{\mu}{\rho},{\rm dist}(x,\mathcal{X})\leq 2\sqrt{\frac{\varepsilon}{\rho}}\right\}$ , we find that

[TABLE]

as desired.

E.2 Proof of Theorem 9.4

Let $i\geq 0$ , suppose $x_{i}\in\widetilde{\mathcal{T}}_{1}$ , and let $x^{\ast}\in\mathrm{proj}_{\mathcal{X}^{\ast}}(x_{i})$ . Notice that Lemma 9.2 implies $\tilde{f}(x_{i})-\min_{\mathcal{X}}f>0$ . We successively compute

[TABLE]

Here, the estimate (E.1) follows from the fact that the projection $\mathrm{proj}_{Q}(\cdot)$ is nonexpansive, (E.2) uses Lemma 9.3, the estimate (E.4) follows from the assumption $\epsilon<\frac{\mu}{14}\|x_{k}-x^{*}\|$ , the estimate (E.5) follows from the estimate $\|x_{i}-x^{*}\|\leq\frac{\mu}{4\rho}$ , while (E.3) and (E.6) use Lemma 9.2. We therefore deduce

[TABLE]

Consequently either we have ${\rm dist}(x_{i+1},\mathcal{X}^{\ast})<\frac{14\varepsilon}{\mu}$ or $x_{i+1}\in\widetilde{\mathcal{T}}_{1}$ . Therefore, by induction, the proof is complete.

E.3 Proof of Theorem 9.6

Let $i\geq 0$ , suppose $x_{i}\in\mathcal{T}_{\gamma}$ , and let $x^{\ast}\in\mathrm{proj}_{\mathcal{X}^{\ast}}(x_{i})$ . Then

[TABLE]

Rearranging yields the result.

Appendix F Auxiliary lemmas

Lemma F.1 (Lemma 3.1 in [13]).

Let $S_{r}:=\left\{X\in{\bf R}^{d_{1}\times d_{2}}\mid\text{Rank }\,(X)\leq r,\left\|X\right\|_{F}=1\right\}$ . There exists an $\epsilon$ -net $\mathcal{N}$ (with respect to $\|\cdot\|_{F}$ ) of $S_{r}$ obeying

[TABLE]

Proposition F.2 (Corollary 1.4 in [54]).

Consider $X_{1},\dots,X_{d}$ real-valued random variables and let $\sigma\in\SS^{d-1}$ be a unit vector. Let $t,p>0$ such that

[TABLE]

Then the following holds

[TABLE]

where $C>0$ is a universal constant.

Theorem F.3 (Talagrand’s Functional Bernstein for non-identically distributed variables [36, Theorem 1.1(c)]).

Let $\mathcal{S}$ be a countable index set. Let $Z_{1},\ldots,Z_{n}$ be independent vector-valued random variables of the form $Z_{i}=(Z_{i,s})_{s\in\mathcal{S}}$ . Let $Z:=\sup_{s\in\mathcal{S}}\sum_{i=1}^{n}Z_{i,s}$ . Assume that for all $i\in[n]$ and $s\in\mathcal{S}$ , $\mathbb{E}Z_{i,s}=0$ and $\left|Z_{i,s}\right|\leq b$ . Let

[TABLE]

Then for each $t>0$ , we have the tail bound

[TABLE]

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Ahmed, B. Recht, and J. Romberg. Blind deconvolution using convex programming. IEEE Transactions on Information Theory , 60(3):1711–1732, 2014.
2[2] P. Albano and P. Cannarsa. Singularities of semiconcave functions in Banach spaces. In Stochastic analysis, control, optimization and applications , Systems Control Found. Appl., pages 171–190. Birkhäuser Boston, Boston, MA, 1999.
3[3] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems , pages 3873–3881, 2016.
4[4] J.M. Borwein and A.S. Lewis. Convex analysis and nonlinear optimization . CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, 3. Springer-Verlag, New York, 2000. Theory and examples.
5[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence . Oxford university press, 2013.
6[6] J.V. Burke. Descent methods for composite nondifferentiable optimization problems. Math. Programming , 33(3):260–279, 1985.
7[7] J.V. Burke and M.C. Ferris. A Gauss-Newton method for convex composite optimization. Math. Programming , 71(2, Ser. A):179–194, 1995.
8[8] T.T. Cai and A. Zhang. ROP: matrix recovery via rank-one projections. Ann. Statist. , 43(1):102–138, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Low-rank matrix recovery with composite optimization:

Abstract

Contents

1 Introduction

Algorithms and conditioning for smooth formulations

Algorithms and conditioning for nonsmooth formulations

Guiding strategy.

Approximation and sharpness via the Restricted Isometry Property

Beyond RIP: matrix completion and robust PCA

Robust recovery with sparse outliers and dense noise

Numerical experiments

Outline of the paper

2 Preliminaries

3 Regularity conditions and algorithms (informal)

Assumption A**.**

Lemma 3.1** (Subgradient bound and Lipschitz continuity [52, Theorem 9.13]).**

4 Regularity under RIP

Assumption B** (Restricted Isometry Property (RIP)).**

4.1 Approximation and Lipschitz continuity

Proposition 4.1** (Approximation accuracy and Lipschitz continuity (symmetric)).**

Proof.

Proposition 4.2** (Approximation accuracy and Lipschitz continuity (asymmetric)).**

Proof.

4.2 Sharpness

4.2.1 Sharpness in the noiseless regime

Proposition 4.3** ([57, Lemma 5.4]).**

Proposition 4.4** ([16, Proposition 4.2]).**

Theorem 4.5** (Sharpness (asymmetric and noiseless)).**

Proof.

4.2.2 Sharpness in presence of outliers

Assumption C** (I\mathcal{I}I-outlier bounds).**

Proposition 4.6** (Sharpness with outliers (symmetric)).**

Proof.

Proposition 4.7** (Sharpness with outliers (asymmetric)).**

5 General convergence guarantees for subgradient & prox-linear methods

Assumption D**.**

Lemma 5.1** ([23, Lemma 3.1]).**

Theorem 5.2** (Polyak subgradient method).**

Theorem 5.3** (Geometrically decaying subgradient method).**

Theorem 5.4** (Prox-linear algorithm).**

Corollary 5.5** (Convergence guarantees under RIP (symmetric)).**

5.1 Guarantees under local regularity

Assumption E**.**

Theorem 5.6** (Polyak subgradient method (local regularity)).**

Theorem 5.7** (Geometrically decaying subgradient method (local regularity)).**

Assumption F**.**

Theorem 5.8** (Prox-linear (local)).**

Corollary 5.9** (Convergence guarantees under RIP (asymmetric)).**

6 Examples of ℓ1/ℓ2\ell_{1}/\ell_{2}ℓ1​/ℓ2​ RIP

Definition 6.1** (Data-generating mechanism).**

Matrix Sensing.

Quadratic Sensing I .

Quadratic Sensing II .

Bilinear Sensing.

6.1 Warm-up: ℓ2/ℓ2\ell_{2}/\ell_{2}ℓ2​/ℓ2​ RIP for matrix sensing with Gaussian design

Theorem 6.2** (ℓ2/ℓ2\ell_{2}/\ell_{2}ℓ2​/ℓ2​-RIP for matrix sensing**).

6.2 The ℓ1/ℓ2\ell_{1}/\ell_{2}ℓ1​/ℓ2​ RIP and I\mathcal{I}I-outlier bounds: quadratic and bilinear sensing

Assumption G** (Matrix Sensing).**

Assumption H** (Quadratic Sensing I).**

Assumption I** (Quadratic Sensing II).**

Assumption J** (Bilinear Sensing).**

Lemma 6.3**.**

Theorem 6.4** (ℓ1/ℓ2\ell_{1}/\ell_{2}ℓ1​/ℓ2​ RIP and I\mathcal{I}I-outlier bounds).**

7 Matrix Completion

Lemma 7.1** (Sharpness [19]).**

Lemma 7.2** (Lemma 5 in [19]).**

Lemma 7.3** (Approximation accuracy and Lipschitz continuity).**

Proof.

Lemma 7.4**.**

Corollary 7.5** (Prox-linear method for matrix completion).**

Proof.

Theorem 7.6**.**

Corollary 7.7** (Subgradient method for matrix completion).**

Assumption A.

Lemma 3.1 (Subgradient bound and Lipschitz continuity [52, Theorem 9.13]).

Assumption B (Restricted Isometry Property (RIP)).

Proposition 4.1 (Approximation accuracy and Lipschitz continuity (symmetric)).

Proposition 4.2 (Approximation accuracy and Lipschitz continuity (asymmetric)).

Proposition 4.3 ([57, Lemma 5.4]).

Proposition 4.4 ([16, Proposition 4.2]).

Theorem 4.5 (Sharpness (asymmetric and noiseless)).

Assumption C ( $\mathcal{I}$ -outlier bounds).

Proposition 4.6 (Sharpness with outliers (symmetric)).

Proposition 4.7 (Sharpness with outliers (asymmetric)).

Assumption D.

Lemma 5.1 ([23, Lemma 3.1]).

Theorem 5.2 (Polyak subgradient method).

Theorem 5.3 (Geometrically decaying subgradient method).

Theorem 5.4 (Prox-linear algorithm).

Corollary 5.5 (Convergence guarantees under RIP (symmetric)).

Assumption E.

Theorem 5.6 (Polyak subgradient method (local regularity)).

Theorem 5.7 (Geometrically decaying subgradient method (local regularity)).

Assumption F.

Theorem 5.8 (Prox-linear (local)).

Corollary 5.9 (Convergence guarantees under RIP (asymmetric)).

6 Examples of $\ell_{1}/\ell_{2}$ RIP

Definition 6.1 (Data-generating mechanism).

6.1 Warm-up: $\ell_{2}/\ell_{2}$ RIP for matrix sensing with Gaussian design

Theorem 6.2 ( $\ell_{2}/\ell_{2}$ -RIP for matrix sensing).

6.2 The $\ell_{1}/\ell_{2}$ RIP and $\mathcal{I}$ -outlier bounds: quadratic and bilinear sensing

Assumption G (Matrix Sensing).

Assumption H (Quadratic Sensing I).

Assumption I (Quadratic Sensing II).

Assumption J (Bilinear Sensing).

Lemma 6.3.

Theorem 6.4 ( $\ell_{1}/\ell_{2}$ RIP and $\mathcal{I}$ -outlier bounds).

Lemma 7.1 (Sharpness [19]).

Lemma 7.2 (Lemma 5 in [19]).

Lemma 7.3 (Approximation accuracy and Lipschitz continuity).

Lemma 7.4.

Corollary 7.5 (Prox-linear method for matrix completion).

Theorem 7.6.

Corollary 7.7 (Subgradient method for matrix completion).

Lemma 8.1.

Lemma 8.2 (Regularity constants).

Lemma 8.3 (Approximation accuracy).

Theorem 8.4.

Theorem 8.5 (Sharpness (rank one)).

Conjecture 8.6 (Sharpness (general rank)).

Lemma 8.7.

Lemma 8.8 (Sharpness of Non-Euclidean Robust PCA).

Theorem 8.9 (Convergence for non-Euclidean Robust PCA).

Lemma 9.1.

Lemma 9.2 (Approximate sharpness).

Lemma 9.3 (Approximate subgradient inequality).

Theorem 9.4 (Polyak subgradient method).

Theorem 9.5 (Geometrically decaying subgradient method).

Theorem 9.6 (Prox-linear algorithm).

9.1 Example: sparse outliers and dense noise under $\ell_{1}/\ell_{2}$ RIP

Assumption K ( $\mathcal{I}$ -outlier bounds).

Proposition 9.7.

Corollary 9.8 (Convergence guarantees under RIP with sparse outliers and dense noise (symmetric)).

Lemma A.1.

Proposition B.1.

Lemma B.2.

Lemma B.3.

Lemma B.4.

Lemma B.5.