Phase Transitions of Spectral Initialization for High-Dimensional   Nonconvex Estimation

Yue M. Lu; Gen Li

arXiv:1702.06435·cs.IT·July 23, 2019

Phase Transitions of Spectral Initialization for High-Dimensional Nonconvex Estimation

Yue M. Lu, Gen Li

PDF

TL;DR

This paper analyzes the phase transition behavior of spectral initialization in high-dimensional nonconvex estimation, revealing thresholds for when the method provides meaningful signal estimates.

Contribution

It offers a precise asymptotic characterization of spectral initialization performance across generalized linear models, extending beyond phase retrieval.

Findings

01

Performance sharply transitions at specific sample-to-dimension ratios.

02

Below threshold, estimates are no better than random.

03

Above threshold, estimates align with the true signal.

Abstract

We study a spectral initialization method that serves a key role in recent work on estimating signals in nonconvex settings. Previous analysis of this method focuses on the phase retrieval problem and provides only performance bounds. In this paper, we consider arbitrary generalized linear sensing models and present a precise asymptotic characterization of the performance of the method in the high-dimensional limit. Our analysis also reveals a phase transition phenomenon that depends on the ratio between the number of samples and the signal dimension. When the ratio is below a minimum threshold, the estimates given by the spectral method are no better than random guesses drawn from a uniform distribution on the hypersphere, thus carrying no information; above a maximum threshold, the estimates become increasingly aligned with the target signal. The computational complexity of the…

Figures7

Click any figure to enlarge with its caption.

Equations306

y_{i} \sim f (y ∣ a_{i}^{⊤} ξ),

y_{i} \sim f (y ∣ a_{i}^{⊤} ξ),

ξ = x arg min i = 1 \sum m ℓ (y_{i}, a_{i}^{⊤} x),

ξ = x arg min i = 1 \sum m ℓ (y_{i}, a_{i}^{⊤} x),

D_{m} = def \frac{1}{m} i = 1 \sum m T (y_{i}) a_{i} a_{i}^{⊤},

D_{m} = def \frac{1}{m} i = 1 \sum m T (y_{i}) a_{i} a_{i}^{⊤},

ρ (ξ, x_{1}) = def \frac{( ξ ^{⊤} x _{1} ) ^{2}}{∥ ξ ∥ ^{2} ∥ x _{1} ∥ ^{2}},

ρ (ξ, x_{1}) = def \frac{( ξ ^{⊤} x _{1} ) ^{2}}{∥ ξ ∥ ^{2} ∥ x _{1} ∥ ^{2}},

s \sim N (0, 1), P (y ∣ s) = f (y ∣ κ s), and z = T (y),

s \sim N (0, 1), P (y ∣ s) = f (y ∣ κ s), and z = T (y),

λ \to τ^{+} lim E \frac{z}{( λ - z ) ^{2}} = λ \to τ^{+} lim E \frac{z s ^{2}}{λ - z} = \infty.

λ \to τ^{+} lim E \frac{z}{( λ - z ) ^{2}} = λ \to τ^{+} lim E \frac{z s ^{2}}{λ - z} = \infty.

E z s^{2} > E z .

E z s^{2} > E z .

z = T (y) = y \mathds 1_{∣ y ∣ \leq t^{2}},

z = T (y) = y \mathds 1_{∣ y ∣ \leq t^{2}},

D_{m} \approx E (z_{i} a_{i} a_{i}^{⊤}),

D_{m} \approx E (z_{i} a_{i} a_{i}^{⊤}),

a_{i}^{⊤} = [s_{i} u_{i}^{⊤}],

a_{i}^{⊤} = [s_{i} u_{i}^{⊤}],

E (z_{i} a_{i} a_{i}^{⊤})

E (z_{i} a_{i} a_{i}^{⊤})

= [E z s^{2} 0 0 E z I_{n - 1}],

g (s) = def E_{z ∣ s} (z ∣ s)

g (s) = def E_{z ∣ s} (z ∣ s)

ϕ (λ) = def λ E \frac{z s ^{2}}{λ - z}

ϕ (λ) = def λ E \frac{z s ^{2}}{λ - z}

\psi_{\alpha}(\lambda)\overset{\text{def}}{=}\lambda\,\Big{(}1/\alpha+\mathbb{E}\frac{z}{\lambda-z}\Big{)},

\psi_{\alpha}(\lambda)\overset{\text{def}}{=}\lambda\,\Big{(}1/\alpha+\mathbb{E}\frac{z}{\lambda-z}\Big{)},

\overline{λ}_{α} = def λ > τ ar g min ψ_{α} (λ) .

\overline{λ}_{α} = def λ > τ ar g min ψ_{α} (λ) .

\zeta_{\alpha}(\lambda)\overset{\text{def}}{=}\psi_{\alpha}\big{(}\lambda\vee\overline{\lambda}_{\alpha}\big{)}

\zeta_{\alpha}(\lambda)\overset{\text{def}}{=}\psi_{\alpha}\big{(}\lambda\vee\overline{\lambda}_{\alpha}\big{)}

ζ_{α} (λ) = ϕ (λ), λ > τ .

ζ_{α} (λ) = ϕ (λ), λ > τ .

ρ (ξ_{n}, x_{1}^{n}) ⟶ P {0, \frac{ψ _{α}^{'} ( λ _{α}^{*} )}{ψ _{α}^{'} ( λ _{α}^{*} ) - ϕ ^{'} ( λ _{α}^{*} )}, if ψ_{α}^{'} (λ_{α}^{*}) < 0, if ψ_{α}^{'} (λ_{α}^{*}) > 0,

ρ (ξ_{n}, x_{1}^{n}) ⟶ P {0, \frac{ψ _{α}^{'} ( λ _{α}^{*} )}{ψ _{α}^{'} ( λ _{α}^{*} ) - ϕ ^{'} ( λ _{α}^{*} )}, if ψ_{α}^{'} (λ_{α}^{*}) < 0, if ψ_{α}^{'} (λ_{α}^{*}) > 0,

λ_{1}^{D_{m}} ⟶ P ζ_{α} (λ_{α}^{*}) and λ_{2}^{D_{m}} ⟶ P ζ_{α} (\overline{λ}_{α})

λ_{1}^{D_{m}} ⟶ P ζ_{α} (λ_{α}^{*}) and λ_{2}^{D_{m}} ⟶ P ζ_{α} (\overline{λ}_{α})

λ_{c, m i n} = def λ \in Λ min λ and λ_{c, m a x} = def λ \in Λ max λ

λ_{c, m i n} = def λ \in Λ min λ and λ_{c, m a x} = def λ \in Λ max λ

ρ (ξ_{n}, x_{1}^{n}) ⟶ P {0, ρ (α), if α < α_{c, m i n}, if α > α_{c, m a x},

ρ (ξ_{n}, x_{1}^{n}) ⟶ P {0, ρ (α), if α < α_{c, m i n}, if α > α_{c, m a x},

α_{c, m i n}^{- 1} = E \frac{z ^{2}}{( λ _{c, m i n} - z ) ^{2}}, α_{c, m a x}^{- 1} = E \frac{z ^{2}}{( λ _{c, m a x} - z ) ^{2}},

α_{c, m i n}^{- 1} = E \frac{z ^{2}}{( λ _{c, m i n} - z ) ^{2}}, α_{c, m a x}^{- 1} = E \frac{z ^{2}}{( λ _{c, m a x} - z ) ^{2}},

1/ α

1/ α

1/ ρ

\phi(\lambda)=\frac{c\lambda}{\lambda-1}\quad\text{and}\quad\psi_{\alpha}(\lambda)=\lambda\Big{(}1/\alpha+\frac{d}{\lambda-1}\Big{)},

\phi(\lambda)=\frac{c\lambda}{\lambda-1}\quad\text{and}\quad\psi_{\alpha}(\lambda)=\lambda\Big{(}1/\alpha+\frac{d}{\lambda-1}\Big{)},

c = def E z s^{2} and d = def E z

c = def E z s^{2} and d = def E z

ζ_{α} (λ) = {λ / α + λ d / (λ - 1), (d + 1/ α)^{2}, for λ \geq 1 + α d for 1 < λ < 1 + α d .

ζ_{α} (λ) = {λ / α + λ d / (λ - 1), (d + 1/ α)^{2}, for λ \geq 1 + α d for 1 < λ < 1 + α d .

ρ (ξ_{n}, x_{1}^{n}) ⟶ P {0, \frac{α - d / ( c - d ) ^{2}}{α + 1/ ( c - d )}, for α < α_{c}, for α > α_{c},

ρ (ξ_{n}, x_{1}^{n}) ⟶ P {0, \frac{α - d / ( c - d ) ^{2}}{α + 1/ ( c - d )}, for α < α_{c}, for α > α_{c},

λ_{1}^{D_{m}} ⟶ P {(d + 1/ α)^{2}, c + \frac{c}{α ( c - d )}, for α < α_{c}, for α > α_{c},

λ_{1}^{D_{m}} ⟶ P {(d + 1/ α)^{2}, c + \frac{c}{α ( c - d )}, for α < α_{c}, for α > α_{c},

λ_{2}^{D_{m}} ⟶ P (d + 1/ α)^{2}

λ_{2}^{D_{m}} ⟶ P (d + 1/ α)^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Phase Transitions of Spectral Initialization for High-Dimensional Nonconvex Estimation

Yue M. Lu and Gen Li Y. M. Lu is with the John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA (e-mail: [email protected]). Part of this work was done during his visit to the Information Initiative at Duke (iiD) in Spring 2016. He thanks members of this interdisciplinary program for their hospitality.G. Li is with the Department of Electronic Engineering, Tsinghua University, Beijing 100084, China (e-mail: [email protected]). He was a summer visiting undergraduate student at the John A. Paulson School of Engineering and Applied Sciences, Harvard University.This work was supported in part by the ARO under contract W911NF-16-1-0265 and by the US National Science Foundation under grants CCF-1319140 and CCF-1718698. Preliminary version of this work was presented at the IEEE International Symposium on Information Theory (ISIT) in 2017.

Abstract

We study a spectral initialization method that serves a key role in recent work on estimating signals in nonconvex settings. Previous analysis of this method focuses on the phase retrieval problem and provides only performance bounds. In this paper, we consider arbitrary generalized linear sensing models and present a precise asymptotic characterization of the performance of the method in the high-dimensional limit. Our analysis also reveals a phase transition phenomenon that depends on the ratio between the number of samples and the signal dimension. When the ratio is below a minimum threshold, the estimates given by the spectral method are no better than random guesses drawn from a uniform distribution on the hypersphere, thus carrying no information; above a maximum threshold, the estimates become increasingly aligned with the target signal. The computational complexity of the method, as measured by the spectral gap, is also markedly different in the two phases. Worked examples and numerical results are provided to illustrate and verify the analytical predictions. In particular, simulations show that our asymptotic formulas provide accurate predictions for the actual performance of the spectral method even at moderate signal dimensions.

Index Terms:

Spectral initialization, signal estimation, nonconvex optimization, spiked covariance model, phase transition

I Introduction

We consider the problem of estimating an $n$ -dimensional vector $\boldsymbol{\xi}$ from a number of generalized linear measurements. Let $\left\{\boldsymbol{a}_{i}\right\}_{1\leq i\leq m}$ be a set of sensing vectors in $\mathbb{R}^{n}$ . Given $\left\{\boldsymbol{a}_{i}^{\top}\boldsymbol{\xi}\right\}$ , the measurements are drawn independently from

[TABLE]

where $f(\cdot\,|\,\cdot)$ is a conditional density function modeling the acquisition process. This model arises in many problems in signal processing and statistical learning. Examples include photon-limited imaging [1, 2], phase retrieval [3], signal recovery from quantized measurements [4], and various single-index and generalized linear regression problems [5, 6].

The standard method for recovering $\boldsymbol{\xi}$ is to use the estimator

[TABLE]

where $\ell\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{2}\rightarrow\mathbb{R}$ is some loss function (e.g., the negative log-likelihood of the observation model as used in maximum likelihood estimation). In many applications, however, the natural loss function is not convex with respect to $\boldsymbol{x}$ . There is often no effective way to convexify (2). In those cases for which convex relaxations do exist, the resulting algorithms can be computationally expensive. The problem of phase retrieval, where $y_{i}=(\boldsymbol{a}_{i}^{\top}\boldsymbol{\xi})^{2}+\varepsilon_{i}$ for some noise terms $\left\{\varepsilon_{i}\right\}$ , is an example in the latter scenario. Convex relaxation schemes such as those based on lifting and semidefinite programming (e.g., [7, 8, 9, 10]) have been successfully developed for solving the phase retrieval problem, but the challenges facing these schemes lie in their actual implementation. In practice, the computational complexity and memory requirement associated with these convex-relaxation methods are prohibitive for signal dimensions that are encountered in real-word applications such as imaging.

In light of these issues, there is strong recent interest in developing and analyzing efficient iterative methods that directly solve nonconvex forms of (2). Examples include the alternating minimization scheme for phase retrieval [11], the Wirtinger Flow algorithm and its variants [12, 13, 14, 15, 16], iterative projection methods [17, 18], and recent schemes for phase retrieval using linear programming [19, 20, 21, 22]. A common ingredient that contributes to the success of these algorithms for nonconvex estimation is that they all use some carefully-designed spectral method as an initialization step, which is then followed by further (iterative) refinement. Beyond the signal estimation problem considered in this paper, related spectral methods have also been successfully applied to initialize algorithms for solving other nonconvex problems such as matrix completion [23], low-rank matrix recovery [24], blind deconvolution [25, 26], sparse coding [27], and joint alignment from pairwise differences [28].

In this paper, we present an exact high-dimensional analysis of a widely-used spectral method [11, 12, 13] for estimating $\boldsymbol{\xi}$ . The method consists of only two steps: First, construct a data matrix from the sensing vectors and measurements as

[TABLE]

where $\mathcal{T}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}\rightarrow\mathbb{R}$ is a preprocessing function (e.g., a trimming or truncation step). Second, compute a normalized eigenvector, denoted by $\boldsymbol{x}_{1}$ , that corresponds to the largest eigenvalue of $\boldsymbol{D}_{m}$ . The vector $\boldsymbol{x}_{1}$ is then our estimate of $\boldsymbol{\xi}$ (up to an unknown scalar). It is notable that this method is model-free in that the algorithm does not require the knowledge of the exact acquisition process [i.e., the conditional density $f(\cdot\,|\,\cdot)$ in (1)].

The idea of this spectral method can be traced back to the early work of Li [29], under the name of Principal Hessian Directions for general multi-index models. Similar spectral techniques were also proposed in [23, 24], for initializing algorithms for matrix completion. In [11], Netrapalli, Jain, and Sanghavi used this method to address the problem of phase retrieval. Under the assumption that the sensing vectors consist of i.i.d. Gaussian random variables, these authors show that the leading eigenvector $\boldsymbol{x}_{1}$ is aligned with the target vector $\boldsymbol{\xi}$ in direction when there are sufficiently many measurements. More specifically, they show that the squared cosine similarity

[TABLE]

which measures the degree of the alignment between the two vectors, approaches $1$ with high probability, when the number of samples $m\geq c_{1}n\log^{3}n$ . This sufficient condition on sample complexity was later improved to $m\geq c_{2}n\log n$ in [7], and further improved to $m\geq c_{3}n$ in [13] with an additional trimming step on the measurements. In these expressions, $c_{1},c_{2},c_{3}$ stand for some unspecified numerical constants.

In this paper, we provide a precise asymptotic characterization of the performance of the spectral method under Gaussian measurements. Our analysis considers general acquisition models under arbitrary conditional distributions $f(y\,|\,\boldsymbol{a}_{i}^{\top}\boldsymbol{\xi})$ , of which the phase retrieval problem is a special case. Unlike previous work, which only provides bounds for $\rho(\boldsymbol{\xi},\boldsymbol{x}_{1})$ , we derive the exact high-dimensional limit of this value. In particular, we show that, as $n$ and $m$ both tend to infinity with the sampling ratio $\alpha\overset{\text{def}}{=}m/n$ kept fixed, the squared cosine similarity $\rho$ converges in probability to a limit value $\rho(\alpha)$ . Explicit formulas are provided for computing $\rho(\alpha)$ .

Geometrically, the squared cosine similarity $\rho(\boldsymbol{\xi},\boldsymbol{x}_{1})$ as defined in (4) specifies the angle $\theta$ between $\boldsymbol{\xi}$ and $\boldsymbol{x}_{1}$ . The values of $\rho$ vary from [math] to $1$ : $\rho=1$ means perfect alignment, i.e., $\theta=0$ or $\pi$ ; and $\rho=0$ is the opposite case, meaning $\boldsymbol{x}_{1}$ is orthogonal to (i.e. uncorrelated with) $\boldsymbol{\xi}$ . That the spectral method can yield an estimate $\boldsymbol{x}_{1}$ with a positive $\rho$ in high dimensional settings is a nontrivial property. To see this, assume that $\boldsymbol{\xi}$ is pointing towards the “north pole” in the unit $(n-1)$ -sphere $\mathcal{S}^{n-1}$ , as illustrated in Figure 1. If we choose $\boldsymbol{x}_{1}$ uniformly at random from $\mathcal{S}^{n-1}$ , then with high probability, the resulting correlation $\sqrt{\rho(\boldsymbol{\xi},\boldsymbol{x}_{1})}$ will be of order $\mathcal{O}(1/\sqrt{n})$ . In other words, for large $n$ , most of the uniform measure on $\mathcal{S}^{n-1}$ is concentrated within a very thin band of width $\mathcal{O}(1/\sqrt{n})$ near the “equator” of the sphere (see Figure 1).

Our analysis reveals a phase transition phenomenon that occurs at certain critical values of the sampling ratio. In particular, there exist a lower and an upper threshold, denoted by $\alpha_{c,\min}$ and $\alpha_{c,\max}$ , respectively, that mark the transitions between two very different phases.

(a) An uncorrelated phase takes place when the sampling ratio $\alpha<\alpha_{c,\min}$ . Within this phase, the limiting value $\rho(\alpha)=0$ , meaning that the estimate from the spectral method is asymptotically uncorrelated with the target vector $\boldsymbol{\xi}$ . In this case, the spectral method is not effective, as its estimate $\boldsymbol{x}_{1}$ is no better than a random guess drawn uniformly from the hypersphere $\mathcal{S}^{n-1}$ .

(b) A correlated phase takes place when $\alpha>\alpha_{c,\max}$ , with $\alpha_{c,\max}$ being the upper threshold. Within this phase, the limiting value $\rho(\alpha)>0$ . Geometrically, the estimate $\boldsymbol{x}_{1}$ (or its negative version $-\boldsymbol{x}_{1}$ ) will be concentrated on the surface of a right-circular cone (see Figure 1) whose generating lines make an angle $\theta=\arccos\big{(}\sqrt{\rho(\alpha)}\big{)}$ to the target vector $\boldsymbol{\xi}$ . Moreover, $\rho(\alpha)$ tends to 1 as $\alpha\rightarrow\infty$ .

In many signal estimation models that we have studied so far, the two thresholds coincide, i.e. $\alpha_{c,\min}=\alpha_{c,\max}$ , meaning that the phase transition happens at a single critical value of the sampling ratio. However, it is indeed possible that $\alpha_{c,\min}<\alpha_{c,\max}$ , in which case a finite number of correlated and uncorrelated phases alternative when $\alpha$ varies within the interval $(\alpha_{c,\min},\alpha_{c,\max})$ . A concrete example demonstrating this more complicated situation can be found in Section IV-C.

The above phase transition phenomenon also has implications in terms of the computational complexity of the spectral method. In a correlated phase, there is a nonzero gap between the largest and the second largest eigenvalues of $\boldsymbol{D}_{m}$ . As a result, the leading eigenvector $\boldsymbol{x}_{1}$ can be efficiently computed by using power iterations on $\boldsymbol{D}_{m}$ . In contrast, within an uncorrelated phase, the gap of the eigenvalues converges to zero, making power iterations inefficient.

The rest of the paper is organized as follows. After precisely laying out the various technical assumptions, we present in Section II the main results of this work, stated as Theorem 1 and Proposition 1. Examples and numerical simulations are also provided there to demonstrate and verify these analytical results. In particular, as a worked example, we derive a universal closed-form expression for the limiting values $\rho(\alpha)$ for all acquisition models that generate one-bit $\left\{0,1\right\}$ measurements. We prove Theorem 1 in Section III. Key to our proof is a deterministic, fixed-point characterization of the squared cosine similarity $\rho(\boldsymbol{\xi},\boldsymbol{x}_{1})$ , which is valid for any finite dimension $n$ and for any deterministic sensing vectors $\left\{\boldsymbol{a}_{i}\right\}$ . When specialized to Gaussian measurements, this fixed-point characterization allows us to connect our problem to a generalized version of the spiked population model (see, e.g., [30, 31, 32]) studied in random matrix theory. In Section IV, we look more closely at the phase transition phenomenon predicted by our asymptotic results and prove Proposition 1. Section V concludes the paper with discussions on possible generalizations and improvements of our results as well as their connections to related work in the literature.

Notations: To study the high-dimensional limit of the spectral initialization method, we shall consider a sequence of problem instances, indexed by the ambient dimension $n$ . For each $n$ , we seek to estimate an underlying signal denoted by $\boldsymbol{\xi}_{n}\in\mathbb{R}^{n}$ . Formally, we should use $D_{m(n)}$ to denote the data matrix, where $m(n)$ is the number of measurements as a function of the dimension $n$ . However, to lighten the notation, we will simply write it as $\boldsymbol{D}_{m}$ , keeping the dependence of $m$ on $n$ implicit. $\boldsymbol{x}_{1}^{n}$ stands for a leading eigenvector of $\boldsymbol{D}_{m}$ . We use $\overset{\mathcal{P}}{\longrightarrow}$ and $\overset{\text{a.s.}}{\longrightarrow}$ to denote convergence in probability and almost sure convergence, respectively. Let $\boldsymbol{M}$ be a symmetric matrix. Its eigenvalues in descending order are written as $\lambda_{1}^{\boldsymbol{M}}\geq\lambda_{2}^{\boldsymbol{M}}\geq\ldots\geq\lambda_{n}^{\boldsymbol{M}}$ . In particular, $\lambda_{1}^{\boldsymbol{M}}$ , sometimes also written as $\lambda_{1}(\boldsymbol{M})$ , denotes the largest eigenvalue of $\boldsymbol{M}$ . Throughout the paper, $\boldsymbol{A}^{1/2}$ stands for the principal square root of a positive semi-definite symmetric matrix $\boldsymbol{A}$ . For any $a,b\in\mathbb{R}$ , we write $\max\left\{a,b\right\}$ as $a\vee b$ . Finally, $\mathds{1}_{x\in\mathcal{I}}$ stands for the indicator function of a set $\mathcal{I}$ .

II Main Results

II-A Technical Assumptions

In what follows, we first state the basic assumptions under which our results are proved.

(A.1)

The sensing vectors are independent Gaussian random vectors. Specifically, let $(a_{ij})$ , for $i,j\geq 1$ , be a doubly infinite array of i.i.d. standard normal random variables. Then the $i$ th sensing vector $\boldsymbol{a}_{i}=[a_{i1},a_{i2},\ldots,a_{in}]^{\top}$ . 2. (A.2)

$m=m(n)$ with $\alpha_{n}=m(n)/n\rightarrow\alpha>0$ as $n\rightarrow\infty$ . 3. (A.3)

$\mathinner{\!\left\lVert\boldsymbol{\xi}_{n}\right\rVert}=\kappa>0$ . 4. (A.4)

Let $s,y$ and $z$ be three random variables such that

[TABLE]

where $f(\cdot\,|\,\cdot)$ is the conditional density function (1) associated with the observation model, and $\mathcal{T}(\cdot)$ is the preprocessing step used in the construction of $\boldsymbol{D}_{m}$ in (3). We shall assume that the probability measure of the random variable $z$ is supported within a finite interval $[0,\tau]$ . Throughout the paper, we always take $\tau$ to be the tightest such upper bound. 5. (A.5)

As $\lambda$ approaches $\tau$ from the right,

[TABLE] 6. (A.6)

The random variables $z$ and $s^{2}$ are positively correlated: $\text{cov}(z,s^{2})=\mathbb{E}\,zs^{2}-\mathbb{E}z\,\mathbb{E}s^{2}>0$ , which is equivalent to

[TABLE]

The last three assumptions require some explanations. First, we note that assumption (A.4) requires that $z$ should take values within a finite interval on the positive axis. This can be enforced by choosing a suitable function $\mathcal{T}(\cdot)$ . For example, in the problem of phase retrieval, the measurement model ( $y=s^{2}$ ) leads to unbounded $\left\{y_{i}\right\}$ . We can set

[TABLE]

where $t>0$ is some parameter and $\mathds{1}_{\mathinner{\!\left\lvert y\right\rvert}\leq t^{2}}$ denotes the indicator function for the condition $\mathinner{\!\left\lvert y\right\rvert}\leq t^{2}$ . This is indeed the trimming strategy proposed in [13]. As shown there, this boundedness condition on the support of $z$ is an essential ingredient in achieving linear sample complexities. The assumption that $z$ be nonnegative is largely made to simplify our analysis, but this restriction can be removed. In a recent work [33], Mondelli and Montanari extended our results by showing that the same asymptotic predictions presented in this paper still hold under cases where $z$ can take negative values. See also Remark 2 in Section II-B.

In assumption (A.5), the expressions in (6) essentially require that the random variable $z$ should have sufficient probability mass near the upper bound $\tau$ . Let $h(z)=\mathbb{E}_{s|z}(s^{2}|z)$ . We show in Appendix -A that (6) holds when there exist some positive constants $c_{0}$ and $\varepsilon$ such that the probability density function $p_{Z}(z)$ of $z$ and the conditional moment $h(z)$ are both bounded below by $c_{0}$ for all $z\in[\tau-\varepsilon,\tau]$ . The model in (8) represents one such case. Another sufficient condition for (6) to hold is when the law of $z$ has a point mass at $\tau$ . The acquisition models described in (27) and (28) in later sections are examples for which this condition is applicable.

The inequality in (A.6) is also a natural requirement. To see this, we note that the data matrix $\boldsymbol{D}_{m}$ in (3) is the sample average of $m$ i.i.d. random rank-one matrices $\left\{\mathcal{T}(y_{i})\boldsymbol{a}_{i}\boldsymbol{a}_{i}^{\top}\right\}_{i\leq m}$ . When the number of samples $m$ is large, this sample average should be “close” to the statistical expectation, i.e.,

[TABLE]

where $z_{i}\overset{\text{def}}{=}\mathcal{T}(y_{i})$ . To compute the above expectation, it will be convenient to assume that the underlying signal $\boldsymbol{\xi}=\kappa\boldsymbol{e}_{1}$ , where $\boldsymbol{e}_{1}$ is the first vector of the canonical basis of $\mathbb{R}^{n}$ . (This assumption can be made without loss of generality, due to the rotational invariance of the multivariate normal distribution.) Correspondingly, we can partition each sensing vector into two parts, as

[TABLE]

so that $\boldsymbol{a}_{i}^{\top}\boldsymbol{\xi}=\kappa s_{i}$ and the conditional density of $y_{i}$ given $s_{i}$ is $f(y\,|\,\kappa s_{i})$ . Since $s_{i},y_{i}$ and $z_{i}$ are all independent of $\boldsymbol{u}_{i}$ ,

[TABLE]

where $\boldsymbol{I}_{n-1}$ is the identity matrix of size $(n-1)$ . If the inequality $\mathbb{E}\,zs^{2}>\mathbb{E}z$ , as required in (A.6), indeed holds, the leading eigenvector of the expected matrix will be $\boldsymbol{e}_{1}$ , which is perfectly aligned with the target vector $\boldsymbol{\xi}$ . Now since the data matrix $\boldsymbol{D}_{m}$ is an approximation of the expectation, the sample eigenvector should also be an approximation of $\boldsymbol{\xi}$ .

The above argument provides an intuitive but nonrigorous explanation for why the spectral initialization method would work. The approximation in (9) can be made exact if the signal dimension $n$ is kept fixed and the number of measurement $m$ goes to infinity. However, we consider the case when $m$ and $n$ both tend to infinity, at a constant ratio $\alpha=m/n$ bounded away from [math] and $\infty$ . In this regime, the approximation in (9) will not become an equality even if $m\rightarrow\infty$ . As we will show, the correlation $\rho(\boldsymbol{\xi}_{n},\boldsymbol{x}_{1}^{n})$ between the target vector $\boldsymbol{\xi}_{n}$ and the sample eigenvector $\boldsymbol{x}_{1}^{n}$ will converge to a deterministic value $\rho(\alpha)$ that depends on the sampling ratio $\alpha$ .

A notable exception to (7) is when

[TABLE]

is an odd function plus some arbitrary constant $C$ . In this case, $\mathbb{E}zs^{2}=\mathbb{E}[g(s)s^{2}]=C$ and $\mathbb{E}z=\mathbb{E}g(s)=C$ and thus (7) does not hold. In practice, this means that the spectral method will not be effective for acquisition models such as $z=\operatorname{sign}(s)+C$ . We will revisit this point in Section V where we describe an alternative initialization scheme that can handle such cases.

A final remark before we present our main results: Since the eigenvector $\boldsymbol{x}_{1}^{n}$ is always normalized, the spectral method cannot provide any information about the norm of $\boldsymbol{\xi}_{n}$ . However, in many cases where the sensing vectors are drawn from certain random ensembles, there are simple methods to accurately estimate $\kappa=\mathinner{\!\left\lVert\boldsymbol{\xi}_{n}\right\rVert}$ . We provide some discussions on how to do this in Appendix -B.

II-B Main Results: Asymptotic Characterizations

In this section, we summarize the main results of our work on an asymptotic characterization of the spectral method with Gaussian measurements. To state our results, we first need to introduce several helper functions. Let $s,z$ be the random variables defined in (5). We consider two functions

[TABLE]

and

[TABLE]

both defined on the open interval $(\tau,\infty)$ , where $\tau$ is the bound in assumption (A.4). Within their domains, it is easy to check that both functions are convex. In particular, $\psi_{\alpha}(\lambda)$ achieves its minimum at a unique point denoted by

[TABLE]

Finally, let

[TABLE]

be a modification of $\psi_{\alpha}(\lambda)$ . This new function is again defined for $\lambda\in(\tau,\infty)$ .

Theorem 1

Under (A.1) – (A.6), the following hold:

There is a unique solution, denoted by $\lambda^{\ast}_{\alpha}$ , to the equation

[TABLE] 2. 2.

As $n\rightarrow\infty$ ,

[TABLE]

where $\psi^{\prime}_{\alpha}(\cdot)$ and $\phi^{\prime}(\cdot)$ denote the derivatives of the two functions. 3. 3.

Let $\lambda_{1}^{\boldsymbol{D}_{m}}\geq\lambda_{2}^{\boldsymbol{D}_{m}}$ be the top two eigenvalues of $\boldsymbol{D}_{m}$ .

[TABLE]

as $n\rightarrow\infty$ . Moreover, $\zeta_{\alpha}(\lambda^{\ast}_{\alpha})\geq\zeta_{\alpha}(\overline{\lambda}_{\alpha})$ , with the inequality becoming strict if and only if $\psi^{\prime}_{\alpha}(\lambda^{\ast}_{\alpha})>0$ .

Remark 1

The above theorem, whose proof is given in Section III, provides a complete asymptotic characterization of the performance of the spectral method. In particular, the theorem shows that the squared cosine similarity $\rho(\boldsymbol{\xi}_{n},\boldsymbol{x}_{1}^{n})$ converges in probability to a deterministic value in the high-dimensional limit. Moreover, there exists a generic phase transition phenomenon: depending on the sign of the derivative $\psi^{\prime}_{\alpha}(\cdot)$ at $\lambda^{\ast}_{\alpha}$ , the limiting value can be either zero (i.e., the uncorrelated phase) or strictly positive (i.e., the correlated phase). The computational complexity of the spectral method is also very different in the two phases. Within the uncorrelated phase, the gap between the top two leading eigenvalues, $\lambda_{1}^{\boldsymbol{D}_{m}}$ and $\lambda_{2}^{\boldsymbol{D}_{m}}$ , diminishes to zero, making iterative algorithms such as power iterations increasingly difficult to converge. In contrast, within the correlated phase, the spectral gap converges to a positive value.

Remark 2

The results of this work were first reported in [34, 35]. When this paper was under review, the results given in Theorem 1 were further extended by Mondelli and Montanari in [33]. In particular, these authors extended our asymptotic predictions from the real-valued case to the complex-valued case, and more importantly, they showed that the same predictions still hold under cases where the variable $z$ defined in assumption (A.4) can take negative values. See [33, Lemma 2] for details.

It will be more convenient to characterize the phase transitions predicted by Theorem 1 in terms of the sampling ratio $\alpha$ . To do so, we first introduce a set $\Lambda$ , containing all the zero-crossings of the function $\Delta(\lambda)=\lambda\,\mathbb{E}\frac{z}{(\lambda-z)^{2}}-\mathbb{E}\frac{zs^{2}}{\lambda-z}$ on the open interval $(\tau,\infty)$ . We can show that $\Lambda$ is always nonempty and that it contains a finite number of points (see Lemma 2 in Section IV-A). Let

[TABLE]

denote the smallest and the largest elements in $\Lambda$ , respectively.

Proposition 1

Under (A.1) – (A.6), and as $n\rightarrow\infty$ ,

[TABLE]

where

[TABLE]

and $\rho(\alpha)$ is a function with the following parametric representation in terms of a parameter $\lambda$ :

[TABLE]

for all $\lambda>\lambda_{c,\max}$ . Moreover, $\rho(\alpha)\rightarrow 1$ as $\alpha\rightarrow\infty$ .

Remark 3

In many of the signal acquisition models we have studied, the set $\Lambda$ contains exactly one element. In this case, $\lambda_{c,\min}=\lambda_{c,\max}$ and hence $\alpha_{c,\min}=\alpha_{c,\max}$ . Consequently, the phase transition of the spectral method takes place at a single threshold value $\alpha_{c}$ , which separates the uncorrelated phase from the correlated one. However, it is indeed possible to find cases for which $\alpha_{c,\min}<\alpha_{c,\max}$ . This leads to a more complicated scenario, where a finite number of correlated and uncorrelated phases can alternatively take place within the interval $(\alpha_{c,\min},\alpha_{c,\max})$ . One such example is given in Section IV-C.

II-C Worked-Example: Binary Models

To illustrate the results presented above, we consider here a special case where $z_{i}$ takes only binary values $\left\{0,1\right\}$ . This situation naturally appears in problems such as logistic regression and one-bit quantized sensing, where the measurements $y_{i}\in\left\{0,1\right\}$ and we can set $z_{i}=y_{i}$ . For cases where the measurements $\left\{y_{i}\right\}$ are not necessarily binary, this type of one-bit model is still relevant whenever the preprocessing function $z=\mathcal{T}(x)$ generates binary outputs. The simplicity of this setting allows us to obtain closed-form expressions for the various quantities in Theorem 1 and Proposition 1.

To proceed, we first explicitly compute the functions $\phi(\lambda)$ and $\psi_{\alpha}(\lambda)$ defined in Section II-B as

[TABLE]

where

[TABLE]

and both functions are defined on the interval $\lambda>1$ . The minimum of $\psi_{\alpha}(\lambda)$ is achieved as $\overline{\lambda}_{\alpha}=1+\sqrt{\alpha d}$ , and thus

[TABLE]

Solving equation (17) and using (18), we get

[TABLE]

where $\alpha_{c}=\frac{d}{(c-d)^{2}}$ . (Note that this result can also be obtained by invoking the parametric characterization of $\rho(\alpha)$ given in Proposition 1.) Finally, the asymptotic predictions (19) for the top two eigenvalues can be computed as

[TABLE]

and

[TABLE]

for all $\alpha$ .

Remark 4

It is interesting to note that the asymptotic characterizations given in (24), (25) and (26) are universal, in the sense that they only depend on the two constants $c$ and $d$ defined in (23) but not on the exact details of the joint probability distributions of $s,y$ and $z$ . Thus, for one-bit models, it suffices to compute the constants in (23), which then completely determine the asymptotic performance of the spectral method.

II-D Numerical Simulations

Example 1 (Logistic regression)

Consider the case where $\left\{y_{i}\right\}$ are binary random variables generated according to the following conditional distribution:

[TABLE]

where $\beta$ is some constant. Let $z_{i}=\mathcal{T}(y_{i})=y_{i}$ . Since $z_{i}\in\left\{0,1\right\}$ , we just need to compute the constants $c$ and $d$ in (23), after which we can use the closed-form expressions (24), (25) and (26) to obtain the asymptotic predictions. In Figure 2(a) we compare the analytical prediction (24) of the squared cosine similarity with results of numerical simulations. In our experiment, we set the signal dimension to $n=4096$ . The norm of $\boldsymbol{\xi}_{n}$ is $\kappa=3$ , and $\beta=6$ . The sample averages and error bars (corresponding to one standard deviation) shown in the figure are calculated over 16 independent trials. We can see that the analytical predictions match numerical results very well. Figure 2(b) shows the top two eigenvalues. When $\alpha<\alpha_{c}$ , the two eigenvalues are asymptotically equal, but they start to diverge as $\alpha$ becomes larger than $\alpha_{c}$ . To clearly illustrate this phenomenon, we plot in the insert the eigengap $\lambda_{1}-\lambda_{2}$ as a function of $\alpha$ .

Example 2 (Phase retrieval)

In the second example, we consider the problem of phase retrieval, where

[TABLE]

Here, $\omega_{i}\sim_{\text{i.i.d.}}\mathcal{N}(0,1)$ and $\sigma\geq 0$ is the standard deviation of the noise. In [13], the authors show that it is important to omit large values of $\left\{y_{i}\right\}$ , and they propose to use the scheme in (8) when constructing the data matrix $\boldsymbol{D}_{m}$ . A different strategy can be found in [15], where the authors propose to use

[TABLE]

In what follows, we shall refer to (8) and (28) as the trimming algorithm and the subset algorithm, respectively. Figure 3(a) shows the asymptotic performance of these two algorithms and compare them with numerical results ( $n=4096$ and 16 independent trials). The performance of the subset algorithm (for which we choose the parameter $t=1.5$ ) can be characterized by the closed-form formula (24). The trimming algorithm (for which we use $t=3$ ) is more complicated as $z_{i}$ is no longer binary. We use the parametric characterization in Proposition 1 to obtain its asymptotic performance. Again, our analytical predictions match numerical results. The performance of both algorithms clearly depends on the choice of the thresholding parameter $t$ . To show this, we plot in Figure 3(b) the critical phase transition points $\alpha_{c}$ of both algorithms as functions of $t$ , at two different noise levels: $\sigma=0$ and $\sigma=2$ . This points to the possibility of using our analytical prediction to optimally tune the algorithmic parameters and, more generally, to optimize the functional form of the preprocessing function $\mathcal{T}(\cdot)$ . Indeed, the optimal design of $\mathcal{T}(\cdot)$ was obtained in a recent work [36], which leverages the asymptotic characterizations given here. Interestingly, under a mild technical condition, it is shown that there exists a simple fixed design that is uniformly optimal over all sampling ratios; see [36, Theorem 1].

III Proof of the Main Results

In this section, we prove Theorem 1, which provides an exact characterization of the asymptotic performance of the spectral method for signal estimation.

III-A Overview

We first rewrite the data matrix $\boldsymbol{D}_{m}$ in (3) as

[TABLE]

where $\boldsymbol{A}=[\boldsymbol{a}_{1},\boldsymbol{a}_{2},\ldots,\boldsymbol{a}_{m}]$ is an $n\times m$ matrix of i.i.d. normal random variables and

[TABLE]

is a diagonal matrix with entries $z_{i}=\mathcal{T}(y_{i})$ . Our goal boils down to studying the largest eigenvalue of $\boldsymbol{D}_{m}$ and the associated eigenvector $\boldsymbol{x}_{1}^{n}$ . To simplify notation, we shall first assume that $\boldsymbol{\xi}_{n}=\kappa\boldsymbol{e}_{1}$ , with $\boldsymbol{e}_{1}$ being the first vector in the canonical basis.

Remark 5

The non-null eigenvalues of $\boldsymbol{D}_{m}$ are equal to those of a companion matrix

[TABLE]

which bears strong resemblance to a sample covariance matrix. Limiting spectral distributions (LSDs) of sample covariance matrices have been extensively studied in random matrix theory; see for instance [37] and the references given there. As a special case, when $\boldsymbol{Z}$ is the identity matrix, the LSD of $\widetilde{\boldsymbol{D}}_{m}$ is given by the classical Marčenko-Pastur law [38]. Results for more general diagonal matrices $\boldsymbol{Z}$ are also available [39]. However, in these studies, $\boldsymbol{Z}$ and $\boldsymbol{A}$ need to be independent. A challenge in our problem is that $\boldsymbol{Z}$ and $\boldsymbol{A}$ are correlated. To see this, we partition each sensing vector $\boldsymbol{a}_{i}$ into two parts as in (10). We can then write

[TABLE]

where $\boldsymbol{s}\overset{\text{def}}{=}[s_{1},s_{2},\ldots,s_{m}]^{\top}$ is an $m$ -dimensional Gaussian random vector, and $\boldsymbol{U}$ is an $(n-1)\times m$ matrix consisting of i.i.d. standard normal random variables. Since $\boldsymbol{\xi}_{n}=\kappa\boldsymbol{e}_{1}$ , the diagonal elements of $\boldsymbol{Z}$ are independent of $\boldsymbol{U}$ but they do depend on $\boldsymbol{s}$ through $y_{i}\sim f(y\,|\,\kappa s_{i})$ . Consequently, we cannot apply existing results on the LSD of sample covariance matrices to our case.

Our proof of Theorem 1 consists of two main ingredients. First, we will show in Proposition 2 that $\lambda_{1}^{\boldsymbol{D}_{m}}$ and $\rho(\boldsymbol{\xi}_{n},\boldsymbol{x}_{1}^{n})$ can be obtained from a fixed-point equation involving a function $L_{m}(\mu)$ , to be defined in (36), where $\mu>0$ is an auxiliary variable. The main benefit of introducing the variable $\mu$ and the function $L_{m}(\mu)$ is that, for each $\mu>0$ , the above-mentioned correlation between $\boldsymbol{A}$ and $\boldsymbol{Z}$ can be effectively decoupled. This then allows us to obtain the second ingredient of our proof: using results from random matrix theory [40, 32], we show in Section III-C that $L_{m}(\mu)$ , under the assumption of Gaussian sensing vectors, will converge almost surely to a deterministic limit function as the dimension $n\rightarrow\infty$ (see Proposition 4).

III-B A Fixed-Point Characterization

By substituting (31) into (29), we can write $\boldsymbol{D}_{m}$ in a more compact block-partitioned form as

[TABLE]

where

[TABLE]

is a scalar that converges to $\mathbb{E}zs^{2}$ as $m\rightarrow\infty$ ,

[TABLE]

is a symmetric matrix, and

[TABLE]

Next, we consider a parametric family of matrices $\left\{\boldsymbol{P}_{m}+\mu\hskip 0.5pt\boldsymbol{q}_{m}\boldsymbol{q}_{m}^{\top}\mathrel{\mathop{\mathchar 58\relax}}{\mu>0}\right\}$ , and let $L_{m}(\mu)$ denote their largest eigenvalues, i.e.,

[TABLE]

In what follows, we show how to compute $\lambda_{1}^{\boldsymbol{D}_{m}}$ and $\rho(\boldsymbol{\xi}_{n},\boldsymbol{x}_{1}^{n})$ via a fixed-point equation involving $L_{m}(\mu)$ . Since we assume that $\boldsymbol{\xi}=\kappa\,\boldsymbol{e}_{1}$ and that the leading eigenvector $\boldsymbol{x}_{1}^{n}$ is normalized, the quantity $\rho(\boldsymbol{\xi}_{n},\boldsymbol{x}_{1}^{n})$ is equal to $(\boldsymbol{e}_{1}^{\top}\boldsymbol{x}_{1}^{n})^{2}$ , the squared magnitude of the first element of the eigenvector.

Our discussions below are general and they apply to any block-partitioned matrix in the form

[TABLE]

Its components $a\in\mathbb{R}$ , $\boldsymbol{P}\in\mathbb{R}^{(n-1)\times(n-1)}$ and $\boldsymbol{q}\in\mathbb{R}^{n-1}$ can be arbitrarily chosen, not necessarily defined as in (33), (34) and (35). Our only requirements are that $\boldsymbol{P}$ is a symmetric matrix and that $\mathinner{\!\left\lVert\boldsymbol{q}\right\rVert}\neq 0$ .

Let $\lambda_{1}^{\boldsymbol{P}}\geq\lambda_{2}^{\boldsymbol{P}}\geq\ldots\lambda_{n-1}^{\boldsymbol{P}}$ be the set of eigenvalues of $\boldsymbol{P}$ , and let $\boldsymbol{w}_{1},\boldsymbol{w}_{2},\ldots,\boldsymbol{w}_{n-1}$ be a corresponding set of orthonormal eigenvectors. Consider a function

[TABLE]

which has poles on those eigenvalues for which $\boldsymbol{w}_{i}^{\top}\boldsymbol{q}\neq 0$ . In what follows, we restrict the domain of $R(\lambda)$ to

[TABLE]

Within this open interval, $R(\lambda)$ is a well-defined smooth function. It increases monotonically from $-\infty$ to [math], and thus it admits a functional inverse, denoted by $R^{-1}(x)$ , for all $x<0$ . Similar to (36), we define

[TABLE]

for all $\mu>0$ .

Lemma 1

Let $\boldsymbol{P}$ be a symmetric matrix and $\boldsymbol{q}$ a nonzero vector. Then, for each $\mu>0$ ,

[TABLE]

Moreover, $L(\mu)$ is a nondecreasing convex function with $\lim_{\mu\rightarrow\infty}L(\mu)=\infty$ . It is differentiable everywhere on $(0,\infty)$ except at (up to) one point.

Proof:

Since $\boldsymbol{P}$ is diagonalizable by an orthonormal matrix, we can assume without loss of generality that $\boldsymbol{P}$ is a diagonal matrix. In this case, we can simply write $R(\lambda)=\sum_{i}\frac{q_{i}^{2}}{\lambda_{i}^{\boldsymbol{P}}-\lambda}$ , and this function is defined on the open interval $(\max\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}q_{i}\neq 0\right\},\infty)$ .

Using the matrix determinant lemma [41], we can compute the characteristic polynomial of $\boldsymbol{P}+\mu\hskip 0.5pt\boldsymbol{q}\boldsymbol{q}^{\top}$ as

[TABLE]

In (40), $\operatorname{adj}(\cdot)$ stands for the adjugate of a matrix. To reach (41), we have used the fact that, for any diagonal matrix $\boldsymbol{A}=\operatorname{diag}\left\{d_{1},d_{2},\ldots,d_{n-1}\right\}$ , $\operatorname{adj}(\boldsymbol{A})=\operatorname{diag}\left\{\prod_{j\neq 1}d_{j},\prod_{j\neq 2}d_{j},\ldots,\prod_{j\neq n-1}d_{j}\right\}$ .

Partition the set $\left\{1,2,\ldots,n-1\right\}$ into two subsets:

[TABLE]

We observe that the characteristic polynomial can be factored into $c(\lambda)=c_{1}(\lambda)c_{2}(\lambda)$ , where $c_{2}(\lambda)=\prod_{i\in\mathcal{I}_{2}}(\lambda-\lambda_{i}^{\boldsymbol{P}})$ and

[TABLE]

It is possible that the second subset $\mathcal{I}_{2}$ is empty, in which case $c_{2}(\lambda)$ is understood to be equal to $1$ , but $\mathcal{I}_{1}$ is never empty, since $\boldsymbol{q}\neq\boldsymbol{0}$ . Next, we study the largest root of the polynomial $c_{1}(\lambda)$ . For any $\lambda>\max\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{1}\right\}$ , we can write

[TABLE]

Recall that $R(\lambda)$ is the function defined in (38) and $R^{-1}(\cdot)$ is its functional inverse. It follows from (43) that $R^{-1}(-1/\mu)$ is the only root of $c_{1}(\lambda)$ in the interval $(\max\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{1}\right\},\infty)$ , and therefore it is also the largest root. Due to the factorization $c(\lambda)=c_{1}(\lambda)c_{2}(\lambda)$ , we have

[TABLE]

Finally, since $R^{-1}(-1/\mu)>\max\left\{\lambda_{i}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{1}\right\}$ , we reach the formula in (39).

By construction, $R^{-1}(-1/\mu)$ is strictly increasing and $\lim_{\mu\rightarrow\infty}R^{-1}(-1/\mu)=\infty$ . It is also differentiable everywhere on $(0,\infty)$ . It follows that $L(\mu)$ is nondecreasing with $L(\infty)=\infty$ , and that the function is differentiable everywhere except for at most one point $\mu_{0}$ , which, if it exists, must satisfy the identity $R^{-1}(-1/\mu_{0})=\lambda_{1}^{\boldsymbol{P}}$ . Finally, the convexity of $L(\mu)$ follows from the fact that it is the maximum of a set of linear functions, as $L(\mu)=\lambda_{1}(\boldsymbol{P}+\mu\hskip 0.5pt\boldsymbol{q}\boldsymbol{q}^{\top})=\max_{\boldsymbol{x}\mathrel{\mathop{\mathchar 58\relax}}\,\mathinner{\!\left\lVert\boldsymbol{x}\right\rVert}=1}\boldsymbol{x}^{\top}(\boldsymbol{P}+\mu\hskip 0.5pt\boldsymbol{q}\boldsymbol{q}^{\top})\boldsymbol{x}$ . ∎

Given a block-partitioned matrix $\boldsymbol{D}$ , the following proposition shows that its leading eigenvalue $\lambda_{1}^{\boldsymbol{D}}$ and the squared cosine similarity $(\boldsymbol{e}_{1}^{\top}\boldsymbol{x}_{1})^{2}$ can be obtained from the function $L(\mu)$ .

Proposition 2

Let $\mu^{\ast}>0$ be the unique solution to the fixed-point equation

[TABLE]

Then, $\lambda_{1}^{\boldsymbol{D}}=L(\mu^{\ast})$ and

[TABLE]

where $\partial_{-}L(\mu)$ and $\partial_{+}L(\mu)$ denote the left and right derivatives of $L(\mu)$ , respectively. In particular, if $L(\mu)$ is differentiable at $\mu^{\ast}$ , then

[TABLE]

Remark 6

We prove this result in Appendix -C. Note that (44) is equivalent to

[TABLE]

Since $L(\mu)$ is nondecreasing with $L(\infty)=\infty$ whereas $a+1/\mu$ decreases monotonically from $\infty$ to [math], the equation (47), and thus (44), always admits one and only one solution. Moreover, by Lemma 1, $L(\mu)$ is a convex function, and therefore its left and right derivatives always exist.

III-C Asymptotic Limit of $L_{m}(\mu)$

The characterization given in Proposition 2 is valid for any block-partitioned matrix in the form of (37). When applied to the specific case of our data matrix in (32), with its components $a_{m}$ , $\boldsymbol{P}_{m}$ and $\boldsymbol{q}_{m}$ defined as in (33), (34) and (35), this result provides a very general deterministic characterization of the performance of the spectral method that is valid for any finite dimension $n$ and for any sensing vectors.

Next, we specialize to the case of i.i.d. Gaussian sensing vectors and show that $L_{m}(\mu)$ converges almost surely to a deterministic function as $m,n\rightarrow\infty$ . To that end, we note that $L_{m}(\mu)$ is the leading eigenvalue of

[TABLE]

where

[TABLE]

is a rank-one perturbation of the diagonal matrix $\boldsymbol{Z}$ given in (30). Since $\boldsymbol{U}$ and $\boldsymbol{M}_{m}$ are independent, we first study the spectrum of $\boldsymbol{M}_{m}$ .

Let $\lambda_{1}^{\boldsymbol{M}_{m}}\geq\lambda_{2}^{\boldsymbol{M}_{m}}\geq\ldots\geq\lambda_{m}^{\boldsymbol{M}_{m}}$ be the set of eigenvalues of $\boldsymbol{M}_{m}$ in descending order. Let

[TABLE]

be the empirical spectral measure of the last $m-1$ eigenvalues.

Proposition 3

Fix $\mu>0$ . As $m,n\rightarrow\infty$ , the empirical spectral measure $f^{\boldsymbol{M}_{m}}(\lambda)$ converges almost surely to the probability law of the random variable $z$ . Meanwhile,

[TABLE]

where $Q^{-1}(\cdot)$ is the functional inverse of the function

[TABLE]

The domain of $Q(\lambda)$ is the open interval $(\tau,\infty)$ , with $\tau$ being the upper bound of the support of the probability law of $z$ .

Remark 7

By construction, $Q(\lambda)$ is a continuous and strictly decreasing function with $Q(\infty)=0$ . Assumption (A.5) further guarantees that $\lim_{\lambda\rightarrow\tau^{+}}Q(\lambda)=\infty$ . Thus, $Q(\lambda)$ admits a functional inverse and that $Q^{-1}(1/\mu)$ is well-defined for all $\mu>0$ .

According to assumption (A.4) stated in Section II-A, the law of $z$ is supported within the interval $[0,\tau]$ . The above proposition, whose proof can be found in Appendix -D, shows that the spectrum of $\boldsymbol{M}_{m}$ consists of two parts: a “bulk spectrum” of $m-1$ eigenvalues supported within $[0,\tau]$ and a single spiked eigenvalue $\lambda_{1}^{\boldsymbol{M}_{m}}$ well separated from the bulk. This setting is a generalization of the classical spiked population model [30]. Adapting the results given in [32] (see also [40] for related results under more general settings), we thus reach the second important ingredient of our proof of Theorem 1, characterizing the asymptotic limit of $L_{m}(\mu)$ .

Proposition 4

For each fixed $\mu>0$ ,

[TABLE]

where $\zeta_{\alpha}(\cdot)$ is the function defined in (16) and $Q^{-1}(1/\mu)$ is the limit value in (50).

Proof:

Recall from (48) that $L_{m}(\mu)$ is the leading eigenvalue of $\tfrac{1}{m}\boldsymbol{U}\boldsymbol{M}_{m}\boldsymbol{U}^{\top}$ . Since $\boldsymbol{U}$ and $\boldsymbol{M}_{m}$ are independent, and since $\boldsymbol{U}$ is a Gaussian random matrix with a rotationally invariant distribution, we can equivalently study the leading eigenvalue of the following matrix

[TABLE]

Proposition 3 shows that $\left\{\lambda_{i}^{\boldsymbol{M}}\mathrel{\mathop{\mathchar 58\relax}}i\geq 2\right\}$ form a bulk spectrum, which converges to the law of $z$ as $m\rightarrow\infty$ , whereas $\lambda_{1}^{\boldsymbol{M}}$ converges to a “spike” $\lambda_{\mu}=Q^{-1}(1/\mu)>\tau$ , which is separated from the bulk.

The asymptotic limits of extreme sample eigenvalues of matrices in the form of (53) have been studied in [40, 32]. In our proof, we use the asymptotic characterization given in [32]. Key to this asymptotic analysis is the function $\psi_{\alpha}(\lambda)$ defined111We have adapted the original definition of $\psi_{\alpha}(\lambda)$ in [32, eq. (3.2)] because our matrix in (53) has a slightly different scaling from the one considered in [32]. in (14). The asymptotic behaviors of the leading sample eigenvalue turn out to depend on the sign of $\psi^{\prime}_{\alpha}(\lambda)$ at the point $\lambda_{\mu}$ :

In particular, applying [32, Theorem 4.1], we have

[TABLE]

The case when $\psi^{\prime}_{\alpha}(\lambda_{\mu})\leq 0$ is covered in [32, Theorem 4.2]. Adapting that result to our specific setting, we have

[TABLE]

As an equivalent form, we can write $\psi_{\alpha}(\lambda)=\mathbb{E}z+\frac{\lambda}{\alpha}+\mathbb{E}\frac{z^{2}}{\lambda-z}$ . From this, we can easily check that $\psi_{\alpha}(\lambda)$ is a convex function and that it admits a unique minimum within its domain $(\tau,\infty)$ . It follows that the two separate cases in (54) and (55) can be more compactly written as $L_{m}(\mu)\overset{\text{a.s.}}{\longrightarrow}\zeta_{\alpha}(\lambda_{\mu})$ , where $\zeta_{\alpha}(\cdot)$ is the modified function defined in (16). ∎

III-D Proof of Theorem 1

We are now ready to prove our asymptotic characterizations given in Theorem 1. Since the sensing vectors $\boldsymbol{a}_{i}$ are drawn from the rotationally invariant multivariate normal distribution, the quantity $\rho(\boldsymbol{\xi}_{n},\boldsymbol{x}_{1}^{n})$ for a general vector $\boldsymbol{\xi}_{n}$ (with $\mathinner{\!\left\lVert\boldsymbol{\xi}_{n}\right\rVert}=\kappa$ ) and $\rho(\kappa\boldsymbol{e}_{1},\boldsymbol{x}_{1}^{n})$ for the special case $\boldsymbol{\xi}_{n}=\kappa\boldsymbol{e}_{1}$ have exactly the same probability distribution. In what follows, we will carry out the proof by assuming that the target vector $\boldsymbol{\xi}_{n}=\kappa\boldsymbol{e}_{1}$ . By showing that $\rho(\kappa\boldsymbol{e}_{1},\boldsymbol{x}_{1}^{n})$ converges to the right-hand side of (18) almost surely, the convergence to the same limit in probability for a general $\boldsymbol{\xi}_{n}$ then follows as an immediate consequence.

To start, we use the deterministic characterization given in Proposition 2. For each $m\geq 1$ , let $\mu_{m}$ be the unique fixed-point of (44). Equivalently, $\mu_{m}$ satisfies the identity

[TABLE]

By Proposition 4, for every fixed $\mu$ ,

[TABLE]

as $m\rightarrow\infty$ . Since $L_{m}(\mu)$ and $\zeta_{\alpha}(\mu)$ are nondecreasing, the two functions on both sides of (56) are strictly increasing. This condition, together with the fact that $a_{m}\overset{\text{a.s.}}{\longrightarrow}\mathbb{E}{zs^{2}}$ , allows us to apply Lemma 3 in Appendix -E to conclude $\mu_{m}\overset{\text{a.s.}}{\longrightarrow}\mu^{\ast}$ , where $\mu^{\ast}$ is the unique point such that

[TABLE]

To determine the asymptotic behavior of the leading eigenvector $\boldsymbol{x}_{1}^{n}$ , we use the characterization given in (45). Since $\left\{L_{m}(\mu)\right\}$ are convex functions, we apply Lemma 4 in Appendix -E. In particular, if $\zeta_{\alpha}(Q^{-1}(1/\mu))$ is differentiable at $\mu=\mu^{\ast}$ , that lemma gives us

[TABLE]

and similarly

[TABLE]

Substituting these limits into (45), we get

[TABLE]

To simplify the above expression, we introduce a change of variable, writing $\lambda=Q^{-1}(1/\mu)$ . In particular, $\lambda^{\ast}=Q^{-1}(1/\mu^{\ast})$ . Using the characterization (57) and recalling the definition of $Q(\lambda)$ in (51), we get

[TABLE]

where $\phi(\cdot)$ is defined in (13). By their constructions, it is easily checked that $\zeta_{\alpha}(\lambda)$ is a nondecreasing continuous function on $(\tau,\infty)$ whereas $\phi(\lambda)$ is a strictly decreasing continuous function. Moreover, by assumption (A.5), $\lim_{\lambda\rightarrow\tau^{+}}\phi(\lambda)=\infty$ . Thus, the existence of $\lambda^{\ast}$ satisfying (59) and its uniqueness are guaranteed. Substituting $\lambda^{\ast}=Q^{-1}(1/\mu^{\ast})$ into (58) gives us

[TABLE]

where we have also used the fact that $Q^{\prime}(\lambda)=\phi^{\prime}(\lambda)$ . To reach the characterization (18) given in the theorem, we just need to note that, by its definition in (16), $\zeta^{\prime}_{\alpha}(\lambda)=\psi^{\prime}_{\alpha}(\lambda)$ if $\psi^{\prime}_{\alpha}(\lambda)>0$ and $\zeta^{\prime}_{\alpha}(\lambda)=0$ if $\psi^{\prime}_{\alpha}(\lambda)<0$ .

Next, we characterize the first two eigenvalues $\lambda_{1}^{\boldsymbol{D}_{m}}$ and $\lambda_{2}^{\boldsymbol{D}_{m}}$ . By Proposition 2, the leading eigenvalue $\lambda_{1}^{\boldsymbol{D}_{m}}=L_{m}(\mu_{m})$ . Since $\mu_{m}\overset{\text{a.s.}}{\longrightarrow}\mu^{\ast}$ , applying Lemma 3 stated in Appendix -E leads to

[TABLE]

Recall from (32) that $\boldsymbol{P}_{m}$ is a principal submatrix of $\boldsymbol{D}_{m}$ obtained by deleting the first row and column of $\boldsymbol{D}_{m}$ . It follows from the standard Cauchy interlacing theorem (see, e.g., [42, Theorem 4.3.8]) that

[TABLE]

Applying [32, Lemma 3.1] (which is due to [43]), the upper edge of the support of the limiting spectral density of $\boldsymbol{P}_{m}$ is given by

[TABLE]

where $\overline{\lambda}_{\alpha}$ is the minimizing point defined in (15). It follows that $\lambda_{2}^{\boldsymbol{P}_{m}}\overset{\text{a.s.}}{\longrightarrow}\zeta_{\alpha}(\overline{\lambda}_{\alpha})$ and $\lambda_{1}^{\boldsymbol{P}_{m}}\overset{\text{a.s.}}{\longrightarrow}\zeta_{\alpha}(\overline{\lambda}_{\alpha})$ , and thus

[TABLE]

by the interlacing inequalities in (60). Finally, by the constructions of $\psi_{\alpha}(\lambda)$ and $\zeta_{\alpha}(\lambda)$ , we have $\zeta_{\alpha}(\lambda)>\zeta_{\alpha}(\overline{\lambda}_{\alpha})$ if and only if $\psi^{\prime}_{\alpha}(\lambda)>0$ , and the proof is complete.

IV Sampling Ratios and Phase Transitions

In this section, we study the phase transition phenomena characterized in Theorem 1 in more detail. In particular, we prove Proposition 1 (as stated in Section II-B), which specifies the phase transitions and the asymptotic limits of the cosine similarities in terms of the sampling ratio $\alpha$ .

IV-A Critical Sampling Ratios

By Theorem 1, whether the leading eigenvector $\boldsymbol{x}_{1}^{n}$ is asymptotically correlated or uncorrelated with the target vector $\boldsymbol{\xi}_{n}$ depends on the sign of the derivative $\psi^{\prime}_{\alpha}(\lambda)$ evaluated at a point $\lambda^{\ast}_{\alpha}$ . And this point is uniquely defined through the equation $\zeta_{\alpha}(\lambda^{\ast}_{\alpha})=\phi(\lambda^{\ast}_{\alpha})$ . Let $\overline{\lambda}_{\alpha}$ , defined in (15), be the point at which the strictly convex function $\psi_{\alpha}(\lambda)$ achieves its minimum. Calculating the derivative of $\psi_{\alpha}(\lambda)$ and setting it to zero, we get

[TABLE]

By the construction of the function $\zeta_{\alpha}(\lambda)$ in (16) and by the monotonicity of $\phi(\lambda)$ , we can conclude that $\psi^{\prime}(\lambda^{\ast}_{\alpha})>0$ if and only if

[TABLE]

Substituting (61) into (14) gives us $\psi_{\alpha}(\overline{\lambda}_{\alpha})=\overline{\lambda}_{\alpha}^{2}\,\mathbb{E}\frac{z}{(\overline{\lambda}_{\alpha}-z)^{2}}$ . Thus, transitions between the correlated and uncorrelated phases take place exactly at the zero-crossings of the function

[TABLE]

where $\Delta(\lambda)$ is obtained by removing a common factor $\overline{\lambda}_{\alpha}$ from the difference $\psi_{\alpha}(\overline{\lambda}_{\alpha})-\phi(\overline{\lambda}_{\alpha})$ and by writing $\overline{\lambda}_{\alpha}$ simply as $\lambda$ . Let $\Lambda$ be the set consisting of all the zero-crossings of $\Delta(\lambda)$ within the open interval $(\tau,\infty)$ . Using (61), we can then establish a one-to-one mapping between points in $\Lambda$ and a set of critical values of the sampling ratios.

Lemma 2

The set $\Lambda$ is nonempty. It contains a finite number of points, denoted by $\lambda_{c,1}\leq\lambda_{c,2}\leq\ldots\leq\lambda_{c,r}$ for some $r\geq 1$ . Moreover,

[TABLE]

Proof:

We first show that $\Lambda$ is nonempty. For $\lambda>\tau$ , applying the Cauchy-Schwartz inequality gives us

[TABLE]

By assumption (A.5), $\mathbb{E}\frac{z}{(\lambda-z)^{2}}\rightarrow\infty$ as $\lambda$ approaches $\tau$ from the right. Thus, we have

[TABLE]

To study the function $\Delta(\lambda)$ as $\lambda\rightarrow\infty$ , we note that

[TABLE]

By assumption (A.6), $\mathbb{E}z<\mathbb{E}zs^{2}$ . We can then conclude from inequality (65) that

[TABLE]

Since $\Delta(\lambda)$ is a continuous function, (64) and (65) imply that there must exist at least one zero-crossing.

Next, we show the upper bound given in (63). For any $\lambda_{c}\in\Lambda$ , we have from (62) that

[TABLE]

By assumption (A.4), $z$ is bounded within $[0,\tau]$ . It follows that

[TABLE]

Substituting the above inequalities into (67) gives us $(\mathbb{E}z)\lambda_{c}^{2}\geq(\mathbb{E}zs^{2})(\lambda_{c}-\tau)^{2}$ , which, after some simple manipulations, leads to the upper bound given in (63).

Finally, to show that $\Lambda$ is a finite set, we extend $\Delta(\lambda)$ in (62) to the complex domain $\left\{\lambda\in\mathbb{C}\mathrel{\mathop{\mathchar 58\relax}}\operatorname{Re}(\lambda)>\tau\right\}$ . Since $\Delta(\lambda)$ is analytic and it is not zero everywhere, by the principle of permanence, it has at most a finite number of zeros in the bounded domain $(\tau,\frac{\tau}{1-\sqrt{\mathbb{E}z/\mathbb{E}zs^{2}}})$ . ∎

IV-B Proof of Proposition 1

Write $\lambda_{c,\min}=\lambda_{c,1}$ and $\lambda_{c,\max}=\lambda_{c,r}$ . The corresponding critical sampling ratios $\alpha_{c,\min}$ and $\alpha_{c,\max}$ , as defined in (20), are obtained through the one-to-one mapping given in (61).

Fix $\alpha<\alpha_{c,\min}$ . By the monotonicity of the mapping (61), the corresponding $\overline{\lambda}_{\alpha}$ is strictly less than the smallest zero-crossing point $\lambda_{c,1}$ . From the proof of Lemma 2, we conclude that $\Delta(\overline{\lambda}_{\alpha})>0$ , and thus $\zeta_{\alpha}(\cdot)$ and $\phi_{\alpha}(\cdot)$ intersects at a point $\lambda^{\ast}_{\alpha}<\overline{\lambda}_{\alpha}$ . This implies that $\psi^{\prime}_{\alpha}(\lambda^{\ast}_{\alpha})<0$ and thus Theorem 1 gives us

[TABLE]

Now fix $\alpha>\alpha_{c,\max}$ , in which case $\overline{\lambda}_{\alpha}>\lambda_{c,\max}$ . Since $\Delta(\overline{\lambda}_{\alpha})<0$ , we must have $\lambda^{\ast}_{\alpha}>\overline{\lambda}_{\alpha}$ and thus $\psi^{\prime}_{\alpha}(\lambda^{\ast}_{\alpha})>0$ . To derive the parametric form of $\rho(\alpha)$ given in the statement of the proposition, we note that

[TABLE]

Thus, the equation $\zeta_{\alpha}(\lambda^{\ast}_{\alpha})=\phi(\lambda^{\ast}_{\alpha})$ becomes $\psi_{\alpha}(\lambda^{\ast}_{\alpha})=\phi(\lambda^{\ast}_{\alpha})$ . Using the explicit definitions of these functions given in (13) and (14), we get

[TABLE]

We can also explicitly compute

[TABLE]

Similarly, we can write

[TABLE]

Substituting (70) and (71) into the asymptotic characterization (18) gives us (22), which, together with (68), provides a parametric representation of the function $\rho(\alpha)$ .

Finally, we show that $\rho(\alpha)\rightarrow 1$ as $\alpha\rightarrow\infty$ . From (68) and after some simple manipulations, we have

[TABLE]

Since $\lambda^{\ast}_{\alpha}\rightarrow\infty$ as $\alpha\rightarrow\infty$ , the above formula gives us

[TABLE]

where the leading coefficient $\mathbb{E}(zs^{2}-z)$ is positive by assumption (A.6). By the boundedness of $z$ ,

[TABLE]

and thus $\mathbb{E}\frac{z^{2}}{(\lambda^{\ast}_{\alpha}-z)^{2}}=\mathcal{O}(1/\alpha^{2})$ . It follows from (69) that $\psi^{\prime}_{\alpha}(\lambda^{\ast}_{\alpha})=\mathcal{O}(1/\alpha)$ . Similarly, we conclude from (71) that $\mathinner{\!\left\lvert\phi^{\prime}(\lambda^{\ast}_{\alpha})\right\rvert}=\mathcal{O}(1/\alpha^{2})$ . Substituting these limiting expressions into (18) then gives us $\lim_{\alpha\rightarrow\infty}\rho(\alpha)=1$ , and this completes the proof.

Remark 8

When the set $\Lambda$ consists of a single element, which is the case for many signal acquisition models we have studied, $\lambda_{c,\min}=\lambda_{c,\max}$ and thus $\alpha_{c,\min}=\alpha_{c,\max}$ . There then exists a single critical sampling ratio $\alpha_{c}$ separating the uncorrelated phase from the correlated one. For $\alpha<\alpha_{c}$ , the estimates from the spectral method is asymptotically orthogonal to $\boldsymbol{\xi}_{n}$ ; for $\alpha>\alpha_{c}$ , the estimates will be concentrated on the surface of a right-circular cone whose generating lines make an angle $\theta=\arccos(\sqrt{\rho(\alpha)})$ to the target vector $\boldsymbol{\xi}_{n}$ . The situation is more complicated when $\Lambda$ contains multiple zero-crossings, in which case a finite number of correlated and uncorrelated phases can alternatively take place between $\alpha_{c,\min}$ and $\alpha_{c,\max}$ . A concrete example demonstrating this situation is shown in the next subsection.

IV-C Multiple Phase Transitions: an Example

Consider the following model:

[TABLE]

where $0<\theta<1$ , and $\mathcal{I}_{1},\mathcal{I}_{2}$ are two nonoverlapping intervals on the positive real axis. We also set $\kappa=\mathinner{\!\left\lVert\boldsymbol{\xi}_{n}\right\rVert}=1$ , and thus $\boldsymbol{a}_{i}^{\top}\boldsymbol{\xi}_{n}$ has the same distribution as a standard normal random variable, denoted by $s$ . Define

[TABLE]

where $\mathds{1}_{\mathcal{I}_{\ell}}(\cdot)$ is the indicator function of $\mathcal{I}_{\ell}$ , for $\ell=1,2$ . As $z$ takes only $3$ different values, we can explicitly compute $\Delta(\lambda)$ in (62) as

[TABLE]

Choose $\theta=0.48$ , $\mathcal{I}_{1}=[4.7947,4.9847]$ , and $\mathcal{I}_{2}=[0.8995,0.8998]$ . We then have $\beta_{1}=1.0086\times 10^{-6}$ , $\beta_{2}=1.5970\times 10^{-4}$ , $\omega_{1}=2.3976\times 10^{-5}$ and $\omega_{2}=1.2926\times 10^{-4}$ . In this case, $\Delta(\lambda)$ turns out to have three zero-crossings:

[TABLE]

By the mapping in (61), they correspond to three critical sampling ratios:

[TABLE]

Using the characterization given in Theorem 1, we obtain the limiting values of the squared cosine similarity as a function of the sampling ratio $\alpha$ . Figure 4 illustrates this function $\rho(\alpha)$ . We can see that, when $\alpha<\alpha_{c,1}$ , the estimates given by the spectral method are asymptotically uncorrelated with $\boldsymbol{\xi}_{n}$ . When $\alpha$ is in the interval $(\alpha_{c,1},\alpha_{c,2})$ , however, the function $\rho(\alpha)$ has a small “bump” (see the insert for a zoomed-in view), meaning that the estimates become asymptotically correlated with $\boldsymbol{\xi}_{n}$ . However, the correlation returns to zero as $\alpha$ moves past the second phase transition point $\alpha_{c,2}$ . Finally, when $\alpha>\alpha_{c,3}$ , the estimates become correlated with $\boldsymbol{\xi}_{n}$ again, and $\rho(\alpha)$ tends to one as $\alpha\rightarrow\infty$ .

Remark 9

It would be desirable to obtain a deeper understanding of the above phenomenon involving multiple phase transitions. The example provided here is purely theoretical, as its phase transitions take place at very large values of $\alpha$ . It will be interesting to explore other possible examples of multiple phase transitions with more practical values of $\alpha$ . Moreover, as most signal acquisition models we have studied seem to involve only a single phase transition point, it will be interesting to seek easy-to-verify conditions for the function $\Delta(\lambda)$ defined in (62) to have only one zero-crossing. We leave these as interesting open questions.

V Discussion

In this paper, we have presented a precise asymptotic characterization of the performance of a spectral method for estimating signals from generalized linear measurements with Gaussian sensing vectors. Our analysis also reveals a phase transition phenomenon that takes place at certain critical sampling ratios. Below a minimum threshold, estimates given by the methods are nearly orthogonal to the true signal $\boldsymbol{\xi}$ , thus carrying no information; above a maximum threshold, the estimates become increasingly aligned with $\boldsymbol{\xi}$ . The computational complexity of the spectral method is also markedly different in the two phases. Within the uncorrelated phase, the gap between the top two leading eigenvalues diminishes to zero. In contrast, a nonzero spectral gap emerges within the correlated phase. In this section, we close the paper by discussing some possible directions for extending and improving our results as well as their connections to related work in the literature.

The rate of convergence and more refined analysis. The performance of the spectral method was first studied in [11] for the problem of phase retrieval. In that paper, it is shown that, for each $\delta\in(0,1)$ , there is a constant $c_{1}(\delta)$ such that $\rho(\xi_{n},x_{1}^{n})>1-\delta$ with high probability when

[TABLE]

This estimate of the sample complexity was improved to $m>c_{2}(\delta)n\log n$ in [7] and to $m>c_{3}(\delta)n$ in [13]. The key technical tools underlying these previous estimates are matrix concentration inequalities (see, e.g., [44]), which guarantee that the spectral norm of the difference between the data matrix $\boldsymbol{D}_{m}$ and its expectation $\mathbb{E}\boldsymbol{D}_{m}$ will be small when the sampling ratio $m/n$ is sufficiently large. The closeness of the corresponding leading eigenvectors of $\boldsymbol{D}_{m}$ and $\mathbb{E}\boldsymbol{D}_{m}$ then follow from standard perturbation arguments. (See also our discussions towards the end of Section II-A.) Our work differs from and complements these finite-sample bounds in that we obtain sharp asymptotics to characterize the exact performance of the spectral method in the high-dimensional regime. A (theoretical) limitation of our analysis is that it is asymptotic in nature, requiring both $m,n\rightarrow\infty$ . Although numerical simulations shown in Section II-D indicate that the asymptotic predictions are accurate even for moderate signal dimensions, it will be useful to quantify the rate of convergence towards the asymptotic limits in future work.

Another possible direction to further refine our analysis is to consider second-order asymptotics at the level of central limit theorems (CLTs). See for instance [45] for a related CLT analysis for the extreme eigenvalues of spiked covariance models.

Alternative initialization schemes. The spectral method considered in this paper is certainly not the only choice for initialization purposes. For example, an interesting alternative is the simple linear estimator studied in [46]:

[TABLE]

By using the moment calculations in [46, Proposition 1.1] and bounding high-order moments, one can easily obtain that

[TABLE]

where $s$ and $z$ are the random variables defined in (5).

Recall the function $g(s)$ introduced in (12) and our discussions thereafter, where we point out that the spectral method is not suitable for acquisition models for which $g(s)$ is an odd function plus a constant. Such cases will pose no problem for the linear estimator in (72). However, it is interesting to note that the linear estimator will be ineffective when $g(s)$ is an even function, as is the case in phase retrieval. To see this, we note that $\mathbb{E}zs=\mathbb{E}g(s)s=0$ when $g(s)$ is even. It then follows from (73) that the linear estimator will be asymptotically uncorrelated with the target signal $\boldsymbol{\xi}_{n}$ .

For cases where the function $g(s)$ is neither odd nor even, the choice between the spectral method and the linear estimator is not as clear-cut. The spectral method exhibits phase transition behaviors with its estimates in the uncorrelated phase at small values of $\alpha$ . In contrast, as shown in (73), the performance of the linear estimator increases as a monotonic function of $\alpha$ . As a result, in the regime of very small $\alpha$ , the linear estimator will be preferable. For (moderately) larger values of $\alpha$ , the comparison between the spectral method and the linear estimator cannot be easily made, as their performance also depends on the preprocessing function $\mathcal{T}(\cdot)$ used in (3) and (72).

The incorporation of priors. In this work, we assume that the target signal $\boldsymbol{\xi}_{n}$ is an arbitrary unknown (deterministic) signal. In many applications, the underlying signals satisfy additional constraints (such as sparsity). In [46], the authors considered a two-step scheme, where the initial linear estimate given in (72) is further projected onto a set which encapsulates one’s prior knowledge about $\boldsymbol{\xi}$ . It will be interesting to consider and analyze similar projection schemes for the estimates obtained by the spectral method.

Universality and more realistic sensing vectors. Our asymptotic analysis assumes that the sensing vectors are real-valued i.i.d. Gaussian random vectors. Numerical simulations seem to suggest that the theoretical predictions given in Theorem 1 remain valid for more general random measurement ensembles and for complex-valued sensing vectors. To demonstrate this, we show in Figure 5 the results of applying the spectral method to estimate a $64\times 64$ cameraman image from phaseless measurements under Poisson noise:

[TABLE]

where the bound $\tau$ is set to 5 and $\mathinner{\!\left\lVert\boldsymbol{\xi}\right\rVert}$ is normalized to 1 in our simulations. Two measurement ensembles are considered: real-valued sensing vectors whose elements are independent Rademacher ( $\pm 1$ ) random variables, and complex-valued sensing vectors with elements drawn from the complex Gaussian distribution $\mathcal{N}(0,\tfrac{1}{2})+j\mathcal{N}(0,\tfrac{1}{2})$ . We see from the figure that the theoretical predictions (the solid lines) have excellent agreement with simulation results for this moderately-sized problem, even though the sensing vectors can be non-Gaussian. Rigorously establishing the validity of our asymptotic predictions without the Gaussian assumption will be an important future work. Thanks to the deterministic characterization given in Proposition 2, this task boils down to showing that the result of Proposition 4 still holds when the sensing matrix consists of i.i.d. entries drawn from more general distributions. A related but more ambitious line of work will be to characterize the performance of the spectral method for structured and more practical sensing ensembles such as the coded diffraction scheme for phase retrieval with random modulation patterns.

Low-rank matrix recovery. The spectral method studied in this paper belongs to a more general theme. Let $\boldsymbol{X}^{\star}\in\mathbb{R}^{p\times n}$ be a rank- $r$ matrix and $\left\{\boldsymbol{A}_{i}\right\}_{1\leq i\leq m}$ a collection of sensing matrices of the same size as $\boldsymbol{X}^{\star}$ . To recover $\boldsymbol{X}^{\star}$ from linear measurements of the form $\left\{y_{i}=\operatorname{Tr}\boldsymbol{A}_{i}^{\top}\boldsymbol{X}\right\}_{i}$ , we can consider the following rank-constrained least squares problem

[TABLE]

and try to solve it via projected gradient descent

[TABLE]

where $\mathcal{P}_{r}$ denotes projection onto the set of rank- $r$ matrices, and $\mu>0$ is the step size. As pointed out in [47], the spectral method studied in this paper can be viewed as the very first iteration of (75), if we start the algorithm from $\boldsymbol{X}_{0}=\boldsymbol{0}_{n\times n}$ and consider the special case of recovering a symmetric rank-one matrix (i.e., $r=1$ , $p=n$ ) with $\boldsymbol{A}_{i}=\boldsymbol{a}_{i}\boldsymbol{a}_{i}^{\top}$ . Thus, an interesting line of future research is to extend the results of this work, notably the key characterization given in Proposition 2, to more general settings with $r>1$ and $p\neq n$ and to other sensing matrices. Such extensions will be useful in applications such as low-rank matrix recovery, covariance estimation, and blind deconvolution.

-A Sufficient Conditions for Assumption (A.5) to Hold

In this appendix, we provide two sufficient conditions for Assumption (A.5) to hold.

Case 1: Suppose that the probability law of the random variable $z$ contains a point mass $c\,\delta(z-\tau)$ at its upper boundary $\tau$ , where $c$ is some positive constant. This applies to the logistic regression model in Example 1, the subset algorithm (28) in Example 2, the noisy phase retrieval model in (74), and the quantization model described in Section IV-C.

In this case,

[TABLE]

as $\lambda\rightarrow\tau^{+}$ . To verify the second expression in (6), let $h(z)=\mathbb{E}_{s|z}(s^{2}|z)$ . Since $\mathbb{P}(z=\tau)>0$ , we must have $h(\tau)>0$ . Thus,

[TABLE]

which tends to $\infty$ as $\lambda$ approaches $\tau$ from the right.

Case 2: Suppose that there exist some positive constants $c$ and $\varepsilon$ such that the probability density function $p_{Z}(z)$ of $z$ and the conditional moment $h(z)$ are both bounded below by $c$ for all $z\in[\tau-\varepsilon,\tau]$ . The model in (8) represents one such case. Under this setting,

[TABLE]

Similarly, we can verify that $\mathbb{E}\frac{z}{(\lambda-z)^{2}}\rightarrow\infty$ as $\lambda\rightarrow\tau^{+}$ .

-B Norm Estimation

The spectral initialization method estimates the orientation of the vector $\boldsymbol{\xi}_{n}$ but it provides no information about its norm, as the eigenvector $\boldsymbol{x}_{1}^{n}$ is always normalized. In many cases where the sensing vectors come from certain random ensembles, the norm $\mathinner{\!\left\lVert\boldsymbol{\xi}_{n}\right\rVert}$ can be accurately estimated from the measurements.

As a simple illustrative example, we can consider the (noiseless) phase retrieval problem: $y_{i}=(\boldsymbol{a}_{i}^{\top}\boldsymbol{\xi}_{n})^{2}$ , where $\boldsymbol{\xi}_{n}$ is a deterministic unknown vector with $\kappa=\mathinner{\!\left\lVert\boldsymbol{\xi}_{n}\right\rVert}$ , and the sensing vectors $\left\{\boldsymbol{a}_{i}\right\}$ are i.i.d. standard normal random vectors. Since $\boldsymbol{a}_{i}^{\top}\boldsymbol{\xi}_{n}\sim\mathcal{N}(0,\kappa^{2})$ , the measurement $y_{i}$ can be represented as

[TABLE]

where $s_{i}$ (for $1\leq i\leq m$ ) are i.i.d. standard normal random variables. A simple estimator of the norm is then

[TABLE]

which is asymptotically consistent as $m\rightarrow\infty$ .

More generally, consider an observation model $y_{i}\sim f(y\,|\,\boldsymbol{a}_{i}^{\top}\boldsymbol{\xi}_{n})$ , where $f(\cdot\,|\,\cdot)$ is a conditional probability density function and $\boldsymbol{a}_{i}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(0,\boldsymbol{I}_{n})$ . Again, writing $\boldsymbol{a}_{i}^{\top}\boldsymbol{\xi}_{n}=\kappa s_{i}$ for i.i.d. normal random variables $\left\{s_{i}\right\}$ , we can represent the probability distributions of the measurements $\left\{y_{i}\right\}$ as

[TABLE]

Let $w(\kappa)\overset{\text{def}}{=}\mathbb{E}(y_{i})=\int y\,p_{\kappa}(y)\operatorname{d\!}y$ . If $w(\kappa)$ is monotonic on the positive real line, the method of moments gives an estimator

[TABLE]

We note that the estimator in (76) is a special case of (77). More generally, one could also estimate $\kappa$ by using maximum likelihood

[TABLE]

whose asymptotic consistency can be established under standard conditions [48] on the parametric density function $p_{\kappa}(y)$ .

-C Proof of Proposition 2

By a suitable choice of a transformation matrix

[TABLE]

where ${\boldsymbol{W}}\in\mathbb{R}^{(n-1)\times(n-1)}$ is an orthogonal matrix involving the last $(n-1)$ rows and columns only, we can get a matrix

[TABLE]

where $\mathcal{I}_{1},\mathcal{I}_{2}$ are the two sets of indices defined in (42) and $\widetilde{\boldsymbol{q}}$ is a vector consisting of all the nonzero elements of $\boldsymbol{W}^{\top}\boldsymbol{q}$ . Let $\lambda_{1}^{\widetilde{\boldsymbol{D}}}$ and $\widetilde{\boldsymbol{x}}_{1}$ be the largest eigenvalue of $\widetilde{\boldsymbol{D}}$ and an associated unit-norm eigenvector, respectively. Clearly, $\lambda_{1}^{\boldsymbol{D}}=\lambda_{1}^{\widetilde{\boldsymbol{D}}}$ and $(\boldsymbol{e}_{1}^{\top}\boldsymbol{x}_{1})^{2}=(\boldsymbol{e}_{1}^{\top}\widetilde{\boldsymbol{x}}_{1})^{2}$ . Thus, we just need to consider $\widetilde{\boldsymbol{D}}$ in our proof.

Due to its block-diagonal form, the eigenvalues of $\widetilde{\boldsymbol{D}}$ is the union of those of its top-left submatrix

[TABLE]

and those of its bottom-right submatrix $\operatorname{diag}\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{2}\right\}$ . In particular,

[TABLE]

The eigenvectors associated with $\operatorname{diag}\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{2}\right\}$ are easy to characterize. Clearly, each $\lambda_{i}^{\boldsymbol{P}},i\in\mathcal{I}_{2}$ is an eigenvalue of $\widetilde{\boldsymbol{D}}$ , and it corresponds to an eigenvector $\boldsymbol{e}_{j(i)}$ , where $j(i)\geq 3$ is the row index of $\lambda_{i}^{\boldsymbol{P}}$ in $\widetilde{\boldsymbol{D}}$ .

The eigenvalues and eigenvectors of $\boldsymbol{S}$ can also be precisely characterized. Due to its shape, $\boldsymbol{S}$ is sometimes referred to in the literature as an arrowhead matrix [49, 50]. It can be shown (see for instance [51][pp. 94 – 97]) that $\lambda_{1}^{\boldsymbol{S}}$ is the unique point within the interval $\lambda>\max\left\{\lambda_{i}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{1}\right\}$ to satisfy the equation

[TABLE]

where $R(\lambda)$ is the function defined in (38). (Alternatively, we can use the Laplace expansion to explicitly derive the characteristic polynomial of $\boldsymbol{S}$ as

[TABLE]

Then, by following similar arguments as those used in the proof of Lemma 1, we can reach the characterization (80) about $\lambda_{1}^{\boldsymbol{S}}$ .) Furthermore, let $\boldsymbol{x}_{1}^{\boldsymbol{S}}$ be a unit-norm eigenvector of $\widetilde{\boldsymbol{D}}$ associated with $\lambda_{1}^{\boldsymbol{S}}$ . It is easily checked that

[TABLE]

where $\boldsymbol{y}=(\lambda_{1}^{\boldsymbol{S}}\boldsymbol{I}-\operatorname{diag}\left\{\lambda_{i}^{\boldsymbol{P}}\right\}_{i\in\mathcal{I}_{1}})^{-1}\widetilde{\boldsymbol{q}}$ and $\boldsymbol{0}_{r}$ is a row vector of $r$ zeroes with $r$ being the cardinality of $\mathcal{I}_{2}$ . It follows that

[TABLE]

where $R^{\prime}(\lambda)$ denotes the derivative of the function $R(\lambda)$ .

To show the claim of the proposition, we consider the following three cases.

Case 1: $\lambda_{1}^{\boldsymbol{S}}>\max\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{2}\right\}$ . We choose $\mu^{\ast}=-1/R(\lambda_{1}^{\boldsymbol{S}})$ , and thus $\lambda_{1}^{\boldsymbol{S}}=R^{-1}(-1/\mu^{\ast})$ . It follows from Lemma 1 that

[TABLE]

where the second equality is due to the fact that

[TABLE]

and the last equality comes from (79).

Using the identity (80) for $\lambda_{1}^{\boldsymbol{S}}$ , we can also verify that $\mu^{\ast}$ indeed satisfies the equation (44). (Its uniqueness is always guaranteed; see Remark 6 at the end of Section III-B.) The unit-norm leading eigenvector of $\widetilde{\boldsymbol{D}}$ in this case is the vector $\boldsymbol{x}_{1}^{\boldsymbol{S}}$ defined in (81). Since $L(\mu)=R^{-1}(-1/\mu)$ in a neighborhood of $\mu^{\ast}$ , the function $L(\mu)$ is differentiable at $\mu^{\ast}$ and

[TABLE]

Substituting (84) into (82) leads to (46).

Case 2: $\lambda_{1}^{\boldsymbol{S}}<\max\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{2}\right\}$ , in which case $\lambda_{1}^{\widetilde{D}}=\max\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{2}\right\}=\lambda_{1}^{\boldsymbol{P}}$ , where the last equality is due to (83). The corresponding leading eigenvector has nonzero elements only in its last $r$ entries, where $r$ is the cardinality of $\mathcal{I}_{2}$ . Thus,

[TABLE]

We set $\mu^{\ast}=(\lambda_{1}^{\boldsymbol{P}}-a)^{-1}$ . (Note that we are guaranteed to have $\mu^{\ast}>0$ . This can be verified by observing that $\lambda_{1}^{\boldsymbol{P}}>\lambda_{1}^{\boldsymbol{S}}=a-R(\lambda_{1}^{\boldsymbol{S}})>a$ , where the equality is due to (80) and the last inequality follows from the fact that $R(\lambda)<0$ .) Since $R(\lambda)$ is a strictly increasing function, we have

[TABLE]

where the equality comes from (80). It then follows from Lemma 1 that $L(\mu^{\ast})=\lambda_{1}^{\boldsymbol{P}}=\lambda_{1}^{\widetilde{\boldsymbol{D}}}$ and, moreover, $\mu^{\ast}$ satisfies the equation (44).

To characterize the eigenvector, we note that $L(\mu)\equiv\lambda_{1}^{\boldsymbol{P}}$ in a neighborhood of $\mu^{\ast}$ . We then have $L^{\prime}(\mu^{\ast})=0$ , which, together with (85), leads to (46).

Case 3: $\lambda_{1}^{\boldsymbol{S}}=\max\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{2}\right\}$ . This is a special case, where the algebraic multiplicity of the leading eigenvalue $\lambda_{1}^{\widetilde{\boldsymbol{D}}}=\lambda_{1}^{\boldsymbol{S}}=\lambda_{1}^{\boldsymbol{P}}$ is greater than one. The leading eigenvectors are not unique, and they can be any vector in the form of

[TABLE]

where $\boldsymbol{x}_{1}^{\boldsymbol{S}}$ is the eigenvector defined in (81) and $\boldsymbol{v}$ is an eigenvector associated with $\max\left\{\lambda_{i}^{\boldsymbol{P}}\mathrel{\mathop{\mathchar 58\relax}}i\in\mathcal{I}_{2}\right\}$ , and $c_{1},c_{2}$ are two constants satisfying $c_{1}^{2}+c_{2}^{2}=1$ . Since $\boldsymbol{e}_{1}^{\top}\boldsymbol{v}=0$ , we have from (82) that

[TABLE]

Same as what we did in Case 2, we set $\mu^{\ast}=(\lambda_{1}^{\boldsymbol{P}}-a)^{-1}$ . Following the same arguments there, we can show that $L(\mu^{\ast})=\lambda_{1}^{\boldsymbol{P}}=\lambda_{1}^{\widetilde{\boldsymbol{D}}}$ and $\mu^{\ast}$ satisfies the equation (44). Moreover, we can see that $L(\mu)=R^{-1}(-1/\mu)$ for $\mu>\mu^{\ast}$ and $L(\mu)\equiv\lambda_{1}^{\boldsymbol{P}}$ for $\mu<\mu^{\ast}$ . The function $L(\mu)$ is not differentiable at $\mu^{\ast}$ , but its right and left derivatives do exist. It is easy to get $\partial_{+}L(\mu^{\ast})=\left(1/R^{\prime}(\lambda_{1}^{\boldsymbol{S}})\right)(\mu^{\ast})^{-2}$ [see (84)] and $\partial_{-}L(\mu^{\ast})=0$ . Substituting these quantities into (86), we reach the characterization given in (45).

-D Proof of Proposition 3

To establish the almost-sure convergence of the random measure $f^{\boldsymbol{M}_{m}}(\lambda)$ to the probability law of $z$ , we just need to show that, almost surely, the empirical distribution function

[TABLE]

converges to $F_{z}(\lambda)$ , the cumulative distribution function of z, at all points $\lambda$ where $F_{z}(\lambda)$ is continuous. Since $\boldsymbol{M}_{m}$ is a rank-one perturbation of the diagonal matrix $\boldsymbol{Z}$ , standard interlacing theorems (see [42, Theorem 4.3.4]) give us

[TABLE]

for $1\leq k\leq m-2$ . Let $F^{\boldsymbol{Z}}(\lambda)=\frac{1}{m}\#\left\{1\leq j\leq m\mathrel{\mathop{\mathchar 58\relax}}z_{j}\leq\lambda\right\}$ be the empirical distribution function of the eigenvalues of $\boldsymbol{Z}$ . We can then easily verify from (87) that

[TABLE]

Since $\left\{z_{i}\right\}_{1\leq i\leq m}$ is an i.i.d. sample of the random variable $z$ , with probability one $F^{\boldsymbol{Z}}(\lambda)$ converges to $F_{z}(\lambda)$ all all points $\lambda$ where $F_{z}(\lambda)$ is continuous. It then follows from (88) that $F^{\boldsymbol{M}_{m}}(\lambda)$ converges almost surely to the same limit $F_{z}(\lambda)$ .

To study the leading eigenvalue $\lambda_{1}^{\boldsymbol{M}_{m}}$ , we use Lemma 1. To apply that result, we require $\boldsymbol{v}=[z_{1}s_{1},z_{2}s_{2},\ldots,z_{m}s_{m}]^{\top}$ as defined in (35) to be not equal to the all-zero vector. This condition holds almost surely for all sufficiently large $m$ . To see this, we note that $\mathbb{P}(z_{i}=0)<1$ , as otherwise assumption (A.6) will not hold. Moreover, $s_{i}\neq 0$ with probability one. It follows that the i.i.d. sequence $z_{1}s_{1},z_{2}s_{2},z_{3}s_{3},\ldots$ has an infinite number of nonzero elements. Thus, almost surely, the $m$ -dimensional vector $\boldsymbol{v}\neq\boldsymbol{0}$ for sufficiently large $m$ .

Applying (39) to our case, we have $\lambda_{1}^{\boldsymbol{M}_{m}}=R_{m}^{-1}(-1/\mu)\vee\max\left\{z_{i}\right\}_{1\leq i\leq m}$ , where

[TABLE]

with this function defined on $\lambda>\max\left\{z_{i}\right\}_{1\leq i\leq m}$ . Since $R_{m}^{-1}(-1/\mu)>\max\left\{z_{i}\right\}_{1\leq i\leq m}$ , we can further simplify the characterization to

[TABLE]

For every $\lambda>\tau$ , with $\tau$ being the upper bound of the support of the probability distribution of $z$ , it follows from the strong law of large numbers that $R_{m}(\lambda)$ converges almost surely to

[TABLE]

where $Q(\lambda)$ is defined in (51). On its domain $\lambda>\tau$ , the function $-Q(\lambda)$ is strictly increasing and thus it admits a functional inverse $(-Q)^{-1}(x)=Q^{-1}(-x)$ . Applying Lemma 3 in Appendix -E, we have

[TABLE]

as $m\rightarrow\infty$ .

-E Auxiliary Lemmas

We prove here two auxiliary lemmas that are used in our proofs of Proposition 3 and Theorem 1.

Lemma 3

Let $\left\{f_{n}(x)\right\}_{n\geq 1}$ be a family of (random) functions defined on an open interval $(a,b)$ . Each $f_{n}(x)$ is continuous and nondecreasing. For each $x\in(a,b)$ , $f_{n}(x)\overset{\text{a.s.}}{\longrightarrow}f(x)$ as $n\rightarrow\infty$ , where $f(x)$ is a continuous and nondecreasing function. Then, for any sequence $\left\{x_{n}\right\}\subset(a,b)$ with $x_{n}\overset{\text{a.s.}}{\longrightarrow}x^{\ast}\in(a,b)$ , we have

[TABLE]

If, in addition, the functions $\left\{f_{n}(x)\right\}$ and $f(x)$ are strictly increasing, we denote by $\left\{f_{n}^{-1}(x)\right\}_{n\geq 1}$ and $f^{-1}(x)$ the corresponding functional inverses. Assume that the domains of $\left\{f_{n}^{-1}(x)\right\}_{n\geq 1}$ and $f^{-1}(x)$ contain a common open interval $\mathcal{I}$ . Then for any sequence $\left\{y_{n}\right\}_{n\geq 1}\subset\mathcal{I}$ such that $y_{n}\overset{\text{a.s.}}{\longrightarrow}y\in\mathcal{I}$ , we have

[TABLE]

Proof:

We first show (89). Let $\beta_{k}=x^{\ast}-h/k$ for $k=1,2,\ldots$ be a sequence that converges to $x^{\ast}$ from the left. We choose $h<\min\left\{x^{\ast}-a,b-x^{\ast}\right\}$ so that the entire sequence stays within the interval $(a,b)$ . Similarly, define a sequence $\gamma_{k}=x^{\ast}+h/k$ , for $k=1,2,\ldots$ , that converges to $x^{\ast}$ from the right. Denote by $\mathcal{A}$ the intersection of the event that $f_{n}(x)\rightarrow f(x)$ for all $x\in\left\{\beta_{k}\right\}_{k}\cup\left\{\gamma_{k}\right\}_{k}$ and the event that $x_{n}\rightarrow x$ . Clearly, $\mathbb{P}(\mathcal{A})=1$ . Next, we show that (89) holds within this almost sure event.

Fix $k\geq 1$ . As $x_{n}\rightarrow x^{\ast}$ , we have $\beta_{k}\leq x_{n}\leq\gamma_{k}$ for all sufficiently large $n$ . By the monotonicity of $f_{n}(x)$ ,

[TABLE]

It follows that

[TABLE]

As $k$ is arbitrary, we take the $k\rightarrow\infty$ limit, which leads to $\lim_{n}f_{n}(x_{n})=f(x)$ by the continuity of $f(x)$ .

The proof of (90) is similar. We establish it under the additional assumption that $\left\{f_{n}(x)\right\}$ and $f(x)$ are strictly increasing. Construct two sequences $\left\{\beta_{k}\right\}$ and $\left\{\gamma_{k}\right\}$ as above, with $x^{\ast}$ replaced by $f^{-1}(y)$ . Also define the event $\mathcal{A}$ similarly. We show that, within the almost sure event $\mathcal{A}$ , we have $f_{n}^{-1}(y_{n})\rightarrow f^{-1}(y)$ .

Fix $k\geq 1$ . Since $f(x)$ is strictly increasing, $\beta_{k}<f^{-1}(y)<\gamma_{k}$ implies that

[TABLE]

As $f_{n}(\beta_{k})\rightarrow f(\beta_{k})$ , $f_{n}(\gamma_{k})\rightarrow f(\gamma_{k})$ and $y_{n}\rightarrow y$ , the inequalities

[TABLE]

hold for all sufficiently large $n$ . By the strict monotonicity of $f_{n}(x)$ ,

[TABLE]

for all sufficiently large $n$ . It then follows that $\beta_{k}\leq\lim\inf_{n}f_{n}^{-1}(y_{n})\leq\lim\sup_{n}f_{n}^{-1}(y_{n})\leq\gamma_{k}$ , for each $k$ . As $\beta_{k}\rightarrow f^{-1}(y)$ and $\gamma_{k}\rightarrow f^{-1}(y)$ , we are done. ∎

Lemma 4

Let $\left\{f_{n}(x)\right\}_{n\geq 1}$ be a sequence of (random) convex functions defined on an open interval $(a,b)$ . For each $x\in(a,b)$ , $f_{n}(x)\overset{\text{a.s.}}{\longrightarrow}f(x)$ . Let $\left\{x_{n}\right\}_{n\geq 1}\subset(a,b)$ be a sequence such that $x_{n}\overset{\text{a.s.}}{\longrightarrow}x^{\ast}$ for some $x^{\ast}\in(a,b)$ . If $f(x)$ is differentiable at $x^{\ast}$ , then

[TABLE]

where $\partial_{-}f_{n}(x)$ and $\partial_{+}f_{n}(x)$ denote the left and right derivatives of $f_{n}(x)$ , respectively.

Proof:

Similar to the proof of Lemma 3, we construct two sequences: $\left\{\beta_{k}\right\}_{k\geq 1}$ is strictly increasing and converges to $x^{\ast}$ from the left, whereas $\left\{\gamma_{k}\right\}_{k\geq 1}$ is strictly decreasing and converges to $x^{\ast}$ from the right. Denote by $\mathcal{A}$ the intersection of the event that $f_{n}(x)\rightarrow f(x)$ for all $x\in\left\{\beta_{k}\right\}_{k}\cup\left\{\gamma_{k}\right\}_{k}$ and the event that $x_{n}\rightarrow x^{\ast}$ . It is easily checked that $\mathbb{P}(\mathcal{A})=1$ . Next, we establish (91) within this almost sure event.

For any $i<j$ , since $\beta_{i}<\beta_{j}<x^{\ast}$ and $x_{n}\rightarrow x^{\ast}$ , we must have $\beta_{i}<\beta_{j}<x_{n}$ for all sufficiently large $n$ . By the convexity of $f_{n}(x)$ , its left derivatives always exist and we have

[TABLE]

for all sufficiently large $n$ . It follows that

[TABLE]

for all $i<j$ . Since

[TABLE]

we must have

[TABLE]

Working with the sequence $\left\{\gamma_{k}\right\}_{k\geq 1}$ and using similar arguments as above, we can show that

[TABLE]

Since $\lim\sup_{n}\partial_{-}f_{n}(x_{n})\leq\lim\sup_{n}\partial_{+}f_{n}(x_{n})$ , we use (92) and (93) to conclude that $\lim_{n}\partial_{-}f_{n}(x_{n})$ exists and that it is equal to $f^{\prime}(x^{\ast})$ . By similar arguments, the same claim also holds for the sequence $\left\{\partial_{+}f_{n}(x_{n})\right\}$ , and thus the proof is complete. ∎

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Unser and M. Eden, “Maximum likelihood estimation of linear signal parameters for Poisson processes,” IEEE Trans. Acoust., Speech, and Signal Process. , vol. 36, no. 6, pp. 942–945, Jun. 1988.
2[2] F. Yang, Y. M. Lu, L. Sbaiz, and M. Vetterli, “Bits from photons: Oversampled image acquisition using binary poisson statistics,” IEEE Trans. Image Process. , vol. 21, no. 4, pp. 1421–1436, 2012.
3[3] J. R. Fienup, “Phase retrieval algorithms: a comparison,” Applied Optics , vol. 21, no. 15, pp. 2758–2769, 1982.
4[4] S. Rangan and V. K. Goyal, “Recursive consistent estimation with bounded noise,” Information Theory, IEEE Transactions on , vol. 47, no. 1, pp. 457–464, 2001.
5[5] A. J. Dobson and A. Barnett, An Introduction to Generalized Linear Models , 3rd ed. Boca Raton: Chapman and Hall/CRC, May 2008.
6[6] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition , 2nd ed. New York, NY: Springer, Apr. 2011.
7[7] E. J. Candes, T. Strohmer, and V. Voroninski, “Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming,” Communications on Pure and Applied Mathematics , vol. 66, no. 8, pp. 1241–1274, 2013.
8[8] E. J. Candes and X. Li, “Solving quadratic equations via Phase Lift when there are about as many equations as unknowns,” Foundations of Computational Mathematics , vol. 14, no. 5, pp. 1017–1026, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Phase Transitions of Spectral Initialization for High-Dimensional Nonconvex Estimation

Abstract

Index Terms:

I Introduction

II Main Results

II-A Technical Assumptions

II-B Main Results: Asymptotic Characterizations

Theorem 1

Remark 1

Remark 2

Proposition 1

Remark 3

II-C Worked-Example: Binary Models

Remark 4

II-D Numerical Simulations

Example 1** (Logistic regression)**

Example 2** (Phase retrieval)**

III Proof of the Main Results

III-A Overview

Remark 5

III-B A Fixed-Point Characterization

Lemma 1

Proof:

Proposition 2

Remark 6

III-C Asymptotic Limit of Lm(μ)L_{m}(\mu)Lm​(μ)

Proposition 3

Remark 7

Proposition 4

Proof:

III-D Proof of Theorem 1

IV Sampling Ratios and Phase Transitions

IV-A Critical Sampling Ratios

Lemma 2

Proof:

IV-B Proof of Proposition 1

Remark 8

IV-C Multiple Phase Transitions: an Example

Remark 9

V Discussion

-A Sufficient Conditions for Assumption (A.5) to Hold

-B Norm Estimation

-C Proof of Proposition 2

-D Proof of Proposition 3

-E Auxiliary Lemmas

Lemma 3

Proof:

Lemma 4

Proof:

Example 1 (Logistic regression)

Example 2 (Phase retrieval)

III-C Asymptotic Limit of $L_{m}(\mu)$