A Random Matrix Approach to Neural Networks

Cosme Louart; Zhenyu Liao; Romain Couillet

arXiv:1702.05419·math.PR·June 30, 2017

A Random Matrix Approach to Neural Networks

Cosme Louart, Zhenyu Liao, Romain Couillet

PDF

1 Repo

TL;DR

This paper analyzes the spectral properties of Gram random matrices in neural networks using random matrix theory, providing deterministic equivalents and insights into network performance and hyperparameter tuning.

Contribution

It introduces a novel random matrix model for neural networks and derives deterministic equivalents for spectral measures, aiding understanding and optimization of random neural networks.

Findings

01

Deterministic equivalents for spectral measures of neural network matrices

02

Insights into asymptotic performance of single-layer random neural networks

03

Practical methods for hyperparameter tuning based on spectral analysis

Abstract

This article studies the Gram random matrix model $G = \frac{1}{T} Σ^{T} Σ$ , $Σ = σ (W X)$ , classically found in the analysis of random feature maps and random neural networks, where $X = [x_{1}, \dots, x_{T}] \in R^{p \times T}$ is a (data) matrix of bounded norm, $W \in R^{n \times p}$ is a matrix of independent zero-mean unit variance entries, and $σ : R \to R$ is a Lipschitz continuous (activation) function --- $σ (W X)$ being understood entry-wise. By means of a key concentration of measure lemma arising from non-asymptotic random matrix arguments, we prove that, as $n, p, T$ grow large at the same rate, the resolvent $Q = (G + γ I_{T})^{- 1}$ , for $γ > 0$ , has a similar behavior as that met in sample covariance matrix models, involving notably the moment $Φ = \frac{T}{n} E [G]$ , which provides in passing a deterministic equivalent…

Tables1

Table 1. Table 1: Values of Φ a b subscript Φ 𝑎 𝑏 \Phi_{ab} for w ∼ 𝒩 ( 0 , I p ) similar-to 𝑤 𝒩 0 subscript 𝐼 𝑝 w\sim\mathcal{N}(0,I_{p}) , ∠ ( a , b ) ≡ a 𝖳 b ‖ a ‖ ‖ b ‖ ∠ 𝑎 𝑏 superscript 𝑎 𝖳 𝑏 norm 𝑎 norm 𝑏 \angle(a,b)\equiv\frac{a^{\sf T}b}{\|a\|\|b\|} .

$σ (t)$	$Φ_{a b}$
$t$	$a^{𝖳} b$
$\max (t, 0)$	$\frac{1}{2 π} ‖ a ‖ ‖ b ‖ (∠ (a, b) acos (- ∠ (a, b)) + \sqrt{1 - ∠ {(a, b)}^{2}})$
$\| t \|$	$\frac{2}{π} ‖ a ‖ ‖ b ‖ (∠ (a, b) asin (∠ (a, b)) + \sqrt{1 - ∠ {(a, b)}^{2}})$
$\erf (t)$	$\frac{2}{π} asin (\frac{2 a^{𝖳} b}{\sqrt{(1 + 2 {‖ a ‖}^{2}) (1 + 2 {‖ b ‖}^{2})}})$
$1_{{t > 0}}$	$\frac{1}{2} - \frac{1}{2 π} acos (∠ (a, b))$
$sign (t)$	$\frac{2}{π} asin (∠ (a, b))$
$\cos (t)$	$\exp (- \frac{1}{2} ({‖ a ‖}^{2} + {‖ b ‖}^{2})) \cosh (a^{𝖳} b)$
$\sin (t)$	$\exp (- \frac{1}{2} ({‖ a ‖}^{2} + {‖ b ‖}^{2})) \sinh (a^{𝖳} b)$ .

Equations550

β

β

Q \equiv (\frac{1}{T} Σ^{T} Σ + γ I_{T})^{- 1}

Q \equiv (\frac{1}{T} Σ^{T} Σ + γ I_{T})^{- 1}

E_{train}

E_{train}

E_{test}

E_{test}

W = φ (\tilde{W})

W = φ (\tilde{W})

0 < n lim inf min {p / n, T / n} \leq n lim sup max {p / n, T / n} < \infty

0 < n lim inf min {p / n, T / n} \leq n lim sup max {p / n, T / n} < \infty

n lim sup ∥ X ∥

n lim sup ∥ X ∥

n lim sup ij max ∣ Y_{ij} ∣

Φ

Φ

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}P\left(\left|\frac{1}{T}\sigma^{\sf T}A\sigma-\frac{1}{T}\operatorname{tr}\Phi A\right|>t\right)}

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}P\left(\left|\frac{1}{T}\sigma^{\sf T}A\sigma-\frac{1}{T}\operatorname{tr}\Phi A\right|>t\right)}

P (\frac{1}{T} σ^{T} A σ - \frac{1}{T} tr Φ A > t) \leq C e^{- c n m i n (t, t^{2})}

P (\frac{1}{T} σ^{T} A σ - \frac{1}{T} tr Φ A > t) \leq C e^{- c n m i n (t, t^{2})}

\overset{ˉ}{Q}

\overset{ˉ}{Q}

E [Q] - \overset{ˉ}{Q}

E [Q] - \overset{ˉ}{Q}

\int f d μ_{n} - \int f d \overset{μ}{ˉ}_{n}

\int f d μ_{n} - \int f d \overset{μ}{ˉ}_{n}

m_{\overset{μ}{ˉ}_{n}} (z)

m_{\overset{μ}{ˉ}_{n}} (z)

δ_{z}

δ_{z}

Ψ

Ψ

E [Q A Q] - (\overset{ˉ}{Q} A \overset{ˉ}{Q} + \frac{\frac{1}{n} tr ( Ψ Q ˉ A Q ˉ )}{1 - \frac{1}{n} tr Ψ ^{2} Q ˉ ^{2}} \overset{ˉ}{Q} Ψ \overset{ˉ}{Q})

E [Q A Q] - (\overset{ˉ}{Q} A \overset{ˉ}{Q} + \frac{\frac{1}{n} tr ( Ψ Q ˉ A Q ˉ )}{1 - \frac{1}{n} tr Ψ ^{2} Q ˉ ^{2}} \overset{ˉ}{Q} Ψ \overset{ˉ}{Q})

n^{\frac{1}{2} - ε} (E_{train} - \overset{ˉ}{E}_{train})

n^{\frac{1}{2} - ε} (E_{train} - \overset{ˉ}{E}_{train})

E_{train}

E_{train}

\overset{ˉ}{E}_{train}

(\frac{γ}{λ _{i} + γ})^{2} (1 + λ_{i} \frac{\frac{1}{n} \sum _{j = 1}^{T} λ _{j} ( λ _{j} + γ ) ^{- 2}}{1 - \frac{1}{n} \sum _{j = 1}^{T} λ _{j}^{2} ( λ _{j} + γ ) ^{- 2}}), 1 \leq i \leq T

(\frac{γ}{λ _{i} + γ})^{2} (1 + λ_{i} \frac{\frac{1}{n} \sum _{j = 1}^{T} λ _{j} ( λ _{j} + γ ) ^{- 2}}{1 - \frac{1}{n} \sum _{j = 1}^{T} λ _{j}^{2} ( λ _{j} + γ ) ^{- 2}}), 1 \leq i \leq T

Φ_{A B}

Φ_{A B}

Ψ_{A B}

n^{\frac{1}{2} - ε} (E_{test} - \overset{ˉ}{E}_{test})

n^{\frac{1}{2} - ε} (E_{test} - \overset{ˉ}{E}_{test})

E_{test}

E_{test}

\overset{ˉ}{E}_{test}

+ \frac{\frac{1}{n} tr Y ^{T} Y Q ˉ Ψ Q ˉ}{1 - \frac{1}{n} tr ( Ψ Q ˉ ) ^{2}} [\frac{1}{T ^} tr Ψ_{\hat{X} \hat{X}} - \frac{1}{T ^} tr (I_{T} + γ \overset{ˉ}{Q}) (Ψ_{X \hat{X}} Ψ_{\hat{X} X} \overset{ˉ}{Q})] .

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Phi_{ab}\equiv}{\rm E}[\sigma(w^{\sf T}a)\sigma(w^{\sf T}b)]

\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Phi_{ab}\equiv}{\rm E}[\sigma(w^{\sf T}a)\sigma(w^{\sf T}b)]

Φ_{ab}

Φ_{ab}

+ ζ_{2} ζ_{1} m_{3} [(a^{2})^{T} b + a^{T} (b^{2})] + ζ_{2} ζ_{0} m_{2} [∥ a ∥^{2} + ∥ b ∥^{2}] + ζ_{0}^{2}

asin (x)

asin (x)

sinh (x)

acos (x)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Zhenyu-LIAO/RMT4ELM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Random Matrix Approach

to Neural Networks

Cosme Louartlabel=e1][email protected] [

Zhenyu Liaolabel=e2][email protected] [

Romain Couilletlabel=e3][email protected] [ CentraleSupélec, University of Paris–Saclay, France.

Abstract

This article studies the Gram random matrix model $G=\frac{1}{T}\Sigma^{\sf T}\Sigma$ , $\Sigma=\sigma(WX)$ , classically found in the analysis of random feature maps and random neural networks, where $X=[x_{1},\ldots,x_{T}]\in{\mathbb{R}}^{p\times T}$ is a (data) matrix of bounded norm, $W\in{\mathbb{R}}^{n\times p}$ is a matrix of independent zero-mean unit variance entries, and $\sigma:{\mathbb{R}}\to{\mathbb{R}}$ is a Lipschitz continuous (activation) function — $\sigma(WX)$ being understood entry-wise. By means of a key concentration of measure lemma arising from non-asymptotic random matrix arguments, we prove that, as $n,p,T$ grow large at the same rate, the resolvent $Q=(G+\gamma I_{T})^{-1}$ , for $\gamma>0$ , has a similar behavior as that met in sample covariance matrix models, involving notably the moment $\Phi=\frac{T}{n}{\rm E}[G]$ , which provides in passing a deterministic equivalent for the empirical spectral measure of $G$ . Application-wise, this result enables the estimation of the asymptotic performance of single-layer random neural networks. This in turn provides practical insights into the underlying mechanisms into play in random neural networks, entailing several unexpected consequences, as well as a fast practical means to tune the network hyperparameters.

60B20,

62M45,

keywords:

[class=MSC]

, , and

t3Couillet’s work is supported by the ANR Project RMT4GRAPH (ANR-14-CE28-0006).

1 Introduction

Artificial neural networks, developed in the late fifties (Rosenblatt, 1958) in an attempt to develop machines capable of brain-like behaviors, know today an unprecedented research interest, notably in its applications to computer vision and machine learning at large (Krizhevsky, Sutskever and Hinton, 2012; Schmidhuber, 2015) where superhuman performances on specific tasks are now commonly achieved. Recent progress in neural network performances however find their source in the processing power of modern computers as well as in the availability of large datasets rather than in the development of new mathematics. In fact, for lack of appropriate tools to understand the theoretical behavior of the non-linear activations and deterministic data dependence underlying these networks, the discrepancy between mathematical and practical (heuristic) studies of neural networks has kept widening. A first salient problem in harnessing neural networks lies in their being completely designed upon a deterministic training dataset $X=[x_{1},\ldots,x_{T}]\in{\mathbb{R}}^{p\times T}$ , so that their resulting performances intricately depend first and foremost on $X$ . Recent works have nonetheless established that, when smartly designed, mere randomly connected neural networks can achieve performances close to those reached by entirely data-driven network designs (Rahimi and Recht, 2007; Saxe et al., 2011). As a matter of fact, to handle gigantic databases, the computationally expensive learning phase (the so-called backpropagation of the error method) typical of deep neural network structures becomes impractical, while it was recently shown that smartly designed single-layer random networks (as studied presently) can already reach superhuman capabilities (Cambria et al., 2015) and beat expert knowledge in specific fields (Jaeger and Haas, 2004). These various findings have opened the road to the study of neural networks by means of statistical and probabilistic tools (Choromanska et al., 2015; Giryes, Sapiro and Bronstein, 2015). The second problem relates to the non-linear activation functions present at each neuron, which have long been known (as opposed to linear activations) to help design universal approximators for any input-output target map (Hornik, Stinchcombe and White, 1989).

In this work, we propose an original random matrix-based approach to understand the end-to-end regression performance of single-layer random artificial neural networks, sometimes referred to as extreme learning machines (Huang, Zhu and Siew, 2006; Huang et al., 2012), when the number $T$ and size $p$ of the input dataset are large and scale proportionally with the number $n$ of neurons in the network. These networks can also be seen, from a more immediate statistical viewpoint, as a mere linear ridge-regressor relating a random feature map $\sigma(WX)\in{\mathbb{R}}^{n\times T}$ of explanatory variables $X=[x_{1},\ldots,x_{T}]\in{\mathbb{R}}^{p\times T}$ and target variables $y=[y_{1},\ldots,y_{T}]\in{\mathbb{R}}^{d\times T}$ , for $W\in{\mathbb{R}}^{n\times p}$ a randomly designed matrix and $\sigma(\cdot)$ a non-linear ${\mathbb{R}}\to{\mathbb{R}}$ function (applied component-wise). Our approach has several interesting features both for theoretical and practical considerations. It is first one of the few known attempts to move the random matrix realm away from matrices with independent or linearly dependent entries. Notable exceptions are the line of works surrounding kernel random matrices (El Karoui, 2010; Couillet and Benaych-Georges, 2016) as well as large dimensional robust statistics models (Couillet, Pascal and Silverstein, 2015; El Karoui, 2013; Zhang, Cheng and Singer, 2014). Here, to alleviate the non-linear difficulty, we exploit concentration of measure arguments (Ledoux, 2005) for non-asymptotic random matrices, thereby pushing further the original ideas of (El Karoui, 2009; Vershynin, 2012) established for simpler random matrix models. While we believe that more powerful, albeit more computational intensive, tools (such as an appropriate adaptation of the Gaussian tools advocated in (Pastur and Ŝerbina, 2011)) cannot be avoided to handle advanced considerations in neural networks, we demonstrate here that the concentration of measure phenomenon allows one to fully characterize the main quantities at the heart of the single-layer regression problem at hand.

In terms of practical applications, our findings shed light on the already incompletely understood extreme learning machines which have proved extremely efficient in handling machine learning problems involving large to huge datasets (Huang et al., 2012; Cambria et al., 2015) at a computationally affordable cost. But our objective is also to pave to path to the understanding of more involved neural network structures, featuring notably multiple layers and some steps of learning by means of backpropagation of the error.

Our main contribution is twofold. From a theoretical perspective, we first obtain a key lemma, Lemma 1, on the concentration of quadratic forms of the type $\sigma(w^{\sf T}X)A\sigma(X^{\sf T}w)$ where $w=\varphi(\tilde{w})$ , $\tilde{w}\sim\mathcal{N}(0,I_{p})$ , with $\varphi:{\mathbb{R}}\to{\mathbb{R}}$ and $\sigma:{\mathbb{R}}\to{\mathbb{R}}$ Lipschitz functions, and $X\in{\mathbb{R}}^{p\times T}$ , $A\in{\mathbb{R}}^{n\times n}$ are deterministic matrices. This non-asymptotic result (valid for all $n,p,T$ ) is then exploited under a simultaneous growth regime for $n,p,T$ and boundedness conditions on $\|X\|$ and $\|A\|$ to obtain, in Theorem 1, a deterministic approximation $\bar{Q}$ of the resolvent ${\rm E}[Q]$ , where $Q=(\frac{1}{T}\Sigma^{\sf T}\Sigma+\gamma I_{T})^{-1}$ , $\gamma>0$ , $\Sigma=\sigma(WX)$ , for some $W=\varphi(\tilde{W})$ , $\tilde{W}\in{\mathbb{R}}^{n\times p}$ having independent $\mathcal{N}(0,1)$ entries. As the resolvent of a matrix (or operator) is an important proxy for the characterization of its spectrum (see e.g., (Pastur and Ŝerbina, 2011; Akhiezer and Glazman, 1993)), this result therefore allows for the characterization of the asymptotic spectral properties of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ , such as its limiting spectral measure in Theorem 2.

Application-wise, the theoretical findings are an important preliminary step for the understanding and improvement of various statistical methods based on random features in the large dimensional regime. Specifically, here, we consider the question of linear ridge-regression from random feature maps, which coincides with the aforementioned single hidden-layer random neural network known as extreme learning machine. We show that, under mild conditions, both the training $E_{\rm train}$ and testing $E_{\rm test}$ mean-square errors, respectively corresponding to the regression errors on known input-output pairs $(x_{1},y_{1}),\ldots,(x_{T},y_{T})$ (with $x_{i}\in{\mathbb{R}}^{p}$ , $y_{i}\in{\mathbb{R}}^{d}$ ) and unknown pairings $(\hat{x}_{1},\hat{y}_{1}),\ldots,(\hat{x}_{\hat{T}},\hat{y}_{\hat{T}})$ , almost surely converge to deterministic limiting values as $n,p,T$ grow large at the same rate (while $d$ is kept constant) for every fixed ridge-regression parameter $\gamma>0$ .

Simulations on real image datasets are provided that corroborate our results.

These findings provide new insights into the roles played by the activation function $\sigma(\cdot)$ and the random distribution of the entries of $W$ in random feature maps as well as by the ridge-regression parameter $\gamma$ in the neural network performance. We notably exhibit and prove some peculiar behaviors, such as the impossibility for the network to carry out elementary Gaussian mixture classification tasks, when either the activation function or the random weights distribution are ill chosen.

Besides, for the practitioner, the theoretical formulas retrieved in this work allow for a fast offline tuning of the aforementioned hyperparameters of the neural network, notably when $T$ is not too large compared to $p$ . The graphical results provided in the course of the article were particularly obtained within a $100$ - to $500$ -fold gain in computation time between theory and simulations.

The remainder of the article is structured as follows: in Section 2, we introduce the mathematical model of the system under investigation. Our main results are then described and discussed in Section 3, the proofs of which are deferred to Section 5. Section 4 discusses our main findings. The article closes on concluding remarks on envisioned extensions of the present work in Section 6. The appendix provides some intermediary lemmas of constant use throughout the proof section.

Reproducibility: Python 3 codes used to produce the results of Section 4 are available at https://github.com/Zhenyu-LIAO/RMT4ELM

Notations: The norm $\|\cdot\|$ is understood as the Euclidean norm for vectors and the operator norm for matrices, while the norm $\|\cdot\|_{F}$ is the Frobenius norm for matrices. All vectors in the article are understood as column vectors.

2 System Model

We consider a ridge-regression task on random feature maps defined as follows. Each input data $x\in{\mathbb{R}}^{p}$ is multiplied by a matrix $W\in{\mathbb{R}}^{n\times p}$ ; a non-linear function $\sigma:{\mathbb{R}}\to{\mathbb{R}}$ is then applied entry-wise to the vector $Wx$ , thereby providing a set of $n$ random features $\sigma(Wx)\in{\mathbb{R}}^{n}$ for each datum $x\in{\mathbb{R}}^{p}$ . The output $z\in{\mathbb{R}}^{d}$ of the linear regression is the inner product $z=\beta^{\sf T}\sigma(Wx)$ for some matrix $\beta\in{\mathbb{R}}^{n\times d}$ to be designed.

From a neural network viewpoint, the $n$ neurons of the network are the virtual units operating the mapping $W_{i\cdot}x\mapsto\sigma(W_{i\cdot}x)$ ( $W_{i\cdot}$ being the $i$ -th row of $W$ ), for $1\leq i\leq n$ . The neural network then operates in two phases: a training phase where the regression matrix $\beta$ is learned based on a known input-output dataset pair $(X,Y)$ and a testing phase where, for $\beta$ now fixed, the network operates on a new input dataset $\hat{X}$ with corresponding unknown output $\hat{Y}$ .

During the training phase, based on a set of known input $X=[x_{1},\ldots,x_{T}]\in{\mathbb{R}}^{p\times T}$ and output $Y=[y_{1},\ldots,y_{T}]\in{\mathbb{R}}^{d\times T}$ datasets, the matrix $\beta$ is chosen so as to minimize the mean square error $\frac{1}{T}\sum_{i=1}^{T}\|z_{i}-y_{i}\|^{2}+\gamma\|\beta\|_{F}^{2}$ , where $z_{i}=\beta^{\sf T}\sigma(Wx_{i})$ and $\gamma>0$ is some regularization factor. Solving for $\beta$ , this leads to the explicit ridge-regressor

[TABLE]

where we defined $\Sigma\equiv\sigma(WX)$ . This follows from differentiating the mean square error along $\beta$ to obtain $0=\gamma\beta+\frac{1}{T}\sum_{i=1}^{T}\sigma(Wx_{i})(\beta^{\sf T}\sigma(Wx_{i})-y_{i})^{\sf T}$ , so that $(\frac{1}{T}\Sigma\Sigma^{\sf T}+\gamma I_{n})\beta=\frac{1}{T}\Sigma Y^{\sf T}$ which, along with $(\frac{1}{T}\Sigma\Sigma^{\sf T}+\gamma I_{n})^{-1}\Sigma=\Sigma(\frac{1}{T}\Sigma^{\sf T}\Sigma+\gamma I_{T})^{-1}$ , gives the result.

In the remainder, we will also denote

[TABLE]

the resolvent of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ . The matrix $Q$ naturally appears as a key quantity in the performance analysis of the neural network. Notably, the mean-square error $E_{\rm train}$ on the training dataset $X$ is given by

[TABLE]

Under the growth rate assumptions on $n,p,T$ taken below, it shall appear that the random variable $E_{\rm train}$ concentrates around its mean, letting then appear ${\rm E}[Q^{2}]$ as a central object in the asymptotic evaluation of $E_{\rm train}$ .

The testing phase of the neural network is more interesting in practice as it unveils the actual performance of neural networks. For a test dataset $\hat{X}\in{\mathbb{R}}^{p\times\hat{T}}$ of length $\hat{T}$ , with unknown output $\hat{Y}\in{\mathbb{R}}^{d\times\hat{T}}$ , the test mean-square error is defined by

[TABLE]

where $\hat{\Sigma}=\sigma(W\hat{X})$ and $\beta$ is the same as used in (1) (and thus only depends on $(X,Y)$ and $\gamma$ ). One of the key questions in the analysis of such an elementary neural network lies in the determination of $\gamma$ which minimizes $E_{\rm test}$ (and is thus said to have good generalization performance). Notably, small $\gamma$ values are known to reduce $E_{\rm train}$ but to induce the popular overfitting issue which generally increases $E_{\rm test}$ , while large $\gamma$ values engender both large values for $E_{\rm train}$ and $E_{\rm test}$ .

From a mathematical standpoint though, the study of $E_{\rm test}$ brings forward some technical difficulties that do not allow for as a simple treatment through the present concentration of measure methodology as the study of $E_{\rm train}$ . Nonetheless, the analysis of $E_{\rm train}$ allows at least for heuristic approaches to become available, which we shall exploit to propose an asymptotic deterministic approximation for $E_{\rm test}$ .

From a technical standpoint, we shall make the following set of assumptions on the mapping $x\mapsto\sigma(Wx)$ .

Assumption 1 (Subgaussian $W$ ).

The matrix $W$ is defined by

[TABLE]

(understood entry-wise), where $\tilde{W}$ has independent and identically distributed $\mathcal{N}(0,1)$ entries and $\varphi(\cdot)$ is $\lambda_{\varphi}$ -Lipschitz.

For $a=\varphi(b)\in{\mathbb{R}}^{\ell}$ , $\ell\geq 1$ , with $b\sim\mathcal{N}(0,I_{\ell})$ , we shall subsequently denote $a\sim\mathcal{N}_{\varphi}(0,I_{\ell})$ .

Under the notations of Assumption 1, we have in particular $W_{ij}\sim\mathcal{N}(0,1)$ if $\varphi(t)=t$ and $W_{ij}\sim\mathcal{U}(-1,1)$ (the uniform distribution on $[-1,1]$ ) if $\varphi(t)=-1+2\frac{1}{\sqrt{2\pi}}\int_{t}^{\infty}e^{-x^{2}}dx$ ( $\varphi$ is here a $\sqrt{2/\pi}$ -Lipschitz map).

We further need the following regularity condition on the function $\sigma$ .

Assumption 2 (Function $\sigma$ ).

The function $\sigma$ is Lipschitz continuous with parameter $\lambda_{\sigma}$ .

This assumption holds for many of the activation functions traditionally considered in neural networks, such as sigmoid functions, the rectified linear unit $\sigma(t)=\max(t,0)$ , or the absolute value operator.

When considering the interesting case of simultaneously large data and random features (or neurons), we shall then make the following growth rate assumptions.

Assumption 3 (Growth Rate).

As $n\to\infty$ ,

[TABLE]

while $\gamma,\lambda_{\sigma},\lambda_{\varphi}>0$ and $d$ are kept constant. In addition,

[TABLE]

3 Main Results

3.1 Main technical results and training performance

As a standard preliminary step in the asymptotic random matrix analysis of the expectation ${\rm E}[Q]$ of the resolvent $Q=(\frac{1}{T}\Sigma^{\sf T}\Sigma+\gamma I_{T})^{-1}$ , a convergence of quadratic forms based on the row vectors of $\Sigma$ is necessary (see e.g., (Marc̆enko and Pastur, 1967; Silverstein and Bai, 1995)). Such results are usually obtained by exploiting the independence (or linear dependence) in the vector entries. This not being the case here, as the entries of the vector $\sigma(X^{\sf T}w)$ are in general not independent, we resort to a concentration of measure approach, as advocated in (El Karoui, 2009). The following lemma, stated here in a non-asymptotic random matrix regime (that is, without necessarily resorting to Assumption 3), and thus of independent interest, provides this concentration result. For this lemma, we need first to define the following key matrix

[TABLE]

of size $T\times T$ , where $w\sim\mathcal{N}_{\varphi}(0,I_{p})$ .

Lemma 1 (Concentration of quadratic forms).

Let Assumptions 1–2 hold. Let also $A\in{\mathbb{R}}^{T\times T}$ such that $\|A\|\leq 1$ and, for $X\in{\mathbb{R}}^{p\times T}$ and $w\sim\mathcal{N}_{\varphi}(0,I_{p})$ , define the random vector $\sigma\equiv\sigma(w^{\sf T}X)^{\sf T}\in{\mathbb{R}}^{T}$ . Then,

[TABLE]

for $t_{0}\equiv|\sigma(0)|+\lambda_{\varphi}\lambda_{\sigma}\|X\|\sqrt{\frac{p}{T}}$ and $C,c>0$ independent of all other parameters. In particular, under the additional Assumption 3,

[TABLE]

*for some $C,c>0$ . *

Note that this lemma partially extends concentration of measure results involving quadratic forms, see e.g., (Rudelson et al., 2013, Theorem 1.1), to non-linear vectors.

With this result in place, the standard resolvent approaches of random matrix theory apply, providing our main theoretical finding as follows.

Theorem 1 (Asymptotic equivalent for ${\rm E}[Q]$ ).

Let Assumptions 1–3 hold and define $\bar{Q}$ as

[TABLE]

where $\delta$ is implicitly defined as the unique positive solution to $\delta=\frac{1}{T}\operatorname{tr}\Phi\bar{Q}$ . Then, for all $\varepsilon>0$ , there exists $c>0$ such that

[TABLE]

As a corollary of Theorem 1 along with a concentration argument on $\frac{1}{T}\operatorname{tr}Q$ , we have the following result on the spectral measure of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ , which may be seen as a non-linear extension of (Silverstein and Bai, 1995) for which $\sigma(t)=t$ .

Theorem 2 (Limiting spectral measure of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ ).

Let Assumptions 1–3 hold and, for $\lambda_{1},\ldots,\lambda_{T}$ the eigenvalues of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ , define $\mu_{n}=\frac{1}{T}\sum_{i=1}^{T}{\bm{\delta}}_{\lambda_{i}}$ . Then, for every bounded continuous function $f$ , with probability one

[TABLE]

where $\bar{\mu}_{n}$ is the measure defined through its Stieltjes transform $m_{\bar{\mu}_{n}}(z)\equiv\int(t-z)^{-1}d\bar{\mu}_{n}(t)$ given, for $z\in\{w\in{\mathbb{C}},~{}\Im[w]>0\}$ , by

[TABLE]

with $\delta_{z}$ the unique solution in $\{w\in{\mathbb{C}},~{}\Im[w]>0\}$ of

[TABLE]

Note that $\bar{\mu}_{n}$ has a well-known form, already met in early random matrix works (e.g., (Silverstein and Bai, 1995)) on sample covariance matrix models. Notably, $\bar{\mu}_{n}$ is also the deterministic equivalent of the empirical spectral measure of $\frac{1}{T}P^{\sf T}W^{\sf T}WP$ for any deterministic matrix $P\in{\mathbb{R}}^{p\times T}$ such that $P^{\sf T}P=\Phi$ . As such, to some extent, the results above provide a consistent asymptotic linearization of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ . From standard spiked model arguments (see e.g., (Benaych-Georges and Nadakuditi, 2012)), the result $\|{\rm E}[Q]-\bar{Q}\|\to 0$ further suggests that also the eigenvectors associated to isolated eigenvalues of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ (if any) behave similarly to those of $\frac{1}{T}P^{\sf T}W^{\sf T}WP$ , a remark that has fundamental importance in the neural network performance understanding.

However, as shall be shown in Section 3.3, and contrary to empirical covariance matrix models of the type $P^{\sf T}W^{\sf T}WP$ , $\Phi$ explicitly depends on the distribution of $W_{ij}$ (that is, beyond its first two moments). Thus, the aforementioned linearization of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ , and subsequently the deterministic equivalent for $\mu_{n}$ , are not universal with respect to the distribution of zero-mean unit variance $W_{ij}$ . This is in striking contrast to the many linear random matrix models studied to date which often exhibit such universal behaviors. This property too will have deep consequences in the performance of neural networks as shall be shown through Figure 3 in Section 4 for an example where inappropriate choices for the law of $W$ lead to network failure to fulfill the regression task.

For convenience in the following, letting $\delta$ and $\Phi$ be defined as in Theorem 1, we shall denote

[TABLE]

Theorem 1 provides the central step in the evaluation of $E_{\rm train}$ , for which not only ${\rm E}[Q]$ but also ${\rm E}[Q^{2}]$ needs be estimated. This last ingredient is provided in the following proposition.

Proposition 1 (Asymptotic equivalent for ${\rm E}[QAQ]$ ).

Let Assumptions 1–3 hold and $A\in{\mathbb{R}}^{T\times T}$ be a symmetric non-negative definite matrix which is either $\Phi$ or a matrix with uniformly bounded operator norm (with respect to $T$ ). Then, for all $\varepsilon>0$ , there exists $c>0$ such that, for all $n$ ,

[TABLE]

As an immediate consequence of Proposition 1, we have the following result on the training mean-square error of single-layer random neural networks.

Theorem 3 (Asymptotic training mean-square error).

Let Assumptions 1–3 hold and $\bar{Q}$ , $\Psi$ be defined as in Theorem 1 and (3). Then, for all $\varepsilon>0$ ,

[TABLE]

almost surely, where

[TABLE]

Since $\bar{Q}$ and $\Phi$ share the same orthogonal eigenvector basis, it appears that $E_{\rm train}$ depends on the alignment between the right singular vectors of $Y$ and the eigenvectors of $\Phi$ , with weighting coefficients

[TABLE]

where we denoted $\lambda_{i}=\lambda_{i}(\Psi)$ , $1\leq i\leq T$ , the eigenvalues of $\Psi$ (which depend on $\gamma$ through $\lambda_{i}(\Psi)=\frac{n}{T(1+\delta)}\lambda_{i}(\Phi)$ ). If $\liminf_{n}n/T>1$ , it is easily seen that $\delta\to 0$ as $\gamma\to 0$ , in which case $E_{\rm train}\to 0$ almost surely. However, in the more interesting case in practice where $\limsup_{n}n/T<1$ , $\delta\to\infty$ as $\gamma\to 0$ and $E_{\rm train}$ consequently does not have a simple limit (see Section 4.3 for more discussion on this aspect).

Theorem 3 is also reminiscent of applied random matrix works on empirical covariance matrix models, such as (Bai and Silverstein, 2007; Kammoun et al., 2009), then further emphasizing the strong connection between the non-linear matrix $\sigma(WX)$ and its linear counterpart $W\Phi^{\frac{1}{2}}$ .

As a side note, observe that, to obtain Theorem 3, we could have used the fact that $\operatorname{tr}Y^{\sf T}YQ^{2}=-\frac{\partial}{\partial\gamma}\operatorname{tr}Y^{\sf T}YQ$ which, along with some analyticity arguments (for instance when extending the definition of $Q=Q(\gamma)$ to $Q(z)$ , $z\in{\mathbb{C}}$ ), would have directly ensured that $\frac{\partial\bar{Q}}{\partial\gamma}$ is an asymptotic equivalent for $-{\rm E}[Q^{2}]$ , without the need for the explicit derivation of Proposition 1. Nonetheless, as shall appear subsequently, Proposition 1 is also a proxy to the asymptotic analysis of $E_{\rm test}$ . Besides, the technical proof of Proposition 1 quite interestingly showcases the strength of the concentration of measure tools under study here.

3.2 Testing performance

As previously mentioned, harnessing the asymptotic testing performance $E_{\rm test}$ seems, to the best of the authors’ knowledge, out of current reach with the sole concentration of measure arguments used for the proof of the previous main results. Nonetheless, if not fully effective, these arguments allow for an intuitive derivation of a deterministic equivalent for $E_{\rm test}$ , which is strongly supported by simulation results. We provide this result below under the form of a yet unproven claim, a heuristic derivation of which is provided at the end of Section 5.

To introduce this result, let $\hat{X}=[\hat{x}_{1},\ldots,\hat{x}_{\hat{T}}]\in{\mathbb{R}}^{p\times\hat{T}}$ be a set of input data with corresponding output $\hat{Y}=[\hat{y}_{1},\ldots,\hat{y}_{\hat{T}}]\in{\mathbb{R}}^{d\times\hat{T}}$ . We also define $\hat{\Sigma}=\sigma(W\hat{X})\in{\mathbb{R}}^{p\times\hat{T}}$ . We assume that $\hat{X}$ and $\hat{Y}$ satisfy the same growth rate conditions as $X$ and $Y$ in Assumption 3. To introduce our claim, we need to extend the definition of $\Phi$ in (2) and $\Psi$ in (3) to the following notations: for all pair of matrices $(A,B)$ of appropriate dimensions,

[TABLE]

where $w\sim\mathcal{N}_{\varphi}(0,I_{p})$ . In particular, $\Phi=\Phi_{XX}$ and $\Psi=\Psi_{XX}$ .

With these notations in place, we are in position to state our claimed result.

Conjecture 1 (Deterministic equivalent for $E_{\rm test}$ ).

Let Assumptions 1–2 hold and $\hat{X},\hat{Y}$ satisfy the same conditions as $X,Y$ in Assumption 3. Then, for all $\varepsilon>0$ ,

[TABLE]

almost surely, where

[TABLE]

While not immediate at first sight, one can confirm (using notably the relation $\Psi\bar{Q}+\gamma\bar{Q}=I_{T}$ ) that, for $(\hat{X},\hat{Y})=(X,Y)$ , $\bar{E}_{\rm train}=\bar{E}_{\rm test}$ , as expected.

In order to evaluate practically the results of Theorem 3 and Conjecture 1, it is a first step to be capable of estimating the values of $\Phi_{AB}$ for various $\sigma(\cdot)$ activation functions of practical interest. Such results, which call for completely different mathematical tools (mostly based on integration tricks), are provided in the subsequent section.

3.3 Evaluation of $\Phi_{AB}$

The evaluation of $\Phi_{AB}={\rm E}[\sigma(w^{\sf T}A)^{\sf T}\sigma(w^{\sf T}B)]$ for arbitrary matrices $A,B$ naturally boils down to the evaluation of its individual entries and thus to the calculus, for arbitrary vectors $a,b\in{\mathbb{R}}^{p}$ , of

[TABLE]

The evaluation of (4) can be obtained through various integration tricks for a wide family of mappings $\varphi(\cdot)$ and activation functions $\sigma(\cdot)$ . The most popular activation functions in neural networks are sigmoid functions, such as $\sigma(t)={\rm erf}(t)\equiv\frac{2}{\sqrt{\pi}}\int_{0}^{t}e^{-u^{2}}du$ , as well as the so-called rectified linear unit (ReLU) defined by $\sigma(t)=\max(t,0)$ which has been recently popularized as a result of its robust behavior in deep neural networks. In physical artificial neural networks implemented using light projections, $\sigma(t)=|t|$ is the preferred choice. Note that all aforementioned functions are Lipschitz continuous and therefore in accordance with Assumption 2.

Despite their not abiding by the prescription of Assumptions 1 and 2, we believe that the results of this article could be extended to more general settings, as discussed in Section 4. In particular, since the key ingredient in the proof of all our results is that the vector $\sigma(w^{\sf T}X)$ follows a concentration of measure phenomenon, induced by the Gaussianity of $\tilde{w}$ (if $w=\varphi(\tilde{w})$ ), the Lipschitz character of $\sigma$ and the norm boundedness of $X$ , it is likely, although not necessarily simple to prove, that $\sigma(w^{\sf T}X)$ may still concentrate under relaxed assumptions. This is likely the case for more generic vectors $w$ than $\mathcal{N}_{\varphi}(0,I_{p})$ as well as for a larger class of activation functions, such as polynomial or piece-wise Lipschitz continuous functions.

In anticipation of these likely generalizations, we provide in Table 1 the values of $\Phi_{ab}$ for $w\sim\mathcal{N}(0,I_{p})$ (i.e., for $\varphi(t)=t$ ) and for a set of functions $\sigma(\cdot)$ not necessarily satisfying Assumption 2. Denoting $\Phi\equiv\Phi(\sigma(t))$ , it is interesting to remark that, since $\arccos(x)=-\arcsin(x)+\frac{\pi}{2}$ , $\Phi(\max(t,0))=\Phi(\frac{1}{2}t)+\Phi(\frac{1}{2}|t|)$ . Also, $[\Phi(\cos(t))+\Phi(\sin(t))]_{a,b}=\exp(-\frac{1}{2}\|a-b\|^{2})$ , a result reminiscent of (Rahimi and Recht, 2007).111It is in particular not difficult to prove, based on our framework, that, as $n/T\to\infty$ , a random neural network composed of $n/2$ neurons with activation function $\sigma(t)=\cos(t)$ and $n/2$ neurons with activation function $\sigma(t)=\sin(t)$ implements a Gaussian difference kernel. Finally, note that $\Phi({\rm erf}(\kappa t))\to\Phi({\rm sign}(t))$ as $\kappa\to\infty$ , inducing that the extension by continuity of ${\rm erf}(\kappa t)$ to ${\rm sign}(t)$ propagates to their associated kernels.

In addition to these results for $w\sim\mathcal{N}(0,I_{p})$ , we also evaluated $\Phi_{ab}={\rm E}[\sigma(w^{\sf T}a)\sigma(w^{\sf T}b)]$ for $\sigma(t)=\zeta_{2}t^{2}+\zeta_{1}t+\zeta_{0}$ and $w\in{\mathbb{R}}^{p}$ a vector of independent and identically distributed entries of zero mean and moments of order $k$ equal to $m_{k}$ (so $m_{1}=0$ ); $w$ is not restricted here to satisfy $w\sim\mathcal{N}_{\varphi}(0,I_{p})$ . In this case, we find

[TABLE]

where we defined $(a^{2})\equiv[a_{1}^{2},\ldots,a_{p}^{2}]^{\sf T}$ .

It is already interesting to remark that, while classical random matrix models exhibit a well-known universality property — in the sense that their limiting spectral distribution is independent of the moments (higher than two) of the entries of the involved random matrix, here $W$ —, for $\sigma(\cdot)$ a polynomial of order two, $\Phi$ and thus $\mu_{n}$ strongly depend on ${\rm E}[W_{ij}^{k}]$ for $k=3,4$ . We shall see in Section 4 that this remark has troubling consequences. We will notably infer (and confirm via simulations) that the studied neural network may provably fail to fulfill a specific task if the $W_{ij}$ are Bernoulli with zero mean and unit variance but succeed with possibly high performance if the $W_{ij}$ are standard Gaussian (which is explained by the disappearance or not of the term $(a^{\sf T}b)^{2}$ and $(a^{2})^{\sf T}(b^{2})$ in (3.3) if $m_{4}=m_{2}^{2}$ ).

4 Practical Outcomes

We discuss in this section the outcomes of our main results in terms of neural network application. The technical discussions on Theorem 1 and Proposition 1 will be made in the course of their respective proofs in Section 5.

4.1 Simulation Results

We first provide in this section a simulation corroborating the findings of Theorem 3 and suggesting the validity of Conjecture 1. To this end, we consider the task of classifying the popular MNIST image database (LeCun, Cortes and Burges, 1998), composed of grayscale handwritten digits of size $28\times 28$ , with a neural network composed of $n=512$ units and standard Gaussian $W$ . We represent here each image as a $p=784$ -size vector; $1\,024$ images of sevens and $1\,024$ images of nines were extracted from the database and were evenly split in $512$ training and test images, respectively. The database images were jointly centered and scaled so to fall close to the setting of Assumption 3 on $X$ and $\hat{X}$ (an admissible preprocessing intervention). The columns of the output values $Y$ and $\hat{Y}$ were taken as unidimensional ( $d=1$ ) with $Y_{1j},\hat{Y}_{1j}\in\{-1,1\}$ depending on the image class. Figure 1 displays the simulated (averaged over $100$ realizations of $W$ ) versus theoretical values of $E_{\rm train}$ and $E_{\rm test}$ for three choices of Lipschitz continuous functions $\sigma(\cdot)$ , as a function of $\gamma$ .

Note that a perfect match between theory and practice is observed, for both $E_{\rm train}$ and $E_{\rm test}$ , which is a strong indicator of both the validity of Conjecture 1 and the adequacy of Assumption 3 to the MNIST dataset.

We subsequently provide in Figure 2 the comparison between theoretical formulas and practical simulations for a set of functions $\sigma(\cdot)$ which do not satisfy Assumption 2, i.e., either discontinuous or non-Lipschitz maps. The closeness between both sets of curves is again remarkably good, although to a lesser extent than for the Lipschitz continuous functions of Figure 1. Also, the achieved performances are generally worse than those observed in Figure 1.

It should be noted that the performance estimates provided by Theorem 3 and Conjecture 1 can be efficiently implemented at low computational cost in practice. Indeed, by diagonalizing $\Phi$ (which is a marginal cost independent of $\gamma$ ), $\bar{E}_{\rm train}$ can be computed for all $\gamma$ through mere vector operations; similarly $\bar{E}_{\rm test}$ is obtained by the marginal cost of a basis change of $\Phi_{\hat{X}X}$ and the matrix product $\Phi_{X\hat{X}}\Phi_{\hat{X}X}$ , all remaining operations being accessible through vector operations. As a consequence, the simulation durations to generate the aforementioned theoretical curves using the linked Python script were found to be $100$ to $500$ times faster than to generate the simulated network performances. Beyond their theoretical interest, the provided formulas therefore allow for an efficient offline tuning of the network hyperparameters, notably the choice of an appropriate value for the ridge-regression parameter $\gamma$ .

4.2 The underlying kernel

Theorem 1 and the subsequent theoretical findings importantly reveal that the neural network performances are directly related to the Gram matrix $\Phi$ , which acts as a deterministic kernel on the dataset $X$ . This is in fact a well-known result found e.g., in (Williams, 1998) where it is shown that, as $n\to\infty$ alone, the neural network behaves as a mere kernel operator (this observation is retrieved here in the subsequent Section 4.3). This remark was then put at an advantage in (Rahimi and Recht, 2007) and subsequent works, where random feature maps of the type $x\mapsto\sigma(Wx)$ are proposed as a computationally efficient proxy to evaluate kernels $(x,y)\mapsto\Phi(x,y)$ .

As discussed previously, the formulas for $\bar{E}_{\rm train}$ and $\bar{E}_{\rm test}$ suggest that good performances are achieved if the dominant eigenvectors of $\Phi$ show a good alignment to $Y$ (and similarly for $\Phi_{X\hat{X}}$ and $\hat{Y}$ ). This naturally drives us to finding a priori simple regression tasks where ill-choices of $\Phi$ may annihilate the neural network performance. Following recent works on the asymptotic performance analysis of kernel methods for Gaussian mixture models (Couillet and Benaych-Georges, 2016; Zhenyu Liao, 2017; Mai and Couillet, 2017) and (Couillet and Kammoun, 2016), we describe here such a task.

Let $x_{1},\ldots,x_{T/2}\sim\mathcal{N}(0,\frac{1}{p}C_{1})$ and $x_{T/2+1},\ldots,x_{T}\sim\mathcal{N}(0,\frac{1}{p}C_{2})$ where $C_{1}$ and $C_{2}$ are such that $\operatorname{tr}C_{1}=\operatorname{tr}C_{2}$ , $\|C_{1}\|,\|C_{2}\|$ are bounded, and $\operatorname{tr}(C_{1}-C_{2})^{2}=O(p)$ . Accordingly, $y_{1},\ldots,y_{T/2+1}=-1$ and $y_{T/2+1},\ldots,y_{T}=1$ . It is proved in the aforementioned articles that, under these conditions, it is theoretically possible, in the large $p,T$ limit, to classify the data using a kernel least-square support vector machine (that is, with a training dataset) or with a kernel spectral clustering method (that is, in a completely unsupervised manner) with a non-trivial limiting error probability (i.e., neither zero nor one). This scenario has the interesting feature that $x_{i}^{\sf T}x_{j}\to 0$ almost surely for all $i\neq j$ while $\|x_{i}\|^{2}-\frac{1}{p}\operatorname{tr}(\frac{1}{2}C_{1}+\frac{1}{2}C_{2})\to 0$ , almost surely, irrespective of the class of $x_{i}$ , thereby allowing for a Taylor expansion of the non-linear kernels as early proposed in (El Karoui, 2010).

Transposed to our present setting, the aforementioned Taylor expansion allows for a consistent approximation $\tilde{\Phi}$ of $\Phi$ by an information-plus-noise (spiked) random matrix model (see e.g., (Loubaton and Vallet, 2010; Benaych-Georges and Nadakuditi, 2012)). In the present Gaussian mixture context, it is shown in (Couillet and Benaych-Georges, 2016) that data classification is (asymptotically at least) only possible if $\tilde{\Phi}_{ij}$ explicitly contains the quadratic term $(x_{i}^{\sf T}x_{j})^{2}$ (or combinations of $(x_{i}^{2})^{\sf T}x_{j}$ , $(x_{j}^{2})^{\sf T}x_{i}$ , and $(x_{i}^{2})^{\sf T}(x_{j}^{2})$ ). In particular, letting $a,b\sim\mathcal{N}(0,C_{i})$ with $i=1,2$ , it is easily seen from Table 1 that only $\max(t,0)$ , $|t|$ , and $\cos(t)$ can realize the task. Indeed, we have the following Taylor expansions around $x=0$ :

[TABLE]

where only the last three functions (only found in the expression of $\Phi_{ab}$ corresponding to $\sigma(t)=\max(t,0)$ , $|t|$ , or $\cos(t)$ ) exhibit a quadratic term.

More surprisingly maybe, recalling now Equation (3.3) which considers non-necessarily Gaussian $W_{ij}$ with moments $m_{k}$ of order $k$ , a more refined analysis shows that the aforementioned Gaussian mixture classification task will fail if $m_{3}=0$ and $m_{4}=m_{2}^{2}$ , so for instance for $W_{ij}\in\{-1,1\}$ Bernoulli with parameter $\frac{1}{2}$ . The performance comparison of this scenario is shown in the top part of Figure 3 for $\sigma(t)=-\frac{1}{2}t^{2}+1$ and $C_{1}=\operatorname{\rm diag}(I_{p/2},4I_{p/2})$ , $C_{2}=\operatorname{\rm diag}(4I_{p/2},I_{p/2})$ , for $W_{ij}\sim\mathcal{N}(0,1)$ and $W_{ij}\sim{\rm Bern}$ (that is, Bernoulli $\{(-1,\frac{1}{2}),(1,\frac{1}{2})\}$ ). The choice of $\sigma(t)=\zeta_{2}t^{2}+\zeta_{1}t+\zeta_{0}$ with $\zeta_{1}=0$ is motivated by (Couillet and Benaych-Georges, 2016; Couillet and Kammoun, 2016) where it is shown, in a somewhat different setting, that this choice is optimal for class recovery. Note that, while the test performances are overall rather weak in this setting, for $W_{ij}\sim\mathcal{N}(0,1)$ , $E_{\rm test}$ drops below one (the amplitude of the $\hat{Y}_{ij}$ ), thereby indicating that non-trivial classification is performed. This is not so for the Bernoulli $W_{ij}\sim{\rm Bern}$ case where $E_{\rm test}$ is systematically greater than $|\hat{Y}_{ij}|{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=1}$ . This is theoretically explained by the fact that, from Equation (3.3), $\Phi_{ij}$ contains structural information about the data classes through the term $2m_{2}^{2}(x_{i}^{\sf T}x_{j})^{2}+(m_{4}-3m_{2}^{2})(x_{i}^{2})^{\sf T}(x_{j}^{2})$ which induces an information-plus-noise model for $\Phi$ as long as $2m_{2}^{2}+(m_{4}-3m_{2}^{2})\neq 0$ , i.e., $m_{4}\neq m_{2}^{2}$ (see (Couillet and Benaych-Georges, 2016) for details). This is visually seen in the bottom part of Figure 3 where the Gaussian scenario presents an isolated eigenvalue for $\Phi$ with corresponding structured eigenvector, which is not the case of the Bernoulli scenario. To complete this discussion, it appears relevant in the present setting to choose $W_{ij}$ in such a way that $m_{4}-m_{2}^{2}$ is far from zero, thus suggesting the interest of heavy-tailed distributions. To confirm this prediction, Figure 3 additionally displays the performance achieved and the spectrum of $\Phi$ observed for $W_{ij}\sim{\rm Stud}$ , that is, following a Student-t distribution with degree of freedom $\nu=7$ normalized to unit variance (in this case $m_{2}=1$ and $m_{4}=5$ ). Figure 3 confirms the large superiority of this choice over the Gaussian case (note nonetheless the slight inaccuracy of our theoretical formulas in this case, which is likely due to too small values of $p,n,T$ to accommodate $W_{ij}$ with higher order moments, an observation which is confirmed in simulations when letting $\nu$ be even smaller).

4.3 Limiting cases

We have suggested that $\Phi$ contains, in its dominant eigenmodes, all the usable information describing $X$ . In the Gaussian mixture example above, it was notably shown that $\Phi$ may completely fail to contain this information, resulting in the impossibility to perform a classification task, even if one were to take infinitely many neurons in the network. For $\Phi$ containing useful information about $X$ , it is intuitive to expect that both $\inf_{\gamma}\bar{E}_{\rm train}$ and $\inf_{\gamma}\bar{E}_{\rm test}$ become smaller as $n/T$ and $n/p$ become large. It is in fact easy to see that, if $\Phi$ is invertible (which is likely to occur in most cases if $\liminf_{n}T/p>1$ ), then

[TABLE]

and we fall back on the performance of a classical kernel regression. It is interesting in particular to note that, as the number of neurons $n$ becomes large, the effect of $\gamma$ on $E_{\rm test}$ flattens out. Therefore, a smart choice of $\gamma$ is only relevant for small (and thus computationally more efficient) neuron layers. This observation is depicted in Figure 4 where it is made clear that a growth of $n$ reduces $E_{\rm train}$ to zero while $E_{\rm test}$ saturates to a non-zero limit which becomes increasingly irrespective of $\gamma$ . Note additionally the interesting phenomenon occurring for $n\leq T$ where too small values of $\gamma$ induce important performance losses, thereby suggesting a strong importance of proper choices of $\gamma$ in this regime.

Of course, practical interest lies precisely in situations where $n$ is not too large. We may thus subsequently assume that $\limsup_{n}n/T<1$ . In this case, as suggested by Figures 1–2, the mean-square error performances achieved as $\gamma\to 0$ may predict the superiority of specific choices of $\sigma(\cdot)$ for optimally chosen $\gamma$ . It is important for this study to differentiate between cases where $r\equiv{\rm rank}(\Phi)$ is smaller or greater than $n$ . Indeed, observe that, with the spectral decomposition $\Phi=U_{r}\Lambda_{r}U_{r}^{\sf T}$ for $\Lambda_{r}\in{\mathbb{R}}^{r\times r}$ diagonal and $U_{r}\in{\mathbb{R}}^{T\times r}$ ,

[TABLE]

which satisfies, as $\gamma\to 0$ ,

[TABLE]

A phase transition therefore exists whereby $\delta$ assumes a finite positive value in the small $\gamma$ limit if $r/n<1$ , or scales like $1/\gamma$ otherwise.

As a consequence, if $r<n$ , as $\gamma\to 0$ , $\Psi\to\frac{n}{T}(1-\frac{r}{n})\Phi$ and $\bar{Q}\sim\frac{T}{n-r}U_{r}\Lambda_{r}^{-1}U_{r}^{\sf T}+\frac{1}{\gamma}V_{r}V_{r}^{\sf T}$ , where $V_{r}\in{\mathbb{R}}^{T\times(n-r)}$ is any matrix such that $[U_{r}~{}V_{r}]$ is orthogonal, so that $\Psi\bar{Q}\to U_{r}U_{r}^{\sf T}$ and $\Psi\bar{Q}^{2}\to U_{r}\Lambda_{r}^{-1}U_{r}^{\sf T}$ ; and thus, $\bar{E}_{\rm train}\to\frac{1}{T}\operatorname{tr}YV_{r}V_{r}^{\sf T}Y^{\sf T}=\frac{1}{T}\|YV_{r}\|^{2}_{F}$ , which states that the residual training error corresponds to the energy of $Y$ not captured by the space spanned by $\Phi$ . Since $E_{\rm train}$ is an increasing function of $\gamma$ , so is $\bar{E}_{\rm train}$ (at least for all large $n$ ) and thus $\frac{1}{T}\|YV_{r}\|^{2}_{F}$ corresponds to the lowest achievable asymptotic training error.

If instead $r>n$ (which is the most likely outcome in practice), as $\gamma\to 0$ , $\bar{Q}\sim\frac{1}{\gamma}(\frac{n}{T}\frac{\Phi}{\Delta}+I_{T})^{-1}$ and thus

[TABLE]

where $\Psi_{\Delta}=\frac{n}{T}\frac{\Phi}{\Delta}$ and $Q_{\Delta}=(\frac{n}{T}\frac{\Phi}{\Delta}+I_{T})^{-1}$ .

These results suggest that neural networks should be designed both in a way that reduces the rank of $\Phi$ while maintaining a strong alignment between the dominant eigenvectors of $\Phi$ and the output matrix $Y$ .

Interestingly, if $X$ is assumed as above to be extracted from a Gaussian mixture and that $Y\in{\mathbb{R}}^{1\times T}$ is a classification vector with $Y_{1j}\in\{-1,1\}$ , then the tools proposed in (Couillet and Benaych-Georges, 2016) (related to spike random matrix analysis) allow for an explicit evaluation of the aforementioned limits as $n,p,T$ grow large. This analysis is however cumbersome and outside the scope of the present work.

5 Proof of the Main Results

In the remainder, we shall use extensively the following notations:

[TABLE]

i.e., $\sigma_{i}=\sigma(w_{i}^{\sf T}X)^{\sf T}$ . Also, we shall define $\Sigma_{-i}\in{\mathbb{R}}^{(n-1)\times T}$ the matrix $\Sigma$ with $i$ -th row removed, and correspondingly

[TABLE]

Finally, because of exchangeability, it shall often be convenient to work with the generic random vector $w\sim\mathcal{N}_{\varphi}(0,I_{T})$ , the random vector $\sigma$ distributed as any of the $\sigma_{i}$ ’s, the random matrix $\Sigma_{-}$ distributed as any of the $\Sigma_{-i}$ ’s, and with the random matrix $Q_{-}$ distributed as any of the $Q_{-i}$ ’s.

5.1 Concentration Results on $\Sigma$

Our first results provide concentration of measure properties on functionals of $\Sigma$ . These results unfold from the following concentration inequality for Lipschitz applications of a Gaussian vector; see e.g., (Ledoux, 2005, Corollary 2.6, Propositions 1.3, 1.8) or (Tao, 2012, Theorem 2.1.12). For $d\in\mathbb{N}$ , consider $\mu$ the canonical Gaussian probability on $\mathbb{R}^{d}$ defined through its density $d\mu(w)=(2\pi)^{-\frac{d}{2}}e^{-\frac{1}{2}\|w\|^{2}}$ and $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ a $\lambda_{f}$ -Lipschitz function. Then, we have the said normal concentration

[TABLE]

where $C,c>0$ are independent of $d$ and $\lambda_{f}$ . As a corollary (see e.g., (Ledoux, 2005, Proposition 1.10)), for every $k\geq 1$ ,

[TABLE]

The main approach to the proof of our results, starting with that of the key Lemma 1, is as follows: since $W_{ij}=\varphi(\tilde{W}_{ij})$ with $\tilde{W}_{ij}\sim\mathcal{N}(0,1)$ and $\varphi$ Lipschitz, the normal concentration of $\tilde{W}$ transfers to $W$ which further induces a normal concentration of the random vector $\sigma$ and the matrix $\Sigma$ , thereby implying that Lipschitz functionals of $\sigma$ or $\Sigma$ also concentrate. As pointed out earlier, these concentration results are used in place for the independence assumptions (and their multiple consequences on convergence of random variables) classically exploited in random matrix theory.

Notations: In all subsequent lemmas and proofs, the letters $c,c_{i},C,C_{i}>0$ will be used interchangeably as positive constants independent of the key equation parameters (notably $n$ and $t$ below) and may be reused from line to line. Additionally, the variable $\varepsilon>0$ will denote any small positive number; the variables $c,c_{i},C,C_{i}$ may depend on $\varepsilon$ .

We start by recalling the first part of the statement of Lemma 1 and subsequently providing its proof.

Lemma 2 (Concentration of quadratic forms).

Let Assumptions 1–2 hold. Let also $A\in{\mathbb{R}}^{T\times T}$ such that $\|A\|\leq 1$ and, for $X\in{\mathbb{R}}^{p\times T}$ and $w\sim\mathcal{N}_{\varphi}(0,I_{p})$ , define the random vector $\sigma\equiv\sigma(w^{\sf T}X)^{\sf T}\in{\mathbb{R}}^{T}$ . Then,

[TABLE]

for $t_{0}\equiv|\sigma(0)|+\lambda_{\varphi}\lambda_{\sigma}\|X\|\sqrt{\frac{p}{T}}$ and $C,c>0$ independent of all other parameters.

Proof.

The layout of the proof is as follows: since the application $w\mapsto\frac{1}{T}\sigma^{\sf T}A\sigma$ is “quadratic” in $w$ and thus not Lipschitz (therefore not allowing for a natural transfer of the concentration of $w$ to $\frac{1}{T}\sigma^{\sf T}A\sigma$ ), we first prove that $\frac{1}{\sqrt{T}}\|\sigma\|$ satisfies a concentration inequality, which provides a high probability $O(1)$ bound on $\frac{1}{\sqrt{T}}\|\sigma\|$ . Conditioning on this event, the map $w\mapsto\frac{1}{\sqrt{T}}\sigma^{\sf T}A\sigma$ can then be shown to be Lipschitz (by isolating one of the $\sigma$ terms for bounding and the other one for retrieving the Lipschitz character) and, up to an appropriate control of concentration results under conditioning, the result is obtained.

Following this plan, we first provide a concentration inequality for $\|\sigma\|$ . To this end, note that the application $\psi:{\mathbb{R}}^{p}\to{\mathbb{R}}^{T}$ , $\tilde{w}\mapsto\sigma(\varphi(\tilde{w})^{\sf T}X)^{\sf T}$ is Lipschitz with parameter $\lambda_{\varphi}\lambda_{\sigma}\|X\|$ as the combination of the $\lambda_{\varphi}$ -Lipschitz function $\varphi:\tilde{w}\mapsto w$ , the $\|X\|$ -Lipschitz map ${\mathbb{R}}^{n}\to{\mathbb{R}}^{T}$ , $w\mapsto X^{\sf T}w$ and the $\lambda_{\sigma}$ -Lipschitz map ${\mathbb{R}}^{T}\to{\mathbb{R}}^{T}$ , $Y\mapsto\sigma(Y)$ . As a Gaussian vector, $\tilde{w}$ has a normal concentration and so does $\psi(\tilde{w})$ . Since the Euclidean norm ${\mathbb{R}}^{T}\to{\mathbb{R}}$ , $Y\mapsto\|Y\|$ is $1$ -Lipschitz, we thus have immediately by (6)

[TABLE]

for some $c,C>0$ independent of all parameters.

Finally, using again the Lipschitz character of $\sigma(w^{\sf T}X)$ ,

[TABLE]

so that, by Jensen’s inequality,

[TABLE]

with ${\rm E}[\|\varphi(\tilde{w})\|^{2}]\leq\lambda_{\varphi}^{2}{\rm E}[\|\tilde{w}\|^{2}]=p\lambda_{\varphi}^{2}$ (since $\tilde{w}\sim\mathcal{N}(0,I_{p})$ ). Letting $t_{0}\equiv|\sigma(0)|+\lambda_{\sigma}\lambda_{\varphi}\|X\|\sqrt{\frac{p}{T}}$ , we then find

[TABLE]

which, with the remark $t\geq 4t_{0}\Rightarrow(t-t_{0})^{2}\geq t^{2}/2$ , may be equivalently stated as

[TABLE]

As a side (but important) remark, note that, since

[TABLE]

the result above implies that

[TABLE]

and thus, since $\|\cdot\|_{F}\geq\|\cdot\|$ , we have

[TABLE]

Thus, in particular, under the additional Assumption 3, with high probability, the operator norm of $\frac{\Sigma}{\sqrt{T}}$ cannot exceed a rate $\sqrt{T}$ .

Remark 1 (Loss of control of the structure of $\Sigma$ ).

The aforementioned control of $\|\Sigma\|$ arises from the bound $\|\Sigma\|\leq\|\Sigma\|_{F}$ which may be quite loose (by as much as a factor $\sqrt{T}$ ). Intuitively, under the supplementary Assumption 3, if ${\rm E}[\sigma]\neq 0$ , then $\frac{\Sigma}{\sqrt{T}}$ is “dominated” by the matrix $\frac{1}{\sqrt{T}}{\rm E}[\sigma]1_{T}^{\sf T}$ , the operator norm of which is indeed of order $\sqrt{n}$ and the bound is tight. If $\sigma(t)=t$ and ${\rm E}[W_{ij}]=0$ , we however know that $\|\frac{\Sigma}{\sqrt{T}}\|=O(1)$ (Bai and Silverstein, 1998). One is tempted to believe that, more generally, if ${\rm E}[\sigma]=0$ , then $\|\frac{\Sigma}{\sqrt{T}}\|$ should remain of this order. And, if instead ${\rm E}[\sigma]\neq 0$ , the contribution of $\frac{1}{\sqrt{T}}{\rm E}[\sigma]1_{T}^{\sf T}$ should merely engender a single large amplitude isolate singular value in the spectrum of $\frac{\Sigma}{\sqrt{T}}$ and the other singular values remain of order $O(1)$ . These intuitions are not captured by our concentration of measure approach.

Since $\Sigma=\sigma(WX)$ is an entry-wise operation, concentration results with respect to the Frobenius norm are natural, where with respect to the operator norm are hardly accessible.

Back to our present considerations, let us define the probability space $\mathcal{A}_{K}=\{w,~{}\|\sigma(w^{\sf T}X)\|\leq K\sqrt{T}\}$ .

Conditioning the random variable of interest in Lemma 2 with respect to $\mathcal{A}_{K}$ and its complementary $\mathcal{A}_{K}^{c}$ , for some $K\geq 4t_{0}$ , gives

[TABLE]

We can already bound $P(\mathcal{A}_{K}^{c})$ thanks to (7). As for the first right-hand side term, note that on the set $\{\sigma(w^{\sf T}X),w\in\mathcal{A}_{K}\}$ , the function $f:\mathbb{R}^{T}\rightarrow\mathbb{R}:\ \sigma\mapsto\sigma^{\sf T}A\sigma$ is $K\sqrt{T}$ -Lipschitz. This is because, for all $\sigma,\sigma+h\in\{\sigma(w^{\sf T}X),w\in\mathcal{A}_{K}\}$ ,

[TABLE]

Since conditioning does not allow for a straightforward application of (6), we consider instead $\tilde{f}$ , a $K\sqrt{T}$ -Lipschitz continuation to ${\mathbb{R}}^{T}$ of $f_{\mathcal{A}_{K}}$ , the restriction of $f$ to $\mathcal{A}_{K}$ , such that all the radial derivative of $\tilde{f}$ are constant in the set $\{\sigma,\|\sigma\|\geq K\sqrt{T}\}$ . We may thus now apply (6) and our previous results to obtain

[TABLE]

Therefore,

[TABLE]

Our next step is then to bound the difference $\Delta=|{\rm E}[\tilde{f}(\sigma(w^{\sf T}X))]-{\rm E}[f(\sigma(w^{\sf T}X))]|$ . Since $f$ and $\tilde{f}$ are equal on $\{\sigma,\|\sigma\|\leq K\sqrt{T}\}$ ,

[TABLE]

where $\mu_{\sigma}$ is the law of $\sigma(w^{\sf T}X)$ . Since $\|A\|\leq 1$ , for $\|\sigma\|\geq K\sqrt{T}$ , $\max(|f(\sigma)|,|\tilde{f}(\sigma)|)\leq\|\sigma\|^{2}$ and thus

[TABLE]

where in last inequality we used the fact that for $x\in{\mathbb{R}}$ , $xe^{-x}\leq e^{-1}\leq 1$ , and $K\geq 4t_{0}\geq 4\lambda_{\sigma}\lambda_{\varphi}\|X\|\sqrt{\frac{p}{T}}$ . As a consequence,

[TABLE]

so that, with the same remark as before, for $t\geq\frac{4\Delta}{KT}$ ,

[TABLE]

To avoid the condition $t\geq\frac{4\Delta}{KT}$ , we use the fact that, probabilities being lower than one, it suffices to replace $C$ by $\lambda C$ with $\lambda\geq 1$ such that

[TABLE]

The above inequality holds if we take for instance $\lambda=\frac{1}{C}e^{\frac{18C^{2}}{c}}$ since then $t\leq\frac{4\Delta}{KT}\leq\frac{24C\lambda_{\varphi}^{2}\lambda_{\sigma}^{2}\|X\|^{2}}{cKT}\leq\frac{6C\lambda_{\varphi}\lambda_{\sigma}\|X\|}{c\sqrt{pT}}$ (using successively $\Delta\geq\frac{6C}{c}\lambda_{\varphi}^{2}\lambda_{\sigma}^{2}\|X\|^{2}$ and $K\geq 4\lambda_{\sigma}\lambda_{\varphi}\|X\|\sqrt{\frac{p}{T}}$ ) and thus

[TABLE]

Therefore, setting $\lambda=\max(1,\frac{1}{C}e^{\frac{{C^{\prime}}^{2}c}{2}})$ , we get for every $t>0$

[TABLE]

which, together with the inequality $P(\mathcal{A}_{K}^{c})\leq Ce^{-\frac{cTK^{2}}{2\lambda_{\varphi}^{2}\lambda_{\sigma}^{2}\|X\|^{2}}}$ , gives

[TABLE]

We then conclude

[TABLE]

and, with $K=\max(4t_{0},\sqrt{t})$ ,

[TABLE]

Indeed, if $4t_{0}\leq\sqrt{t}$ then $\min(t^{2}/K^{2},K^{2})=t$ , while if $4t_{0}\geq\sqrt{t}$ then $\min(t^{2}/K^{2},K^{2})=\min(t^{2}/16t_{0}^{2},16t_{0}^{2})=t^{2}/16t_{0}^{2}$ .

∎

As a corollary of Lemma 2, we have the following control of the moments of $\frac{1}{T}\sigma^{\sf T}A\sigma$ .

Corollary 1 (Moments of quadratic forms).

*Let Assumptions 1–2 hold. For $w\sim\mathcal{N}_{\varphi}(0,I_{p})$ , $\sigma\equiv\sigma(w^{\sf T}X)^{\sf T}\in{\mathbb{R}}^{T}$ , $A\in{\mathbb{R}}^{T\times T}$ such that $\|A\|\leq 1$ , and $k\in{\mathbb{N}}$ , *

[TABLE]

with $t_{0}=|\sigma(0)|+\lambda_{\sigma}\lambda_{\varphi}\|X\|\sqrt{\frac{p}{T}}$ , $\eta=\|X\|\lambda_{\sigma}\lambda_{\varphi}$ , and $C_{1},C_{2}>0$ independent of the other parameters. In particular, under the additional Assumption 3,

[TABLE]

Proof.

We use the fact that, for a nonnegative random variable $Y$ , ${\rm E}[Y]=\int_{0}^{\infty}P(Y>t)dt$ , so that

[TABLE]

which, along with the boundedness of the integrals, concludes the proof. ∎

Beyond concentration results on functions of the vector $\sigma$ , we also have the following convenient property for functions of the matrix $\Sigma$ .

Lemma 3 (Lipschitz functions of $\Sigma$ ).

Let $f:{\mathbb{R}}^{n\times T}\to{\mathbb{R}}$ be a $\lambda_{f}$ -Lipschitz function with respect to the Froebnius norm. Then, under Assumptions 1–2,

[TABLE]

for some $C,c>0$ . In particular, under the additional Assumption 3,

[TABLE]

Proof.

Denoting $W=\varphi(\tilde{W})$ , since ${\rm vec}(\tilde{W})\equiv{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}[\tilde{W}_{11},\cdots,\tilde{W}_{np}]}$ is a Gaussian vector, by the normal concentration of Gaussian vectors, for $g$ a $\lambda_{g}$ -Lipschitz function of $W$ with respect to the Frobenius norm (i.e., the Euclidean norm of ${\rm vec}(W)$ ), by (6),

[TABLE]

for some $C,c>0$ . Let’s consider in particular $g:W\mapsto f(\Sigma/\sqrt{T})$ and remark that

[TABLE]

concluding the proof. ∎

A first corollary of Lemma 3 is the concentration of the Stieltjes transform $\frac{1}{T}\operatorname{tr}\left(\frac{1}{T}\Sigma^{\sf T}\Sigma-zI_{T}\right)^{-1}$ of $\mu_{n}$ , the empirical spectral measure of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ , for all $z\in{\mathbb{C}}\setminus{\mathbb{R}}^{+}$ (so in particular, for $z=-\gamma$ , $\gamma>0$ ).

Corollary 2 (Concentration of the Stieltjes transform of $\mu_{n}$ ).

Under Assumptions 1–2, for $z\in{\mathbb{C}}\setminus{\mathbb{R}}^{+}$ ,

[TABLE]

for some $C,c>0$ , where ${\rm dist}(z,{\mathbb{R}}^{+})$ is the Hausdorff set distance. In particular, for $z=-\gamma$ , $\gamma>0$ , and under the additional Assumption 3

[TABLE]

Proof.

We can apply Lemma 3 for $f:R\mapsto\frac{1}{T}\operatorname{tr}(R^{\sf T}R-zI_{T})^{-1}$ , since we have

[TABLE]

where, for the second to last inequality, we successively used the relations $|\operatorname{tr}AB|\leq\sqrt{\operatorname{tr}AA^{\sf T}}\sqrt{\operatorname{tr}BB^{\sf T}}$ , $|\operatorname{tr}CD|\leq\|D\|\operatorname{tr}C$ for nonnegative definite $C$ , and $\|(R^{\sf T}R-zI_{T})^{-1}\|\leq{\rm dist}(z,{\mathbb{R}}^{+})^{-1}$ , $\|(R^{\sf T}R-zI_{T})^{-1}R^{\sf T}R\|\leq 1$ , $\|(R^{\sf T}R-zI_{T})^{-1}R^{\sf T}\|=\|(R^{\sf T}R-zI_{T})^{-1}R^{\sf T}R(R^{\sf T}R-zI_{T})^{-1}\|^{\frac{1}{2}}\leq\|(R^{\sf T}R-zI_{T})^{-1}R^{\sf T}R\|^{\frac{1}{2}}\|(R^{\sf T}R-zI_{T})^{-1}\|^{\frac{1}{2}}\leq{\rm dist}(z,{\mathbb{R}}^{+})^{-\frac{1}{2}}$ , for $z\in{\mathbb{C}}\setminus{\mathbb{R}}^{+}$ , and finally $\|\cdot\|\leq\|\cdot\|_{F}$ . ∎

Lemma 3 also allows for an important application of Lemma 2 as follows.

Lemma 4 (Concentration of $\frac{1}{T}\sigma^{\sf T}Q_{-}\sigma$ ).

Let Assumptions 1–3 hold and write $W^{\sf T}=[w_{1},\ldots,w_{n}]$ . Define $\sigma\equiv\sigma(w_{1}^{\sf T}X)^{\sf T}\in{\mathbb{R}}^{T}$ and, for $W_{-}^{\sf T}=[w_{2},\ldots,w_{n}]$ and $\Sigma_{-}=\sigma(W_{-}X)$ , let $Q_{-}=(\frac{1}{T}\Sigma_{-}^{\sf T}\Sigma_{-}+\gamma I_{T})^{-1}$ . Then, for $A,B\in{\mathbb{R}}^{T\times T}$ such that $\|A\|,\|B\|\leq 1$

[TABLE]

for some $C,c>0$ independent of the other parameters.

Proof.

Let $f:R\mapsto\frac{1}{T}\sigma^{\sf T}A(R^{\sf T}R+\gamma I_{T})^{-1}B\sigma$ . Reproducing the proof of Corollary 2, conditionally to $\frac{1}{T}\|\sigma\|^{2}\leq K$ for any arbitrary large enough $K>0$ , it appears that $f$ is Lipschitz with parameter of order $O(1)$ . Along with (7) and Assumption 3, this thus ensures that

[TABLE]

for some $C,c>0$ . We may then apply Lemma 1 on the bounded norm matrix $A{\rm E}[Q_{-}]B$ to further find that

[TABLE]

which concludes the proof. ∎

As a further corollary of Lemma 3, we have the following concentration result on the training mean-square error of the neural network under study.

Corollary 3 (Concentration of the mean-square error).

Under Assumptions 1–3,

[TABLE]

for some $C,c>0$ independent of the other parameters.

Proof.

We apply Lemma 3 to the mapping $f:R\mapsto\frac{1}{T}\operatorname{tr}Y^{\sf T}Y(R^{\sf T}R+\gamma I_{T})^{-2}$ . Denoting $Q=(R^{\sf T}R+\gamma I_{T})^{-1}$ and $Q^{H}=((R+H)^{\sf T}(R+H)+\gamma I_{T})^{-1}$ , remark indeed that

[TABLE]

As $\|Q^{H}(R+H)^{\sf T}\|=\sqrt{\|Q^{H}(R+H)^{\sf T}(R+H)Q^{H}\|}$ and $\|RQ\|=\sqrt{\|QR^{\sf T}RQ\|}$ are bounded and $\frac{1}{T}\operatorname{tr}Y^{\sf T}Y$ is also bounded by Assumption 3, this implies

[TABLE]

for some $C>0$ . The function $f$ is thus Lipschitz with parameter independent of $n$ , which allows us to conclude using Lemma 3. ∎

The aforementioned concentration results are the building blocks of the proofs of Theorem 1–3 which, under all Assumptions 1–3, are established using standard random matrix approaches.

5.2 Asymptotic Equivalents

5.2.1 First Equivalent for ${\rm E}[Q]$

This section is dedicated to a first characterization of ${\rm E}[Q]$ , in the “simultaneously large” $n,p,T$ regime. This preliminary step is classical in studying resolvents in random matrix theory as the direct comparison of ${\rm E}[Q]$ to $\bar{Q}$ with the implicit $\delta$ may be cumbersome. To this end, let us thus define the intermediary deterministic matrix

[TABLE]

with $\alpha\equiv\frac{1}{T}\operatorname{tr}\Phi{\rm E}[Q_{-}]$ , where we recall that $Q_{-}$ is a random matrix distributed as, say, $(\frac{1}{T}\Sigma^{\sf T}\Sigma-\frac{1}{T}\sigma_{1}\sigma_{1}^{\sf T}+\gamma I_{T})^{-1}$ .

First note that, since $\frac{1}{T}\operatorname{tr}\Phi={\rm E}[\frac{1}{T}\|\sigma\|^{2}]$ and, from (7) and Assumption 3, $P(\frac{1}{T}\|\sigma\|^{2}>t)\leq Ce^{-cnt^{2}}$ for all large $t$ , we find that $\frac{1}{T}\operatorname{tr}\Phi=\int_{0}^{\infty}t^{2}P(\frac{1}{T}\|\sigma\|^{2}>t)dt\leq C^{\prime}$ for some constant $C^{\prime}$ . Thus, $\alpha\leq\|{\rm E}[Q_{-}]\|\frac{1}{T}\operatorname{tr}\Phi\leq\frac{C^{\prime}}{\gamma}$ is uniformly bounded.

We will show here that $\|{\rm E}[Q]-\tilde{Q}\|\to 0$ as $n\to\infty$ in the regime of Assumption 3. As the proof steps are somewhat classical, we defer to the appendix some classical intermediary lemmas (Lemmas 5–7). Using the resolvent identity, Lemma 5, we start by writing

[TABLE]

which, from Lemma 6, gives, for $Q_{-i}=(\frac{1}{T}\Sigma^{\sf T}\Sigma-\frac{1}{T}\sigma_{i}\sigma_{i}^{\sf T}+\gamma I_{T})^{-1}$ ,

[TABLE]

Note now, from the independence of $Q_{-i}$ and $\sigma_{i}\sigma_{i}^{\sf T}$ , that the second right-hand side expectation is simply ${\rm E}[Q_{-i}]\Phi$ . Also, exploiting Lemma 6 in reverse on the rightmost term, this gives

[TABLE]

It is convenient at this point to note that, since ${\rm E}[Q]-\tilde{Q}$ is symmetric, we may write

[TABLE]

We study the two right-hand side terms of (5.2.1) independently.

For the first term, since $Q-Q_{-i}=-Q\frac{1}{T}\sigma_{i}\sigma_{i}^{\sf T}Q_{-i}$ ,

[TABLE]

where we used again Lemma 6 in reverse. Denoting $D=\operatorname{\rm diag}(\{1+\frac{1}{T}\sigma_{i}^{\sf T}Q_{-i}\sigma_{i}\}_{i=1}^{n})$ , this can be compactly written

[TABLE]

Note at this point that, from Lemma 7, $\|\Phi\tilde{Q}\|\leq(1+\alpha)\frac{T}{n}$ and

[TABLE]

Besides, by Lemma 4 and the union bound,

[TABLE]

for some $C,c>0$ , so in particular, recalling that $\alpha\leq C^{\prime}$ for some constant $C^{\prime}>0$ ,

[TABLE]

As a consequence of all the above (and of the boundedness of $\alpha$ ), we have that, for some $c>0$ ,

[TABLE]

Let us now consider the second right-hand side term of (5.2.1). Using the relation $ab^{\sf T}+ba^{\sf T}\preceq aa^{\sf T}+bb^{\sf T}$ in the order of Hermitian matrices (which unfolds from $(a-b)(a-b)^{\sf T}\succeq 0$ ), we have, with $a=T^{\frac{1}{4}}Q\sigma_{i}(\frac{1}{T}\sigma_{i}^{\sf T}Q_{-i}\sigma_{i}-\alpha)$ and $b=T^{-\frac{1}{4}}\tilde{Q}\sigma_{i}$ ,

[TABLE]

where $D_{2}={\rm diag}(\{\frac{1}{T}\sigma_{i}^{\sf T}Q_{-i}\sigma_{i}-\alpha\}_{i=1}^{n})$ . Of course, since we also have $-aa^{\sf T}-bb^{\sf T}\preceq ab^{\sf T}+ba^{\sf T}$ (from $(a+b)(a+b)^{\sf T}\succeq 0$ ), we have symmetrically

[TABLE]

But from Lemma 4,

[TABLE]

so that, with a similar reasoning as in the proof of Corollary 1,

[TABLE]

where we additionally used $\|Q\Sigma\|\leq\sqrt{T}$ in the first inequality.

Since in addition $\left\|\frac{n}{T\sqrt{T}}\tilde{Q}\Phi\tilde{Q}\right\|\leq Cn^{-\frac{1}{2}}$ , this gives

[TABLE]

Together with (5.2.1), we thus conclude that

[TABLE]

Note in passing that we proved that

[TABLE]

where the first equality holds by exchangeability arguments.

In particular,

[TABLE]

where $|\frac{1}{T}\operatorname{tr}\Phi({\rm E}[Q_{-}]-{\rm E}[Q])|\leq{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{c}{n}}$ . And thus, by the previous result,

[TABLE]

We have proved in the beginning of the section that $\frac{1}{T}\operatorname{tr}\Phi$ is bounded and thus we finally conclude that

[TABLE]

5.2.2 Second Equivalent for ${\rm E}[Q]$

In this section, we show that ${\rm E}[Q]$ can be approximated by the matrix $\bar{Q}$ , which we recall is defined as

[TABLE]

where $\delta>0$ is the unique positive solution to $\delta=\frac{1}{T}\operatorname{tr}\Phi\bar{Q}$ . The fact that $\delta>0$ is well defined is quite standard and has already been proved several times for more elaborate models. Following the ideas of (Hoydis, Couillet and Debbah, 2013), we may for instance use the framework of so-called standard interference functions (Yates, 1995) which claims that, if a map $f:[0,\infty)\to(0,\infty)$ , $x\mapsto f(x)$ , satisfies $x\geq x^{\prime}\Rightarrow f(x)\geq f(x^{\prime})$ , $\forall a>1,af(x)>f(ax)$ and there exists $x_{0}$ such that $x_{0}\geq f(x_{0})$ , then $f$ has a unique fixed point (Yates, 1995, Th 2). It is easily shown that $\delta\mapsto\frac{1}{T}\operatorname{tr}\Phi\bar{Q}$ is such a map, so that $\delta$ exists and is unique.

To compare $\tilde{Q}$ and $\bar{Q}$ , using the resolvent identity, Lemma 5, we start by writing

[TABLE]

from which

[TABLE]

which implies that

[TABLE]

It thus remains to show that

[TABLE]

to prove that $|\alpha-\delta|\leq cn^{\varepsilon-\frac{1}{2}}$ . To this end, note that, by Cauchy–Schwarz’s inequality,

[TABLE]

so that it is sufficient to bound the limsup of both terms under the square root strictly by one. Next, remark that

[TABLE]

In particular,

[TABLE]

But at the same time, since $\|(\frac{n}{T}\Phi+\gamma I_{T})^{-1}\|\leq\gamma^{-1}$ ,

[TABLE]

the limsup of which is bounded. We thus conclude that

[TABLE]

Similarly, $\alpha$ , which is known to be bounded, satisfies

[TABLE]

and we thus have also

[TABLE]

which completes to prove that $|\alpha-\delta|\leq cn^{\varepsilon-\frac{1}{2}}$ .

As a consequence of all this,

[TABLE]

and we have thus proved that $\|{\rm E}[Q]-\bar{Q}\|\leq cn^{-\frac{1}{2}+\varepsilon}$ for some $c>0$ .

From this result, along with Corollary 2, we now have that

[TABLE]

for all large $n$ . As a consequence, for all $\gamma>0$ , $\frac{1}{T}\operatorname{tr}Q-\frac{1}{T}\operatorname{tr}\bar{Q}\to 0$ almost surely. As such, the difference $m_{\mu_{n}}-m_{\bar{\mu}_{n}}$ of Stieltjes transforms $m_{\mu_{n}}:{\mathbb{C}}\setminus{\mathbb{R}}^{+}\to{\mathbb{C}}$ , $z\mapsto\frac{1}{T}\operatorname{tr}(\frac{1}{T}\Sigma^{\sf T}\Sigma-zI_{T})^{-1}$ and $m_{\bar{\mu}_{n}}:{\mathbb{C}}\setminus{\mathbb{R}}^{+}\to{\mathbb{C}}$ , $z\mapsto\frac{1}{T}\operatorname{tr}(\frac{n}{T}\frac{\Phi}{1+\delta_{z}}-zI_{T})^{-1}$ (with $\delta_{z}$ the unique Stieltjes transform solution to $\delta_{z}=\frac{1}{T}\operatorname{tr}\Phi(\frac{n}{T}\frac{\Phi}{1+\delta_{z}}-zI_{T})^{-1}$ ) converges to zero for each $z$ in a subset of ${\mathbb{C}}\setminus{\mathbb{R}}^{+}$ having at least one accumulation point (namely ${\mathbb{R}}^{-}$ ), almost surely so (that is, on a probability set $\mathcal{A}_{z}$ with $P(\mathcal{A}_{z})=1$ ). Thus, letting $\{z_{k}\}_{k=1}^{\infty}$ be a converging sequence strictly included in ${\mathbb{R}}^{-}$ , on the probability one space $\mathcal{A}=\cap_{k=1}^{\infty}\mathcal{A}_{k}$ , $m_{\mu_{n}}(z_{k})-m_{\bar{\mu}_{n}}(z_{k})\to 0$ for all $k$ . Now, $m_{\mu_{n}}$ is complex analytic on ${\mathbb{C}}\setminus{\mathbb{R}}^{+}$ and bounded on all compact subsets of ${\mathbb{C}}\setminus{\mathbb{R}}^{+}$ . Besides, it was shown in (Silverstein and Bai, 1995; Silverstein and Choi, 1995) that the function $m_{\bar{\mu}_{n}}$ is well-defined, complex analytic and bounded on all compact subsets of ${\mathbb{C}}\setminus{\mathbb{R}}^{+}$ . As a result, on $\mathcal{A}$ , $m_{\mu_{n}}-m_{\bar{\mu}_{n}}$ is complex analytic, bounded on all compact subsets of ${\mathbb{C}}\setminus{\mathbb{R}}^{+}$ and converges to zero on a subset admitting at least one accumulation point. Thus, by Vitali’s convergence theorem (Titchmarsh, 1939), with probability one, $m_{\mu_{n}}-m_{\bar{\mu}_{n}}$ converges to zero everywhere on ${\mathbb{C}}\setminus{\mathbb{R}}^{+}$ . This implies, by (Bai and Silverstein, 2009, Theorem B.9), that $\mu_{n}-\bar{\mu}_{n}\to 0$ , vaguely as a signed finite measure, with probability one, and, since $\bar{\mu}_{n}$ is a probability measure (again from the results of (Silverstein and Bai, 1995; Silverstein and Choi, 1995)), we have thus proved Theorem 2.

5.2.3 Asymptotic Equivalent for ${\rm E}[QAQ]$ , where $A$ is either $\Phi$ or symmetric of bounded norm

The evaluation of the second order statistics of the neural network under study requires, beside ${\rm E}[Q]$ , to evaluate the more involved form ${\rm E}[QAQ]$ , where $A$ is a symmetric matrix either equal to $\Phi$ or of bounded norm (so in particular $\|\bar{Q}A\|$ is bounded). To evaluate this quantity, first write

[TABLE]

Of course, since $QAQ$ is symmetric, we may write

[TABLE]

which will reveal more practical to handle.

First note that, since $\left\|{\rm E}[Q]-\bar{Q}\right\|\leq Cn^{\varepsilon-\frac{1}{2}}$ and $A$ is such that $\|\bar{Q}A\|$ is bounded, $\|{\rm E}[\bar{Q}AQ]-\bar{Q}A\bar{Q}]\|\leq\|\bar{Q}A\|\|{\rm E}[Q]-\bar{Q}\|\leq C^{\prime}n^{\varepsilon-\frac{1}{2}}$ , which provides an estimate for the first expectation. We next evaluate the last right-hand side expectation above. With the same notations as previously, from exchangeability arguments and using $Q=Q_{-}-Q\frac{1}{T}\sigma\sigma^{\sf T}Q_{-}$ , observe that

[TABLE]

which, reusing $Q=Q_{-}-Q\frac{1}{T}\sigma\sigma^{\sf T}Q_{-}$ , is further decomposed as

[TABLE]

(where in the previous to last line, we have merely reorganized the terms conveniently) and our interest is in handling $Z_{1}+Z_{1}^{\sf T}+Z_{2}+Z_{2}^{\sf T}+Z_{3}+Z_{3}^{\sf T}+Z_{4}+Z_{4}^{\sf T}$ . Let us first treat term $Z_{2}$ . Since $\bar{Q}AQ_{-}$ is bounded, by Lemma 4, $\frac{1}{T}\sigma^{\sf T}\bar{Q}AQ_{-}\sigma$ concentrates around $\frac{1}{T}\operatorname{tr}\Phi\bar{Q}AE[Q_{-}]$ ; but, as $\|\Phi\bar{Q}\|$ is bounded, we also have $|\frac{1}{T}\operatorname{tr}\Phi\bar{Q}AE[Q_{-}]-\frac{1}{T}\operatorname{tr}\Phi\bar{Q}A\bar{Q}|\leq cn^{\varepsilon-\frac{1}{2}}$ . We thus deduce, with similar arguments as previously, that

[TABLE]

with probability exponentially close to one, in the order of symmetric matrices. Taking expectation and norms on both sides, and conditioning on the aforementioned event and its complementary, we thus have that

[TABLE]

But, again by exchangeability arguments,

[TABLE]

with $D=\operatorname{\rm diag}(\{1+\frac{1}{T}\sigma_{i}^{\sf T}Q_{-}\sigma_{i}\})$ , the operator norm of which is bounded as O(1). So finally,

[TABLE]

We now move to term $Z_{3}+Z_{3}^{\sf T}$ . Using the relation $ab^{\sf T}+ba^{\sf T}\preceq aa^{\sf T}+bb^{\sf T}$ ,

[TABLE]

and the symmetrical lower bound (equal to the opposite of the upper bound), where $D_{3}=\operatorname{\rm diag}((\delta-\frac{1}{T}\sigma_{i}^{\sf T}Q_{-i}\sigma_{i})/(1+\frac{1}{T}\sigma_{i}^{\sf T}Q_{-i}\sigma_{i}))$ . For the same reasons as above, the first right-hand side term is bounded by $Cn^{\varepsilon-\frac{1}{2}}$ . As for the second term, for $A=I_{T}$ , it is clearly bounded; for $A=\Phi$ , using $\frac{n}{T}\frac{\bar{Q}\Phi}{1+\delta}=I_{T}-\gamma\bar{Q}$ , ${\rm E}[Q_{-}A\bar{Q}\Phi\bar{Q}AQ_{-}]$ can be expressed in terms of ${\rm E}[Q_{-}\Phi Q_{-}]$ and ${\rm E}[Q_{-}\bar{Q}^{k}\Phi Q_{-}]$ for $k=1,2$ , all of which have been shown to be bounded (at most by $Cn^{\varepsilon}$ ). We thus conclude that

[TABLE]

Finally, term $Z_{4}$ can be handled similarly as term $Z_{2}$ and is shown to be of norm bounded by $Cn^{\varepsilon-\frac{1}{2}}$ .

As a consequence of all the above, we thus find that

[TABLE]

It is attractive to feel that the sum of the second and third terms above vanishes. This is indeed verified by observing that, for any matrix $B$ ,

[TABLE]

and symmetrically

[TABLE]

with $D=\operatorname{\rm diag}(1+\frac{1}{T}\sigma_{i}^{\sf T}Q_{-i}\sigma_{i})$ , and a similar reasoning is performed to control ${\rm E}[Q_{-}BQ]-{\rm E}[Q_{-}BQ_{-}]$ and ${\rm E}[QBQ_{-}]-{\rm E}[Q_{-}BQ_{-}]$ . For $B$ bounded, $\|{\rm E}[Q\frac{1}{T}\Sigma^{\sf T}D\Sigma QBQ]\|$ is bounded as $O(1)$ , and thus $\|{\rm E}[QBQ]-{\rm E}[Q_{-}BQ_{-}]\|$ is of order $O(n^{-1})$ . So in particular, taking $A$ of bounded norm, we find that

[TABLE]

Take now $B=\Phi$ . Then, from the relation $AB^{\sf T}+BA^{\sf T}\preceq AA^{\sf T}+BB^{\sf T}$ in the order of symmetric matrices,

[TABLE]

The first norm in the parenthesis is bounded by $Cn^{\varepsilon}$ and it thus remains to control the second norm. To this end, similar to the control of ${\rm E}[Q\Phi Q]$ , by writing ${\rm E}[Q\Phi Q\Phi Q]={\rm E}[Q\sigma_{1}\sigma_{1}^{\sf T}Q\sigma_{2}\sigma_{2}^{\sf T}Q]$ for $\sigma_{1},\sigma_{2}$ independent vectors with the same law as $\sigma$ , and exploiting the exchangeability, we obtain after some calculus that ${\rm E}[Q\Phi Q]$ can be expressed as the sum of terms of the form ${\rm E}[Q_{++}\frac{1}{T}\Sigma_{++}^{\sf T}D\Sigma_{++}Q_{++}]$ or ${\rm E}[Q_{++}\frac{1}{T}\Sigma_{++}^{\sf T}D\Sigma_{++}Q_{++}\frac{1}{T}\Sigma_{++}^{\sf T}D_{2}\Sigma_{++}Q_{++}]$ for $D,D_{2}$ diagonal matrices of norm bounded as $O(1)$ , while $\Sigma_{++}$ and $Q_{++}$ are similar as $\Sigma$ and $Q$ , only for $n$ replaced by $n+2$ . All these terms are bounded as $O(1)$ and we finally obtain that ${\rm E}[Q\Phi Q\Phi Q]$ is bounded and thus

[TABLE]

With the additional control on $Q\Phi Q_{-}-Q_{-}\Phi Q_{-}$ and $Q_{-}\Phi Q-Q_{-}\Phi Q_{-}$ , together, this implies that ${\rm E}[Q\Phi Q]={\rm E}[Q_{-}\Phi Q_{-}]+O_{\|\cdot\|}({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}n^{-1}})$ . Hence, for $A=\Phi$ , exploiting the fact that $\frac{n}{T}\frac{1}{1+\delta}\Phi\bar{Q}\Phi=\Phi-\gamma\bar{Q}\Phi$ , we have the simplification

[TABLE]

or equivalently

[TABLE]

We have already shown in (11) that $\limsup_{n}\frac{n}{T}\frac{\frac{1}{T}\operatorname{tr}\Phi^{2}\bar{Q}^{2}}{(1+\delta)^{2}}<1$ and thus

[TABLE]

So finally, for all $A$ of bounded norm,

[TABLE]

which proves immediately Proposition 1 and Theorem 3.

5.3 Derivation of $\Phi_{ab}$

5.3.1 Gaussian $w$

In this section, we evaluate the terms $\Phi_{ab}$ provided in Table 1. The proof for the term corresponding to $\sigma(t)={\rm erf}(t)$ can be already be found in (Williams, 1998, Section 3.1) and is not recalled here. For the other functions $\sigma(\cdot)$ , we follow a similar approach as in (Williams, 1998), as detailed next.

The evaluation of $\Phi_{ab}$ for $w\sim\mathcal{N}(0,I_{p})$ requires to estimate

[TABLE]

Assume that $a$ and $b$ and not linearly dependent. It is convenient to observe that this integral can be reduced to a two-dimensional integration by considering the basis $e_{1},\ldots,e_{p}$ defined (for instance) by

[TABLE]

and $e_{3},\ldots,e_{p}$ any completion of the basis. By letting $w=\tilde{w}_{1}e_{1}+\ldots+\tilde{w}_{p}e_{p}$ and $a=\tilde{a}_{1}e_{1}$ ( $\tilde{a}_{1}=\|a\|$ ), $b=\tilde{b}_{1}e_{1}+\tilde{b}_{2}e_{2}$ (where $\tilde{b}_{1}=\frac{a^{\sf T}b}{\|a\|}$ and $\tilde{b}_{2}=\|b\|\sqrt{1-\frac{(a^{\sf T}b)^{2}}{\|a\|^{2}\|b\|^{2}}}$ ), this reduces $\mathcal{I}$ to

[TABLE]

Letting $\tilde{w}=[\tilde{w}_{1},\tilde{w}_{2}]^{\sf T}$ , $\tilde{a}=[\tilde{a}_{1},0]^{\sf T}$ and $\tilde{b}=[\tilde{b}_{1},\tilde{b}_{2}]^{\sf T}$ , this is conveniently written as the two-dimensional integral

[TABLE]

The case where $a$ and $b$ would be linearly dependent can then be obtained by continuity arguments.

The function $\sigma(t)=\max(t,0)$

For this function, we have

[TABLE]

Since $\tilde{a}=\tilde{a}_{1}e_{1}$ , a simple geometric representation lets us observe that

[TABLE]

where we defined $\theta_{0}\equiv\arccos\left(\frac{\tilde{b}_{1}}{\|\tilde{b}\|}\right)=-\arcsin\left(\frac{\tilde{b}_{1}}{\|\tilde{b}\|}\right)+\frac{\pi}{2}$ . We may thus operate a polar coordinate change of variable (with inverse Jacobian determinant equal to $r$ ) to obtain

[TABLE]

With two integration by parts, we have that $\int_{{\mathbb{R}}^{+}}r^{3}e^{-\frac{1}{2}r^{2}}dr=2$ . Classical trigonometric formulas also provide

[TABLE]

where we used in particular $\sin(2\arccos(x))=2x\sqrt{1-x^{2}}$ . Altogether, this is after simplification and replacement of $\tilde{a}_{1}$ , $\tilde{b}_{1}$ and $\tilde{b}_{2}$ ,

[TABLE]

It is worth noticing that this may be more compactly written as

[TABLE]

which is minimum for $\angle(a,b)\to-1$ (since $\arccos(-x)\geq 0$ on $[-1,1]$ ) and takes there the limiting value zero. Hence $\mathcal{I}>0$ for $a$ and $b$ not linearly dependent.

For $a$ and $b$ linearly dependent, we simply have $\mathcal{I}=0$ for $\angle(a,b)=-1$ and $\mathcal{I}=\frac{1}{2}\|a\|\|b\|$ for $\angle(a,b)=1$ .

The function $\sigma(t)=|t|$

Since $|t|=\max(t,0)+\max(-t,0)$ , we have

[TABLE]

Hence, reusing the results above, we have here

[TABLE]

Using the identity $\operatorname{acos}(-x)-\operatorname{acos}(x)=2\operatorname{asin}(x)$ provides the expected result.

The function $\sigma(t)=1_{t\geq 0}$

With the same notations as in the case $\sigma(t)=\max(t,0)$ , we have to evaluate

[TABLE]

After a polar coordinate change of variable, this is

[TABLE]

from which the result unfolds.

The function $\sigma(t)={\rm sign}(t)$

Here it suffices to note that ${\rm sign}(t)=1_{t\geq 0}-1_{-t\geq 0}$ so that

[TABLE]

and to apply the result of the previous section, with either $(a,b)$ , $(-a,b)$ , $(a,-b)$ or $(-a,-b)$ . Since $\arccos(-x)=-\arccos(x)+\pi$ , we conclude that

[TABLE]

The functions $\sigma(t)=\cos(t)$ and $\sigma(t)=\sin(t)$ .

Let us first consider $\sigma(t)=\cos(t)$ . We have here to evaluate

[TABLE]

which boils down to evaluating, for $d\in\{\tilde{a}+\tilde{b},\tilde{a}-\tilde{b},-\tilde{a}+\tilde{b},-\tilde{a}-\tilde{b}\}$ , the integral

[TABLE]

Altogether, we find

[TABLE]

For $\sigma(t)=\sin(t)$ , it suffices to appropriately adapt the signs in the expression of $\mathcal{I}$ (using the relation $\sin(t)=\frac{1}{2\imath}(e^{t}+e^{-t})$ ) to obtain in the end

[TABLE]

as desired.

5.4 Polynomial $\sigma(\cdot)$ and generic $w$

In this section, we prove Equation 3.3 for $\sigma(t)=\zeta_{2}t^{2}+\zeta_{1}t+\zeta_{0}$ and $w\in{\mathbb{R}}^{p}$ a random vector with independent and identically distributed entries of zero mean and moment of order $k$ equal to $m_{k}$ . The result is based on standard combinatorics. We are to evaluate

[TABLE]

After development, it appears that one needs only assess, for say vectors $c,d\in{\mathbb{R}}^{p}$ that take values in $\{a,b\}$ , the moments

[TABLE]

where we recall the definition $(a^{2})=[a_{1}^{2},\ldots,a_{p}^{2}]^{\sf T}$ . Gathering all the terms for appropriate selections of $c,d$ leads to (3.3).

5.5 Heuristic derivation of Conjecture 1

Conjecture 1 essentially follows as an aftermath of Remark 1. We believe that, similar to $\Sigma$ , $\hat{\Sigma}$ is expected to be of the form $\hat{\Sigma}=\hat{\Sigma}^{\circ}+\hat{\bar{\sigma}}1_{\hat{T}}^{\sf T}$ , where $\hat{\bar{\sigma}}={\rm E}[\sigma(w^{\sf T}\hat{X})]^{\sf T}$ , with $\|\frac{\hat{\Sigma}^{\circ}}{\sqrt{T}}\|\leq n^{\varepsilon}$ with high probability. Besides, if $X,\hat{X}$ were chosen as constituted of Gaussian mixture vectors, with non-trivial growth rate conditions as introduced in (Couillet and Benaych-Georges, 2016), it is easily seen that $\bar{\sigma}=c1_{p}+v$ and $\hat{\bar{\sigma}}=c1_{p}+\hat{v}$ , for some constant $c$ and $\|v\|,\|\hat{v}\|=O(1)$ .

This subsequently ensures that $\Phi_{X\hat{X}}$ and $\Phi_{\hat{X}\hat{X}}$ would be of a similar form $\Phi_{X\hat{X}}^{\circ}+\bar{\sigma}\hat{\bar{\sigma}}^{\sf T}$ and $\Phi_{\hat{X}\hat{X}}^{\circ}+\hat{\bar{\sigma}}\hat{\bar{\sigma}}^{\sf T}$ with $\Phi_{X\hat{X}}^{\circ}$ and $\Phi_{\hat{X}\hat{X}}^{\circ}$ of bounded norm. These facts, that would require more advanced proof techniques, let envision the following heuristic derivation for Conjecture 1.

Recall that our interest is on the test performance $E_{\rm test}$ defined as

[TABLE]

which may be rewritten as

[TABLE]

If $\hat{\Sigma}=\hat{\Sigma}^{\circ}+\hat{\bar{\sigma}}1_{\hat{T}}^{\sf T}$ follows the aforementioned claimed operator norm control, reproducing the steps of Corollary 3 leads to a similar concentration for $E_{\rm test}$ , which we shall then admit. We are therefore left to evaluating ${\rm E}[Z_{2}]$ and ${\rm E}[Z_{3}]$ .

We start with the term ${\rm E}[Z_{2}]$ , which we expand as

[TABLE]

with $D=\operatorname{\rm diag}(\{\delta-\frac{1}{T}\sigma_{i}^{\sf T}Q_{-i}\sigma_{i}\})$ , the operator norm of which is bounded by $n^{\varepsilon-\frac{1}{2}}$ with high probability. Now, observe that, again with the assumption that $\hat{\Sigma}=\hat{\Sigma}^{\circ}+\bar{\sigma}1_{\hat{T}}^{\sf T}$ with controlled $\hat{\Sigma}^{\circ}$ , $Z_{22}$ may be decomposed as

[TABLE]

In the display above, the first right-hand side term is now of order $O(n^{\varepsilon-\frac{1}{2}})$ . As for the second right-hand side term, note that $D\bar{\sigma}$ is a vector of independent and identically distributed zero mean and variance $O(n^{-1})$ entries; while note formally independent of $YQ\Sigma^{\sf T}$ , it is nonetheless expected that this independence “weakens” asymptotically (a behavior several times observed in linear random matrix models), so that one expects by central limit arguments that the second right-hand side term be also of order $O(n^{\varepsilon-\frac{1}{2}})$ .

This would thus result in

[TABLE]

where we used $\|{\rm E}[Q_{-}]-\bar{Q}\|\leq Cn^{\varepsilon-\frac{1}{2}}$ and the definition $\Psi_{X\hat{X}}=\frac{n}{T}\frac{\Phi_{X\hat{X}}}{1+\delta}$ .

We then move on to ${\rm E}[Z_{3}]$ of Equation (12), which can be developed as

[TABLE]

In the term $Z_{32}$ , reproducing the proof of Lemma 1 with the condition $\|\hat{X}\|$ bounded, we obtain that $\frac{\hat{\sigma}_{i}^{\sf T}\hat{\sigma}_{i}}{\hat{T}}$ concentrates around $\frac{1}{\hat{T}}\operatorname{tr}\Phi_{\hat{X}\hat{X}}$ , which allows us to write

[TABLE]

with $D=\operatorname{\rm diag}(\{\frac{1}{\hat{T}}\sigma_{i}^{\sf T}\hat{\sigma}_{i}-\frac{1}{\hat{T}}\operatorname{tr}\Phi_{\hat{T}\hat{T}}\}_{i=1}^{n})$ and thus $Z_{322}$ can be rewritten as

[TABLE]

while for $Z_{321}$ , following the same arguments as previously, we have

[TABLE]

where $D=\operatorname{\rm diag}(\{(1+\delta)^{2}-(1+\frac{1}{T}\sigma_{i}^{\sf T}Q_{-i}\sigma_{i})^{2}\}_{i=1}^{n})$ .

Since ${\rm E}[Q_{-}AQ_{-}]={\rm E}[QAQ]+O_{\|\cdot\|}(n^{\varepsilon-\frac{1}{2}})$ , we are free to plug in the asymptotic equivalent of ${\rm E}[QAQ]$ derived in Section 5.2.3, and we deduce

[TABLE]

The term $Z_{31}$ of the double sum over $i$ and $j$ ( $j\neq i$ ) needs more efforts. To handle this term, we need to remove the dependence of both $\sigma_{i}$ and $\sigma_{j}$ in $Q$ in sequence. We start with $j$ as follows:

[TABLE]

where in the previous to last inequality we used the relation

[TABLE]

For $Z_{311}$ , we replace $1+\frac{1}{T}\sigma_{j}^{\sf T}{Q_{-j}}\sigma_{j}$ by $1+\delta$ and take expectation over $w_{j}$

[TABLE]

The idea to handle $Z_{3112}$ is to retrieve forms of the type $\sum_{j=1}^{n}d_{j}\hat{\sigma}_{j}\sigma_{j}^{\sf T}=\hat{\Sigma}^{\sf T}D\Sigma$ for some $D$ satisfying $\|D\|\leq n^{\varepsilon-\frac{1}{2}}$ with high probability. To this end, we use

[TABLE]

and thus $Z_{3112}$ can be expanded as the sum of three terms that shall be studied in order:

[TABLE]

where $D=\operatorname{\rm diag}(\{\delta-\frac{1}{T}\sigma_{j}^{\sf T}Q_{-j}\sigma_{j}\}_{i=1}^{n})$ . First, $Z_{31121}$ is of order $O(n^{\varepsilon-\frac{1}{2}})$ since $Q\frac{\Sigma^{\sf T}\hat{\Sigma}}{T}$ is of bounded operator norm. Subsequently, $Z_{31122}$ can be rewritten as

[TABLE]

with here

[TABLE]

The same arguments apply for $Z_{31123}$ but for

[TABLE]

which completes to show that $|Z_{3112}|\leq Cn^{\varepsilon-\frac{1}{2}}$ and thus

[TABLE]

It remains to handle $Z_{3111}$ . Under the same claims as above, we have

[TABLE]

where we introduced the notation $Q_{-ij}=(\frac{1}{T}\Sigma^{\sf T}\Sigma-\frac{1}{T}\sigma_{i}\sigma_{i}^{\sf T}-\frac{1}{T}\sigma_{j}\sigma_{j}^{\sf T}+\gamma I_{T})^{-1}$ . For $Z_{31111}$ , we replace $\frac{1}{T}\sigma_{i}^{\sf T}{Q_{-ij}}\sigma_{i}$ by $\delta$ , and take the expectation over $w_{i}$ , as follows

[TABLE]

with $Q_{--}$ having the same law as $Q_{-ij}$ , $D=\operatorname{\rm diag}(\{\delta-\frac{1}{T}\sigma_{i}^{\sf T}{Q_{-ij}}\sigma_{i}\}_{i=1}^{n})$ and $D^{\prime}=\operatorname{\rm diag}\left\{\frac{(\delta-\frac{1}{T}\sigma_{i}^{\sf T}{Q_{-ij}}\sigma_{i})\frac{1}{T}\operatorname{tr}\left(\Phi_{\hat{X}{X}}{Q_{-ij}}\Phi_{X\hat{X}}\right)}{(1-\frac{1}{T}\sigma_{i}^{\sf T}{Q_{-j}}\sigma_{i})(1+\frac{1}{T}\sigma_{i}^{\sf T}{Q_{-ij}}\sigma_{i})}\right\}_{i=1}^{n}$ , both expected to be of order $O(n^{\varepsilon-\frac{1}{2}})$ . Using again the asymptotic equivalent of ${\rm E}[QAQ]$ devised in Section 5.2.3, we then have

[TABLE]

Following the same principle, we deduce for $Z_{31112}$ that

[TABLE]

with $D_{i}=\frac{1}{T}\operatorname{tr}\left(\Phi_{\hat{X}{X}}Q_{-ij}\Phi_{X\hat{X}}\right)\left[(1+\delta)^{2}-(1+\frac{1}{T}\sigma_{i}^{\sf T}{Q_{-ij}}\sigma_{i})^{2}\right]$ , also believed to be of order $O(n^{\varepsilon-\frac{1}{2}})$ . Recalling the fact that $Z_{311}=Z_{3111}+O(n^{\varepsilon-\frac{1}{2}})$ , we can thus conclude for $Z_{311}$ that

[TABLE]

As for $Z_{312}$ , we have

[TABLE]

Since $Q_{-j}\frac{1}{T}\Sigma_{-j}^{\sf T}\hat{\Sigma}_{-j}$ is expected to be of bounded norm, using the concentration inequality of the quadratic form $\frac{1}{T}\sigma_{j}^{\sf T}{Q_{-j}}\frac{\Sigma_{-j}^{\sf T}\hat{\Sigma}_{-j}}{T}\hat{\sigma}_{j}$ , we infer

[TABLE]

We again replace $\frac{1}{T}\sigma_{j}^{\sf T}{Q_{-j}}\sigma_{j}$ by $\delta$ and take expectation over $w_{j}$ to obtain

[TABLE]

with $D_{j}=(1+\delta)^{2}-(1+\frac{1}{T}\sigma_{j}^{\sf T}Q_{-j}\sigma_{j})^{2}=O(n^{\varepsilon-\frac{1}{2}})$ , which eventually brings the second term to vanish, and we thus get

[TABLE]

For the term $\frac{1}{T^{2}}\operatorname{tr}\left(Q_{-}\Sigma_{-}^{\sf T}\hat{\Sigma}_{-}\Phi_{\hat{X}{X}}\right)$ we apply again the concentration inequality to get

[TABLE]

with high probability, where $D=\operatorname{\rm diag}(\{\delta-\frac{1}{T}\sigma_{i}^{\sf T}{Q_{-ij}}\sigma_{i}\}_{i=1}^{n})$ , the norm of which is of order $O(n^{\varepsilon-\frac{1}{2}})$ . This entails

[TABLE]

with high probability. Once more plugging the asymptotic equivalent of ${\rm E}[QAQ]$ deduced in Section 5.2.3, we conclude for $Z_{312}$ that

[TABLE]

and eventually for $Z_{31}$

[TABLE]

Combining the estimates of ${\rm E}[Z_{2}]$ as well as $Z_{31}$ and $Z_{32}$ , we finally have the estimates for the test error defined in (12) as

[TABLE]

Since by definition, $\bar{Q}=\left(\Psi_{X}+\gamma{I_{T}}\right)^{-1}$ , we may use

[TABLE]

in the second term in brackets to finally retrieve the form of Conjecture 1.

6 Concluding Remarks

This article provides a possible direction of exploration of random matrices involving entry-wise non-linear transformations (here through the function $\sigma(\cdot)$ ), as typically found in modelling neural networks, by means of a concentration of measure approach. The main advantage of the method is that it leverages the concentration of an initial random vector $w$ (here a Lipschitz function of a Gaussian vector) to transfer concentration to all vector $\sigma$ (or matrix $\Sigma$ ) being Lipschitz functions of $w$ . This induces that Lipschitz functionals of $\sigma$ (or $\Sigma$ ) further satisfy concentration inequalities and thus, if the Lipschitz parameter scales with $n$ , convergence results as $n\to\infty$ . With this in mind, note that we could have generalized our input-output model $z=\beta^{\sf T}\sigma(Wx)$ of Section 2 to

[TABLE]

for $\sigma:{\mathbb{R}}^{p}\times\mathcal{P}\to{\mathbb{R}}^{n}$ with $\mathcal{P}$ some probability space and $\mathcal{W}\in\mathcal{P}$ a random variable such that $\sigma(x;\mathcal{W})$ and $\sigma(X;\mathcal{W})$ (where $\sigma(\cdot)$ is here applied column-wise) satisfy a concentration of measure phenomenon; it is not even necessary that $\sigma(X;\mathcal{W})$ has a normal concentration so long that the corresponding concentration function allows for appropriate convergence results. This generalized setting however has the drawback of being less explicit and less practical (as most neural networks involve linear maps $Wx$ rather than non-linear maps of $\mathcal{W}$ and $x$ ).

A much less demanding generalization though would consist in changing the vector $w\sim\mathcal{N}_{\varphi}(0,I_{p})$ for a vector $w$ still satisfying an exponential (not necessarily normal) concentration. This is the case notably if $w=\varphi(\tilde{w})$ with $\varphi(\cdot)$ a Lipschitz map with Lipschitz parameter bounded by, say, $\log(n)$ or any small enough power of $n$ . This would then allow for $w$ with heavier than Gaussian tails.

Despite its simplicity, the concentration method also has some strong limitations that presently do not allow for a sufficiently profound analysis of the testing mean square error. We believe that Conjecture 1 can be proved by means of more elaborate methods. Notably, we believe that the powerful Gaussian method advertised in (Pastur and Ŝerbina, 2011) which relies on Stein’s lemma and the Poincaré–Nash inequality could provide a refined control of the residual terms involved in the derivation of Conjecture 1. However, since Stein’s lemma (which states that ${\rm E}[x\phi(x)]={\rm E}[\phi^{\prime}(x)]$ for $x\sim\mathcal{N}(0,1)$ and differentiable polynomially bounded $\phi$ ) can only be used on products $x\phi(x)$ involving the linear component $x$ , the latter is not directly accessible; we nonetheless believe that appropriate ansatzs of Stein’s lemma, adapted to the non-linear setting and currently under investigation, could be exploited.

As a striking example, one key advantage of such a tool would be the possibility to evaluate expectations of the type $Z={\rm E}[\sigma\sigma^{\sf T}(\frac{1}{T}\sigma^{\sf T}Q_{-}\sigma-\alpha)]$ which, in our present analysis, was shown to be bounded in the order of symmetric matrices by $\Phi Cn^{\varepsilon-\frac{1}{2}}$ with high probability. Thus, if no matrix (such as $\bar{Q}$ ) pre-multiplies $Z$ , since $\|\Phi\|$ can grow as large as $O(n)$ , $Z$ cannot be shown to vanish. But such a bound does not account for the fact that $\Phi$ would in general be unbounded because of the term $\bar{\sigma}\bar{\sigma}^{\sf T}$ in the display $\Phi=\bar{\sigma}\bar{\sigma}^{\sf T}+{\rm E}[(\sigma-\bar{\sigma})(\sigma-\bar{\sigma})^{\sf T}]$ , where $\bar{\sigma}={\rm E}[\sigma]$ . Intuitively, the “mean” contribution $\bar{\sigma}\bar{\sigma}^{\sf T}$ of $\sigma\sigma^{\sf T}$ , being post-multiplied in $Z$ by $\frac{1}{T}\sigma^{\sf T}Q_{-}\sigma-\alpha$ (which averages to zero) disappears; and thus only smaller order terms remain. We believe that the aforementioned ansatzs for the Gaussian tools would be capable of subtly handling this self-averaging effect on $Z$ to prove that $\|Z\|$ vanishes (for $\sigma(t)=t$ , it is simple to show that $\|Z\|\leq Cn^{-1}$ ). In addition, Stein’s lemma-based methods only require the differentiability of $\sigma(\cdot)$ , which need not be Lipschitz, thereby allowing for a larger class of activation functions.

As suggested in the simulations of Figure 2, our results also seem to extend to non continuous functions $\sigma(\cdot)$ . To date, we cannot envision a method allowing to tackle this setting.

In terms of neural network applications, the present article is merely a first step towards a better understanding of the “hardening” effect occurring in large dimensional networks with numerous samples and large data points (that is, simultaneously large $n,p,T$ ), which we exemplified here through the convergence of mean-square errors. The mere fact that some standard performance measure of these random networks would “freeze” as $n,p,T$ grow at the predicted regime and that the performance would heavily depend on the distribution of the random entries is already in itself an interesting result to neural network understanding and dimensioning. However, more interesting questions remain open. Since neural networks are today dedicated to classification rather than regression, a first question is the study of the asymptotic statistics of the output $z=\beta^{\sf T}\sigma(Wx)$ itself; we believe that $z$ satisfies a central limit theorem with mean and covariance allowing for assessing the asymptotic misclassification rate.

A further extension of the present work would be to go beyond the single-layer network and include multiple layers (finitely many or possibly a number scaling with $n$ ) in the network design. The interest here would be on the key question of the best distribution of the number of neurons across the successive layers.

It is also classical in neural networks to introduce different (possibly random) biases at the neuron level, thereby turning $\sigma(t)$ into $\sigma(t+b)$ for a random variable $b$ different for each neuron. This has the effect of mitigating the negative impact of the mean ${\rm E}[\sigma(w_{i}^{\sf T}x_{j})]$ , which is independent of the neuron index $i$ .

Finally, neural networks, despite their having been recently shown to operate almost equally well when taken random in some very specific scenarios, are usually only initiated as random networks before being subsequently trained through backpropagation of the error on the training dataset (that is, essentially through convex gradient descent). We believe that our framework can allow for the understanding of at least finitely many steps of gradient descent, which may then provide further insights into the overall performance of deep learning networks.

Appendix A Intermediary Lemmas

This section recalls some elementary algebraic relations and identities used throughout the proof section.

Lemma 5 (Resolvent Identity).

For invertible matrices $A,B$ , $A^{-1}-B^{-1}=A^{-1}(B-A)B^{-1}$ .

Lemma 6 (A rank- $1$ perturbation identity).

For $A$ Hermitian, $v$ a vector and $t\in{\mathbb{R}}$ , if $A$ and $A+tvv^{\sf T}$ are invertible, then

[TABLE]

Lemma 7 (Operator Norm Control).

For nonnegative definite $A$ and $z\in{\mathbb{C}}\setminus{\mathbb{R}}^{+}$ ,

[TABLE]

where ${\rm dist}(x,\mathcal{A})$ is the Hausdorff distance of a point to a set. In particular, for $\gamma>0$ , $\|(A+\gamma I_{T})^{-1}\|\leq\gamma^{-1}$ and $\|A(A+\gamma I_{T})^{-1}\|\leq 1$ .

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Akhiezer and Glazman (1993) {bbook} [author] \bauthor \bsnm Akhiezer, \bfnm N. I. \binits N. I. and \bauthor \bsnm Glazman, \bfnm I. M. \binits I. M. ( \byear 1993). \btitle Theory of linear operators in Hilbert space. \bpublisher Courier Dover Publications. \endbibitem
2Bai and Silverstein (1998) {barticle} [author] \bauthor \bsnm Bai, \bfnm Z. D. \binits Z. D. and \bauthor \bsnm Silverstein, \bfnm J. W. \binits J. W. ( \byear 1998). \btitle No eigenvalues outside the support of the limiting spectral distribution of large dimensional sample covariance matrices. \bjournal The Annals of Probability \bvolume 26 \bpages 316-345. \endbibitem
3Bai and Silverstein (2007) {barticle} [author] \bauthor \bsnm Bai, \bfnm Z. D. \binits Z. D. and \bauthor \bsnm Silverstein, \bfnm J. W. \binits J. W. ( \byear 2007). \btitle On the signal-to-interference-ratio of CDMA systems in wireless communications. \bjournal Annals of Applied Probability \bvolume 17 \bpages 81-101. \endbibitem
4Bai and Silverstein (2009) {bbook} [author] \bauthor \bsnm Bai, \bfnm Z. D. \binits Z. D. and \bauthor \bsnm Silverstein, \bfnm J. W. \binits J. W. ( \byear 2009). \btitle Spectral analysis of large dimensional random matrices, \bedition second ed. \bpublisher Springer Series in Statistics, \baddress New York, NY, USA. \endbibitem
5Benaych-Georges and Nadakuditi (2012) {barticle} [author] \bauthor \bsnm Benaych-Georges, \bfnm F. \binits F. and \bauthor \bsnm Nadakuditi, \bfnm R. R. \binits R. R. ( \byear 2012). \btitle The singular values and vectors of low rank perturbations of large rectangular random matrices. \bjournal Journal of Multivariate Analysis \bvolume 111 \bpages 120–135. \endbibitem
6Cambria et al. (2015) {barticle} [author] \bauthor \bsnm Cambria, \bfnm Erik \binits E., \bauthor \bsnm Gastaldo, \bfnm Paolo \binits P., \bauthor \bsnm Bisio, \bfnm Federica \binits F. and \bauthor \bsnm Zunino, \bfnm Rodolfo \binits R. ( \byear 2015). \btitle An ELM-based model for affective analogical reasoning. \bjournal Neurocomputing \bvolume 149 \bpages 443–455. \endbibitem
7Choromanska et al. (2015) {binproceedings} [author] \bauthor \bsnm Choromanska, \bfnm Anna \binits A., \bauthor \bsnm Henaff, \bfnm Mikael \binits M., \bauthor \bsnm Mathieu, \bfnm Michael \binits M., \bauthor \bsnm Arous, \bfnm Gérard Ben \binits G. B. and \bauthor \bsnm Le Cun, \bfnm Yann \binits Y. ( \byear 2015). \btitle The Loss Surfaces of Multilayer Networks. In \bbooktitle AISTATS. \endbibitem
8Couillet and Benaych-Georges (2016) {barticle} [author] \bauthor \bsnm Couillet, \bfnm R. \binits R. and \bauthor \bsnm Benaych-Georges, \bfnm F. \binits F. ( \byear 2016). \btitle Kernel spectral clustering of large dimensional data. \bjournal Electronic Journal of Statistics \bvolume 10 \bpages 1393–1454. \endbibitem

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

A Random Matrix Approach

Abstract

keywords:

1 Introduction

2 System Model

Assumption 1** (Subgaussian WWW).**

Assumption 2** (Function σ\sigmaσ).**

Assumption 3** (Growth Rate).**

3 Main Results

3.1 Main technical results and training performance

Lemma 1** (Concentration of quadratic forms).**

Theorem 1** (Asymptotic equivalent for E[Q]{\rm E}[Q]E[Q]).**

Theorem 2** (Limiting spectral measure of 1TΣTΣ\frac{1}{T}\Sigma^{\sf T}\SigmaT1​ΣTΣ).**

Proposition 1** (Asymptotic equivalent for E[QAQ]{\rm E}[QAQ]E[QAQ]).**

Theorem 3** (Asymptotic training mean-square error).**

3.2 Testing performance

Conjecture 1** (Deterministic equivalent for EtestE_{\rm test}Etest​).**

3.3 Evaluation of ΦAB\Phi_{AB}ΦAB​

4 Practical Outcomes

4.1 Simulation Results

4.2 The underlying kernel

4.3 Limiting cases

5 Proof of the Main Results

5.1 Concentration Results on Σ\SigmaΣ

Lemma 2** (Concentration of quadratic forms).**

Proof.

Remark 1** (Loss of control of the structure of Σ\SigmaΣ).**

Corollary 1** (Moments of quadratic forms).**

Proof.

Lemma 3** (Lipschitz functions of Σ\SigmaΣ).**

Proof.

Corollary 2** (Concentration of the Stieltjes transform of μn\mu_{n}μn​).**

Proof.

Lemma 4** (Concentration of 1TσTQ−σ\frac{1}{T}\sigma^{\sf T}Q_{-}\sigmaT1​σTQ−​σ).**

Proof.

Corollary 3** (Concentration of the mean-square error).**

Proof.

5.2 Asymptotic Equivalents

5.2.1 First Equivalent for E[Q]{\rm E}[Q]E[Q]

5.2.2 Second Equivalent for E[Q]{\rm E}[Q]E[Q]

5.2.3 Asymptotic Equivalent for E[QAQ]{\rm E}[QAQ]E[QAQ], where AAA is either Φ\PhiΦ or symmetric of bounded norm

5.3 Derivation of Φab\Phi_{ab}Φab​

5.3.1 Gaussian www

The function σ(t)=max⁡(t,0)\sigma(t)=\max(t,0)σ(t)=max(t,0)

The function σ(t)=∣t∣\sigma(t)=|t|σ(t)=∣t∣

The function σ(t)=1t≥0\sigma(t)=1_{t\geq 0}σ(t)=1t≥0​

The function σ(t)=sign(t)\sigma(t)={\rm sign}(t)σ(t)=sign(t)

The functions σ(t)=cos⁡(t)\sigma(t)=\cos(t)σ(t)=cos(t) and σ(t)=sin⁡(t)\sigma(t)=\sin(t)σ(t)=sin(t).

5.4 Polynomial σ(⋅)\sigma(\cdot)σ(⋅) and generic www

5.5 Heuristic derivation of Conjecture 1

6 Concluding Remarks

Appendix A Intermediary Lemmas

Lemma 5** (Resolvent Identity).**

Lemma 6** (A rank-111 perturbation identity).**

Lemma 7** (Operator Norm Control).**

Assumption 1 (Subgaussian $W$ ).

Assumption 2 (Function $\sigma$ ).

Assumption 3 (Growth Rate).

Lemma 1 (Concentration of quadratic forms).

Theorem 1 (Asymptotic equivalent for ${\rm E}[Q]$ ).

Theorem 2 (Limiting spectral measure of $\frac{1}{T}\Sigma^{\sf T}\Sigma$ ).

Proposition 1 (Asymptotic equivalent for ${\rm E}[QAQ]$ ).

Theorem 3 (Asymptotic training mean-square error).

Conjecture 1 (Deterministic equivalent for $E_{\rm test}$ ).

3.3 Evaluation of $\Phi_{AB}$

5.1 Concentration Results on $\Sigma$

Lemma 2 (Concentration of quadratic forms).

Remark 1 (Loss of control of the structure of $\Sigma$ ).

Corollary 1 (Moments of quadratic forms).

Lemma 3 (Lipschitz functions of $\Sigma$ ).

Corollary 2 (Concentration of the Stieltjes transform of $\mu_{n}$ ).

Lemma 4 (Concentration of $\frac{1}{T}\sigma^{\sf T}Q_{-}\sigma$ ).

Corollary 3 (Concentration of the mean-square error).

5.2.1 First Equivalent for ${\rm E}[Q]$

5.2.2 Second Equivalent for ${\rm E}[Q]$

5.2.3 Asymptotic Equivalent for ${\rm E}[QAQ]$ , where $A$ is either $\Phi$ or symmetric of bounded norm

5.3 Derivation of $\Phi_{ab}$

5.3.1 Gaussian $w$

The function $\sigma(t)=\max(t,0)$

The function $\sigma(t)=|t|$

The function $\sigma(t)=1_{t\geq 0}$

The function $\sigma(t)={\rm sign}(t)$

The functions $\sigma(t)=\cos(t)$ and $\sigma(t)=\sin(t)$ .

5.4 Polynomial $\sigma(\cdot)$ and generic $w$

Lemma 5 (Resolvent Identity).

Lemma 6 (A rank- $1$ perturbation identity).

Lemma 7 (Operator Norm Control).