TL;DR
This paper analyzes the spectral properties of Gram random matrices in neural networks using random matrix theory, providing deterministic equivalents and insights into network performance and hyperparameter tuning.
Contribution
It introduces a novel random matrix model for neural networks and derives deterministic equivalents for spectral measures, aiding understanding and optimization of random neural networks.
Findings
Deterministic equivalents for spectral measures of neural network matrices
Insights into asymptotic performance of single-layer random neural networks
Practical methods for hyperparameter tuning based on spectral analysis
Abstract
This article studies the Gram random matrix model , , classically found in the analysis of random feature maps and random neural networks, where is a (data) matrix of bounded norm, is a matrix of independent zero-mean unit variance entries, and is a Lipschitz continuous (activation) function --- being understood entry-wise. By means of a key concentration of measure lemma arising from non-asymptotic random matrix arguments, we prove that, as grow large at the same rate, the resolvent , for , has a similar behavior as that met in sample covariance matrix models, involving notably the moment , which provides in passing a deterministic equivalent…
| . |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Random Matrix Approach
to Neural Networks
Cosme Louartlabel=e1][email protected] [
Zhenyu Liaolabel=e2][email protected] [
Romain Couilletlabel=e3][email protected] [ CentraleSupélec, University of Paris–Saclay, France.
Abstract
This article studies the Gram random matrix model , , classically found in the analysis of random feature maps and random neural networks, where is a (data) matrix of bounded norm, is a matrix of independent zero-mean unit variance entries, and is a Lipschitz continuous (activation) function — being understood entry-wise. By means of a key concentration of measure lemma arising from non-asymptotic random matrix arguments, we prove that, as grow large at the same rate, the resolvent , for , has a similar behavior as that met in sample covariance matrix models, involving notably the moment , which provides in passing a deterministic equivalent for the empirical spectral measure of . Application-wise, this result enables the estimation of the asymptotic performance of single-layer random neural networks. This in turn provides practical insights into the underlying mechanisms into play in random neural networks, entailing several unexpected consequences, as well as a fast practical means to tune the network hyperparameters.
60B20,
62M45,
keywords:
[class=MSC]
, , and
t3Couillet’s work is supported by the ANR Project RMT4GRAPH (ANR-14-CE28-0006).
1 Introduction
Artificial neural networks, developed in the late fifties (Rosenblatt, 1958) in an attempt to develop machines capable of brain-like behaviors, know today an unprecedented research interest, notably in its applications to computer vision and machine learning at large (Krizhevsky, Sutskever and Hinton, 2012; Schmidhuber, 2015) where superhuman performances on specific tasks are now commonly achieved. Recent progress in neural network performances however find their source in the processing power of modern computers as well as in the availability of large datasets rather than in the development of new mathematics. In fact, for lack of appropriate tools to understand the theoretical behavior of the non-linear activations and deterministic data dependence underlying these networks, the discrepancy between mathematical and practical (heuristic) studies of neural networks has kept widening. A first salient problem in harnessing neural networks lies in their being completely designed upon a deterministic training dataset , so that their resulting performances intricately depend first and foremost on . Recent works have nonetheless established that, when smartly designed, mere randomly connected neural networks can achieve performances close to those reached by entirely data-driven network designs (Rahimi and Recht, 2007; Saxe et al., 2011). As a matter of fact, to handle gigantic databases, the computationally expensive learning phase (the so-called backpropagation of the error method) typical of deep neural network structures becomes impractical, while it was recently shown that smartly designed single-layer random networks (as studied presently) can already reach superhuman capabilities (Cambria et al., 2015) and beat expert knowledge in specific fields (Jaeger and Haas, 2004). These various findings have opened the road to the study of neural networks by means of statistical and probabilistic tools (Choromanska et al., 2015; Giryes, Sapiro and Bronstein, 2015). The second problem relates to the non-linear activation functions present at each neuron, which have long been known (as opposed to linear activations) to help design universal approximators for any input-output target map (Hornik, Stinchcombe and White, 1989).
In this work, we propose an original random matrix-based approach to understand the end-to-end regression performance of single-layer random artificial neural networks, sometimes referred to as extreme learning machines (Huang, Zhu and Siew, 2006; Huang et al., 2012), when the number and size of the input dataset are large and scale proportionally with the number of neurons in the network. These networks can also be seen, from a more immediate statistical viewpoint, as a mere linear ridge-regressor relating a random feature map of explanatory variables and target variables , for a randomly designed matrix and a non-linear function (applied component-wise). Our approach has several interesting features both for theoretical and practical considerations. It is first one of the few known attempts to move the random matrix realm away from matrices with independent or linearly dependent entries. Notable exceptions are the line of works surrounding kernel random matrices (El Karoui, 2010; Couillet and Benaych-Georges, 2016) as well as large dimensional robust statistics models (Couillet, Pascal and Silverstein, 2015; El Karoui, 2013; Zhang, Cheng and Singer, 2014). Here, to alleviate the non-linear difficulty, we exploit concentration of measure arguments (Ledoux, 2005) for non-asymptotic random matrices, thereby pushing further the original ideas of (El Karoui, 2009; Vershynin, 2012) established for simpler random matrix models. While we believe that more powerful, albeit more computational intensive, tools (such as an appropriate adaptation of the Gaussian tools advocated in (Pastur and Ŝerbina, 2011)) cannot be avoided to handle advanced considerations in neural networks, we demonstrate here that the concentration of measure phenomenon allows one to fully characterize the main quantities at the heart of the single-layer regression problem at hand.
In terms of practical applications, our findings shed light on the already incompletely understood extreme learning machines which have proved extremely efficient in handling machine learning problems involving large to huge datasets (Huang et al., 2012; Cambria et al., 2015) at a computationally affordable cost. But our objective is also to pave to path to the understanding of more involved neural network structures, featuring notably multiple layers and some steps of learning by means of backpropagation of the error.
Our main contribution is twofold. From a theoretical perspective, we first obtain a key lemma, Lemma 1, on the concentration of quadratic forms of the type where , , with and Lipschitz functions, and , are deterministic matrices. This non-asymptotic result (valid for all ) is then exploited under a simultaneous growth regime for and boundedness conditions on and to obtain, in Theorem 1, a deterministic approximation of the resolvent , where , , , for some , having independent entries. As the resolvent of a matrix (or operator) is an important proxy for the characterization of its spectrum (see e.g., (Pastur and Ŝerbina, 2011; Akhiezer and Glazman, 1993)), this result therefore allows for the characterization of the asymptotic spectral properties of , such as its limiting spectral measure in Theorem 2.
Application-wise, the theoretical findings are an important preliminary step for the understanding and improvement of various statistical methods based on random features in the large dimensional regime. Specifically, here, we consider the question of linear ridge-regression from random feature maps, which coincides with the aforementioned single hidden-layer random neural network known as extreme learning machine. We show that, under mild conditions, both the training and testing mean-square errors, respectively corresponding to the regression errors on known input-output pairs (with , ) and unknown pairings , almost surely converge to deterministic limiting values as grow large at the same rate (while is kept constant) for every fixed ridge-regression parameter .
Simulations on real image datasets are provided that corroborate our results.
These findings provide new insights into the roles played by the activation function and the random distribution of the entries of in random feature maps as well as by the ridge-regression parameter in the neural network performance. We notably exhibit and prove some peculiar behaviors, such as the impossibility for the network to carry out elementary Gaussian mixture classification tasks, when either the activation function or the random weights distribution are ill chosen.
Besides, for the practitioner, the theoretical formulas retrieved in this work allow for a fast offline tuning of the aforementioned hyperparameters of the neural network, notably when is not too large compared to . The graphical results provided in the course of the article were particularly obtained within a - to -fold gain in computation time between theory and simulations.
The remainder of the article is structured as follows: in Section 2, we introduce the mathematical model of the system under investigation. Our main results are then described and discussed in Section 3, the proofs of which are deferred to Section 5. Section 4 discusses our main findings. The article closes on concluding remarks on envisioned extensions of the present work in Section 6. The appendix provides some intermediary lemmas of constant use throughout the proof section.
Reproducibility: Python 3 codes used to produce the results of Section 4 are available at https://github.com/Zhenyu-LIAO/RMT4ELM
Notations: The norm is understood as the Euclidean norm for vectors and the operator norm for matrices, while the norm is the Frobenius norm for matrices. All vectors in the article are understood as column vectors.
2 System Model
We consider a ridge-regression task on random feature maps defined as follows. Each input data is multiplied by a matrix ; a non-linear function is then applied entry-wise to the vector , thereby providing a set of random features for each datum . The output of the linear regression is the inner product for some matrix to be designed.
From a neural network viewpoint, the neurons of the network are the virtual units operating the mapping ( being the -th row of ), for . The neural network then operates in two phases: a training phase where the regression matrix is learned based on a known input-output dataset pair and a testing phase where, for now fixed, the network operates on a new input dataset with corresponding unknown output .
During the training phase, based on a set of known input and output datasets, the matrix is chosen so as to minimize the mean square error , where and is some regularization factor. Solving for , this leads to the explicit ridge-regressor
[TABLE]
where we defined . This follows from differentiating the mean square error along to obtain , so that which, along with , gives the result.
In the remainder, we will also denote
[TABLE]
the resolvent of . The matrix naturally appears as a key quantity in the performance analysis of the neural network. Notably, the mean-square error on the training dataset is given by
[TABLE]
Under the growth rate assumptions on taken below, it shall appear that the random variable concentrates around its mean, letting then appear as a central object in the asymptotic evaluation of .
The testing phase of the neural network is more interesting in practice as it unveils the actual performance of neural networks. For a test dataset of length , with unknown output , the test mean-square error is defined by
[TABLE]
where and is the same as used in (1) (and thus only depends on and ). One of the key questions in the analysis of such an elementary neural network lies in the determination of which minimizes (and is thus said to have good generalization performance). Notably, small values are known to reduce but to induce the popular overfitting issue which generally increases , while large values engender both large values for and .
From a mathematical standpoint though, the study of brings forward some technical difficulties that do not allow for as a simple treatment through the present concentration of measure methodology as the study of . Nonetheless, the analysis of allows at least for heuristic approaches to become available, which we shall exploit to propose an asymptotic deterministic approximation for .
From a technical standpoint, we shall make the following set of assumptions on the mapping .
Assumption 1** (Subgaussian ).**
The matrix is defined by
[TABLE]
(understood entry-wise), where has independent and identically distributed entries and is -Lipschitz.
For , , with , we shall subsequently denote .
Under the notations of Assumption 1, we have in particular if and (the uniform distribution on ) if ( is here a -Lipschitz map).
We further need the following regularity condition on the function .
Assumption 2** (Function ).**
The function is Lipschitz continuous with parameter .
This assumption holds for many of the activation functions traditionally considered in neural networks, such as sigmoid functions, the rectified linear unit , or the absolute value operator.
When considering the interesting case of simultaneously large data and random features (or neurons), we shall then make the following growth rate assumptions.
Assumption 3** (Growth Rate).**
As ,
[TABLE]
while and are kept constant. In addition,
[TABLE]
3 Main Results
3.1 Main technical results and training performance
As a standard preliminary step in the asymptotic random matrix analysis of the expectation of the resolvent , a convergence of quadratic forms based on the row vectors of is necessary (see e.g., (Marc̆enko and Pastur, 1967; Silverstein and Bai, 1995)). Such results are usually obtained by exploiting the independence (or linear dependence) in the vector entries. This not being the case here, as the entries of the vector are in general not independent, we resort to a concentration of measure approach, as advocated in (El Karoui, 2009). The following lemma, stated here in a non-asymptotic random matrix regime (that is, without necessarily resorting to Assumption 3), and thus of independent interest, provides this concentration result. For this lemma, we need first to define the following key matrix
[TABLE]
of size , where .
Lemma 1** (Concentration of quadratic forms).**
Let Assumptions 1–2 hold. Let also such that and, for and , define the random vector . Then,
[TABLE]
for and independent of all other parameters. In particular, under the additional Assumption 3,
[TABLE]
*for some . *
Note that this lemma partially extends concentration of measure results involving quadratic forms, see e.g., (Rudelson et al., 2013, Theorem 1.1), to non-linear vectors.
With this result in place, the standard resolvent approaches of random matrix theory apply, providing our main theoretical finding as follows.
Theorem 1** (Asymptotic equivalent for ).**
Let Assumptions 1–3 hold and define as
[TABLE]
where is implicitly defined as the unique positive solution to . Then, for all , there exists such that
[TABLE]
As a corollary of Theorem 1 along with a concentration argument on , we have the following result on the spectral measure of , which may be seen as a non-linear extension of (Silverstein and Bai, 1995) for which .
Theorem 2** (Limiting spectral measure of ).**
Let Assumptions 1–3 hold and, for the eigenvalues of , define . Then, for every bounded continuous function , with probability one
[TABLE]
where is the measure defined through its Stieltjes transform given, for , by
[TABLE]
with the unique solution in of
[TABLE]
Note that has a well-known form, already met in early random matrix works (e.g., (Silverstein and Bai, 1995)) on sample covariance matrix models. Notably, is also the deterministic equivalent of the empirical spectral measure of for any deterministic matrix such that . As such, to some extent, the results above provide a consistent asymptotic linearization of . From standard spiked model arguments (see e.g., (Benaych-Georges and Nadakuditi, 2012)), the result further suggests that also the eigenvectors associated to isolated eigenvalues of (if any) behave similarly to those of , a remark that has fundamental importance in the neural network performance understanding.
However, as shall be shown in Section 3.3, and contrary to empirical covariance matrix models of the type , explicitly depends on the distribution of (that is, beyond its first two moments). Thus, the aforementioned linearization of , and subsequently the deterministic equivalent for , are not universal with respect to the distribution of zero-mean unit variance . This is in striking contrast to the many linear random matrix models studied to date which often exhibit such universal behaviors. This property too will have deep consequences in the performance of neural networks as shall be shown through Figure 3 in Section 4 for an example where inappropriate choices for the law of lead to network failure to fulfill the regression task.
For convenience in the following, letting and be defined as in Theorem 1, we shall denote
[TABLE]
Theorem 1 provides the central step in the evaluation of , for which not only but also needs be estimated. This last ingredient is provided in the following proposition.
Proposition 1** (Asymptotic equivalent for ).**
Let Assumptions 1–3 hold and be a symmetric non-negative definite matrix which is either or a matrix with uniformly bounded operator norm (with respect to ). Then, for all , there exists such that, for all ,
[TABLE]
As an immediate consequence of Proposition 1, we have the following result on the training mean-square error of single-layer random neural networks.
Theorem 3** (Asymptotic training mean-square error).**
Let Assumptions 1–3 hold and , be defined as in Theorem 1 and (3). Then, for all ,
[TABLE]
almost surely, where
[TABLE]
Since and share the same orthogonal eigenvector basis, it appears that depends on the alignment between the right singular vectors of and the eigenvectors of , with weighting coefficients
[TABLE]
where we denoted , , the eigenvalues of (which depend on through ). If , it is easily seen that as , in which case almost surely. However, in the more interesting case in practice where , as and consequently does not have a simple limit (see Section 4.3 for more discussion on this aspect).
Theorem 3 is also reminiscent of applied random matrix works on empirical covariance matrix models, such as (Bai and Silverstein, 2007; Kammoun et al., 2009), then further emphasizing the strong connection between the non-linear matrix and its linear counterpart .
As a side note, observe that, to obtain Theorem 3, we could have used the fact that which, along with some analyticity arguments (for instance when extending the definition of to , ), would have directly ensured that is an asymptotic equivalent for , without the need for the explicit derivation of Proposition 1. Nonetheless, as shall appear subsequently, Proposition 1 is also a proxy to the asymptotic analysis of . Besides, the technical proof of Proposition 1 quite interestingly showcases the strength of the concentration of measure tools under study here.
3.2 Testing performance
As previously mentioned, harnessing the asymptotic testing performance seems, to the best of the authors’ knowledge, out of current reach with the sole concentration of measure arguments used for the proof of the previous main results. Nonetheless, if not fully effective, these arguments allow for an intuitive derivation of a deterministic equivalent for , which is strongly supported by simulation results. We provide this result below under the form of a yet unproven claim, a heuristic derivation of which is provided at the end of Section 5.
To introduce this result, let be a set of input data with corresponding output . We also define . We assume that and satisfy the same growth rate conditions as and in Assumption 3. To introduce our claim, we need to extend the definition of in (2) and in (3) to the following notations: for all pair of matrices of appropriate dimensions,
[TABLE]
where . In particular, and .
With these notations in place, we are in position to state our claimed result.
Conjecture 1** (Deterministic equivalent for ).**
Let Assumptions 1–2 hold and satisfy the same conditions as in Assumption 3. Then, for all ,
[TABLE]
almost surely, where
[TABLE]
While not immediate at first sight, one can confirm (using notably the relation ) that, for , , as expected.
In order to evaluate practically the results of Theorem 3 and Conjecture 1, it is a first step to be capable of estimating the values of for various activation functions of practical interest. Such results, which call for completely different mathematical tools (mostly based on integration tricks), are provided in the subsequent section.
3.3 Evaluation of
The evaluation of for arbitrary matrices naturally boils down to the evaluation of its individual entries and thus to the calculus, for arbitrary vectors , of
[TABLE]
The evaluation of (4) can be obtained through various integration tricks for a wide family of mappings and activation functions . The most popular activation functions in neural networks are sigmoid functions, such as , as well as the so-called rectified linear unit (ReLU) defined by which has been recently popularized as a result of its robust behavior in deep neural networks. In physical artificial neural networks implemented using light projections, is the preferred choice. Note that all aforementioned functions are Lipschitz continuous and therefore in accordance with Assumption 2.
Despite their not abiding by the prescription of Assumptions 1 and 2, we believe that the results of this article could be extended to more general settings, as discussed in Section 4. In particular, since the key ingredient in the proof of all our results is that the vector follows a concentration of measure phenomenon, induced by the Gaussianity of (if ), the Lipschitz character of and the norm boundedness of , it is likely, although not necessarily simple to prove, that may still concentrate under relaxed assumptions. This is likely the case for more generic vectors than as well as for a larger class of activation functions, such as polynomial or piece-wise Lipschitz continuous functions.
In anticipation of these likely generalizations, we provide in Table 1 the values of for (i.e., for ) and for a set of functions not necessarily satisfying Assumption 2. Denoting , it is interesting to remark that, since , . Also, , a result reminiscent of (Rahimi and Recht, 2007).111It is in particular not difficult to prove, based on our framework, that, as , a random neural network composed of neurons with activation function and neurons with activation function implements a Gaussian difference kernel. Finally, note that as , inducing that the extension by continuity of to propagates to their associated kernels.
In addition to these results for , we also evaluated for and a vector of independent and identically distributed entries of zero mean and moments of order equal to (so ); is not restricted here to satisfy . In this case, we find
[TABLE]
where we defined .
It is already interesting to remark that, while classical random matrix models exhibit a well-known universality property — in the sense that their limiting spectral distribution is independent of the moments (higher than two) of the entries of the involved random matrix, here —, for a polynomial of order two, and thus strongly depend on for . We shall see in Section 4 that this remark has troubling consequences. We will notably infer (and confirm via simulations) that the studied neural network may provably fail to fulfill a specific task if the are Bernoulli with zero mean and unit variance but succeed with possibly high performance if the are standard Gaussian (which is explained by the disappearance or not of the term and in (3.3) if ).
4 Practical Outcomes
We discuss in this section the outcomes of our main results in terms of neural network application. The technical discussions on Theorem 1 and Proposition 1 will be made in the course of their respective proofs in Section 5.
4.1 Simulation Results
We first provide in this section a simulation corroborating the findings of Theorem 3 and suggesting the validity of Conjecture 1. To this end, we consider the task of classifying the popular MNIST image database (LeCun, Cortes and Burges, 1998), composed of grayscale handwritten digits of size , with a neural network composed of units and standard Gaussian . We represent here each image as a -size vector; images of sevens and images of nines were extracted from the database and were evenly split in training and test images, respectively. The database images were jointly centered and scaled so to fall close to the setting of Assumption 3 on and (an admissible preprocessing intervention). The columns of the output values and were taken as unidimensional () with depending on the image class. Figure 1 displays the simulated (averaged over realizations of ) versus theoretical values of and for three choices of Lipschitz continuous functions , as a function of .
Note that a perfect match between theory and practice is observed, for both and , which is a strong indicator of both the validity of Conjecture 1 and the adequacy of Assumption 3 to the MNIST dataset.
We subsequently provide in Figure 2 the comparison between theoretical formulas and practical simulations for a set of functions which do not satisfy Assumption 2, i.e., either discontinuous or non-Lipschitz maps. The closeness between both sets of curves is again remarkably good, although to a lesser extent than for the Lipschitz continuous functions of Figure 1. Also, the achieved performances are generally worse than those observed in Figure 1.
It should be noted that the performance estimates provided by Theorem 3 and Conjecture 1 can be efficiently implemented at low computational cost in practice. Indeed, by diagonalizing (which is a marginal cost independent of ), can be computed for all through mere vector operations; similarly is obtained by the marginal cost of a basis change of and the matrix product , all remaining operations being accessible through vector operations. As a consequence, the simulation durations to generate the aforementioned theoretical curves using the linked Python script were found to be to times faster than to generate the simulated network performances. Beyond their theoretical interest, the provided formulas therefore allow for an efficient offline tuning of the network hyperparameters, notably the choice of an appropriate value for the ridge-regression parameter .
4.2 The underlying kernel
Theorem 1 and the subsequent theoretical findings importantly reveal that the neural network performances are directly related to the Gram matrix , which acts as a deterministic kernel on the dataset . This is in fact a well-known result found e.g., in (Williams, 1998) where it is shown that, as alone, the neural network behaves as a mere kernel operator (this observation is retrieved here in the subsequent Section 4.3). This remark was then put at an advantage in (Rahimi and Recht, 2007) and subsequent works, where random feature maps of the type are proposed as a computationally efficient proxy to evaluate kernels .
As discussed previously, the formulas for and suggest that good performances are achieved if the dominant eigenvectors of show a good alignment to (and similarly for and ). This naturally drives us to finding a priori simple regression tasks where ill-choices of may annihilate the neural network performance. Following recent works on the asymptotic performance analysis of kernel methods for Gaussian mixture models (Couillet and Benaych-Georges, 2016; Zhenyu Liao, 2017; Mai and Couillet, 2017) and (Couillet and Kammoun, 2016), we describe here such a task.
Let and where and are such that , are bounded, and . Accordingly, and . It is proved in the aforementioned articles that, under these conditions, it is theoretically possible, in the large limit, to classify the data using a kernel least-square support vector machine (that is, with a training dataset) or with a kernel spectral clustering method (that is, in a completely unsupervised manner) with a non-trivial limiting error probability (i.e., neither zero nor one). This scenario has the interesting feature that almost surely for all while , almost surely, irrespective of the class of , thereby allowing for a Taylor expansion of the non-linear kernels as early proposed in (El Karoui, 2010).
Transposed to our present setting, the aforementioned Taylor expansion allows for a consistent approximation of by an information-plus-noise (spiked) random matrix model (see e.g., (Loubaton and Vallet, 2010; Benaych-Georges and Nadakuditi, 2012)). In the present Gaussian mixture context, it is shown in (Couillet and Benaych-Georges, 2016) that data classification is (asymptotically at least) only possible if explicitly contains the quadratic term (or combinations of , , and ). In particular, letting with , it is easily seen from Table 1 that only , , and can realize the task. Indeed, we have the following Taylor expansions around :
[TABLE]
where only the last three functions (only found in the expression of corresponding to , , or ) exhibit a quadratic term.
More surprisingly maybe, recalling now Equation (3.3) which considers non-necessarily Gaussian with moments of order , a more refined analysis shows that the aforementioned Gaussian mixture classification task will fail if and , so for instance for Bernoulli with parameter . The performance comparison of this scenario is shown in the top part of Figure 3 for and , , for and (that is, Bernoulli ). The choice of with is motivated by (Couillet and Benaych-Georges, 2016; Couillet and Kammoun, 2016) where it is shown, in a somewhat different setting, that this choice is optimal for class recovery. Note that, while the test performances are overall rather weak in this setting, for , drops below one (the amplitude of the ), thereby indicating that non-trivial classification is performed. This is not so for the Bernoulli case where is systematically greater than |\hat{Y}_{ij}|{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=1}. This is theoretically explained by the fact that, from Equation (3.3), contains structural information about the data classes through the term which induces an information-plus-noise model for as long as , i.e., (see (Couillet and Benaych-Georges, 2016) for details). This is visually seen in the bottom part of Figure 3 where the Gaussian scenario presents an isolated eigenvalue for with corresponding structured eigenvector, which is not the case of the Bernoulli scenario. To complete this discussion, it appears relevant in the present setting to choose in such a way that is far from zero, thus suggesting the interest of heavy-tailed distributions. To confirm this prediction, Figure 3 additionally displays the performance achieved and the spectrum of observed for , that is, following a Student-t distribution with degree of freedom normalized to unit variance (in this case and ). Figure 3 confirms the large superiority of this choice over the Gaussian case (note nonetheless the slight inaccuracy of our theoretical formulas in this case, which is likely due to too small values of to accommodate with higher order moments, an observation which is confirmed in simulations when letting be even smaller).
4.3 Limiting cases
We have suggested that contains, in its dominant eigenmodes, all the usable information describing . In the Gaussian mixture example above, it was notably shown that may completely fail to contain this information, resulting in the impossibility to perform a classification task, even if one were to take infinitely many neurons in the network. For containing useful information about , it is intuitive to expect that both and become smaller as and become large. It is in fact easy to see that, if is invertible (which is likely to occur in most cases if ), then
[TABLE]
and we fall back on the performance of a classical kernel regression. It is interesting in particular to note that, as the number of neurons becomes large, the effect of on flattens out. Therefore, a smart choice of is only relevant for small (and thus computationally more efficient) neuron layers. This observation is depicted in Figure 4 where it is made clear that a growth of reduces to zero while saturates to a non-zero limit which becomes increasingly irrespective of . Note additionally the interesting phenomenon occurring for where too small values of induce important performance losses, thereby suggesting a strong importance of proper choices of in this regime.
Of course, practical interest lies precisely in situations where is not too large. We may thus subsequently assume that . In this case, as suggested by Figures 1–2, the mean-square error performances achieved as may predict the superiority of specific choices of for optimally chosen . It is important for this study to differentiate between cases where is smaller or greater than . Indeed, observe that, with the spectral decomposition for diagonal and ,
[TABLE]
which satisfies, as ,
[TABLE]
A phase transition therefore exists whereby assumes a finite positive value in the small limit if , or scales like otherwise.
As a consequence, if , as , and , where is any matrix such that is orthogonal, so that and ; and thus, , which states that the residual training error corresponds to the energy of not captured by the space spanned by . Since is an increasing function of , so is (at least for all large ) and thus corresponds to the lowest achievable asymptotic training error.
If instead (which is the most likely outcome in practice), as , and thus
[TABLE]
where and .
These results suggest that neural networks should be designed both in a way that reduces the rank of while maintaining a strong alignment between the dominant eigenvectors of and the output matrix .
Interestingly, if is assumed as above to be extracted from a Gaussian mixture and that is a classification vector with , then the tools proposed in (Couillet and Benaych-Georges, 2016) (related to spike random matrix analysis) allow for an explicit evaluation of the aforementioned limits as grow large. This analysis is however cumbersome and outside the scope of the present work.
5 Proof of the Main Results
In the remainder, we shall use extensively the following notations:
[TABLE]
i.e., . Also, we shall define the matrix with -th row removed, and correspondingly
[TABLE]
Finally, because of exchangeability, it shall often be convenient to work with the generic random vector , the random vector distributed as any of the ’s, the random matrix distributed as any of the ’s, and with the random matrix distributed as any of the ’s.
5.1 Concentration Results on
Our first results provide concentration of measure properties on functionals of . These results unfold from the following concentration inequality for Lipschitz applications of a Gaussian vector; see e.g., (Ledoux, 2005, Corollary 2.6, Propositions 1.3, 1.8) or (Tao, 2012, Theorem 2.1.12). For , consider the canonical Gaussian probability on defined through its density and a -Lipschitz function. Then, we have the said normal concentration
[TABLE]
where are independent of and . As a corollary (see e.g., (Ledoux, 2005, Proposition 1.10)), for every ,
[TABLE]
The main approach to the proof of our results, starting with that of the key Lemma 1, is as follows: since with and Lipschitz, the normal concentration of transfers to which further induces a normal concentration of the random vector and the matrix , thereby implying that Lipschitz functionals of or also concentrate. As pointed out earlier, these concentration results are used in place for the independence assumptions (and their multiple consequences on convergence of random variables) classically exploited in random matrix theory.
Notations: In all subsequent lemmas and proofs, the letters will be used interchangeably as positive constants independent of the key equation parameters (notably and below) and may be reused from line to line. Additionally, the variable will denote any small positive number; the variables may depend on .
We start by recalling the first part of the statement of Lemma 1 and subsequently providing its proof.
Lemma 2** (Concentration of quadratic forms).**
Let Assumptions 1–2 hold. Let also such that and, for and , define the random vector . Then,
[TABLE]
for and independent of all other parameters.
Proof.
The layout of the proof is as follows: since the application is “quadratic” in and thus not Lipschitz (therefore not allowing for a natural transfer of the concentration of to ), we first prove that satisfies a concentration inequality, which provides a high probability bound on . Conditioning on this event, the map can then be shown to be Lipschitz (by isolating one of the terms for bounding and the other one for retrieving the Lipschitz character) and, up to an appropriate control of concentration results under conditioning, the result is obtained.
Following this plan, we first provide a concentration inequality for . To this end, note that the application , is Lipschitz with parameter as the combination of the -Lipschitz function , the -Lipschitz map , and the -Lipschitz map , . As a Gaussian vector, has a normal concentration and so does . Since the Euclidean norm , is -Lipschitz, we thus have immediately by (6)
[TABLE]
for some independent of all parameters.
Finally, using again the Lipschitz character of ,
[TABLE]
so that, by Jensen’s inequality,
[TABLE]
with (since ). Letting , we then find
[TABLE]
which, with the remark , may be equivalently stated as
[TABLE]
As a side (but important) remark, note that, since
[TABLE]
the result above implies that
[TABLE]
and thus, since , we have
[TABLE]
Thus, in particular, under the additional Assumption 3, with high probability, the operator norm of cannot exceed a rate .
Remark 1** (Loss of control of the structure of ).**
The aforementioned control of arises from the bound which may be quite loose (by as much as a factor ). Intuitively, under the supplementary Assumption 3, if , then is “dominated” by the matrix , the operator norm of which is indeed of order and the bound is tight. If and , we however know that (Bai and Silverstein, 1998). One is tempted to believe that, more generally, if , then should remain of this order. And, if instead , the contribution of should merely engender a single large amplitude isolate singular value in the spectrum of and the other singular values remain of order . These intuitions are not captured by our concentration of measure approach.
Since is an entry-wise operation, concentration results with respect to the Frobenius norm are natural, where with respect to the operator norm are hardly accessible.
Back to our present considerations, let us define the probability space .
Conditioning the random variable of interest in Lemma 2 with respect to and its complementary , for some , gives
[TABLE]
We can already bound thanks to (7). As for the first right-hand side term, note that on the set , the function is -Lipschitz. This is because, for all ,
[TABLE]
Since conditioning does not allow for a straightforward application of (6), we consider instead , a -Lipschitz continuation to of , the restriction of to , such that all the radial derivative of are constant in the set . We may thus now apply (6) and our previous results to obtain
[TABLE]
Therefore,
[TABLE]
Our next step is then to bound the difference . Since and are equal on ,
[TABLE]
where is the law of . Since , for , and thus
[TABLE]
where in last inequality we used the fact that for , , and . As a consequence,
[TABLE]
so that, with the same remark as before, for ,
[TABLE]
To avoid the condition , we use the fact that, probabilities being lower than one, it suffices to replace by with such that
[TABLE]
The above inequality holds if we take for instance since then (using successively and ) and thus
[TABLE]
Therefore, setting , we get for every
[TABLE]
which, together with the inequality , gives
[TABLE]
We then conclude
[TABLE]
and, with ,
[TABLE]
Indeed, if then , while if then .
∎
As a corollary of Lemma 2, we have the following control of the moments of .
Corollary 1** (Moments of quadratic forms).**
*Let Assumptions 1–2 hold. For , , such that , and , *
[TABLE]
with , , and independent of the other parameters. In particular, under the additional Assumption 3,
[TABLE]
Proof.
We use the fact that, for a nonnegative random variable , , so that
[TABLE]
which, along with the boundedness of the integrals, concludes the proof. ∎
Beyond concentration results on functions of the vector , we also have the following convenient property for functions of the matrix .
Lemma 3** (Lipschitz functions of ).**
Let be a -Lipschitz function with respect to the Froebnius norm. Then, under Assumptions 1–2,
[TABLE]
for some . In particular, under the additional Assumption 3,
[TABLE]
Proof.
Denoting , since {\rm vec}(\tilde{W})\equiv{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}[\tilde{W}_{11},\cdots,\tilde{W}_{np}]} is a Gaussian vector, by the normal concentration of Gaussian vectors, for a -Lipschitz function of with respect to the Frobenius norm (i.e., the Euclidean norm of ), by (6),
[TABLE]
for some . Let’s consider in particular and remark that
[TABLE]
concluding the proof. ∎
A first corollary of Lemma 3 is the concentration of the Stieltjes transform of , the empirical spectral measure of , for all (so in particular, for , ).
Corollary 2** (Concentration of the Stieltjes transform of ).**
[TABLE]
for some , where is the Hausdorff set distance. In particular, for , , and under the additional Assumption 3
[TABLE]
Proof.
We can apply Lemma 3 for , since we have
[TABLE]
where, for the second to last inequality, we successively used the relations , for nonnegative definite , and , , , for , and finally . ∎
Lemma 3 also allows for an important application of Lemma 2 as follows.
Lemma 4** (Concentration of ).**
Let Assumptions 1–3 hold and write . Define and, for and , let . Then, for such that
[TABLE]
for some independent of the other parameters.
Proof.
Let . Reproducing the proof of Corollary 2, conditionally to for any arbitrary large enough , it appears that is Lipschitz with parameter of order . Along with (7) and Assumption 3, this thus ensures that
[TABLE]
for some . We may then apply Lemma 1 on the bounded norm matrix to further find that
[TABLE]
which concludes the proof. ∎
As a further corollary of Lemma 3, we have the following concentration result on the training mean-square error of the neural network under study.
Corollary 3** (Concentration of the mean-square error).**
[TABLE]
for some independent of the other parameters.
Proof.
We apply Lemma 3 to the mapping . Denoting and , remark indeed that
[TABLE]
As and are bounded and is also bounded by Assumption 3, this implies
[TABLE]
for some . The function is thus Lipschitz with parameter independent of , which allows us to conclude using Lemma 3. ∎
The aforementioned concentration results are the building blocks of the proofs of Theorem 1–3 which, under all Assumptions 1–3, are established using standard random matrix approaches.
5.2 Asymptotic Equivalents
5.2.1 First Equivalent for
This section is dedicated to a first characterization of , in the “simultaneously large” regime. This preliminary step is classical in studying resolvents in random matrix theory as the direct comparison of to with the implicit may be cumbersome. To this end, let us thus define the intermediary deterministic matrix
[TABLE]
with , where we recall that is a random matrix distributed as, say, .
First note that, since and, from (7) and Assumption 3, for all large , we find that for some constant . Thus, is uniformly bounded.
We will show here that as in the regime of Assumption 3. As the proof steps are somewhat classical, we defer to the appendix some classical intermediary lemmas (Lemmas 5–7). Using the resolvent identity, Lemma 5, we start by writing
[TABLE]
which, from Lemma 6, gives, for ,
[TABLE]
Note now, from the independence of and , that the second right-hand side expectation is simply . Also, exploiting Lemma 6 in reverse on the rightmost term, this gives
[TABLE]
It is convenient at this point to note that, since is symmetric, we may write
[TABLE]
We study the two right-hand side terms of (5.2.1) independently.
For the first term, since ,
[TABLE]
where we used again Lemma 6 in reverse. Denoting , this can be compactly written
[TABLE]
Note at this point that, from Lemma 7, and
[TABLE]
Besides, by Lemma 4 and the union bound,
[TABLE]
for some , so in particular, recalling that for some constant ,
[TABLE]
As a consequence of all the above (and of the boundedness of ), we have that, for some ,
[TABLE]
Let us now consider the second right-hand side term of (5.2.1). Using the relation in the order of Hermitian matrices (which unfolds from ), we have, with and ,
[TABLE]
where . Of course, since we also have (from ), we have symmetrically
[TABLE]
But from Lemma 4,
[TABLE]
so that, with a similar reasoning as in the proof of Corollary 1,
[TABLE]
where we additionally used in the first inequality.
Since in addition , this gives
[TABLE]
Together with (5.2.1), we thus conclude that
[TABLE]
Note in passing that we proved that
[TABLE]
where the first equality holds by exchangeability arguments.
In particular,
[TABLE]
where |\frac{1}{T}\operatorname{tr}\Phi({\rm E}[Q_{-}]-{\rm E}[Q])|\leq{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{c}{n}}. And thus, by the previous result,
[TABLE]
We have proved in the beginning of the section that is bounded and thus we finally conclude that
[TABLE]
5.2.2 Second Equivalent for
In this section, we show that can be approximated by the matrix , which we recall is defined as
[TABLE]
where is the unique positive solution to . The fact that is well defined is quite standard and has already been proved several times for more elaborate models. Following the ideas of (Hoydis, Couillet and Debbah, 2013), we may for instance use the framework of so-called standard interference functions (Yates, 1995) which claims that, if a map , , satisfies , and there exists such that , then has a unique fixed point (Yates, 1995, Th 2). It is easily shown that is such a map, so that exists and is unique.
To compare and , using the resolvent identity, Lemma 5, we start by writing
[TABLE]
from which
[TABLE]
which implies that
[TABLE]
It thus remains to show that
[TABLE]
to prove that . To this end, note that, by Cauchy–Schwarz’s inequality,
[TABLE]
so that it is sufficient to bound the limsup of both terms under the square root strictly by one. Next, remark that
[TABLE]
In particular,
[TABLE]
But at the same time, since ,
[TABLE]
the limsup of which is bounded. We thus conclude that
[TABLE]
Similarly, , which is known to be bounded, satisfies
[TABLE]
and we thus have also
[TABLE]
which completes to prove that .
As a consequence of all this,
[TABLE]
and we have thus proved that for some .
From this result, along with Corollary 2, we now have that
[TABLE]
for all large . As a consequence, for all , almost surely. As such, the difference of Stieltjes transforms , and , (with the unique Stieltjes transform solution to ) converges to zero for each in a subset of having at least one accumulation point (namely ), almost surely so (that is, on a probability set with ). Thus, letting be a converging sequence strictly included in , on the probability one space , for all . Now, is complex analytic on and bounded on all compact subsets of . Besides, it was shown in (Silverstein and Bai, 1995; Silverstein and Choi, 1995) that the function is well-defined, complex analytic and bounded on all compact subsets of . As a result, on , is complex analytic, bounded on all compact subsets of and converges to zero on a subset admitting at least one accumulation point. Thus, by Vitali’s convergence theorem (Titchmarsh, 1939), with probability one, converges to zero everywhere on . This implies, by (Bai and Silverstein, 2009, Theorem B.9), that , vaguely as a signed finite measure, with probability one, and, since is a probability measure (again from the results of (Silverstein and Bai, 1995; Silverstein and Choi, 1995)), we have thus proved Theorem 2.
5.2.3 Asymptotic Equivalent for , where is either or symmetric of bounded norm
The evaluation of the second order statistics of the neural network under study requires, beside , to evaluate the more involved form , where is a symmetric matrix either equal to or of bounded norm (so in particular is bounded). To evaluate this quantity, first write
[TABLE]
Of course, since is symmetric, we may write
[TABLE]
which will reveal more practical to handle.
First note that, since and is such that is bounded, , which provides an estimate for the first expectation. We next evaluate the last right-hand side expectation above. With the same notations as previously, from exchangeability arguments and using , observe that
[TABLE]
which, reusing , is further decomposed as
[TABLE]
(where in the previous to last line, we have merely reorganized the terms conveniently) and our interest is in handling . Let us first treat term . Since is bounded, by Lemma 4, concentrates around ; but, as is bounded, we also have . We thus deduce, with similar arguments as previously, that
[TABLE]
with probability exponentially close to one, in the order of symmetric matrices. Taking expectation and norms on both sides, and conditioning on the aforementioned event and its complementary, we thus have that
[TABLE]
But, again by exchangeability arguments,
[TABLE]
with , the operator norm of which is bounded as O(1). So finally,
[TABLE]
We now move to term . Using the relation ,
[TABLE]
and the symmetrical lower bound (equal to the opposite of the upper bound), where . For the same reasons as above, the first right-hand side term is bounded by . As for the second term, for , it is clearly bounded; for , using , can be expressed in terms of and for , all of which have been shown to be bounded (at most by ). We thus conclude that
[TABLE]
Finally, term can be handled similarly as term and is shown to be of norm bounded by .
As a consequence of all the above, we thus find that
[TABLE]
It is attractive to feel that the sum of the second and third terms above vanishes. This is indeed verified by observing that, for any matrix ,
[TABLE]
and symmetrically
[TABLE]
with , and a similar reasoning is performed to control and . For bounded, is bounded as , and thus is of order . So in particular, taking of bounded norm, we find that
[TABLE]
Take now . Then, from the relation in the order of symmetric matrices,
[TABLE]
The first norm in the parenthesis is bounded by and it thus remains to control the second norm. To this end, similar to the control of , by writing for independent vectors with the same law as , and exploiting the exchangeability, we obtain after some calculus that can be expressed as the sum of terms of the form or for diagonal matrices of norm bounded as , while and are similar as and , only for replaced by . All these terms are bounded as and we finally obtain that is bounded and thus
[TABLE]
With the additional control on and , together, this implies that {\rm E}[Q\Phi Q]={\rm E}[Q_{-}\Phi Q_{-}]+O_{\|\cdot\|}({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}n^{-1}}). Hence, for , exploiting the fact that , we have the simplification
[TABLE]
or equivalently
[TABLE]
We have already shown in (11) that and thus
[TABLE]
So finally, for all of bounded norm,
[TABLE]
which proves immediately Proposition 1 and Theorem 3.
5.3 Derivation of
5.3.1 Gaussian
In this section, we evaluate the terms provided in Table 1. The proof for the term corresponding to can be already be found in (Williams, 1998, Section 3.1) and is not recalled here. For the other functions , we follow a similar approach as in (Williams, 1998), as detailed next.
The evaluation of for requires to estimate
[TABLE]
Assume that and and not linearly dependent. It is convenient to observe that this integral can be reduced to a two-dimensional integration by considering the basis defined (for instance) by
[TABLE]
and any completion of the basis. By letting and (), (where and ), this reduces to
[TABLE]
Letting , and , this is conveniently written as the two-dimensional integral
[TABLE]
The case where and would be linearly dependent can then be obtained by continuity arguments.
The function
For this function, we have
[TABLE]
Since , a simple geometric representation lets us observe that
[TABLE]
where we defined . We may thus operate a polar coordinate change of variable (with inverse Jacobian determinant equal to ) to obtain
[TABLE]
With two integration by parts, we have that . Classical trigonometric formulas also provide
[TABLE]
where we used in particular . Altogether, this is after simplification and replacement of , and ,
[TABLE]
It is worth noticing that this may be more compactly written as
[TABLE]
which is minimum for (since on ) and takes there the limiting value zero. Hence for and not linearly dependent.
For and linearly dependent, we simply have for and for .
The function
Since , we have
[TABLE]
Hence, reusing the results above, we have here
[TABLE]
Using the identity provides the expected result.
The function
With the same notations as in the case , we have to evaluate
[TABLE]
After a polar coordinate change of variable, this is
[TABLE]
from which the result unfolds.
The function
Here it suffices to note that so that
[TABLE]
and to apply the result of the previous section, with either , , or . Since , we conclude that
[TABLE]
The functions and .
Let us first consider . We have here to evaluate
[TABLE]
which boils down to evaluating, for , the integral
[TABLE]
Altogether, we find
[TABLE]
For , it suffices to appropriately adapt the signs in the expression of (using the relation ) to obtain in the end
[TABLE]
as desired.
5.4 Polynomial and generic
In this section, we prove Equation 3.3 for and a random vector with independent and identically distributed entries of zero mean and moment of order equal to . The result is based on standard combinatorics. We are to evaluate
[TABLE]
After development, it appears that one needs only assess, for say vectors that take values in , the moments
[TABLE]
where we recall the definition . Gathering all the terms for appropriate selections of leads to (3.3).
5.5 Heuristic derivation of Conjecture 1
Conjecture 1 essentially follows as an aftermath of Remark 1. We believe that, similar to , is expected to be of the form , where , with with high probability. Besides, if were chosen as constituted of Gaussian mixture vectors, with non-trivial growth rate conditions as introduced in (Couillet and Benaych-Georges, 2016), it is easily seen that and , for some constant and .
This subsequently ensures that and would be of a similar form and with and of bounded norm. These facts, that would require more advanced proof techniques, let envision the following heuristic derivation for Conjecture 1.
Recall that our interest is on the test performance defined as
[TABLE]
which may be rewritten as
[TABLE]
If follows the aforementioned claimed operator norm control, reproducing the steps of Corollary 3 leads to a similar concentration for , which we shall then admit. We are therefore left to evaluating and .
We start with the term , which we expand as
[TABLE]
with , the operator norm of which is bounded by with high probability. Now, observe that, again with the assumption that with controlled , may be decomposed as
[TABLE]
In the display above, the first right-hand side term is now of order . As for the second right-hand side term, note that is a vector of independent and identically distributed zero mean and variance entries; while note formally independent of , it is nonetheless expected that this independence “weakens” asymptotically (a behavior several times observed in linear random matrix models), so that one expects by central limit arguments that the second right-hand side term be also of order .
This would thus result in
[TABLE]
where we used and the definition .
We then move on to of Equation (12), which can be developed as
[TABLE]
In the term , reproducing the proof of Lemma 1 with the condition bounded, we obtain that concentrates around , which allows us to write
[TABLE]
with and thus can be rewritten as
[TABLE]
while for , following the same arguments as previously, we have
[TABLE]
where .
Since , we are free to plug in the asymptotic equivalent of derived in Section 5.2.3, and we deduce
[TABLE]
The term of the double sum over and () needs more efforts. To handle this term, we need to remove the dependence of both and in in sequence. We start with as follows:
[TABLE]
where in the previous to last inequality we used the relation
[TABLE]
For , we replace by and take expectation over
[TABLE]
The idea to handle is to retrieve forms of the type for some satisfying with high probability. To this end, we use
[TABLE]
and thus can be expanded as the sum of three terms that shall be studied in order:
[TABLE]
where . First, is of order since is of bounded operator norm. Subsequently, can be rewritten as
[TABLE]
with here
[TABLE]
The same arguments apply for but for
[TABLE]
which completes to show that and thus
[TABLE]
It remains to handle . Under the same claims as above, we have
[TABLE]
where we introduced the notation . For , we replace by , and take the expectation over , as follows
[TABLE]
with having the same law as , and , both expected to be of order . Using again the asymptotic equivalent of devised in Section 5.2.3, we then have
[TABLE]
Following the same principle, we deduce for that
[TABLE]
with , also believed to be of order . Recalling the fact that , we can thus conclude for that
[TABLE]
As for , we have
[TABLE]
Since is expected to be of bounded norm, using the concentration inequality of the quadratic form , we infer
[TABLE]
We again replace by and take expectation over to obtain
[TABLE]
with , which eventually brings the second term to vanish, and we thus get
[TABLE]
For the term we apply again the concentration inequality to get
[TABLE]
with high probability, where , the norm of which is of order . This entails
[TABLE]
with high probability. Once more plugging the asymptotic equivalent of deduced in Section 5.2.3, we conclude for that
[TABLE]
and eventually for
[TABLE]
Combining the estimates of as well as and , we finally have the estimates for the test error defined in (12) as
[TABLE]
Since by definition, , we may use
[TABLE]
in the second term in brackets to finally retrieve the form of Conjecture 1.
6 Concluding Remarks
This article provides a possible direction of exploration of random matrices involving entry-wise non-linear transformations (here through the function ), as typically found in modelling neural networks, by means of a concentration of measure approach. The main advantage of the method is that it leverages the concentration of an initial random vector (here a Lipschitz function of a Gaussian vector) to transfer concentration to all vector (or matrix ) being Lipschitz functions of . This induces that Lipschitz functionals of (or ) further satisfy concentration inequalities and thus, if the Lipschitz parameter scales with , convergence results as . With this in mind, note that we could have generalized our input-output model of Section 2 to
[TABLE]
for with some probability space and a random variable such that and (where is here applied column-wise) satisfy a concentration of measure phenomenon; it is not even necessary that has a normal concentration so long that the corresponding concentration function allows for appropriate convergence results. This generalized setting however has the drawback of being less explicit and less practical (as most neural networks involve linear maps rather than non-linear maps of and ).
A much less demanding generalization though would consist in changing the vector for a vector still satisfying an exponential (not necessarily normal) concentration. This is the case notably if with a Lipschitz map with Lipschitz parameter bounded by, say, or any small enough power of . This would then allow for with heavier than Gaussian tails.
Despite its simplicity, the concentration method also has some strong limitations that presently do not allow for a sufficiently profound analysis of the testing mean square error. We believe that Conjecture 1 can be proved by means of more elaborate methods. Notably, we believe that the powerful Gaussian method advertised in (Pastur and Ŝerbina, 2011) which relies on Stein’s lemma and the Poincaré–Nash inequality could provide a refined control of the residual terms involved in the derivation of Conjecture 1. However, since Stein’s lemma (which states that for and differentiable polynomially bounded ) can only be used on products involving the linear component , the latter is not directly accessible; we nonetheless believe that appropriate ansatzs of Stein’s lemma, adapted to the non-linear setting and currently under investigation, could be exploited.
As a striking example, one key advantage of such a tool would be the possibility to evaluate expectations of the type which, in our present analysis, was shown to be bounded in the order of symmetric matrices by with high probability. Thus, if no matrix (such as ) pre-multiplies , since can grow as large as , cannot be shown to vanish. But such a bound does not account for the fact that would in general be unbounded because of the term in the display , where . Intuitively, the “mean” contribution of , being post-multiplied in by (which averages to zero) disappears; and thus only smaller order terms remain. We believe that the aforementioned ansatzs for the Gaussian tools would be capable of subtly handling this self-averaging effect on to prove that vanishes (for , it is simple to show that ). In addition, Stein’s lemma-based methods only require the differentiability of , which need not be Lipschitz, thereby allowing for a larger class of activation functions.
As suggested in the simulations of Figure 2, our results also seem to extend to non continuous functions . To date, we cannot envision a method allowing to tackle this setting.
In terms of neural network applications, the present article is merely a first step towards a better understanding of the “hardening” effect occurring in large dimensional networks with numerous samples and large data points (that is, simultaneously large ), which we exemplified here through the convergence of mean-square errors. The mere fact that some standard performance measure of these random networks would “freeze” as grow at the predicted regime and that the performance would heavily depend on the distribution of the random entries is already in itself an interesting result to neural network understanding and dimensioning. However, more interesting questions remain open. Since neural networks are today dedicated to classification rather than regression, a first question is the study of the asymptotic statistics of the output itself; we believe that satisfies a central limit theorem with mean and covariance allowing for assessing the asymptotic misclassification rate.
A further extension of the present work would be to go beyond the single-layer network and include multiple layers (finitely many or possibly a number scaling with ) in the network design. The interest here would be on the key question of the best distribution of the number of neurons across the successive layers.
It is also classical in neural networks to introduce different (possibly random) biases at the neuron level, thereby turning into for a random variable different for each neuron. This has the effect of mitigating the negative impact of the mean , which is independent of the neuron index .
Finally, neural networks, despite their having been recently shown to operate almost equally well when taken random in some very specific scenarios, are usually only initiated as random networks before being subsequently trained through backpropagation of the error on the training dataset (that is, essentially through convex gradient descent). We believe that our framework can allow for the understanding of at least finitely many steps of gradient descent, which may then provide further insights into the overall performance of deep learning networks.
Appendix A Intermediary Lemmas
This section recalls some elementary algebraic relations and identities used throughout the proof section.
Lemma 5** (Resolvent Identity).**
For invertible matrices , .
Lemma 6** (A rank- perturbation identity).**
For Hermitian, a vector and , if and are invertible, then
[TABLE]
Lemma 7** (Operator Norm Control).**
For nonnegative definite and ,
[TABLE]
where is the Hausdorff distance of a point to a set. In particular, for , and .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Akhiezer and Glazman (1993) {bbook} [author] \bauthor \bsnm Akhiezer, \bfnm N. I. \binits N. I. and \bauthor \bsnm Glazman, \bfnm I. M. \binits I. M. ( \byear 1993). \btitle Theory of linear operators in Hilbert space. \bpublisher Courier Dover Publications. \endbibitem
- 2Bai and Silverstein (1998) {barticle} [author] \bauthor \bsnm Bai, \bfnm Z. D. \binits Z. D. and \bauthor \bsnm Silverstein, \bfnm J. W. \binits J. W. ( \byear 1998). \btitle No eigenvalues outside the support of the limiting spectral distribution of large dimensional sample covariance matrices. \bjournal The Annals of Probability \bvolume 26 \bpages 316-345. \endbibitem
- 3Bai and Silverstein (2007) {barticle} [author] \bauthor \bsnm Bai, \bfnm Z. D. \binits Z. D. and \bauthor \bsnm Silverstein, \bfnm J. W. \binits J. W. ( \byear 2007). \btitle On the signal-to-interference-ratio of CDMA systems in wireless communications. \bjournal Annals of Applied Probability \bvolume 17 \bpages 81-101. \endbibitem
- 4Bai and Silverstein (2009) {bbook} [author] \bauthor \bsnm Bai, \bfnm Z. D. \binits Z. D. and \bauthor \bsnm Silverstein, \bfnm J. W. \binits J. W. ( \byear 2009). \btitle Spectral analysis of large dimensional random matrices, \bedition second ed. \bpublisher Springer Series in Statistics, \baddress New York, NY, USA. \endbibitem
- 5Benaych-Georges and Nadakuditi (2012) {barticle} [author] \bauthor \bsnm Benaych-Georges, \bfnm F. \binits F. and \bauthor \bsnm Nadakuditi, \bfnm R. R. \binits R. R. ( \byear 2012). \btitle The singular values and vectors of low rank perturbations of large rectangular random matrices. \bjournal Journal of Multivariate Analysis \bvolume 111 \bpages 120–135. \endbibitem
- 6Cambria et al. (2015) {barticle} [author] \bauthor \bsnm Cambria, \bfnm Erik \binits E., \bauthor \bsnm Gastaldo, \bfnm Paolo \binits P., \bauthor \bsnm Bisio, \bfnm Federica \binits F. and \bauthor \bsnm Zunino, \bfnm Rodolfo \binits R. ( \byear 2015). \btitle An ELM-based model for affective analogical reasoning. \bjournal Neurocomputing \bvolume 149 \bpages 443–455. \endbibitem
- 7Choromanska et al. (2015) {binproceedings} [author] \bauthor \bsnm Choromanska, \bfnm Anna \binits A., \bauthor \bsnm Henaff, \bfnm Mikael \binits M., \bauthor \bsnm Mathieu, \bfnm Michael \binits M., \bauthor \bsnm Arous, \bfnm Gérard Ben \binits G. B. and \bauthor \bsnm Le Cun, \bfnm Yann \binits Y. ( \byear 2015). \btitle The Loss Surfaces of Multilayer Networks. In \bbooktitle AISTATS. \endbibitem
- 8Couillet and Benaych-Georges (2016) {barticle} [author] \bauthor \bsnm Couillet, \bfnm R. \binits R. and \bauthor \bsnm Benaych-Georges, \bfnm F. \binits F. ( \byear 2016). \btitle Kernel spectral clustering of large dimensional data. \bjournal Electronic Journal of Statistics \bvolume 10 \bpages 1393–1454. \endbibitem
