Products of Many Large Random Matrices and Gradients in Deep Neural Networks
Boris Hanin, Mihai Nica

TL;DR
This paper analyzes the behavior of products of large random matrices and their impact on gradients in deep neural networks, revealing Gaussian fluctuations and providing insights into gradient stability issues.
Contribution
It introduces a new asymptotic Gaussian limit for the log-norm of matrix products and applies this to quantify gradient stability in deep neural networks.
Findings
Logarithm of matrix product norms is asymptotically Gaussian.
Explicit error bounds for moments and Gaussian approximation.
Quantitative assessment of gradient explosion and vanishing in neural networks.
Abstract
We study products of random matrices in the regime where the number of terms and the size of the matrices simultaneously tend to infinity. Our main theorem is that the logarithm of the norm of such a product applied to any fixed vector is asymptotically Gaussian. The fluctuations we find can be thought of as a finite temperature correction to the limit in which first the size and then the number of matrices tend to infinity. Depending on the scaling limit considered, the mean and variance of the limiting Gaussian depend only on either the first two or the first four moments of the measure from which matrix entries are drawn. We also obtain explicit error bounds on the moments of the norm and the Kolmogorov-Smirnov distance to a Gaussian. Finally, we apply our result to obtain precise information about the stability of gradients in randomly initialized deep neural networks with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods*Communicated@Fast*How Do I Communicate to Expedia?
Products of Many Large Random Matrices and Gradients in Deep Neural Networks
Boris Hanin111Department of Mathematics, Texas A&M; [email protected]
Mihai Nica222Department of Mathematics, University of Toronto; [email protected]
Abstract
We study products of random matrices in the regime where the number of terms and the size of the matrices simultaneously tend to infinity. Our main theorem is that the logarithm of the norm of such a product applied to any fixed vector is asymptotically Gaussian. The fluctuations we find can be thought of as a finite temperature correction to the limit in which first the size and then the number of matrices tend to infinity. Depending on the scaling limit considered, the mean and variance of the limiting Gaussian depend only on either the first two or the first four moments of the measure from which matrix entries are drawn. We also obtain explicit error bounds on the moments of the norm and the Kolmogorov-Smirnov distance to a Gaussian. Finally, we apply our result to obtain precise information about the stability of gradients in randomly initialized deep neural networks with ReLU activations. This provides a quantitative measure of the extent to which the exploding and vanishing gradient problem occurs in a fully connected neural network with ReLU activations and a given architecture.
1 Introduction
Products of independent random matrices are a classical topic in probability and mathematical physics with applications to a variety of fields, from wireless communication networks [28] to the physics of black holes [6], random dynamical systems [26], and recently to the numerical stability of randomly initialized neural networks [14, 24]. In the context of neural networks, products of random matrices are related to the numerical stability of gradients at initialization and therefore give precise information about the exploding and vanishing gradient problem (see Section 1.5, Proposition 2, and Corollary 3). The purpose of this article is to prove several new results about such products in the regime where the number of terms and the sizes of matrices grow simultaneously. This regime has attracted attention [2, 19] but remains poorly understood. We find new phenomena not present when the number of terms and the size of the matrices are sent to infinity sequentially rather than simultaneously (see Section 1.1 for more on this point).
To explain our results, let be a positive integer and let be a list of positive integers. We are concerned with (the non-asymptotic) analysis of products of independent rectangular random matrices of sizes with real entries:
[TABLE]
The specific matrix ensembles we study depend on a parameter and a distribution on . We define:
[TABLE]
where are diagonal matrices
[TABLE]
and are independent random matrices for which the entries are drawn i.i.d. from a fixed distribution on satisfying the following four conditions:
[TABLE]
When the matrices are the identity. In contrast, when , the matrix product naturally arises in connection to the input-output Jacobian matrix for neural nets with nonlinearity and layers with widths initialized with random weights drawn from . In particular, the following equality in distribution holds when :
[TABLE]
so that, when the singular values of are equal in distribution to those of This is a consequence of Proposition 2 below, which opens the door to a rigorous study of the so-called exploding and vanishing gradient problem for nets at finite depth and width. This refines the approach of the first author in [14], and we refer the reader to Section 1.5 for precise definitions an extended discussion of this point. Our main result concerns the distribution of
[TABLE]
As explained in Section 1.4 below, can be thought of as a line-to-line partition function in a disordered medium given by the computation graph underlying the matrix product defining , with corresponding to a kind of initial condition. The diagonal matrices then correspond to valued spins on the vertices of this graph, restricting the allowed paths of the directed random polymer. With this interpretation, our main result, Theorem 1, shows that the analogue of the free energy, namely
[TABLE]
is Gaussian up to an error that tends to zero when tend to infinity.
Theorem 1**.**
Fix , and a distribution satisfying (i)-(iv) above. Let be some fixed unit vector, and for any choice of , , set
[TABLE]
Let be as in (2). Then, the norm of the vector is approximately log-normal distributed:
[TABLE]
This approximation holds both in the sense of distribution and of moments. More precisely, with denoting Kolomogov-Smirnov distance,
[TABLE]
where the implicit constant is uniform for in a compact subset of , and in a compact set bounded away from . Moreover, for every satisfying we have
[TABLE]
where the implicit constant depends on and the moments of but not on
Remark*.*
In the proof of equation (5), we actually show that is bounded above by
[TABLE]
for any choice of and where the constants depend only on the moments of and . By taking and restricting the in a compact set, the result claimed in Theorem 1 holds. Moreover, if we take , then we will actually prove instead the sharper result that
[TABLE]
The two conclusions, equation (7) and equation (6), of Theorem 1 are proven separately and have independent proofs. We prove equation (6) by a path-counting type argument in Section 3. The argument in Section 4 for equation (7), in contrast, uses a central limit theorem for martingales.
1.1 Joint scaling limits
Theorem 1 shows that the free energy from (4) is Gaussian in the double scaling limit
[TABLE]
achieved for instance when are equal and proportional to This asymptotic normality for cannot be seen by taking the limits and to infinity one after the other. Indeed, consider the case when and the standard Gaussian measure. A simple computation using the rotational invariance of i.i.d. Gaussian matrices shows the equality in distribution
[TABLE]
where is a chi-squared random variable with degrees of freedom and the terms in the product are independent. In the limit where is fixed and we have and so
[TABLE]
On the other hand, if the are uniformly bounded, then we have
[TABLE]
In fact, for fixed, converges only with an addition scaling:
[TABLE]
In particular, we have
[TABLE]
making (8) an interesting regime for . The non-commutativity of the limits is well-known [1, 7, 19] and is related to the fact that the local statistics of the singular values of are sensitive to the order in which the limits above are taken. Remaining in the simple case of and Gaussian, a simple application of the central limit results show that when all the are equal and are related to by , then the exact chi-squared representation of equation (9) gives the convergence in distribution:
[TABLE]
which is of course consistent with Theorem 1. Part of the content of Theorem 1 is therefore that this result is essentially independent of the parameter and the measure according to which the entries of the matrices are distributed. See Section 1.3 for more discussion on the novel aspects of Theorem 1.
1.2 Connection to previous work in Random Matrix Theory
The literature on products of random matrices is vast. Much of the previous work concerns products of i.i.d. random matrices, each of size . Such ensembles have been well studied in two distinct regimes: (a) when is fixed and and (b) when is fixed and . Case (a) is related to multiplicative ergodic theory and the study of Lyapunov exponents. The seminal articles in this regime are the results of Furstenberg and Kesten [10], which gives general conditions for the existence of the top Lyapunov exponent
[TABLE]
and the multiplicative ergodic theorem of Osceledets [23], which gives conditions for almost sure (deterministic) values for all the Lyapunov exponents. Many more recent works characterize the Lyapunov exponents under more specific assumptions, most notably for matrices which are rotationally invariant or which have entries that are real or complex Gaussians, see e.g. [1, 9, 8, 18, 21, 15] as well as the survey [3] and references therein.
Case (b), where is fixed and , falls into the setting of free probability. Indeed, one of the great successes of free probability is the idea of “asymptotic freeness”: in the limit , a collection of independent random matrices behave like a collection of freely independent random variables on a non-commutative probability space (see e.g. [4] Chapter 5 or [20] Chapter 1 and 4). Therefore, case (b) is closely related to a product of freely independent random variables; precise results are obtained in [11]. Earlier results [12, 22] examine case (b) without explicit use of free probability. The problem of first taking and afterwards taking can also be handled using the tools of free probability in the case of Gaussian matrices, see [27].
As explained in the Introduction and in Section 1.1, the regimes (a), (b) are asymptotically incompatible in the sense that the limits , and do not generally commute on the level of the local behavior of the singular value distribution. Indeed, the problem of understanding what happens when both are scaled simultaneously is mentioned as an open problem in [1]. To explain this further, we note that the work of Newman [16, Thm. 1] in regime (a) shows that when and is fixed, the density of the Lyapunov exponents of converges in the limit when first and then to the triangular density
[TABLE]
The work of Tucci [27, Thm. 3.2, Ex. 3.4] shows that for Gaussian ensembles related to one obtains the same global limit in the regime (b) when first and then However, as explained in [1] Section 5, while the global density of all the Lyapunov exponents is the triangular law in both cases, the local behavior (e.g. the fluctuations of the top Lyapunov exponent) is observed to be different depending on the order of the limits even in the exactly solvable special case of products of complex Ginibre matrices.
From this spectral point of view, Theorem 1 gives information about certain averages of the Lyapunov exponents. To see this, fix and let and tend to infinity in accordance with (8). Note we we specifically do not take to infinity. Denote by the non-zero singular values of , and by the corresponding left-singular vectors. Then we have
[TABLE]
In many situations of interest we can expect that the inner products satisfy . This happens for example if the vector is chosen uniformly at random on the -sphere or when is a fixed vector and the matrix is invariant under right multiplication by an orthogonal matrix. In this setting
[TABLE]
Hence, Theorem 1 can be interpreted as the statement that the logarithm of the average of the non-zero singular values for is a Gaussian with mean and variance in the limit (8). These non-trivial corrections in can be seen as a finite temperature correction to the maximal entropy regime of Tucci [27] in which first and then . For more on this point of view, we refer the reader to [1, Section 3.2].
Finally, in the specific case where the random matrices are complex Ginibre matrices (i.e. the matrix entries are iid complex Gaussian), very recent work [2, 19] looks at the limiting spectrum under the joint scaling limit , where the ratio is fixed or going to . This work analyzes exact determinental formulas for the joint distribution of singular values available in the case of complex Ginibre matrices. The analogous formulas for real Gaussian matrices given in [15] are significantly more complicated and such an explicit analysis appears to be much more difficult.
1.3 Contribution of the Present Work
In the context of these previous random matrix results, let us point out four novel aspects of Theorem 1. First, it deals with the joint limit for a large class of non-Gaussian matrices with real entries. There is no integrable structure to our matrix ensembles, and we rely instead on a sum-over-paths approach to analyze the moments (6) and a martingale CLT approach for obtaining the KS distance estimates (7).
Second, the ensembles in Theorem 1 include the somewhat unusual diagonal matrices as part of model. Our original motivation for including these is the connection to neural networks explained in Section 1.5. In essence, the matrices can be interpreted as adding iid valued spins to the usual sum over paths approach to moments of products of matrices. Only “open” paths that have spin on every vertex contribute to the sum, causing open paths to be correlated. Previously, Forrester [8] and Tucci [27] considered the case when were deterministic positive definite matrices.
An additional novelty of Theorem 1 is it proves the distribution of \big{|}\big{|}{M}^{(d)}\vec{u}\big{|}\big{|}_{2}^{2} is (mostly) universal: it does not depend on the higher moments of the distribution beyond the mean and variance, with the exception of the fourth moment appearing in in the term . In the regime and , this term is a correction.
The fourth and final novelty of Theorem 1 we would like to emphasize is that our results are non-asymptotic, i.e. we obtain an explicit error term of the form . This is particularly useful when using Theorem 1 for studying gradients in randomly initialized neural networks (see Section 1.5).
Finally, we remark that Theorem 1 only studies \big{|}\big{|}{M}^{(d)}\vec{u}\big{|}\big{|}_{2}^{2} for a fixed vector , and therefore leaves several questions open: for instance the joint law of \{\big{|}\big{|}{M}^{(d)}\vec{u}^{(1)}\big{|}\big{|}_{2}^{2},\ldots,\big{|}\big{|}{M}^{(d)}\vec{u}^{(\ell)}\big{|}\big{|}_{2}^{2}\} for a list of vectors and more generally the limiting spectral distribution of the matrices . We plan to address these questions in forthcoming work.
1.4 Connection to Random Polymers
The matrix ensembles studied in this article, in the case , are related to directed random polymers on the complete graph of size . This model were recently explored in detail (c.f. e.g. [5]). A key object for these polymers is the line-to-line partition function
[TABLE]
where is the temperature of the model, and are i.i.d. mean zero random variables that make up the underlying disordered environment. When the sum over is written via products of matrices of size , the disordered environment can be viewed as a multipartite graph made of vertex clusters of size with (directed) edges from all vertices in to all vertices in The edges of this graph are then decorated with the corresponding matrix entries which are strictly positive, making a sum over paths from the input to the output of this graph. Each path is weighted by its energy, given by the product of weights along the path.
The fact that the weights are positive makes the analysis of the partition function of this traditional random polymer model different than the analysis of the matrix product defined in (1). In particular, no cancellation is possible between the terms in the definition of above, causing to be exponential in . The fixed and limit of of the partition function in the case of these positive weights is the object of study in [5]. As explained in Section 1.3, Theorem 1 studies a different regime where both at the same time. The fact that our weights are mean zero, gives rise to significant cancellation in the terms of from (4), so that the partition function in our mean zero model does not grow exponentially with provided grows with as in (8). Additionally, if in our model, the effect of the diagonal Bernoulli matrices is to close every vertex with probability . The sum over paths in our partition function then becomes the sum only over those paths that pass through vertices that are open.
1.5 The Case as Gradients in Random Neural Nets
One of our motivations for studying the ensembles is that, as we prove in Proposition 2 below, the case corresponds exactly to the input-output Jacobian in randomly initialized neural networks with activations. To explain this connection, fix . A neural network with activations, depth , and layer widths is any function of the form
[TABLE]
where are affine
[TABLE]
and for and any vector we write
[TABLE]
The matrices and vectors are called, respectively, the weights and biases of at layer while collectively define the architecture of . We will write for an input to and will define
[TABLE]
to be the vectors of activities before and after applying at the neurons at layer .
In practice, the weights and biases in a neural network are first randomly initialized and then optimized by (stochastic) gradient descent on a task-specific loss that depends only on the outputs of A single gradient descent update for a trainable parameter (i.e. an entry in one of the weight matrices ) is
[TABLE]
where is the learning rate. An important practical impediment to gradient based learning is the exploding and vanishing gradient problem (EVGP), which occurs when gradients are numerically unstable:
[TABLE]
making the parameter update (10) too small to be meaningful or too large to be precise. An important intuition is that the EVGP will be most pronounced at the start of training, when the weights and biases of are random and the implicit structure of the data being processed has not yet regularized the function computed by .
As explained below, the EVGP for a depth net with hidden layer widths is essentially equivalent to having large fluctuations for the entries (or, in the worst case, for the singular values) of the Jacobians of the transformations between various layers:
[TABLE]
The next result shows the singular value distribution of is that same as that of since
[TABLE]
Proposition 2 also shows that, for any collection of vectors we have the following equality in distribution when :
[TABLE]
Proposition 2**.**
Let be a net with depth and layer widths . Fix Suppose the weights of are , which are drawn iid from the measure as in the original definition (2). Then, writing for the dimensional -Bernoulli random vector, whose entries are independent and take the values with probability , we have
[TABLE]
where are diagonal -Bernoulli matrices as in (2) with parameter and denotes equality in distribution.
Before proving Proposition 2, let us explain why the functions that we study in Theorem 1 are related to the EVGP. Due to the compositional nature of the function computed by we may use the chain rule to write, for the weight connecting neuron to neuron in layer
[TABLE]
where is the column of Therefore, fluctuations of the gradient descent update are captured precisely by fluctations of bi-linear functionals of various layer to layer Jacobians in . We study in this article and obtain in Theorem 1 precise distribution and moment estimates on these quantities. For instance, Theorem 1 combined with Proposition 2 immediately yields the following
Corollary 3**.**
Let be a fully connected depth net with hidden layer width and randomly initialized weights drawn i.i.d. from the measure and scaled to have variance as in (2). Suppose also that the biases of are drawn i.i.d. from any measure satisfying same assumptions as the measure Fix any with and write for the input-output Jacobian of . We have,
[TABLE]
where
[TABLE]
and the implicit constant is uniform when ranges over a compact subset of .
For more information about the EVGP, statistics of gradients in random nets, and distribution of the singular values of the input-output Jacobian we refer the interested reader to [14, 24, 17, 25] for more details.
2 Proof of Proposition 2
2.1 Idea behind the proof
The essential idea behind Proposition 2 is to notice that the derivative of the function is , so when doing the chain rule to compute , we find the following -valued diagonal matrices naturally appearing:
[TABLE]
Since the random weights and biases are symmetrically distributed around [math] (i.e. ) and have no atoms, it is easily verified that each entry in is equally likely to be positive or negative regardless of the value of . Hence the matrix in equation (12) is equal in distribution to the Bernoulli matrix when . This informally explains the connection between and .
It remains to see that these diagonal matrices are independent of each other (since the outputs of the previous layer are fed into to subsequent layers, so are not a priori independent). This again will be a consequence of the fact the underlying random variables are symmetrically distributed, and will be formally verified by conjugating the weights and biases of the network by random random variables. This doesn’t change the distribution of the network, but will allow us to see the independence between layers in a more concrete way.
2.2 Proof of Proposition 2
Proof of Proposition 2.
Fix a neural net as in the statement of proposition 2 and denote its weights and biases at layer by and . For each let
[TABLE]
be an i.i.d. collection of random variables that each take values with probability . We will also define
[TABLE]
Consider the neural net with weights and biases defined by changing the signs of the weights and biases of as follows:
[TABLE]
so that
[TABLE]
We will denote by the activations and input-output Jacobian for both computed at the same fixed input
[TABLE]
Note that since we’ve assumed that have distributions that are symmetric around [math], we have
[TABLE]
Hence, since the weights of the two networks are identically distributed,
[TABLE]
On the other hand, the chain rule yields the following recursion for
[TABLE]
where
[TABLE]
and we’ve used that diagonal matrices commute. Note that apriori the matrices depend on the weights and biases for since . However, we will now verify the following claim about the collection of matrices and variables :
[TABLE]
and that moreover the collection is independent and that each is distributed like a diagonal matrix with independent diagonal entries taking the values of with probability Once we have proven this, since , then equation (13) shows that and , a diagonal -Bernoulli random variables independent of everything else. This will complete the proof of the present proposition since this is exactly the recurrence for the matrices .
To prove (14), we will use the fact that two random variables are independent if the distribution of given does not depend on the value of That is, (14) will follow once we show that for any fixed sequences that
[TABLE]
To check this equality, it suffices to show that given there is exactly one possible configuration for the variables for which the event occurs. The resulting probability then follows since are i.i.d. variables that each take the values with probability The proof is by induction: we will show that for each , given there is a unique configuration for the variables that leads to the event . When we have
[TABLE]
Recalling that for all , we see that for each there is a unique value of for which
[TABLE]
Then, for this value of since we have , there is a unique value of for which The proof of the inductive step is identical. Namely, suppose we have determined the values of for Then, given the weights and biases we have uniquely determined Then, given this value for , for every there is a unique value for which And finally, given this value for there is a unique value of so that This completes the proof. ∎
3 Proof of Theorem 1: Moment Estimates and Path Counting
3.1 Outline of Proof of Equation (6)
We begin by indicating the general plan for the proof of Equation (6) from Theorem 1, which consists of two steps. First, in Proposition 4 below, we express the expectation in (6) as a sum over -tuples of paths . The precise result is the following
Proposition 4** (Moments of as a sum over paths).**
With the notation of Theorem 1, for each we have
[TABLE]
where , and is defined by:
[TABLE]
with denoting the number of unique entries in a tuple , being the multiplicity of edges appearing in the set as in (19), and a combinatorial factor given by (17), and , defined in (21) denoting a weight function that depends on the moments of the entries of the weights matrices .
Note that the definition of in equation (16) depends only on the collection of vertices and , the moments of the measure according to which the entries of the matrices are distributed, and the parameter The utility of equation (15) is that it is written as a product over this function of adjacent layers (rather than the whole path ), which will make it much easier to analyze.
The next step in the proof of the moment estimate (6) is to obtain upper and lower bounds for the expression in (15) that match up to corrections of size This is done in Section 3.4. The main idea here is to treat the sum (15) as an expectation where each , is chosen independently according to the uniform distribution on . The leading term in this expectation comes from event that the entries are all distinct, which happens in layer with probability . When this happens, The subleading term comes from the event that has exactly one element that appears twice, with the others distinct. In each layer, the probability of this type of “collision” is and typically contributes when this happens. Hence, heuristically speaking, we have
[TABLE]
This is almost correct, except at the first layer, where the vector acts as special initial condition and slightly deforms the term in this product when Section 3.4 makes this argument precise.
3.2 Edge Sets, Multiplicities, and Paths
In this section we develop some notation and basic results which is used to clarify the “path counting” needed to prove Proposition 4 below. The major result that is developed in this section, and is needed for Proposition 4, is the enumeration the set of paths in Lemma 8. We will use the following notation conventions:
- •
denote natural numbers
- •
- •
For , we will denote by
[TABLE]
the collection of all unordered sets of directed edges in the complete bipartite graph of , which we think of as a directed graph with edges going from to . Note that some edges may appear multiple times: we consider them with multiplicity, thinking of as a multi-set (e.g. the directed edge can appear twice in ). To every edge set we will associate the edge multiplicity, by:
[TABLE]
We will also use the notation:
[TABLE]
Every edge set is uniquely defined by its multiplicity, and we will often find it more convenient to work with the multiplicities rather than the edge sets directly.
We will need to need to translate back and forth between and the multisets of its left and right endpoints. Specifically, for we define the multisets
[TABLE]
of right and left endpoints of counted with multiplicity. Conversely, given ordered sets of left, right endpoints , , we define the corresponding element of by its multiplicity
[TABLE]
This is the set one gets by drawing an edge between each entry of and the corresponding entry of and then forgetting the order in which the edges were drawn but remembering the multiplicity. Note that this map from ordered sets of left and right endpoints is many to one. This will come up in our computations, and to keep track of this, we make the following definition.
Definition 5**.**
Fix some edge set , with corresponding edge multiplicity and some -tuple so that as unordered multisets
[TABLE]
Define:
[TABLE]
Lemma 6**.**
* is well defined. That is, the enumeration depends only on and not on the choice of stated in Definition 5. Moreover, has the following explicit formula in terms of multinomial coefficients:*
[TABLE]
Proof.
To see that does not depend on note that for any , we have
[TABLE]
if and only if for some in the symmetric group on elements. Further, for any such , we have
[TABLE]
Thus, is a bijection between and for any permutation , proving that is indeed well-defined. To obtain the multinomial coefficient formula for , for each define the set of indices:
[TABLE]
and for every define for the multiset of entries of :
[TABLE]
With this notation, we have
[TABLE]
Thus, enumerating amounts to counting the number of ways the indices of can be arranged in order to satisfy (20). This is counted by multinomial coefficients, and the formula (18) then follows by standard enumeration principles. ∎
Our path counting approach to proving Proposition 4, involves the combinatorics of certain paths decorated by the moments of measure according to which the entries of matrices are drawn. Accordingly, for each , we associate a weight to an edge multiplicity given in terms of the moments of the measure by:
[TABLE]
where the expectation is with respect to In the proof of Proposition 4 we will consider sequences of compatible edge sets in the sense of the following definition.
Definition 7**.**
Let and let . Let denote the set of edge sequences which satisfy:
[TABLE]
The second condition ensures the endpoints of the edges of one layer are compatible with the edges from the next layer. Further, define for each the set of ordered paths:
[TABLE]
Given define the edge sequence corresponding to by specifying the multiplicities
[TABLE]
The formula (24) below will be used in the proof of Proposition 4.
Lemma 8**.**
Let and let and . Consider and any edge sequence with
[TABLE]
Then, the number of ordered paths which have the same edge sequence as and have is given by:
[TABLE]
Proof.
The proof is by induction on When , the left hand side of (24) is precisely the number of so that has which by definition of , equals . Let us now suppose we have proved the statement for with By denoting , and counting the number of possibilities for with , we write the left hand side of (24) as the sum
[TABLE]
Note that since we find that coincides with the right endpoints . Hence, by the inductive hypothesis, every term appearing in the sum from equation (25) is equal to and does not depend on (since depends only on the right endpoints of and not on their order). The number of terms in the sum from equation (25) is exactly by the definition of . The total is therefore , completing the induction. ∎
3.3 Proof of Proposition 4
The first step in proving Proposition 4 is to express as a sum over certain collections of paths.
Definition 9**.**
Let be the set of tuples of paths:
[TABLE]
where was defined in (22). Our notation is that if , then is a -tuple for each .
Lemma 10**.**
For a -tuple , let be the number of unique elements in . Let Then:
[TABLE]
Proof.
Note that the entries of the matrix can be written as a sum over certain paths in , namely:
[TABLE]
Using this interpretation in terms of paths, we obtain by indexing the starting points as and the ending point as , that we can write as a sum over :
[TABLE]
Similarly, the -th power is then given by:
[TABLE]
The result of the lemma follows by taking expectation of both sides, using the independence of the random variables ’s, and relations
[TABLE]
∎
Definition 11**.**
Since the law of the entries of the matrices is assumed to symmetric around the odd moments of are all zero, and it will be useful to consider only edge sets that are “even” in the following sense:
[TABLE]
as well as to define the related sets
[TABLE]
Lemma 12**.**
With the same notation as in Lemma 10, we have
[TABLE]
Proof.
Because the variables are symmetric around [math], all their odd moments vanish. Thus, in the expression (26), only collections of paths in which every edge is traversed an even number of times given a non-zero contribution. What remains are exactly paths from by the definition in equation (27). ∎
Proof of Proposition 4.
Recall the definition of the edge sequences and the notation for paths from Definition 7 (In this proof, we will use this definition when for paths and when for paths ). Fix any . Let be with the entries doubled. For any function of edge sequences, , (it will be more convenient to write , thinking of as a function of the multiplicities of the edge set), consider the following identity for sums over that end at :
[TABLE]
Here denotes doubling all the edges (i.e. the multiplicities double ) and we have used the fact that every even edge sequence arises by taking a sequence and doubling the multiplicity of the edges. (Note that there may be multiple choices of for each , which is why we have to divide be the size of this set to account for this many-to-one-ness.) We now apply Lemma 8 to both the numerator (with ) and the denominator (with ) to see that the enumeration depends only on the edge set and the endpoints of the last layer :
[TABLE]
Summing over all possible endpoints now gives the identity:
[TABLE]
Finally, using this identity on (29), with being the function that appears inside the sum over , gives the desired result of Proposition 4. ∎
3.4 Completion of Proof of Equation (6)
Definition 13**.**
We think of the sum in Proposition 4 as an expectation over discrete random variables . Specifically, we write:
[TABLE]
where is defined to be the expectation with respect to a product measure on sequences , in which the entries of are chosen i.i.d. from the measure , (i.e. for every ; this is a probability measure since is a unit vector), and the entries of are chosen i.i.d. from the uniform measure on for every . (i.e. for any , ). In order to prove that the rightmost product in (31) equal the right hand side of (6), we introduce some notation. Namely, for we partition the set into three pieces:
[TABLE]
Informally, stands for “unique entries”, and consists of those -tuples with no repeated entries; stands for “one pair” and consists of those -tuples with exactly one repeated entry; stands for “bad” and consists of everything else. Formally,
[TABLE]
Lemma 14**.**
For each , under the uniform measure on , each random variable has the following probabilities for the events ,,:
[TABLE]
Proof.
The proof is an elementary exercise in discrete probability. ∎
Lemma 15**.**
Subdivide the “one pair” set by which indices are paired: where for . Then for each we have
[TABLE]
where the implicit constant in is bounded below by and above by
Proof.
This is an elementary calculation from the definition of . If , and the multiplicities of edges in the edge set are all which makes the combinatorial factor in equal to , and every edge is covered exactly twice giving a factor of in the weight term. If , giving a factor of . Moreover, in this case, when the indices which are paired in are also paired in , all the combinatorial factors are again , and the weight term is . If the paired indices from are not paired in , then there the combinatorial term is , and the weight term is . ∎
Lemma 16**.**
We have
[TABLE]
where the quantities are:
[TABLE]
Proof.
Note that for any fixed and , we have the following conditional independence of layers before and after :
[TABLE]
where in the second term we write instead of since the measure no longer depends on Applying this with we find
[TABLE]
where in the second term the random variables are uniform and do not depend on or . An elementary probability computation using Lemma 15 and the measure on shows that
[TABLE]
where the implicit constant in the last term is bounded below by and above by Combining the result of Lemma 14, with (33) and (34), proves Lemma 16.
∎
Lemma 17**.**
Recall the definition of T_{*}\leavevmode\nobreak\ \stackrel{{\scriptstyle\Delta}}{{=}}\leavevmode\nobreak\ \mathcal{E}\left[\prod_{i=2}^{d}C(V(i-1),V(i))\leavevmode\nobreak\ \big{|}\leavevmode\nobreak\ V(1)\in*_{n_{1}}\right] for from Lemma 16. Define the indicator functions
[TABLE]
Then, for any choice of the label , we have that:
[TABLE]
where we’ve introduced
[TABLE]
Proof of Lemma 17.
By using the possible values for computed in Lemma 15 and the definition of , we have that for any label :
[TABLE]
where we’ve abbreviated
[TABLE]
Note that
[TABLE]
This proves the upper bound in (35). The lower bound similarly follows:
[TABLE]
∎
Completion of Proof of Relation (6).
We first notice, by application of the elementary probability estimate recorded in Lemma 18, that the upper and lower bounds on given in Lemma 17 are equal up to small errors. We have for :
[TABLE]
(where is as in Lemma 17). Finally, putting these values for into the result of Lemma 14 we see:
[TABLE]
The last line follows from the elementary fact for exponentials for . ∎
3.5 An elementary probability estimate
Lemma 18**.**
Let be independent events with probabilities and be independent events with probabilities such that
[TABLE]
Denote by the indicator that the event happens, , and by the indicator that happens, . Further, fix for every some as well as . Define
[TABLE]
Then, if for every , we have:
[TABLE]
where by convention In contrast, if for every , we have:
[TABLE]
Proof of Lemma 18.
The proof goes by induction on . The base case can be computed directly
[TABLE]
which is verified to obey the stated inequalities under the convention . To see the induction step, suppose that . Define the filtration . We have, from the definition of that
[TABLE]
We compute by directly examining what happens when and when , that
[TABLE]
Now notice that, since vanishes when , and since , we have that
[TABLE]
Hence, when , since for every , we have the estimate
[TABLE]
and hence obtain from equation (39)
[TABLE]
which is the desired inequality to prove the induction step for the upper bound. To see the lower bound, we will actually prove the lower bound for the sequence
[TABLE]
This is what one gets if all the parameters are equal to , so clearly and it is sufficient to bound this new sequence. Notice that since , we have , so applying equation (40) to this sequence gives that:
[TABLE]
so by equation (39) applied to the sequence, we have then (keeping in mind reverses the inequality):
[TABLE]
which is the desired inequality for the induction step on the lower bound. ∎
4 Proof of Theorem 1: Quantitative Martingale CLT
In the section, we explain the proof of the distribution estimates in equation (7) in Theorem 1 modulo the proof of several key technical results, which are proved in Sections 4.1 and 4.2 below.
We first recall the notation. Namely, fix and consider a fixed measure satisfying (3). For every take independent random matrices with all the entries of drawn i.i.d. from and for each , consider diagonal matrices where are iid valued independent Bernoulli variables . The key objects of study are, for , the random matrices
[TABLE]
The estimates (7) concern the distribution, for any fixed unit vector , of
[TABLE]
Notice that the sequence is equivalently defined recursively as:
[TABLE]
With this notation the relation (7) we seek to show becomes the statement that for every
[TABLE]
where is the Gaussian, and
[TABLE]
The idea of the proof is to look at the quantity as the value of a martingale at time with respect to the filtration
[TABLE]
i.e. the sigma algebra generated by the random variables in the first layers. The basic idea of our proof is to deduce the approximate normality of by applying a martingale CLT with rate (see Theorem 23). Specifically, note that , since is a unit vector. Hence, , is a telescoping sum (modulo the complication discussed below that could vanish):
[TABLE]
and we will think of each entry of the sum as an increment. By subtracting off the conditional means, this will yield a martingale difference sequence which can be analyzed. It will turn out that the variance of these increments satisfy:
[TABLE]
For , we will typically have that
[TABLE]
and therefore the term involving the fourth moment will be of size for all except the first layer when . The sum of these increment variances is precisely our variance parameter (modulo terms like ). This informally explains the appearance of in the formula for , and why the terms from other layers do not depend on the higher moments of .
To give a precise proof of (42), we must deal with a wrinkle in the strategy described above: with a small but positive probability the vectors , making the ratio of the norms of the vectors in (43) undefined. Since the weight matrices are assumed to have no atoms, this can only happen if the Bernoulli variables are all equal to zero. To take this into account, we define the events
[TABLE]
where we’ve abbreviated
[TABLE]
In addition, we will find it convenient to fix a truncation level and set
[TABLE]
We will study the sequence of martingale increments
[TABLE]
that coincide, with high probability, with the martingale difference sequence associated to (see Lemma 22), where by convention we define the product is zero on the event when To prove the approximate normality of we first prove the approximate normality of in the following Proposition.
Proposition 19**.**
We have that:
[TABLE]
Moreover, for any fixed , the sum is approximately normally distributed in the sense that
[TABLE]
We prove Proposition 19 in Section 4.1 below. The next result shows that the sum of the conditional expectations in contributes a constant up to errors of the form
Proposition 20**.**
For any fixed , we have
[TABLE]
where is a random variable satisfying
[TABLE]
Proposition follows from Proposition 28 below. To combine Propositions 19 and 20, we will need the following simple result about perturbations under the -distance.
Lemma 21** (Properties of ).**
If is centered Gaussian with variance , is any random variable, and is a positive random variable then there is a universal constant so that we have:
[TABLE]
For any , there exists a constant so that
[TABLE]
Further, if are any two random variables on the same probability space, then:
[TABLE]
Combining Propositions 19, Proposition 20 and Lemma 21, we obtain
[TABLE]
Finally, combining the following estimate with (49) completes the proof of Theorem 1.
Lemma 22**.**
For any fixed and any we have
[TABLE]
4.1 Proof of Proposition 19
In the proof of Proposition 19, we will use the notation
[TABLE]
and we will say that a random variable is if there exists independent of so that almost surely. The constant may depend on the moments of the random variable and , which we think of as fixed. To conclude the approximate normality (46) we will use the following theorem.
Theorem 23** (Special Case of Martingale CLT with Rate [13]).**
Suppose that is a martingale difference sequence with respect to a filtration . Then
[TABLE]
The following Proposition allows us to control the and moments of appearing on in (51).
Proposition 24**.**
For any , we have that the conditional 2nd and 4th moments of are:
[TABLE]
Moreover, for any and any ,
[TABLE]
We will prove Proposition 24 in Section 4.1.1 below. To complete the proof of Proposition 19, note that Proposition 24 yields
[TABLE]
Hence, in particular,
[TABLE]
Thus, (46) follows from the previous line together with (53), (48), and
[TABLE]
To prove this bound, we begin by using Proposition 24 to establish two inequalities which hold for any fixed :
[TABLE]
Now notice that if is another index so that , then if we take the -conditional expectation of equation (56), we have by using (54) to bound the expectation of (along with the elementary fact for positive random variables) and the fact that that:
[TABLE]
With these inequalities in hand, we proceed by expanding the square as follows:
[TABLE]
The diagonal terms, when , are bounded since by the bound in equation (57). In the remaining off-diagonal terms, by first taking the -conditional expectation out via the tower property, we have by the inequality (58) that:
[TABLE]
Finally, summing all the bounds for diagonal and off-diagonal entries we see that this entire numerator from equation (51) is bounded by , which proves (55) and completes the proof of Proposition 19 modulo checking Proposition 24.
4.1.1 Proof of Proposition 24
We begin by establishing some preliminary results.
Lemma 25**.**
Let be two layer widths, and let be any non-zero fixed vector. Let , be the diagonal Bernoulli matrix, and be the weight matrix whose entries are iid for every . Then:
[TABLE]
Moreover, with the same setup as above, the following error estimates hold uniformly over all non-zero vectors (i.e. the constants in the errors depend only on the moments of and on ):
[TABLE]
Proof.
Note that
[TABLE]
where are independent and
[TABLE]
where denotes the row of Since the entries of are iid with mean [math] and variance , we have . Hence for each and we conclude
[TABLE]
proving the first relation (59). Since each is mean , the relations (60) follow by standard esimates of the moments of a sum of iid centered random variables. Finally to check the second relation in (59), we write
[TABLE]
Moreover,
[TABLE]
By direct evaluation, using that , we find
[TABLE]
Hence, we find
[TABLE]
as claimed. ∎
The following corollary immediately yields (54).
Corollary 26**.**
For any and uniformly over all non-zero vectors , we have:
[TABLE]
Proof.
The tail estimate follows by using the Chebyshev inequality and (60). The bound on the expectation is obtained as follows:
[TABLE]
∎
To complete the proof of Proposition 24, it remains to check (52) and (53). To do this, we begin with the following observation.
Lemma 27**.**
Let . Suppose is a non-negative random variable. Then there are absolute constants (here we let refer to a generic constant which may change value from line to line) so that for any :
[TABLE]
Proof.
The proof is an elementary exercise in the Taylor series expansion for applied to points in the interval on which the derivatives of are bounded. ∎
Lemma 27 together with Lemma 25 gives the following information on the conditional moments of which directly allows us to conclude (52) and (53) and hence complete the proof of Proposition 24.
Proposition 28**.**
Recall the vectors and their norms normalized and norms, and defined in (45) and (50). We have for each
[TABLE]
Proof.
On the event , both sides of the equation are zero and the equality trivially holds. Therefore we have only to consider what happens on the event , where . By equation (41), we have that
[TABLE]
Since we are conditioning on the sigma algebra , we may think of as a fixed vector and apply Lemma 25. To make the equations easier to read, we using the shorthand and we write to mean . Then we have
[TABLE]
By Lemma 25 and Lemma 27, all the terms in equation (63) are which completes the result. A similar argument, combining the moment calculations from Lemma 25 and the series expansion estimates from Lemma 27 in the natural way, gives the higher moments of . ∎
4.2 Facts about KS distance - Proof of Lemma 21
Proof of Lemma 21.
Let us use the notation Since
[TABLE]
we may assume without loss that when proving (47) and (48). We begin by checking (47). We will show that for every , we have that
[TABLE]
from which (47) follows by taking to optimize the inequality. To begin, by considering the random variables all on the same probability space, we have the inequality:
[TABLE]
We now claim that:
[TABLE]
This is proven by examining the two possibilities of the absolute value, and by using the inclusions and respectively for the two cases. In the first case, consider:
[TABLE]
The inequality for is analogous, and equation (66) follows. Equation (64) then follows by combining equations (65),(66), Markov’s inequality , and the standard fact about Gaussian random variables .
To verify (48), note that there exists a constant so that for every
[TABLE]
Hence,
[TABLE]
Finally to show (49), we have
[TABLE]
This completes the proof. ∎
4.3 Proof of Lemma 22
We have
[TABLE]
As in the proof of Proposition 28, on the event we have that
[TABLE]
where are iid random variables each equal in distribution to where is an fixed unit vector. In particular, by Lemma 25, and the higher moments of are uniformly bounded in terms of the moments of and In particular, for each there exists depending only on the moments of and so that
[TABLE]
Hence, by the Markov inequality:
[TABLE]
Hence, by using a union bound and the estimate on the probability in (44), we have for each
[TABLE]
as desired.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. Akemann, Z. Burda, and M. Kieburg. Universal distribution of Lyapunov exponents for products of Ginibre matrices. Journal of Physics A Mathematical General , 47:395202, October 2014.
- 2[2] G. Akemann, Z. Burda, and M. Kieburg. From Integrable to Chaotic Systems: Universal Local Statistics of Lyapunov exponents. Ar Xiv e-prints , page ar Xiv:1809.05905, September 2018.
- 3[3] G. Akemann and J. R. Ipsen. Recent Exact and Asymptotic Results for Products of Independent Random Matrices. Acta Physica Polonica B , 46:1747, 2015.
- 4[4] G. W. Anderson, A. Guionnet, and O. Zeitouni. An Introduction to Random Matrices . Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2009.
- 5[5] F. Comets, G. R. Moreno Flores, and A. Ramirez. Random polymers on the complete graph. ar Xiv e-prints , page ar Xiv:1707.01588, July 2017.
- 6[6] J. Cotler, G. Gur-Ari, M. Hanada, J. Polchinski, P. Saad, S. H Shenker, D. Stanford, A. Streicher, and M. Tezuka. Black holes and random matrices. Journal of High Energy Physics , 2017(5):118, 2017.
- 7[7] P. Deift. Some Open Problems in Random Matrix Theory and the Theory of Integrable Systems. II. SIGMA , 13:016, March 2017.
- 8[8] P. Forrester. Asymptotics of finite system lyapunov exponents for some random matrix ensembles. Journal of Physics A: Mathematical and Theoretical , 48(21):215205, 2015.
