Products of Many Large Random Matrices and Gradients in Deep Neural   Networks

Boris Hanin; Mihai Nica

arXiv:1812.05994·math.PR·January 29, 2020

Products of Many Large Random Matrices and Gradients in Deep Neural Networks

Boris Hanin, Mihai Nica

PDF

TL;DR

This paper analyzes the behavior of products of large random matrices and their impact on gradients in deep neural networks, revealing Gaussian fluctuations and providing insights into gradient stability issues.

Contribution

It introduces a new asymptotic Gaussian limit for the log-norm of matrix products and applies this to quantify gradient stability in deep neural networks.

Findings

01

Logarithm of matrix product norms is asymptotically Gaussian.

02

Explicit error bounds for moments and Gaussian approximation.

03

Quantitative assessment of gradient explosion and vanishing in neural networks.

Abstract

We study products of random matrices in the regime where the number of terms and the size of the matrices simultaneously tend to infinity. Our main theorem is that the logarithm of the $ℓ_{2}$ norm of such a product applied to any fixed vector is asymptotically Gaussian. The fluctuations we find can be thought of as a finite temperature correction to the limit in which first the size and then the number of matrices tend to infinity. Depending on the scaling limit considered, the mean and variance of the limiting Gaussian depend only on either the first two or the first four moments of the measure from which matrix entries are drawn. We also obtain explicit error bounds on the moments of the norm and the Kolmogorov-Smirnov distance to a Gaussian. Finally, we apply our result to obtain precise information about the stability of gradients in randomly initialized deep neural networks with…

Equations389

M^{(d)} \leavevmode = \leavevmode M^{(d)} (n_{0}, \dots, n_{d}) \leavevmode = Δ \leavevmode X^{(d)} \dots X^{(1)}, X^{(i)} \in Mat (n_{i}, n_{i - 1}) .

M^{(d)} \leavevmode = \leavevmode M^{(d)} (n_{0}, \dots, n_{d}) \leavevmode = Δ \leavevmode X^{(d)} \dots X^{(1)}, X^{(i)} \in Mat (n_{i}, n_{i - 1}) .

X^{(i)} = Δ (p n_{i - 1})^{- \frac{1}{2}} D^{(i)} W^{(i)}

X^{(i)} = Δ (p n_{i - 1})^{- \frac{1}{2}} D^{(i)} W^{(i)}

D^{(i)} = Diag (ξ_{j}^{(i)}, j = 1, \dots, n_{i}) \in Mat (n_{i}, n_{i}), ξ_{j}^{(j)} \leavevmode \sim \leavevmode Bernoulli (p) i . i . d .,

D^{(i)} = Diag (ξ_{j}^{(i)}, j = 1, \dots, n_{i}) \in Mat (n_{i}, n_{i}), ξ_{j}^{(j)} \leavevmode \sim \leavevmode Bernoulli (p) i . i . d .,

(i) \leavevmode normalization: E [W_{a, b}^{(i)}] = 0, E [(W_{a, b}^{(i)})^{2}] = 1 (ii) \leavevmode symmetry around 0 : W_{a, b}^{(i)} = d - W_{a, b}^{(i)}

(i) \leavevmode normalization: E [W_{a, b}^{(i)}] = 0, E [(W_{a, b}^{(i)})^{2}] = 1 (ii) \leavevmode symmetry around 0 : W_{a, b}^{(i)} = d - W_{a, b}^{(i)}

(iii) \leavevmode finite moments: \forall k \in N, E [(W_{a, b}^{(i)})^{k}] = Δ μ_{k} < \infty (i v) \leavevmode no atoms: P (W_{a, b}^{(i)} = x) = 0 \forall x \in R .

\big{(}{\mathop{\mathrm{Jac}}}^{(d)}\big{)}^{T}{\mathop{\mathrm{Jac}}}^{(d)}\leavevmode\nobreak\ \stackrel{{\scriptstyle d}}{{=}}\leavevmode\nobreak\ \big{(}{M}^{(d)}\big{)}^{T}{M}^{(d)},

\big{(}{\mathop{\mathrm{Jac}}}^{(d)}\big{)}^{T}{\mathop{\mathrm{Jac}}}^{(d)}\leavevmode\nobreak\ \stackrel{{\scriptstyle d}}{{=}}\leavevmode\nobreak\ \big{(}{M}^{(d)}\big{)}^{T}{M}^{(d)},

Z_{d} (u) \leavevmode = Δ \leavevmode \frac{n _{0}}{n _{d}} ∣∣ M^{(d)} u ∣ ∣^{2}, u \in R^{n_{0}}, ∥ u ∥ = 1.

Z_{d} (u) \leavevmode = Δ \leavevmode \frac{n _{0}}{n _{d}} ∣∣ M^{(d)} u ∣ ∣^{2}, u \in R^{n_{0}}, ∥ u ∥ = 1.

ln (Z_{d} (u)) \leavevmode = \leavevmode ln (\frac{n _{0}}{n _{d}} ∣∣ M^{(d)} u ∣ ∣^{2})

ln (Z_{d} (u)) \leavevmode = \leavevmode ln (\frac{n _{0}}{n _{d}} ∣∣ M^{(d)} u ∣ ∣^{2})

β \leavevmode = Δ \leavevmode (\frac{3}{p} - 1) i = 1 \sum d \frac{1}{n _{i}} + \frac{μ _{4} - 3}{p n _{1}} ∥ u ∥_{4}^{4} .

β \leavevmode = Δ \leavevmode (\frac{3}{p} - 1) i = 1 \sum d \frac{1}{n _{i}} + \frac{μ _{4} - 3}{p n _{1}} ∥ u ∥_{4}^{4} .

\frac{n _{0}}{n _{d}} M^{(d)} u_{2}^{2} \leavevmode \approx \leavevmode exp (N (- \frac{1}{2} β, β)) .

\frac{n _{0}}{n _{d}} M^{(d)} u_{2}^{2} \leavevmode \approx \leavevmode exp (N (- \frac{1}{2} β, β)) .

d_{KS} (ln (\frac{n _{0}}{n _{d}} M^{(d)} u_{2}^{2}), \leavevmode N (- \frac{1}{2} β, β)) = O (i = 1 \sum d \frac{1}{n _{i}^{2}})^{1/5},

d_{KS} (ln (\frac{n _{0}}{n _{d}} M^{(d)} u_{2}^{2}), \leavevmode N (- \frac{1}{2} β, β)) = O (i = 1 \sum d \frac{1}{n _{i}^{2}})^{1/5},

E [\frac{n _{0}^{k}}{n _{d}^{k}} ∥ M u ∥_{2}^{2 k}] \leavevmode = \leavevmode exp [(2 k) β + O (i = 1 \sum d \frac{1}{n _{i}^{2}})] \leavevmode = \leavevmode E [(exp (N (- \frac{1}{2} β, β)))^{k}] + O (β^{- 1} i = 1 \sum d \frac{1}{n _{i}^{2}}),

E [\frac{n _{0}^{k}}{n _{d}^{k}} ∥ M u ∥_{2}^{2 k}] \leavevmode = \leavevmode exp [(2 k) β + O (i = 1 \sum d \frac{1}{n _{i}^{2}})] \leavevmode = \leavevmode E [(exp (N (- \frac{1}{2} β, β)))^{k}] + O (β^{- 1} i = 1 \sum d \frac{1}{n _{i}^{2}}),

O β^{- 1} i = 1 \sum d - 1 n_{i}^{- 2} + (β^{- 2} i = 1 \sum d - 1 n_{i}^{- 2})^{1/5} + (β^{- 1/2} i = 1 \sum d - 1 n_{i}^{- 2})^{1/2} + i = 1 \sum d n_{i}^{- m} + i = 1 \sum d p^{n_{i}}

O β^{- 1} i = 1 \sum d - 1 n_{i}^{- 2} + (β^{- 2} i = 1 \sum d - 1 n_{i}^{- 2})^{1/5} + (β^{- 1/2} i = 1 \sum d - 1 n_{i}^{- 2})^{1/2} + i = 1 \sum d n_{i}^{- m} + i = 1 \sum d p^{n_{i}}

d_{KS} (ln (\frac{n _{0}}{n _{d}} M^{(d)} u_{2}^{2}), N (- \frac{1}{2} β, β)) = O β^{- 1} i = 1 \sum d - 1 n_{i}^{- 2} + (β^{- 2} i = 1 \sum d - 1 n_{i}^{- 2})^{1/5} + (β^{- 1/2} i = 1 \sum d - 1 n_{i}^{- 2})^{1/2}

d_{KS} (ln (\frac{n _{0}}{n _{d}} M^{(d)} u_{2}^{2}), N (- \frac{1}{2} β, β)) = O β^{- 1} i = 1 \sum d - 1 n_{i}^{- 2} + (β^{- 2} i = 1 \sum d - 1 n_{i}^{- 2})^{1/5} + (β^{- 1/2} i = 1 \sum d - 1 n_{i}^{- 2})^{1/2}

d \to \infty, n_{i} = n_{i} (d) \to \infty, 0 \leavevmode < \leavevmode d \to \infty lim sup j = 1 \sum d \frac{1}{n _{i} ( d )} \leavevmode < \leavevmode \infty,

d \to \infty, n_{i} = n_{i} (d) \to \infty, 0 \leavevmode < \leavevmode d \to \infty lim sup j = 1 \sum d \frac{1}{n _{i} ( d )} \leavevmode < \leavevmode \infty,

Z_{d} (u) \leavevmode = d \leavevmode i = 1 \prod d χ_{n_{i}}^{2} / n_{i},

Z_{d} (u) \leavevmode = d \leavevmode i = 1 \prod d χ_{n_{i}}^{2} / n_{i},

m i n n_{i} \to \infty lim ln (Z_{d} (u)) \leavevmode = \leavevmode 0 almost surely .

m i n n_{i} \to \infty lim ln (Z_{d} (u)) \leavevmode = \leavevmode 0 almost surely .

d \to \infty lim ln (Z_{d} (u)) \leavevmode = \leavevmode \infty almost surely .

d \to \infty lim ln (Z_{d} (u)) \leavevmode = \leavevmode \infty almost surely .

d \to \infty lim \frac{1}{d} ln (Z_{d} (u)) \leavevmode = \leavevmode E [lo g ∣∣ M^{(1)} u ∣ ∣^{2}] almost surely .

d \to \infty lim \frac{1}{d} ln (Z_{d} (u)) \leavevmode = \leavevmode E [lo g ∣∣ M^{(1)} u ∣ ∣^{2}] almost surely .

d \to \infty lim n_{1}, \dots, n_{d} \to \infty lim ln (Z_{d} (u)) \leavevmode \neq = \leavevmode n_{1}, \dots, n_{d} \to \infty lim d \to \infty lim ln (Z_{d} (u)),

d \to \infty lim n_{1}, \dots, n_{d} \to \infty lim ln (Z_{d} (u)) \leavevmode \neq = \leavevmode n_{1}, \dots, n_{d} \to \infty lim d \to \infty lim ln (Z_{d} (u)),

n_{i} = 2 β^{- 1} d d \to \infty lim ln (Z_{d} (u)) \leavevmode = \leavevmode N (- β /2, β),

n_{i} = 2 β^{- 1} d d \to \infty lim ln (Z_{d} (u)) \leavevmode = \leavevmode N (- β /2, β),

λ_{m a x} = d \to \infty lim \frac{1}{d} lo g M^{(d)}_{ℓ_{2} \to ℓ_{2}},

λ_{m a x} = d \to \infty lim \frac{1}{d} lo g M^{(d)}_{ℓ_{2} \to ℓ_{2}},

h (λ) = {2 λ, 0, 0 < λ < 1 otherwise .

h (λ) = {2 λ, 0, 0 < λ < 1 otherwise .

∣∣ M^{(d)} u ∣ ∣^{2} \leavevmode = \leavevmode j = 1 \sum n_{0} σ_{j}^{2} ⟨ u, v_{j} ⟩^{2} .

∣∣ M^{(d)} u ∣ ∣^{2} \leavevmode = \leavevmode j = 1 \sum n_{0} σ_{j}^{2} ⟨ u, v_{j} ⟩^{2} .

lo g (∣∣ M^{(d)} u ∣ ∣^{2}) \leavevmode \approx \leavevmode lo g (\frac{1}{n _{0}} j = 1 \sum n_{0} σ_{j}^{2}) .

lo g (∣∣ M^{(d)} u ∣ ∣^{2}) \leavevmode \approx \leavevmode lo g (\frac{1}{n _{0}} j = 1 \sum n_{0} σ_{j}^{2}) .

Z (d) = π \in {1, \dots, n}^{d} \sum exp (\frac{1}{T} η (i, π (i))),

Z (d) = π \in {1, \dots, n}^{d} \sum exp (\frac{1}{T} η (i, π (i))),

N (x) = ReLU \circ A^{(d)} \circ \dots \circ ReLU \circ A^{(1)} (x),

N (x) = ReLU \circ A^{(d)} \circ \dots \circ ReLU \circ A^{(1)} (x),

A^{(i)} (x) = W^{(i)} x + B^{(i)}, W^{(i)} \in Mat (n_{i}, n_{i - 1}), B^{(i)} \in R^{n_{i}},

A^{(i)} (x) = W^{(i)} x + B^{(i)}, W^{(i)} \in Mat (n_{i}, n_{i - 1}), B^{(i)} \in R^{n_{i}},

ReLU (v) = (max {0, v_{1}}, \dots, max {0, v_{m}}) \in R^{m} .

ReLU (v) = (max {0, v_{1}}, \dots, max {0, v_{m}}) \in R^{m} .

Act^{(j)} \leavevmode = Δ \leavevmode ReLU (act^{(j)}), act^{(j)} \leavevmode = Δ \leavevmode A^{(j)} \circ ReLU \circ \dots \circ ReLU \circ A^{(1)} (Act^{(0)})

Act^{(j)} \leavevmode = Δ \leavevmode ReLU (act^{(j)}), act^{(j)} \leavevmode = Δ \leavevmode A^{(j)} \circ ReLU \circ \dots \circ ReLU \circ A^{(1)} (Act^{(0)})

W ⟵ W \leavevmode - \leavevmode λ \frac{\partial L}{\partial W},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods*Communicated@Fast*How Do I Communicate to Expedia?

Full text

Products of Many Large Random Matrices and Gradients in Deep Neural Networks

Boris Hanin111Department of Mathematics, Texas A&M; [email protected]

Mihai Nica222Department of Mathematics, University of Toronto; [email protected]

Abstract

We study products of random matrices in the regime where the number of terms and the size of the matrices simultaneously tend to infinity. Our main theorem is that the logarithm of the $\ell_{2}$ norm of such a product applied to any fixed vector is asymptotically Gaussian. The fluctuations we find can be thought of as a finite temperature correction to the limit in which first the size and then the number of matrices tend to infinity. Depending on the scaling limit considered, the mean and variance of the limiting Gaussian depend only on either the first two or the first four moments of the measure from which matrix entries are drawn. We also obtain explicit error bounds on the moments of the norm and the Kolmogorov-Smirnov distance to a Gaussian. Finally, we apply our result to obtain precise information about the stability of gradients in randomly initialized deep neural networks with ReLU activations. This provides a quantitative measure of the extent to which the exploding and vanishing gradient problem occurs in a fully connected neural network with ReLU activations and a given architecture.

1 Introduction

Products of independent random matrices are a classical topic in probability and mathematical physics with applications to a variety of fields, from wireless communication networks [28] to the physics of black holes [6], random dynamical systems [26], and recently to the numerical stability of randomly initialized neural networks [14, 24]. In the context of neural networks, products of random matrices are related to the numerical stability of gradients at initialization and therefore give precise information about the exploding and vanishing gradient problem (see Section 1.5, Proposition 2, and Corollary 3). The purpose of this article is to prove several new results about such products in the regime where the number of terms and the sizes of matrices grow simultaneously. This regime has attracted attention [2, 19] but remains poorly understood. We find new phenomena not present when the number of terms and the size of the matrices are sent to infinity sequentially rather than simultaneously (see Section 1.1 for more on this point).

To explain our results, let $d\in\mathbb{N}$ be a positive integer and let $n_{0},\ldots,n_{d}\in\mathbb{N}$ be a list of positive integers. We are concerned with (the non-asymptotic) analysis of products of $d$ independent rectangular random matrices of sizes $n_{i}\times n_{i-1}$ with real entries:

[TABLE]

The specific matrix ensembles we study depend on a parameter $p\in(0,1]$ and a distribution $\mu$ on $\mathbb{R}$ . We define:

[TABLE]

where ${D}^{(i)}$ are $n_{i}\times n_{i}$ diagonal matrices

[TABLE]

and ${W}^{(i)}\in\mathrm{Mat}(n_{i},n_{i-1})$ are independent $n_{i}\times n_{i-1}$ random matrices for which the entries $W_{a,b}^{(i)}$ are drawn i.i.d. from a fixed distribution $\mu$ on ${\mathbb{R}}$ satisfying the following four conditions:

[TABLE]

When $p=1,$ the matrices ${D}^{(i)}$ are the identity. In contrast, when $p=1/2$ , the matrix product $M^{(d)}$ naturally arises in connection to the input-output Jacobian matrix ${\mathop{\mathrm{Jac}}}^{(d)}$ for neural nets with $\operatorname{ReLU}$ nonlinearity and $d$ layers with widths $n_{0},\ldots,n_{d}$ initialized with random weights drawn from $\mu$ . In particular, the following equality in distribution holds when $p=\frac{1}{2}$ :

[TABLE]

so that, when $p=1/2,$ the singular values of ${M}^{(d)}$ are equal in distribution to those of ${\mathop{\mathrm{Jac}}}^{(d)}.$ This is a consequence of Proposition 2 below, which opens the door to a rigorous study of the so-called exploding and vanishing gradient problem for $\operatorname{ReLU}$ nets at finite depth and width. This refines the approach of the first author in [14], and we refer the reader to Section 1.5 for precise definitions an extended discussion of this point. Our main result concerns the distribution of

[TABLE]

As explained in Section 1.4 below, $Z_{d}(\vec{u})$ can be thought of as a line-to-line partition function in a disordered medium given by the computation graph underlying the matrix product defining ${M}^{(d)}$ , with $\vec{u}$ corresponding to a kind of initial condition. The diagonal matrices ${D}^{(i)}$ then correspond to $\{0,1\}-$ valued spins on the vertices of this graph, restricting the allowed paths of the directed random polymer. With this interpretation, our main result, Theorem 1, shows that the analogue of the free energy, namely

[TABLE]

is Gaussian up to an error that tends to zero when $n_{i},d$ tend to infinity.

Theorem 1.

Fix $p\in(0,1]$ , and a distribution $\mu$ satisfying (i)-(iv) above. Let $\vec{u}\in\mathbb{R}^{n_{0}}$ be some fixed unit vector, $\left\|\vec{u}\right\|_{2}=1,$ and for any choice of $d$ , $n_{0},\ldots,n_{d}$ , set

[TABLE]

Let ${M}^{(d)}={X}^{(d)}\cdots{X}^{(1)}$ be as in (2). Then, the norm of the vector ${M}^{(d)}\vec{u}$ is approximately log-normal distributed:

[TABLE]

This approximation holds both in the sense of distribution and of moments. More precisely, with $d_{\mathrm{KS}}$ denoting Kolomogov-Smirnov distance,

[TABLE]

where the implicit constant is uniform for $p$ in a compact subset of $(0,1]$ , and $\beta$ in a compact set bounded away from $\beta=0$ . Moreover, for every $k\geq 0,$ satisfying $\binom{k}{2}<\min_{1\leq i\leq d}n_{i},$ we have

[TABLE]

where the implicit constant depends on $k$ and the moments of $\mu$ but not on $\beta,d,n_{i}.$

*Remark**.*

In the proof of equation (5), we actually show that $d_{\mathrm{KS}}\left(\ln\left(\frac{n_{0}}{n_{d}}\left\|{M}^{(d)}\vec{u}\right\|_{2}^{2}\right),\,\mathcal{N}\left(-\frac{1}{2}\beta,\,\beta\right)\right)$ is bounded above by

[TABLE]

for any choice of $m\in\mathbb{N}$ and where the constants depend only on $m,$ the moments of $\mu$ and $p$ . By taking $m=2$ and restricting the $\beta$ in a compact set, the result claimed in Theorem 1 holds. Moreover, if we take $p=1$ , then we will actually prove instead the sharper result that

[TABLE]

The two conclusions, equation (7) and equation (6), of Theorem 1 are proven separately and have independent proofs. We prove equation (6) by a path-counting type argument in Section 3. The argument in Section 4 for equation (7), in contrast, uses a central limit theorem for martingales.

1.1 Joint scaling limits

Theorem 1 shows that the free energy $\ln(Z_{d}(\vec{u}))=\ln(||{M}^{(d)}\vec{u}||_{2}^{2})$ from (4) is Gaussian in the double scaling limit

[TABLE]

achieved for instance when $n_{i}(d)$ are equal and proportional to $d.$ This asymptotic normality for $\ln(Z_{d}(\vec{u}))$ cannot be seen by taking the limits $d\to\infty$ and $\min\{n_{i}\}$ to infinity one after the other. Indeed, consider the case when $p=1$ and $\mu=\mathcal{N}(0,1),$ the standard Gaussian measure. A simple computation using the rotational invariance of i.i.d. Gaussian matrices shows the equality in distribution

[TABLE]

where $\chi_{n}^{2}$ is a chi-squared random variable with $n$ degrees of freedom and the terms in the product are independent. In the limit where $d$ is fixed and $\min\{n_{i},\,i\geq 1\}\to\infty,$ we have $\chi_{n_{i}}^{2}/n_{i}\approx 1+O(n_{i}^{-1/2})$ and so

[TABLE]

On the other hand, if the $n_{i}$ are uniformly bounded, then we have

[TABLE]

In fact, for $n_{i}\equiv n$ fixed, $\ln(Z_{d}(\vec{u}))$ converges only with an addition $1/d$ scaling:

[TABLE]

In particular, we have

[TABLE]

making (8) an interesting regime for $\ln(Z_{d}(\vec{u}))$ . The non-commutativity of the $n_{i},d\to\infty$ limits is well-known [1, 7, 19] and is related to the fact that the local statistics of the singular values of ${M}^{(d)}$ are sensitive to the order in which the limits above are taken. Remaining in the simple case of $p=1$ and $\mu$ Gaussian, a simple application of the central limit results show that when all the $n_{i}$ are equal and are related to $d$ by $n=2\beta^{-1}d$ , then the exact chi-squared representation of equation (9) gives the convergence in distribution:

[TABLE]

which is of course consistent with Theorem 1. Part of the content of Theorem 1 is therefore that this result is essentially independent of the parameter $p$ and the measure $\mu$ according to which the entries of the matrices ${W}^{(i)}$ are distributed. See Section 1.3 for more discussion on the novel aspects of Theorem 1.

1.2 Connection to previous work in Random Matrix Theory

The literature on products of random matrices is vast. Much of the previous work concerns products of $d$ i.i.d. random matrices, each of size $n\times n$ . Such ensembles have been well studied in two distinct regimes: (a) when $n$ is fixed and $d\to\infty$ and (b) when $d$ is fixed and $n\to\infty$ . Case (a) is related to multiplicative ergodic theory and the study of Lyapunov exponents. The seminal articles in this regime are the results of Furstenberg and Kesten [10], which gives general conditions for the existence of the top Lyapunov exponent

[TABLE]

and the multiplicative ergodic theorem of Osceledets [23], which gives conditions for almost sure (deterministic) values for all the Lyapunov exponents. Many more recent works characterize the Lyapunov exponents under more specific assumptions, most notably for matrices which are rotationally invariant or which have entries that are real or complex Gaussians, see e.g. [1, 9, 8, 18, 21, 15] as well as the survey [3] and references therein.

Case (b), where $d$ is fixed and $n\to\infty$ , falls into the setting of free probability. Indeed, one of the great successes of free probability is the idea of “asymptotic freeness”: in the limit $n\to\infty$ , a collection of $d$ independent $n\times n$ random matrices behave like a collection of $d$ freely independent random variables on a non-commutative probability space (see e.g. [4] Chapter 5 or [20] Chapter 1 and 4). Therefore, case (b) is closely related to a product of $d$ freely independent random variables; precise results are obtained in [11]. Earlier results [12, 22] examine case (b) without explicit use of free probability. The problem of first taking $n\to\infty$ and afterwards taking $d\to\infty$ can also be handled using the tools of free probability in the case of Gaussian matrices, see [27].

As explained in the Introduction and in Section 1.1, the regimes (a), (b) are asymptotically incompatible in the sense that the limits $d\to\infty$ , and $n\to\infty$ do not generally commute on the level of the local behavior of the singular value distribution. Indeed, the problem of understanding what happens when both are scaled simultaneously is mentioned as an open problem in [1]. To explain this further, we note that the work of Newman [16, Thm. 1] in regime (a) shows that when $p=1$ and $n_{j}\equiv n$ is fixed, the density of the Lyapunov exponents of ${M}^{(d)}$ converges in the limit when first $d\to\infty$ and then $n\to\infty$ to the triangular density

[TABLE]

The work of Tucci [27, Thm. 3.2, Ex. 3.4] shows that for Gaussian ensembles related to ${M}^{(d)}$ one obtains the same global limit in the regime (b) when first $n\to\infty$ and then $d\to\infty.$ However, as explained in [1] Section 5, while the global density of all the Lyapunov exponents is the triangular law in both cases, the local behavior (e.g. the fluctuations of the top Lyapunov exponent) is observed to be different depending on the order of the limits even in the exactly solvable special case of products of complex Ginibre matrices.

From this spectral point of view, Theorem 1 gives information about certain averages of the Lyapunov exponents. To see this, fix $n_{0}$ and let $n_{i},\,i\geq 1,$ and $d$ tend to infinity in accordance with (8). Note we we specifically do not take $n_{0}$ to infinity. Denote by $\sigma_{1},\ldots,\sigma_{n_{0}}$ the non-zero singular values of ${M}^{(d)}$ , and by $\vec{v}_{1},\ldots,\vec{v}_{n_{0}}$ the corresponding left-singular vectors. Then we have

[TABLE]

In many situations of interest we can expect that the inner products satisfy $\langle u,\,v_{j}\rangle^{2}\leavevmode\nobreak\ \approx\leavevmode\nobreak\ \frac{1}{n_{0}}$ . This happens for example if the vector $\vec{u}$ is chosen uniformly at random on the $n_{0}$ -sphere or when $\vec{u}$ is a fixed vector and the matrix ${W}^{(1)}$ is invariant under right multiplication by an orthogonal matrix. In this setting

[TABLE]

Hence, Theorem 1 can be interpreted as the statement that the logarithm of the average of the non-zero singular values for $||{M}^{(d)}\vec{u}||^{2}$ is a Gaussian with mean $-\beta/2$ and variance $\beta$ in the limit (8). These non-trivial corrections in $\beta$ can be seen as a finite temperature correction to the maximal entropy regime of Tucci [27] in which first $n\to\infty$ and then $d\to\infty$ . For more on this point of view, we refer the reader to [1, Section 3.2].

Finally, in the specific case where the random matrices ${X}^{(j)}$ are complex Ginibre matrices (i.e. the matrix entries are iid complex Gaussian), very recent work [2, 19] looks at the limiting spectrum under the joint scaling limit $d\to\infty$ , $n\to\infty$ where the ratio $d/n$ is fixed or going to $\infty$ . This work analyzes exact determinental formulas for the joint distribution of singular values available in the case of complex Ginibre matrices. The analogous formulas for real Gaussian matrices given in [15] are significantly more complicated and such an explicit analysis appears to be much more difficult.

1.3 Contribution of the Present Work

In the context of these previous random matrix results, let us point out four novel aspects of Theorem 1. First, it deals with the joint $d,n\to\infty$ limit for a large class of non-Gaussian matrices with real entries. There is no integrable structure to our matrix ensembles, and we rely instead on a sum-over-paths approach to analyze the moments (6) and a martingale CLT approach for obtaining the KS distance estimates (7).

Second, the ensembles in Theorem 1 include the somewhat unusual diagonal $\mathrm{Bernoulli}(p)$ matrices ${D}$ as part of model. Our original motivation for including these is the connection to neural networks explained in Section 1.5. In essence, the matrices ${D}^{(i)}$ can be interpreted as adding iid $\{0,1\}-$ valued spins to the usual sum over paths approach to moments of products of matrices. Only “open” paths that have spin $1$ on every vertex contribute to the sum, causing open paths to be correlated. Previously, Forrester [8] and Tucci [27] considered the case when ${D}^{(i)}$ were deterministic positive definite matrices.

An additional novelty of Theorem 1 is it proves the distribution of $\big{|}\big{|}{M}^{(d)}\vec{u}\big{|}\big{|}_{2}^{2}$ is (mostly) universal: it does not depend on the higher moments of the distribution $\mu$ beyond the mean and variance, with the exception of the fourth moment $\mu_{4}$ appearing in $\beta$ in the term $\left\|\vec{u}\right\|_{4}^{4}(\mu_{4}-3)/pn_{1}$ . In the regime $n_{j}\equiv n$ and $d/n\equiv\beta$ , this term is a $1/n$ correction.

The fourth and final novelty of Theorem 1 we would like to emphasize is that our results are non-asymptotic, i.e. we obtain an explicit error term of the form $\sum_{i=1}^{d}{n^{-2}_{i}}$ . This is particularly useful when using Theorem 1 for studying gradients in randomly initialized neural networks (see Section 1.5).

Finally, we remark that Theorem 1 only studies $\big{|}\big{|}{M}^{(d)}\vec{u}\big{|}\big{|}_{2}^{2}$ for a fixed vector $\vec{u}$ , and therefore leaves several questions open: for instance the joint law of $\{\big{|}\big{|}{M}^{(d)}\vec{u}^{(1)}\big{|}\big{|}_{2}^{2},\ldots,\big{|}\big{|}{M}^{(d)}\vec{u}^{(\ell)}\big{|}\big{|}_{2}^{2}\}$ for a list of vectors $\{\vec{u}^{(1)},\ldots,\vec{u}^{(\ell)}\}$ and more generally the limiting spectral distribution of the matrices ${M}^{(d)}$ . We plan to address these questions in forthcoming work.

1.4 Connection to Random Polymers

The matrix ensembles ${M}^{(d)}$ studied in this article, in the case $n_{i}=n,\,\,i=1,\ldots,d$ , are related to directed random polymers on the complete graph of size $n$ . This model were recently explored in detail (c.f. e.g. [5]). A key object for these polymers is the line-to-line partition function

[TABLE]

where $T$ is the temperature of the model, and $\left\{\eta(i,j)\right\}_{(i\in\mathbb{N},1\leq j\leq n)}$ are i.i.d. mean zero random variables that make up the underlying disordered environment. When the sum over $\{1,\ldots,n\}^{d}$ is written via products of $d$ matrices of size $n\times n$ , the disordered environment can be viewed as a multipartite graph made of $d$ vertex clusters $V_{1},\ldots,V_{d}$ of size $n$ with (directed) edges from all vertices in $V_{i}$ to all vertices in $V_{i+1}.$ The edges of this graph are then decorated with the corresponding matrix entries $\exp\left(\frac{1}{T}\eta(i,j)\right),$ which are strictly positive, making $Z(d)$ a sum over paths from the input to the output of this graph. Each path is weighted by its energy, given by the product of weights along the path.

The fact that the weights are positive makes the analysis of the partition function of this traditional random polymer model different than the analysis of the matrix product ${M}^{(d)}$ defined in (1). In particular, no cancellation is possible between the terms in the definition of $Z(d)$ above, causing $Z(d)$ to be exponential in $d$ . The $n$ fixed and $d\to\infty$ limit of of the partition function in the case of these positive weights is the object of study in [5]. As explained in Section 1.3, Theorem 1 studies a different regime where both $d\to\infty,n\to\infty$ at the same time. The fact that our weights are mean zero, gives rise to significant cancellation in the terms of $Z_{d}(\vec{u})$ from (4), so that the partition function in our mean zero model does not grow exponentially with $d$ provided $n$ grows with $d$ as in (8). Additionally, if $p<1$ in our model, the effect of the diagonal Bernoulli matrices ${D}^{(i)}$ is to close every vertex with probability $1-p$ . The sum over paths in our partition function then becomes the sum only over those paths that pass through vertices that are open.

1.5 The Case $p=1/2$ as Gradients in Random Neural Nets

One of our motivations for studying the ensembles ${M}^{(d)}$ is that, as we prove in Proposition 2 below, the case $p=1/2$ corresponds exactly to the input-output Jacobian in randomly initialized neural networks with $\operatorname{ReLU}$ activations. To explain this connection, fix $d,n_{0},\ldots,n_{d}\in\mathbb{N}$ . A neural network with $\operatorname{ReLU}$ activations, depth $d$ , and layer widths $n_{0},\ldots,n_{d}$ is any function of the form

[TABLE]

where $A^{(j)}:{\mathbb{R}}^{n_{j-1}}\rightarrow{\mathbb{R}}^{n_{j}}$ are affine

[TABLE]

and for $m\geq 1$ and any vector $\vec{v}=\left(v_{1},\ldots,v_{m}\right)\in{\mathbb{R}}^{m}$ we write

[TABLE]

The matrices ${W}^{(j)}$ and vectors $\vec{B}^{(j)}$ are called, respectively, the weights and biases of $\mathcal{N}$ at layer $j,$ while $d,n_{0},\ldots,n_{d}$ collectively define the architecture of $\mathcal{N}$ . We will write $\operatorname{\mathrm{Act}}^{(0)}\in{\mathbb{R}}^{n_{0}}$ for an input to $\mathcal{N}$ and will define

[TABLE]

to be the vectors of activities before and after applying $\operatorname{ReLU}$ at the neurons at layer $j$ .

In practice, the weights and biases in a neural network are first randomly initialized and then optimized by (stochastic) gradient descent on a task-specific loss $\mathcal{L}=\mathcal{L}(\operatorname{\mathrm{Act}}^{(d)})$ that depends only on the outputs of $\mathcal{N}.$ A single gradient descent update for a trainable parameter $W$ (i.e. an entry in one of the weight matrices ${W}^{(i)}$ ) is

[TABLE]

where $\lambda>0$ is the learning rate. An important practical impediment to gradient based learning is the exploding and vanishing gradient problem (EVGP), which occurs when gradients are numerically unstable:

[TABLE]

making the parameter update (10) too small to be meaningful or too large to be precise. An important intuition is that the EVGP will be most pronounced at the start of training, when the weights and biases of $\mathcal{N}$ are random and the implicit structure of the data being processed has not yet regularized the function computed by $\mathcal{N}$ .

As explained below, the EVGP for a depth $d$ $\operatorname{ReLU}$ net $\mathcal{N}$ with hidden layer widths $n_{0},\ldots,n_{d}$ is essentially equivalent to having large fluctuations for the entries (or, in the worst case, for the singular values) of the Jacobians of the transformations between various layers:

[TABLE]

The next result shows the singular value distribution of ${M}^{(j^{\prime}-j)}$ is that same as that of ${\mathop{\mathrm{Jac}}}^{(j\rightarrow j^{\prime})}$ since

[TABLE]

Proposition 2 also shows that, for any collection of vectors $\vec{u}_{1},\ldots,\vec{u}_{k}\in{\mathbb{R}}^{n_{j}},$ we have the following equality in distribution when $p=1/2$ :

[TABLE]

Proposition 2.

Let $\mathcal{N}$ be a $\operatorname{ReLU}$ net with depth $d$ and layer widths $n_{0},\ldots,n_{d}$ . Fix $0\leq j<j^{\prime}\leq d.$ Suppose the weights of $\mathcal{N}$ are ${W}^{(i)}$ , which are drawn iid from the measure $\mu$ as in the original definition (2). Then, writing $\eta^{(j^{\prime})}\in\vec{\mathbb{R}^{n_{j^{\prime}}}}$ for the $n_{j^{\prime}}$ dimensional ${\pm 1}$ -Bernoulli random vector, whose entries are independent and take the values $\pm 1$ with probability $1/2$ , we have

[TABLE]

where ${D}^{(i)}$ are diagonal $\{0,1\}$ -Bernoulli matrices as in (2) with parameter $p=\frac{1}{2}$ and $\stackrel{{\scriptstyle d}}{{=}}$ denotes equality in distribution.

Before proving Proposition 2, let us explain why the functions $||{\mathop{\mathrm{Jac}}}^{(j^{\prime}-j)}\vec{u}||^{2}_{2}$ that we study in Theorem 1 are related to the EVGP. Due to the compositional nature of the function computed by $\mathcal{N},$ we may use the chain rule to write, for the weight $W_{a,b}^{(j)}$ connecting neuron $a$ to neuron $b$ in layer $j,$

[TABLE]

where ${\mathop{\mathrm{Jac}}}_{b}^{(j\rightarrow d)}$ is the $b^{th}$ column of ${\mathop{\mathrm{Jac}}}^{(j\rightarrow d)}.$ Therefore, fluctuations of the gradient descent update $\partial\mathcal{L}/\partial W_{a,b}^{(j)}$ are captured precisely by fluctations of bi-linear functionals $\langle{{\mathop{\mathrm{Jac}}}^{(j\rightarrow j^{\prime})}\vec{u},\vec{v}\rangle}$ of various layer to layer Jacobians in $\mathcal{N}$ . We study in this article $\vec{u}=\vec{v}$ and obtain in Theorem 1 precise distribution and moment estimates on these quantities. For instance, Theorem 1 combined with Proposition 2 immediately yields the following

Corollary 3.

Let $\mathcal{N}$ be a fully connected depth $d$ $\operatorname{ReLU}$ net with hidden layer width $n_{0},\ldots,n_{d}$ and randomly initialized weights drawn i.i.d. from the measure $\mu$ and scaled to have variance $2/\text{fan-in}=2/n_{i}$ as in (2). Suppose also that the biases of $\mathcal{N}$ are drawn i.i.d. from any measure satisfying same assumptions as the measure $\mu.$ Fix any $\vec{u}\in{\mathbb{R}}^{n_{0}}$ with $\left\|\vec{u}\right\|=1$ and write ${\mathop{\mathrm{Jac}}}^{(d)}$ for the input-output Jacobian of $\mathcal{N}$ . We have,

[TABLE]

where

[TABLE]

and the implicit constant is uniform when $\beta$ ranges over a compact subset of $(0,\infty)$ .

For more information about the EVGP, statistics of gradients in random $\operatorname{ReLU}$ nets, and distribution of the singular values of the input-output Jacobian we refer the interested reader to [14, 24, 17, 25] for more details.

2 Proof of Proposition 2

2.1 Idea behind the proof

The essential idea behind Proposition 2 is to notice that the derivative of the $\operatorname{ReLU}$ function is $\operatorname{ReLU}^{\prime}(x)=\mathtt{1}\{x>0\}$ , so when doing the chain rule to compute ${\mathop{\mathrm{Jac}}}^{(d)}$ , we find the following $\{0,1\}$ -valued diagonal matrices naturally appearing:

[TABLE]

Since the random weights $W^{(i)}_{a,b}$ and biases $B^{(i)}_{a}$ are symmetrically distributed around [math] (i.e. $-W^{(i)}_{a,b}\stackrel{{\scriptstyle d}}{{=}}W^{(i)}_{a,b}$ ) and have no atoms, it is easily verified that each entry in ${W}^{(i)}\vec{x}+\vec{B}^{(i)}$ is equally likely to be positive or negative regardless of the value of $\vec{x}$ . Hence the matrix in equation (12) is equal in distribution to the Bernoulli matrix ${D}^{(i)}$ when $p=\frac{1}{2}$ . This informally explains the connection between ${\mathop{\mathrm{Jac}}}^{(d)}$ and ${M}^{(d)}$ .

It remains to see that these diagonal matrices are independent of each other (since the outputs of the previous layer are fed into to subsequent layers, so are not a priori independent). This again will be a consequence of the fact the underlying random variables are symmetrically distributed, and will be formally verified by conjugating the weights and biases of the network by random $\pm 1$ random variables. This doesn’t change the distribution of the network, but will allow us to see the independence between layers in a more concrete way.

2.2 Proof of Proposition 2

Proof of Proposition 2.

Fix a neural net $\mathcal{N}$ as in the statement of proposition 2 and denote its weights and biases at layer $j$ by $W_{a,b}^{(j)}$ and $B_{a}^{(j)}$ . For each $j,$ let

[TABLE]

be an i.i.d. collection of random variables that each take values $\pm 1$ with probability $1/2$ . We will also define

[TABLE]

Consider the neural net $\widehat{\mathcal{N}}$ with weights $\widehat{W}_{a,b}^{(j)}$ and biases $\widehat{B}_{a}^{(j)}$ defined by changing the signs of the weights and biases of $\mathcal{N}$ as follows:

[TABLE]

so that

[TABLE]

We will denote by $\widehat{\operatorname{\mathrm{Act}}}^{(j)},\,\widehat{{\mathop{\mathrm{Jac}}}}^{(d)}$ the activations and input-output Jacobian for $\widehat{\mathcal{N}},$ both computed at the same fixed input

[TABLE]

Note that since we’ve assumed that ${W}^{(j)},\vec{B}^{(j)}$ have distributions that are symmetric around [math], we have

[TABLE]

Hence, since the weights of the two networks are identically distributed,

[TABLE]

On the other hand, the chain rule yields the following recursion for $\mathrm{Diag}(\eta^{(d)})\widehat{{\mathop{\mathrm{Jac}}}}^{(d)}:$

[TABLE]

where

[TABLE]

and we’ve used that diagonal matrices commute. Note that apriori the matrices $\widehat{{D}}^{(j)}$ depend on the weights and biases $\widehat{{W}}^{(i)},\widehat{\vec{B}}^{(i)}$ for $i\leq j$ since $\widehat{\operatorname{\mathrm{act}}}^{(j)}=\widehat{{W}}^{(j)}\widehat{\operatorname{\mathrm{Act}}}^{(j-1)}+\widehat{\vec{B}}^{(j)}$ . However, we will now verify the following claim about the collection of matrices $\widehat{{D}}^{(j)}$ and variables $\sigma^{(j)}$ :

[TABLE]

and that moreover the collection $\{\widehat{{D}}^{(j)},\,j=1,\ldots,d\}$ is independent and that each $\widehat{{D}}^{(j)}$ is distributed like a diagonal matrix with independent diagonal entries taking the values of $\{0,1\}$ with probability $1/2.$ Once we have proven this, since $\sigma^{(j)}{W}^{(j)}\stackrel{{\scriptstyle d}}{{=}}{W}^{(j)}$ , then equation (13) shows that $\mathrm{Diag}(\eta^{(d)})\widehat{{\mathop{\mathrm{Jac}}}}^{(d)}\stackrel{{\scriptstyle d}}{{=}}\widehat{{D}}^{(d)}{W}^{(d)}\mathrm{Diag}(\eta^{(d-1)})\widehat{{\mathop{\mathrm{Jac}}}}^{(d-1)}$ and $\widehat{{D}}^{(d)}\stackrel{{\scriptstyle d}}{{=}}\mathrm{Diag}(\xi_{1},\ldots,\xi_{n_{d}})$ , a diagonal $\{0,1\}$ -Bernoulli random variables independent of everything else. This will complete the proof of the present proposition since this is exactly the recurrence for the matrices $M^{(d)}$ .

To prove (14), we will use the fact that two random variables $X,Y$ are independent if the distribution of $X$ given $Y=y$ does not depend on the value of $y.$ That is, (14) will follow once we show that for any fixed sequences $s_{a}^{(j)},r_{a}^{(j)}\in\{\pm 1\}$ that

[TABLE]

To check this equality, it suffices to show that given $\{s_{a}^{(j)},\,r_{a}^{(j)}\}$ there is exactly one possible configuration for the variables $\xi_{a}^{(j)},\eta_{a}^{(j)}$ for which the event $\cap_{a,j}\left\{\widehat{D}_{a}^{(j)}=s_{a}^{(j)},\sigma_{a}^{(j)}=r_{a}^{(j)}\right\}$ occurs. The resulting probability then follows since $\xi_{a}^{(j)},\eta_{a}^{(j)}$ are i.i.d. variables that each take the values $\pm 1$ with probability $1/2.$ The proof is by induction: we will show that for each $i=1,\ldots,d$ , given $W^{(j)},B^{(j)},\,j\leq i$ there is a unique configuration for the variables $\xi_{a}^{(j)},\eta_{a}^{(j)},\,\,j\leq i$ that leads to the event $\cap_{a,j\leq i}\left\{\widehat{D}_{a}^{(j)}=s_{a}^{(j)},\sigma_{a}^{(j)}=r_{a}^{(j)}\right\}$ . When $i=1,$ we have

[TABLE]

Recalling that $\eta_{b}^{(0)}=1$ for all $b$ , we see that for each $a,$ there is a unique value of $\xi_{a}^{(1)}$ for which

[TABLE]

Then, for this value of $\xi_{a}^{(1)},$ since we have $\sigma_{a}^{(1)}=\xi_{a}^{(1)}\eta_{a}^{(1)}$ , there is a unique value of $\eta_{a}^{(1)}$ for which $\sigma_{a}^{(1)}=r_{a}^{(1)}.$ The proof of the inductive step is identical. Namely, suppose we have determined the values of $\xi_{a}^{(j)},\eta_{a}^{(j)}$ for $j\leq i.$ Then, given the weights and biases $W^{(j)},B^{(j)},\,\,j\leq i,$ we have uniquely determined $\widehat{\operatorname{\mathrm{Act}}}^{(j)}.$ Then, given this value for $\widehat{\operatorname{\mathrm{Act}}}^{(j)}$ , for every $a=1,\ldots,n_{i+1},$ there is a unique value $\xi_{a}^{(i+1)}$ for which $\widehat{D}_{a}^{(i+1)}=s_{a}^{(i+1)}.$ And finally, given this value for $\xi_{a}^{(i+1)}$ there is a unique value of $\eta_{a}^{(i+1)}$ so that $\sigma_{a}^{(i+1)}=r_{a}^{(i+1)}.$ This completes the proof. ∎

3 Proof of Theorem 1: Moment Estimates and Path Counting

3.1 Outline of Proof of Equation (6)

We begin by indicating the general plan for the proof of Equation (6) from Theorem 1, which consists of two steps. First, in Proposition 4 below, we express the expectation in (6) as a sum over $k$ -tuples of paths $V\in[n_{0}]^{k}\times\cdots\times[n_{d}]^{k}$ . The precise result is the following

Proposition 4 (Moments of $||{M}\vec{u}||^{2}$ as a sum over paths).

With the notation of Theorem 1, for each $k$ we have

[TABLE]

where $u_{V(0)}\stackrel{{\scriptstyle\Delta}}{{=}}\prod_{j=1}^{k}u_{V_{j}(0)}$ , and $C$ is defined by:

[TABLE]

with $\#v$ denoting the number of unique entries in a tuple $v\in[n]^{k}$ , $m_{x,y}$ being the multiplicity of edges appearing in the set $\left\{(x_{i},y_{i})\right\}_{i=1}^{k}$ as in (19), and $c_{\ell}$ a combinatorial factor given by (17), and $wt$ , defined in (21) denoting a weight function that depends on the moments of the entries of the weights matrices ${W}^{(i)}$ .

Note that the definition of $C$ in equation (16) depends only on the collection of vertices $V(i-1)\in[n_{i-1}]^{k}$ and $V(i)\in[n_{i}]^{k}$ , the moments of the measure $\mu$ according to which the entries of the matrices $\underline{W}^{(i)}$ are distributed, and the parameter $p.$ The utility of equation (15) is that it is written as a product over this function of adjacent layers (rather than the whole path $V$ ), which will make it much easier to analyze.

The next step in the proof of the moment estimate (6) is to obtain upper and lower bounds for the expression in (15) that match up to corrections of size $\sum_{i=1}^{d}n_{i}^{-2}.$ This is done in Section 3.4. The main idea here is to treat the sum (15) as an expectation where each $V(i)\in[n_{i}]^{k}$ , $i\geq 1$ is chosen independently according to the uniform distribution on $[n_{i}]^{k}$ . The leading term in this expectation comes from event that the entries $\left\{V_{1}(i),\ldots V_{k}(i)\right\}$ are all distinct, which happens in layer $i$ with probability $1-O(n^{-1}_{i})$ . When this happens, $C(V(i-1),V(i))=1.$ The subleading term comes from the event that $\left\{V_{1}(i),\ldots,V_{k}(i)\right\}$ has exactly one element that appears twice, with the others distinct. In each layer, the probability of this type of “collision” is $\binom{k}{2}n_{i}^{-1},$ and $C(V(i-1),V(i))$ typically contributes $3/p$ when this happens. Hence, heuristically speaking, we have

[TABLE]

This is almost correct, except at the first layer, where the vector $\vec{u}$ acts as special initial condition and slightly deforms the term in this product when $i=1.$ Section 3.4 makes this argument precise.

3.2 Edge Sets, Multiplicities, and Paths

In this section we develop some notation and basic results which is used to clarify the “path counting” needed to prove Proposition 4 below. The major result that is developed in this section, and is needed for Proposition 4, is the enumeration the set of paths in Lemma 8. We will use the following notation conventions:

•

$n,n^{\prime}$ denote natural numbers $\in\mathbb{N}$

•

$[n]\leavevmode\nobreak\ \stackrel{{\scriptstyle\Delta}}{{=}}\leavevmode\nobreak\ \left\{1,2\ldots,n\right\}$

•

$[n]^{\ell}\leavevmode\nobreak\ \stackrel{{\scriptstyle\Delta}}{{=}}\leavevmode\nobreak\ \left\{(x_{1},\ldots,x_{\ell}):\ x_{j}\in[n]\ \forall 1\leq j\leq\ell\right\}$

For $n,n^{\prime}\in\mathbb{N}$ , we will denote by

[TABLE]

the collection of all unordered sets of $\ell$ directed edges in the complete bipartite graph of $K_{n,n^{\prime}}$ , which we think of as a directed graph with edges going from $[n]$ to $[n^{\prime}]$ . Note that some edges may appear multiple times: we consider them with multiplicity, thinking of $E\in\Sigma^{\ell}(n,n^{\prime})$ as a multi-set (e.g. the directed edge $(1,1)$ can appear twice in $E$ ). To every edge set $E\in\Sigma^{\ell}(n,n^{\prime})$ we will associate the edge multiplicity, $m_{E}\in\mathbb{N}^{n\times n^{\prime}},$ by:

[TABLE]

We will also use the notation:

[TABLE]

Every edge set $E\in\Sigma^{\ell}(n,n^{\prime})$ is uniquely defined by its multiplicity, and we will often find it more convenient to work with the multiplicities rather than the edge sets directly.

We will need to need to translate back and forth between $E\in\Sigma^{\ell}(n,n^{\prime})$ and the multisets of its left and right endpoints. Specifically, for $E=\{e_{j}=(a_{j},b_{j}),\,j=1,\ldots,\ell\}\in\Sigma^{\ell}(n,n^{\prime})$ we define the multisets

[TABLE]

of right and left endpoints of $E$ counted with multiplicity. Conversely, given ordered sets of left, right endpoints $x\in[n]^{\ell}$ , $y\in[n^{\prime}]^{\ell}$ , we define the corresponding element of $\Sigma^{\ell}(n,n^{\prime})$ by its multiplicity

[TABLE]

This is the set one gets by drawing an edge between each entry of $x$ and the corresponding entry of $y$ and then forgetting the order in which the edges were drawn but remembering the multiplicity. Note that this map from ordered sets of left and right endpoints is many to one. This will come up in our computations, and to keep track of this, we make the following definition.

Definition 5.

Fix some edge set $E\in\Sigma^{\ell}(n,n^{\prime})$ , with corresponding edge multiplicity $m_{E}\in\mathbb{N}^{n\times n^{\prime}}$ and some $\ell$ -tuple $y\in[n^{\prime}]^{\ell}$ so that as unordered multisets

[TABLE]

Define:

[TABLE]

Lemma 6.

$c_{\ell}(m_{E})$ * is well defined. That is, the enumeration depends only on $E$ and not on the choice of $y$ stated in Definition 5. Moreover, $c_{\ell}(m_{E})$ has the following explicit formula in terms of multinomial coefficients:*

[TABLE]

Proof.

To see that $\left|\left\{x\in[n]^{\ell}:\ m_{x,y}=m_{E}\right\}\right|$ does not depend on $y$ note that for any $y^{\prime}\in[n^{\prime}]^{\ell}$ , we have

[TABLE]

if and only if $y^{\prime}=\sigma(y)$ for some $\sigma$ in the symmetric group on $\ell$ elements. Further, for any such $\sigma$ , we have

[TABLE]

Thus, $x\mapsto\sigma(x)$ is a bijection between $\left\{x\in[n]^{\ell}:\ m_{x,y}=m_{E}\right\}$ and $\left\{x\in[n]^{\ell}:\ m_{x,y^{\prime}}=m_{E}\right\}$ for any permutation $\sigma\in S_{\ell}$ , proving that $c(m_{E})$ is indeed well-defined. To obtain the multinomial coefficient formula for $c(m_{E})$ , for each $t\in\mathbb{N},\,\ell\in\mathbb{N},\,x\in[t]^{\ell}$ define the set of indices:

[TABLE]

and for every $I\subseteq[\ell]$ define for $x=(x_{1},\ldots,x_{\ell})$ the multiset of entries of $x$ :

[TABLE]

With this notation, we have

[TABLE]

Thus, enumerating $\left|\left\{x\in[n]^{\ell}:\ m_{x,y}=m_{E}\right\}\right|$ amounts to counting the number of ways the indices of $x$ can be arranged in order to satisfy (20). This is counted by multinomial coefficients, and the formula (18) then follows by standard enumeration principles. ∎

Our path counting approach to proving Proposition 4, involves the combinatorics of certain paths decorated by the moments of measure $\mu$ according to which the entries of matrices ${W}^{(i)}$ are drawn. Accordingly, for each $n,n^{\prime}\in\mathbb{N}$ , we associate a weight to an edge multiplicity $m_{E}\in\mathbb{N}^{n\times n^{\prime}}$ given in terms of the moments of the measure $\mu$ by:

[TABLE]

where the expectation is with respect to $\mu.$ In the proof of Proposition 4 we will consider sequences of compatible edge sets in the sense of the following definition.

Definition 7.

Let $d\in\mathbb{N}$ and let $n_{0},n_{1},\ldots,n_{d}\in\mathbb{N}$ . Let $\Sigma^{\ell}(n_{0},\ldots,n_{d})$ denote the set of edge sequences $E(1),\ldots E(d)$ which satisfy:

[TABLE]

The second condition ensures the endpoints of the edges of one layer are compatible with the edges from the next layer. Further, define for each $\ell$ the set of ordered paths:

[TABLE]

Given $\gamma\in\Gamma^{\ell}$ define the edge sequence $E^{\gamma}\in\Sigma^{\ell}(n_{0},\ldots,n_{d})$ corresponding to $\gamma$ by specifying the multiplicities

[TABLE]

The formula (24) below will be used in the proof of Proposition 4.

Lemma 8.

Let $d\in\mathbb{N}$ and let $n_{0},n_{1},\ldots,n_{d}\in\mathbb{N}$ and $\ell\in\mathbb{N}$ . Consider $v\in[n_{d}]^{\ell}$ and any edge sequence $E\in\Sigma^{\ell}(n_{0},\ldots,n_{d})$ with

[TABLE]

Then, the number of ordered paths $\gamma\in\Gamma^{\ell}$ which have the same edge sequence as $E$ and have $\gamma(d)=v$ is given by:

[TABLE]

Proof.

The proof is by induction on $d.$ When $d=1$ , the left hand side of (24) is precisely the number of $\gamma(0)$ so that $\gamma=(\gamma(0),v)$ has $E^{\gamma}=E,$ which by definition of $c_{\ell}$ , equals $c_{\ell}(m_{E(1)})$ . Let us now suppose we have proved the statement for $d=1,\ldots,D-1$ with $D\geq 2.$ By denoting $\gamma(D-1)=\chi$ , and counting the number of possibilities for $\gamma(D)$ with $\gamma(D-1)=\chi$ , we write the left hand side of (24) as the sum

[TABLE]

Note that since $m_{\xi,v}=m_{E(D)},$ we find that $\chi$ coincides with the right endpoints $R(E(D-1))$ . Hence, by the inductive hypothesis, every term appearing in the sum from equation (25) is equal to $\prod_{i=1}^{D-1}c_{\ell}\left(m_{E(i)}\right)$ and does not depend on $\chi$ (since $c_{\ell}(m_{E(D-1)})$ depends only on the right endpoints of $E(D-1)$ and not on their order). The number of terms in the sum from equation (25) is exactly $c_{\ell}(m_{E(D)}),$ by the definition of $c_{\ell}$ . The total is therefore $c_{\ell}(m_{E(D)})\times\prod_{i=1}^{D-1}c_{\ell}\left(m_{E(i)}\right)$ , completing the induction. ∎

3.3 Proof of Proposition 4

The first step in proving Proposition 4 is to express $\mathbf{E}\left[\left\|{M}\vec{u}\right\|^{2k}\right]$ as a sum over certain collections of $2k$ paths.

Definition 9.

Let $Q^{2k}$ be the set of $2k$ tuples of paths:

[TABLE]

where $\Gamma^{2k}$ was defined in (22). Our notation is that if $\gamma\in Q^{2k}$ , then $\gamma(i)\in[n_{i}]^{2k}$ is a $2k$ -tuple for each $0\leq i\leq d$ .

Lemma 10.

For a $2k$ -tuple $x=\left(x_{1},\ldots,x_{2k}\right)\in[n]^{2k}$ , let $\#x$ be the number of unique elements in $x$ . Let $\vec{u}=(u_{1},\ldots,u_{n_{0}})\in\mathbb{R}^{n_{0}}.$ Then:

[TABLE]

Proof.

Note that the entries of the $n_{d}\times n_{0}$ matrix ${M}$ can be written as a sum over certain paths in $\Gamma$ , namely:

[TABLE]

Using this interpretation in terms of paths, we obtain by indexing the starting points as $b_{1},b_{2}\in[n_{0}]$ and the ending point as $a\in[n_{d}]$ , that we can write $\left\|{M}\vec{u}\right\|^{2}$ as a sum over $\gamma\in Q^{2}$ :

[TABLE]

Similarly, the $k$ -th power is then given by:

[TABLE]

The result of the lemma follows by taking expectation of both sides, using the independence of the random variables $\xi_{b}^{(i)},\,W_{a,b}^{(i)}$ ’s, and relations

[TABLE]

∎

Definition 11.

Since the law $\mu$ of the entries of the matrices ${W}^{(i)}$ is assumed to symmetric around $0,$ the odd moments of $\mu$ are all zero, and it will be useful to consider only edge sets that are “even” in the following sense:

[TABLE]

as well as to define the related sets

[TABLE]

Lemma 12.

With the same notation as in Lemma 10, we have

[TABLE]

Proof.

Because the variables $W_{\alpha,\beta}^{(i)}$ are symmetric around [math], all their odd moments vanish. Thus, in the expression (26), only collections of paths in which every edge is traversed an even number of times given a non-zero contribution. What remains are exactly paths from $Q_{even}^{2k}$ by the definition in equation (27). ∎

Proof of Proposition 4.

Recall the definition of the edge sequences $\Sigma^{\ell}(n_{0},\ldots,n_{d})$ and the notation $E^{\gamma}\in\Sigma^{\ell}(n_{0},\ldots,n_{d})$ for paths $\gamma\in\Gamma^{\ell}$ from Definition 7 (In this proof, we will use this definition when $\ell=2k$ for paths $\gamma\in\Gamma^{2k}$ and when $\ell=k$ for paths $V\in\Gamma^{k}$ ). Fix any $v\in[n_{d}]^{k}$ . Let $\chi(v)\stackrel{{\scriptstyle\Delta}}{{=}}(v_{1},v_{1},\ldots,v_{k},v_{k})\in[n_{d}]^{2k}$ be $v$ with the entries doubled. For any function of edge sequences, $f:\Sigma^{2k}(n_{0},n_{1},\ldots,n_{d})\to\mathbb{R}$ , (it will be more convenient to write $f(E)=f(m_{E})$ , thinking of $f$ as a function of the multiplicities of the edge set), consider the following identity for sums over $\gamma\in Q^{2k}_{even}$ that end at $\gamma(d)=\chi(v)$ :

[TABLE]

Here $\chi(E)$ denotes doubling all the edges (i.e. the multiplicities double $m_{\chi(E)}=2m_{E}$ ) and we have used the fact that every even edge sequence $E\in\Sigma^{2k}_{even}(n_{0},\ldots,n_{d})$ arises by taking a sequence $V\in\Gamma^{k}$ and doubling the multiplicity of the edges. (Note that there may be multiple choices of $V$ for each $E\in\Sigma^{2k}_{even}(n_{0},\ldots,n_{d})$ , which is why we have to divide be the size of this set to account for this many-to-one-ness.) We now apply Lemma 8 to both the numerator (with $\ell=2k$ ) and the denominator (with $\ell=k$ ) to see that the enumeration depends only on the edge set $E^{V}$ and the endpoints of the last layer $V(d)=v$ :

[TABLE]

Summing over all possible endpoints $v\in[n_{d}]^{k}$ now gives the identity:

[TABLE]

Finally, using this identity on (29), with $f$ being the function that appears inside the sum over $Q^{2k}_{even}$ , gives the desired result of Proposition 4. ∎

3.4 Completion of Proof of Equation (6)

Definition 13.

We think of the sum in Proposition 4 as an expectation over discrete random variables $V(i)$ . Specifically, we write:

[TABLE]

where $\mathcal{E}_{\vec{u}}$ is defined to be the expectation with respect to a product measure on sequences $V$ , in which the $k$ entries of $V(0)\in[n_{0}]^{k}$ are chosen i.i.d. from the measure $(u_{1}^{2},\ldots,u_{n_{0}}^{2})$ , (i.e. $\mathcal{P}(V_{a}(0)=j)=u^{2}_{j}$ for every $1\leq j\leq n_{0}$ ; this is a probability measure since $u$ is a unit vector), and the $k$ entries of $V(i)\in[n_{i}]^{k}$ are chosen i.i.d. from the uniform measure on $[n_{i}]$ for every $i\geq 1$ . (i.e. $\mathcal{P}(V(i)=v)=\frac{1}{n_{i}^{k}}$ for any $v\in[n_{i}]^{k}$ , $i\geq 1$ ). In order to prove that the rightmost product in (31) equal the right hand side of (6), we introduce some notation. Namely, for $n\in\mathbb{N},$ we partition the set $\left[n\right]^{k}$ into three pieces:

[TABLE]

Informally, $U$ stands for “unique entries”, and consists of those $k$ -tuples with no repeated entries; $P$ stands for “one pair” and consists of those $k$ -tuples with exactly one repeated entry; $B$ stands for “bad” and consists of everything else. Formally,

[TABLE]

Lemma 14.

For each $i\geq 1$ , under the uniform measure on $[n_{i}]^{k}$ , each random variable $V(i)$ has the following probabilities for the events $\left\{V(i)\in U_{n_{i}}\right\}$ , $\left\{V(i)\in P_{n_{i}}\right\}$ , $\left\{V(i)\in B_{n_{i}}\right\}$ :

[TABLE]

Proof.

The proof is an elementary exercise in discrete probability. ∎

Lemma 15.

Subdivide the “one pair” set by which indices are paired: $P_{n}\stackrel{{\scriptstyle\Delta}}{{=}}\cup_{a\neq b}P_{n}(a,b)$ where $P_{n}(a,b)=P_{n}\cap\left\{x\in[n]^{k}:x_{a}=x_{b}\right\}$ for $a\neq b$ . Then for each $k\geq 1,\,i=1,\ldots,d,$ we have

[TABLE]

where the implicit constant in $\Theta(1)$ is bounded below by $1$ and above by $\mu_{2k}(2k-1)!!p^{1-k}.$

Proof.

This is an elementary calculation from the definition of $C$ . If $V(i)\in U_{n_{i}}$ , $\#V(i)=k$ and the multiplicities of edges in the edge set $E^{V(i-1),V(i)}$ are all $1$ which makes the combinatorial factor in $C(V(i-1),V(i))$ equal to $1$ , and every edge is covered exactly twice giving a factor of $\mu^{k}_{2}=1$ in the weight term. If $V(i)\in P_{n_{i}}$ , $\#V(i)=k-1$ giving a factor of $\frac{1}{p}$ . Moreover, in this case, when the indices which are paired in $V(i)$ are also paired in $V(i-1)$ , all the combinatorial factors are again $1$ , and the weight term is $\mu^{k-1}_{2}=\mu_{4}$ . If the paired indices from $V(i)$ are not paired in $V(i-1)$ , then there the combinatorial term is $1^{k-2}\frac{\binom{2\cdot 2}{2\cdot 1}}{\binom{2}{1}}=3$ , and the weight term is $\mu^{k}_{2}=1$ . ∎

Lemma 16.

We have

[TABLE]

where the quantities $\psi_{U},T_{U},\psi_{P},T_{P},\psi_{B},T_{B}$ are:

[TABLE]

Proof.

Note that for any fixed $j=1,\ldots,d-1$ and $v\in[n_{j}]^{k}$ , we have the following conditional independence of layers before $V(j)$ and after $V(j)$ :

[TABLE]

where in the second term we write $\mathcal{E}$ instead of $\mathcal{E}_{\vec{u}}$ since the measure no longer depends on $u.$ Applying this with $j=1,$ we find

[TABLE]

where in the second term the random variables $V(2),\ldots,V(d)$ are uniform and do not depend on $u$ or $v$ . An elementary probability computation using Lemma 15 and the measure $\{u^{2}_{0},\ldots,u^{2}_{n_{0}}\}$ on $V(0)$ shows that

[TABLE]

where the implicit constant in the last term is bounded below by $1$ and above by $\mu_{2k}(2k-1)!!p^{1-k}.$ Combining the result of Lemma 14, with (33) and (34), proves Lemma 16.

∎

Lemma 17.

Recall the definition of $T_{*}\leavevmode\nobreak\ \stackrel{{\scriptstyle\Delta}}{{=}}\leavevmode\nobreak\ \mathcal{E}\left[\prod_{i=2}^{d}C(V(i-1),V(i))\leavevmode\nobreak\ \big{|}\leavevmode\nobreak\ V(1)\in*_{n_{1}}\right]$ for $\ast\in\{U,P,B\}$ from Lemma 16. Define the indicator functions

[TABLE]

Then, for any choice of the label $\ast\in\{U,B,P\}$ , we have that:

[TABLE]

where we’ve introduced

[TABLE]

Proof of Lemma 17.

By using the possible values for $C$ computed in Lemma 15 and the definition of $T_{\ast}$ , we have that for any label $\ast\in\{U,P,B\}$ :

[TABLE]

where we’ve abbreviated

[TABLE]

Note that

[TABLE]

This proves the upper bound in (35). The lower bound similarly follows:

[TABLE]

∎

Completion of Proof of Relation (6).

We first notice, by application of the elementary probability estimate recorded in Lemma 18, that the upper and lower bounds on $T_{\ast}$ given in Lemma 17 are equal up to small errors. We have for $\ast\in\{U,P,B\}$ :

[TABLE]

(where $\widehat{K}$ is as in Lemma 17). Finally, putting these values for $T_{U},T_{P},T_{B}$ into the result of Lemma 14 we see:

[TABLE]

The last line follows from the elementary fact for exponentials $e^{x-\frac{1}{2}x^{2}}\leq 1+x\leq e^{x}$ for $x\geq 0$ . ∎

3.5 An elementary probability estimate

Lemma 18.

Let $A_{0},A_{1},\ldots,A_{d}$ be independent events with probabilities $p_{0},\ldots,p_{d}$ and $B_{0},\ldots,B_{D}$ be independent events with probabilities $q_{0},\ldots,q_{d}$ such that

[TABLE]

Denote by $X_{i}$ the indicator that the event $A_{i}$ happens, $X_{i}:=\mathtt{1}\left\{A_{i}\right\}$ , and by $Y_{i}$ the indicator that $B_{i}$ happens, $Y_{i}=\mathtt{1}\{B_{i}\}$ . Further, fix for every $i\in 1,\ldots,d$ some $\alpha_{i}\geq 1,K_{i}\geq 1$ as well as $\gamma_{i}>0$ . Define

[TABLE]

Then, if $\gamma_{i}\geq 1$ for every $i$ , we have:

[TABLE]

where by convention $\alpha_{0}=\gamma_{0}=1.$ In contrast, if $\gamma_{i}\leq 1$ for every $i$ , we have:

[TABLE]

Proof of Lemma 18.

The proof goes by induction on $d$ . The base case $d=1$ can be computed directly

[TABLE]

which is verified to obey the stated inequalities under the convention $\alpha_{0}=\gamma_{0}=1$ . To see the induction step, suppose that $d\geq 2$ . Define the filtration $\mathcal{F}_{j}=\sigma\left(X_{0},Y_{0},\ldots,X_{j},Y_{j}\right)$ . We have, from the definition of $Z_{i}$ that

[TABLE]

We compute by directly examining what happens when $X_{d-1}=0$ and when $X_{d-1}=1$ , that

[TABLE]

Now notice that, since $X_{d-1}Z_{d-1}$ vanishes when $X_{d-1}=0$ , and since $A_{d-1}\cap B_{d-1}=\emptyset$ , we have that

[TABLE]

Hence, when $\gamma_{d}\geq 1$ , since $Z_{j}\leq Z_{j+1}$ for every $j$ , we have the estimate

[TABLE]

and hence obtain from equation (39)

[TABLE]

which is the desired inequality to prove the induction step for the upper bound. To see the lower bound, we will actually prove the lower bound for the sequence

[TABLE]

This is what one gets if all the parameters $K_{d}$ are equal to $1$ , so clearly $Z_{i}\geq\widehat{Z}_{i}$ and it is sufficient to bound this new sequence. Notice that since $\gamma<1$ , we have $\widehat{Z}_{d-2}\leavevmode\nobreak\ \leq\leavevmode\nobreak\ \gamma_{d-1}^{-1}\widehat{Z}_{d-1}$ , so applying equation (40) to this sequence gives that:

[TABLE]

so by equation (39) applied to the $\widehat{Z}_{i}$ sequence, we have then (keeping in mind $\gamma_{d}-1<0$ reverses the inequality):

[TABLE]

which is the desired inequality for the induction step on the lower bound. ∎

4 Proof of Theorem 1: Quantitative Martingale CLT

In the section, we explain the proof of the distribution estimates in equation (7) in Theorem 1 modulo the proof of several key technical results, which are proved in Sections 4.1 and 4.2 below.

We first recall the notation. Namely, fix $0<p\leq 1$ and consider a fixed measure $\mu$ satisfying (3). For every $i=1,\ldots,n$ take independent random $n_{i}\times n_{i-1}$ matrices ${W}^{(i)}$ with all the entries of ${W}^{(i)}$ drawn i.i.d. from $\mu$ and for each $i=1,\ldots,d$ , consider $n_{i}\times n_{i}$ diagonal matrices ${D}^{(i)}=\text{diag}\left(\xi^{(i)}_{1},\ldots,\xi^{(i)}_{n_{i}}\right)$ where $\xi_{a}^{(i)}$ are iid $\left\{0,1\right\}-$ valued independent Bernoulli $(p)$ variables $\mathbf{P}\left(\xi=1\right)=1-\mathbf{P}\left(\xi=0\right)=p$ . The key objects of study are, for $i=0,1,\ldots$ , the random $n_{i}\times n_{0}$ matrices

[TABLE]

The estimates (7) concern the distribution, for any fixed unit vector $\vec{u}^{(0)}\in\mathbb{R}^{n_{0}}$ , of

[TABLE]

Notice that the sequence $\vec{u}^{(i)}$ is equivalently defined recursively as:

[TABLE]

With this notation the relation (7) we seek to show becomes the statement that for every $m\geq 1$

[TABLE]

where $\mathcal{N}(\mu,\sigma^{2})$ is the Gaussian, and

[TABLE]

The idea of the proof is to look at the quantity $\ln\left(\left\|\vec{u}^{(d)}\right\|_{2}^{2}\right)$ as the value of a martingale at time $d$ with respect to the filtration

[TABLE]

i.e. $\mathcal{F}_{i}$ the sigma algebra generated by the random variables in the first $i$ layers. The basic idea of our proof is to deduce the approximate normality of $\ln\left(\left\|\vec{u}^{(d)}\right\|_{2}^{2}\right)$ by applying a martingale CLT with rate (see Theorem 23). Specifically, note that $\ln\left(\left\|\vec{u}^{(0)}\right\|_{2}^{2}\right)=0$ , since $\vec{u}^{(0)}$ is a unit vector. Hence, $\ln\left(\left\|\vec{u}^{(d)}\right\|_{2}^{2}\right)$ , is a telescoping sum (modulo the complication discussed below that $\left\|\vec{u}^{(i)}\right\|$ could vanish):

[TABLE]

and we will think of each entry of the sum as an increment. By subtracting off the conditional means, this will yield a martingale difference sequence which can be analyzed. It will turn out that the variance of these increments satisfy:

[TABLE]

For $i\geq 1$ , we will typically have that

[TABLE]

and therefore the term involving the fourth moment $\mu_{4}$ will be of size $O(n_{i}^{-2})$ for all except the first layer when $i=0$ . The sum of these increment variances is precisely our variance parameter $\beta$ (modulo terms like $n_{i}^{-2}$ ). This informally explains the appearance of $\left\|\vec{u}^{(0)}\right\|_{4}^{4}$ in the formula for $\beta$ , and why the terms from other layers do not depend on the higher moments of $\mu$ .

To give a precise proof of (42), we must deal with a wrinkle in the strategy described above: with a small but positive probability the vectors $\vec{u}^{(i)}=0$ , making the ratio of the norms of the vectors $\vec{u}^{(i)},\vec{u}^{(i-1)}$ in (43) undefined. Since the weight matrices ${W}$ are assumed to have no atoms, this can only happen if the Bernoulli variables are all equal to zero. To take this into account, we define the events

[TABLE]

where we’ve abbreviated

[TABLE]

In addition, we will find it convenient to fix a truncation level $0<\alpha<1,$ and set

[TABLE]

We will study the sequence of martingale increments

[TABLE]

that coincide, with high probability, with the martingale difference sequence associated to $\ln(S^{(i)})$ (see Lemma 22), where by convention we define the product $\ln_{\alpha}\left(S^{(i)}/S^{(i-1)}\right)\mathtt{1}_{A_{i-1}}$ is zero on the event $A^{c}_{i-1}$ when $S^{(i-1)}=0.$ To prove the approximate normality of $\ln(||\vec{u}^{(i)}||_{2}^{2}),$ we first prove the approximate normality of $\sum_{i}X_{i}$ in the following Proposition.

Proposition 19.

We have that:

[TABLE]

Moreover, for any fixed $0<\alpha<1$ , the sum $\sum_{i=1}^{d}X_{i}$ is approximately normally distributed in the sense that

[TABLE]

We prove Proposition 19 in Section 4.1 below. The next result shows that the sum of the conditional expectations in $\sum_{i}X_{i}$ contributes a constant $\beta/2$ up to errors of the form $\sum_{i}n_{i}^{-2}.$

Proposition 20.

For any fixed $0<\alpha<1$ , we have

[TABLE]

where $Y$ is a random variable satisfying

[TABLE]

Proposition follows from Proposition 28 below. To combine Propositions 19 and 20, we will need the following simple result about perturbations under the $KS$ -distance.

Lemma 21 (Properties of $d_{KS}$ ).

If $\mathcal{N}(0,\beta)$ is centered Gaussian with variance $\beta$ , $X$ is any random variable, and $Y$ is a positive random variable then there is a universal constant $C$ so that we have:

[TABLE]

For any $k>0$ , there exists a constant $C$ so that

[TABLE]

Further, if $X,Y$ are any two random variables on the same probability space, then:

[TABLE]

Combining Propositions 19, Proposition 20 and Lemma 21, we obtain

[TABLE]

Finally, combining the following estimate with (49) completes the proof of Theorem 1.

Lemma 22.

For any fixed $0<\alpha<1$ and any $m\geq 1$ we have

[TABLE]

4.1 Proof of Proposition 19

In the proof of Proposition 19, we will use the notation

[TABLE]

and we will say that a random variable $Y$ is $O_{a.s.}(f(n_{i-1}))$ if there exists $C>0$ independent of $n_{i},d$ so that $\left|Y\right|\leq Cf(n_{i-1})$ almost surely. The constant $C$ may depend on the moments of the random variable $\mu$ and $p$ , which we think of as fixed. To conclude the approximate normality (46) we will use the following theorem.

Theorem 23 (Special Case of Martingale CLT with Rate [13]).

Suppose that $X_{0},X_{1},\ldots$ is a martingale difference sequence with respect to a filtration $\{\mathcal{F}_{i},\,i=0,1,\ldots\}$ . Then

[TABLE]

The following Proposition allows us to control the $2^{nd}$ and $4^{th}$ moments of $X_{i}$ appearing on in (51).

Proposition 24.

For any $0<\alpha<1$ , we have that the conditional 2nd and 4th moments of $X_{i}$ are:

[TABLE]

Moreover, for any $i\geq 1$ and any $j\leq i-1$ ,

[TABLE]

We will prove Proposition 24 in Section 4.1.1 below. To complete the proof of Proposition 19, note that Proposition 24 yields

[TABLE]

Hence, in particular,

[TABLE]

Thus, (46) follows from the previous line together with (53), (48), and

[TABLE]

To prove this bound, we begin by using Proposition 24 to establish two inequalities which hold for any fixed $i$ :

[TABLE]

Now notice that if $1\leq j\leq d$ is another index so that $j<i$ , then if we take the $\mathcal{F}_{j-1}$ -conditional expectation of equation (56), we have by using (54) to bound the expectation of $\frac{F^{(i-1)}}{(S^{(i-1)})^{2}}$ (along with the elementary fact ${\mathbb{E}}\left[|Z-{\mathbb{E}}\left[Z\right]|\right]\leq 2{\mathbb{E}}\left[Z\right]$ for positive random variables) and the fact that ${\mathbb{E}}\left[\left|1_{A_{i}}-\mathbf{P}(A_{i})\right|\right]=2\mathbf{P}(A_{i})\mathbf{P}(\bar{A_{i}})=O(p^{n_{i}})=O(n_{i}^{-1})$ that:

[TABLE]

With these inequalities in hand, we proceed by expanding the square as follows:

[TABLE]

The diagonal terms, when $i=j$ , are bounded since ${\mathbb{E}}\left[{\mathbb{E}}\left[X_{i}^{2}\right]-{\mathbb{E}}\left[X_{i}^{2}\leavevmode\nobreak\ \left|\mathcal{F}_{i-1}\right.\right]\right]^{2}=O(n_{i}^{-2})$ by the bound in equation (57). In the remaining off-diagonal terms, by first taking the $\mathcal{F}_{j-1}$ -conditional expectation out via the tower property, we have by the inequality (58) that:

[TABLE]

Finally, summing all the bounds for diagonal and off-diagonal entries we see that this entire numerator from equation (51) is bounded by $O(\sum_{i=1}^{d}n_{i}^{-2})$ , which proves (55) and completes the proof of Proposition 19 modulo checking Proposition 24.

4.1.1 Proof of Proposition 24

We begin by establishing some preliminary results.

Lemma 25.

Let $n,m\in\mathbb{N}$ be two layer widths, and let $\vec{u}\in\mathbb{R}^{m}$ be any non-zero fixed vector. Let ${D}=\text{diag}\left(\xi_{1},\ldots,\xi_{n}\right)\in\mathbb{R}^{n\times n}$ , $\mathbf{P}\left(\xi_{i}=1\right)=1-\mathbf{P}\left(\xi_{i}=0\right)=p$ be the diagonal Bernoulli $(p)$ matrix, and ${W}\in\mathbb{R}^{n\times m}$ be the weight matrix whose entries are iid $W_{i,j}\sim\mu$ for every $1\leq i\leq n,1\leq j\leq m$ . Then:

[TABLE]

Moreover, with the same setup as above, the following error estimates hold uniformly over all non-zero vectors $\vec{u}\in\mathbb{R}^{m}$ (i.e. the constants in the $O$ errors depend only on the moments of $\mu$ and on $p$ ):

[TABLE]

Proof.

Note that

[TABLE]

where $Z_{j}$ are independent and

[TABLE]

where $W_{j}$ denotes the $j^{th}$ row of ${W}.$ Since the entries of $W_{j}$ are iid with mean [math] and variance $1$ , we have $\mathbf{E}[\left(\left\langle W_{j},\widehat{u}\right\rangle\right)^{2}]=\left\|\hat{u}\right\|^{2}=1$ . Hence ${\mathbb{E}}\left[Z_{j}\right]=1$ for each $Z_{j}$ and we conclude

[TABLE]

proving the first relation (59). Since each $Z_{j}$ is mean $1$ , the relations (60) follow by standard esimates of the $3^{rd},{2r}^{th}$ moments of a sum of $n$ iid centered random variables. Finally to check the second relation in (59), we write

[TABLE]

Moreover,

[TABLE]

By direct evaluation, using that $\mathrm{Var}[W_{j,i}]=1$ , we find

[TABLE]

Hence, we find

[TABLE]

as claimed. ∎

The following corollary immediately yields (54).

Corollary 26.

For any $0<\alpha<1$ and uniformly over all non-zero vectors $u\in\mathbb{R}^{m}$ , we have:

[TABLE]

Proof.

The tail estimate follows by using the Chebyshev inequality and (60). The bound on the expectation is obtained as follows:

[TABLE]

∎

To complete the proof of Proposition 24, it remains to check (52) and (53). To do this, we begin with the following observation.

Lemma 27.

Let $\ln_{\alpha}(x)=\begin{cases}\ln(\alpha)&x<\alpha\\ \ln(x)&x\geq\alpha\end{cases}$ . Suppose $Y$ is a non-negative random variable. Then there are absolute constants $C$ (here we let $C$ refer to a generic constant which may change value from line to line) so that for any $0<\alpha<1$ :

[TABLE]

Proof.

The proof is an elementary exercise in the Taylor series expansion for $\ln(x)$ applied to points in the interval $[\alpha,\infty)$ on which the derivatives of $\ln(x)$ are bounded. ∎

Lemma 27 together with Lemma 25 gives the following information on the conditional moments of $X_{i},$ which directly allows us to conclude (52) and (53) and hence complete the proof of Proposition 24.

Proposition 28.

Recall the vectors $\vec{u}^{(i)}$ and their norms normalized $L^{2}$ and $L^{4}$ norms, $S^{(i)}$ and $F^{(i)}$ defined in (45) and (50). We have for each $i\geq 0$

[TABLE]

Proof.

On the event $A_{i}^{c}$ , both sides of the equation are zero and the equality trivially holds. Therefore we have only to consider what happens on the event $A_{i}$ , where $\vec{u}^{(i)}\neq 0$ . By equation (41), we have that

[TABLE]

Since we are conditioning on the sigma algebra $\mathcal{F}_{i}$ , we may think of $\vec{u}^{(i)}$ as a fixed vector and apply Lemma 25. To make the equations easier to read, we using the shorthand $\widehat{S}=\frac{S^{(i+1)}}{S^{(i)}}$ and we write ${\mathbb{E}}\left[\cdot\right]$ to mean ${\mathbb{E}}\left[\cdot\mathtt{1}_{A_{i}}\left|\mathcal{F}\right._{i}\right]$ . Then we have

[TABLE]

By Lemma 25 and Lemma 27, all the terms in equation (63) are $O_{a.s.}(n_{i}^{-2})$ which completes the result. A similar argument, combining the moment calculations from Lemma 25 and the series expansion estimates from Lemma 27 in the natural way, gives the higher moments of $\ln_{\alpha}\left(\frac{S^{(i+1)}}{S^{(i)}}\right)$ . ∎

4.2 Facts about KS distance - Proof of Lemma 21

Proof of Lemma 21.

Let us use the notation $\mathcal{N}\stackrel{{\scriptstyle d}}{{=}}\mathcal{N}(0,1).$ Since

[TABLE]

we may assume without loss that $\beta=1$ when proving (47) and (48). We begin by checking (47). We will show that for every $s>0$ , we have that

[TABLE]

from which (47) follows by taking $s=\sqrt{\sqrt{2^{-1}\pi}{\mathbb{E}}\left[\left|Y\right|\right]}$ to optimize the inequality. To begin, by considering the random variables $X,Y,\mathcal{N}$ all on the same probability space, we have the inequality:

[TABLE]

We now claim that:

[TABLE]

This is proven by examining the two possibilities of the absolute value, $\mathbf{P}\left(X+Y\leq t,\left|Y\right|\leq s\right)-\mathbf{P}(\mathcal{N}\leq t,\left|Y\right|\leq s)$ and $\mathbf{P}(\mathcal{N}\leq t,\left|Y\right|\leq s)-\mathbf{P}\left(X+Y\leq t,\left|Y\right|\leq s\right)$ by using the inclusions $\left\{X+Y\leq t,\left|Y\right|\leq s\right\}\subset\left\{X-s\leq t,\left|Y\right|\leq s\right\}$ and $\left\{X+s\leq t,\left|Y\right|\leq s\right\}\subset\left\{X+Y\leq t,\left|Y\right|\leq s\right\}$ respectively for the two cases. In the first case, consider:

[TABLE]

The inequality for $\mathbf{P}(\mathcal{N}\leq t,\left|Y\right|\leq s)-\mathbf{P}\left(X+Y\leq t,\left|Y\right|\leq s\right)$ is analogous, and equation (66) follows. Equation (64) then follows by combining equations (65),(66), Markov’s inequality $\mathbf{P}(\left|Y\right|\leq s)\leq s^{-1}\mathbf{E}{\left|Y\right|}$ , and the standard fact about Gaussian random variables $\sup_{t\in{\mathbb{R}}}\mathbf{P}\left(\left|\mathcal{N}-t\right|\leq s\right)\leq\sqrt{2^{-1}\pi}s$ .

To verify (48), note that there exists a constant $C>0$ so that for every $k>0$

[TABLE]

Hence,

[TABLE]

Finally to show (49), we have

[TABLE]

This completes the proof. ∎

4.3 Proof of Lemma 22

We have

[TABLE]

As in the proof of Proposition 28, on the event $S^{(i-1)}\neq 0,$ we have that

[TABLE]

where $Z_{j}$ are iid random variables each equal in distribution to $p^{-1}\xi_{j}\left\langle W_{j}^{(i)},\widehat{u}\right\rangle$ where $\widehat{u}$ is an fixed unit vector. In particular, by Lemma 25, $\mathbf{E}{Z_{j}}=1$ and the higher moments of $Z_{j}$ are uniformly bounded in terms of the moments of $\mu$ and $1-p.$ In particular, for each $m\geq 1,$ there exists $C_{m}>0$ depending only on the moments of $\mu$ and $1-p$ so that

[TABLE]

Hence, by the Markov inequality:

[TABLE]

Hence, by using a union bound and the estimate on the probability $A_{i}$ in (44), we have for each $m\geq 1,$

[TABLE]

as desired.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Akemann, Z. Burda, and M. Kieburg. Universal distribution of Lyapunov exponents for products of Ginibre matrices. Journal of Physics A Mathematical General , 47:395202, October 2014.
2[2] G. Akemann, Z. Burda, and M. Kieburg. From Integrable to Chaotic Systems: Universal Local Statistics of Lyapunov exponents. Ar Xiv e-prints , page ar Xiv:1809.05905, September 2018.
3[3] G. Akemann and J. R. Ipsen. Recent Exact and Asymptotic Results for Products of Independent Random Matrices. Acta Physica Polonica B , 46:1747, 2015.
4[4] G. W. Anderson, A. Guionnet, and O. Zeitouni. An Introduction to Random Matrices . Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2009.
5[5] F. Comets, G. R. Moreno Flores, and A. Ramirez. Random polymers on the complete graph. ar Xiv e-prints , page ar Xiv:1707.01588, July 2017.
6[6] J. Cotler, G. Gur-Ari, M. Hanada, J. Polchinski, P. Saad, S. H Shenker, D. Stanford, A. Streicher, and M. Tezuka. Black holes and random matrices. Journal of High Energy Physics , 2017(5):118, 2017.
7[7] P. Deift. Some Open Problems in Random Matrix Theory and the Theory of Integrable Systems. II. SIGMA , 13:016, March 2017.
8[8] P. Forrester. Asymptotics of finite system lyapunov exponents for some random matrix ensembles. Journal of Physics A: Mathematical and Theoretical , 48(21):215205, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Products of Many Large Random Matrices and Gradients in Deep Neural Networks

Abstract

1 Introduction

Theorem 1**.**

Remark*.*

1.1 Joint scaling limits

1.2 Connection to previous work in Random Matrix Theory

1.3 Contribution of the Present Work

1.4 Connection to Random Polymers

1.5 The Case p=1/2p=1/2p=1/2 as Gradients in Random Neural Nets

Proposition 2**.**

Corollary 3**.**

2 Proof of Proposition 2

2.1 Idea behind the proof

2.2 Proof of Proposition 2

Proof of Proposition 2.

3 Proof of Theorem 1: Moment Estimates and Path Counting

3.1 Outline of Proof of Equation (6)

Proposition 4** (Moments of ∣∣Mu⃗∣∣2||{M}\vec{u}||^{2}∣∣Mu∣∣2 as a sum over paths).**

3.2 Edge Sets, Multiplicities, and Paths

Definition 5**.**

Lemma 6**.**

Proof.

Definition 7**.**

Lemma 8**.**

Proof.

3.3 Proof of Proposition 4

Definition 9**.**

Lemma 10**.**

Proof.

Definition 11**.**

Lemma 12**.**

Proof.

Proof of Proposition 4.

3.4 Completion of Proof of Equation (6)

Definition 13**.**

Lemma 14**.**

Proof.

Lemma 15**.**

Proof.

Lemma 16**.**

Proof.

Lemma 17**.**

Proof of Lemma 17.

Completion of Proof of Relation (6).

3.5 An elementary probability estimate

Lemma 18**.**

Proof of Lemma 18.

4 Proof of Theorem 1: Quantitative Martingale CLT

Proposition 19**.**

Proposition 20**.**

Lemma 21** (Properties of dKSd_{KS}dKS​).**

Lemma 22**.**

4.1 Proof of Proposition 19

Theorem 23** (Special Case of Martingale CLT with Rate [13]).**

Proposition 24**.**

4.1.1 Proof of Proposition 24

Lemma 25**.**

Proof.

Corollary 26**.**

Proof.

Lemma 27**.**

Proof.

Proposition 28**.**

Proof.

4.2 Facts about KS distance - Proof of Lemma 21

Proof of Lemma 21.

4.3 Proof of Lemma 22

Theorem 1.

*Remark**.*

1.5 The Case $p=1/2$ as Gradients in Random Neural Nets

Proposition 2.

Corollary 3.

Proposition 4 (Moments of $||{M}\vec{u}||^{2}$ as a sum over paths).

Definition 5.

Lemma 6.

Definition 7.

Lemma 8.

Definition 9.

Lemma 10.

Definition 11.

Lemma 12.

Definition 13.

Lemma 14.

Lemma 15.

Lemma 16.

Lemma 17.

Lemma 18.

Proposition 19.

Proposition 20.

Lemma 21 (Properties of $d_{KS}$ ).

Lemma 22.

Theorem 23 (Special Case of Martingale CLT with Rate [13]).

Proposition 24.

Lemma 25.

Corollary 26.

Lemma 27.

Proposition 28.