Concentration inequalities for random matrix products
Amelia Henriksen, Rachel Ward

TL;DR
This paper establishes sharp nonasymptotic concentration inequalities for normalized products of independent bounded random matrices, with applications to stochastic algorithms like streaming PCA.
Contribution
It provides the first nonasymptotic spectral norm bounds for a broad class of random matrix products, combining matrix Bernstein inequality and combinatorial methods.
Findings
Spectral norm error bound of O((log n)^2 log(d/δ)/√n) with high probability
Convergence of normalized matrix products to matrix exponential e^{X}
Sharpness of the rate up to logarithmic factors
Abstract
Suppose is a sequence of bounded independent random matrices with common dimension and common expectation . Under these general assumptions, the normalized random matrix product converges to as . Normalized random matrix products of this form arise naturally in stochastic iterative algorithms, such as Oja's algorithm for streaming Principal Component Analysis. Here, we derive nonasymptotic concentration inequalities for such random matrix products. In particular, we show that the spectral norm error satisfies with probability exceeding . This rate is sharp in , , and , up to possibly the and …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Concentration inequalities for random matrix products
Amelia Henriksen and Rachel Ward Oden Institute for Computational Engineering and Sciences, University of Texas, Austin, TX (email: [email protected], [email protected]). This material is based upon work supported in part by AFOSR MURI Award N00014-17-S-F006.
Abstract
Suppose is a sequence of bounded independent random matrices with common dimension and common expectation . Under these general assumptions, the normalized random matrix product
[TABLE]
converges to as . Normalized random matrix products of this form arise naturally in stochastic iterative algorithms, such as Oja’s algorithm for streaming Principal Component Analysis. Here, we derive nonasymptotic concentration inequalities for such random matrix products. In particular, we show that the spectral norm error satisfies with probability exceeding . This rate is sharp in , , and , up to possibly the and factors. The proof relies on two key points of theory: the Matrix Bernstein inequality concerning the concentration of sums of random matrices, and Baranyai’s theorem from combinatorial mathematics. Concentration bounds for general classes of random matrix products are hard to come by in the literature, and we hope that our result will inspire further work in this direction.
1 Introduction
A classical limit theorem from complex analysis reads: *Let be a uniformly bounded complex sequence whose mean converges towards . Then *
[TABLE]
This result is easily verified by taking the natural logarithm of each side, and observing that . A non-commutative extension of this result was recently proven by Emme and Hubert in [EH18]:
Proposition 1**.**
Let be a sequence of complex matrices satisfying
[TABLE]
and such that is bounded for a norm by . Consider the matrix product
[TABLE]
Then
[TABLE]
The proof of Theorem 1 is not a straightforward extension of the scalar result. The matrix product is non-commutative in general, , and so of course fails to hold in turn.
An important special case within the framework of Proposition 1 is when the are uniformly bounded independent random matrices with common expectation . Then is also a random matrix, and has expectation . Within this framework, it is natural to ask about about rates of convergence of to . As far as we are aware, precise rates of convergence for matrix products of the form have not appeared in the literature before, despite such random matrix products naturally arising in stochastic iterative algorithms such as stochastic gradient descent; in particular, in Oja’s algorithm for estimating the top eigenvector of the covariance matrix of a distribution of matrices observed sequentially [Kra70, Oja82, BDF13, MCJ13, SRO15, JJK*+*16, AZL17]. Here, as the main content of this paper, we derive a rate of convergence for matrix products of this form.
Theorem 1** (Main Theorem).**
Consider a sequence of independent (real or complex-valued) random matrices with common dimension . Assume that
[TABLE]
Introduce the sequence of random matrices given by
[TABLE]
Suppose that , and are such that
[TABLE]
Then with probability exceeding , the following holds:
[TABLE]
where denotes the matrix spectral norm.
Theorem 1 immediately implies a bound on the expected value of Note that and , so for any satisfying (3),
[TABLE]
In particular, setting gives
[TABLE]
Note that the convergence rate is unavoidable under the stated assumptions. Indeed, consider the scalar case , where is a sequence of independent real-valued mean-zero scalars, bounded uniformly by . In this case, as becomes sufficiently small, and are nearly equivalent. Thus, applying the standard scalar Bernstein inequality to results in a bound of the form . It remains open whether the and factors in the rate given by Theorem 1 can be removed, and also whether the dependence on can be improved.
Remark 1**.**
Limit laws for products of random matrices have been extensively analyzed in the context of ergodic theory or martingales on Markov chains – see for instance the book [BQ16] or the extensive survey articles [Fur02, Led01]. However, results in the form of quantitative rates of convergence of general random matrix products are quite scarce, apart from specialized cases such as for products of i.i.d. Gaussian random matrices. Surprisingly, for the random matrix product we consider (2), a rigorous proof of the limiting behavior appears to have only been proven recently [EH18], even though a seemingly incomplete proof of this limiting behavior was provided as Theorem of the 1984 paper [Ber84].**
Notation. Throughout, refers to the spectral norm of the matrix . For an integer , we use the notation to refer to the set . We write to refer to the probability of the event .
2 Preliminaries
A crucial ingredient of the proof of Theorem 1 is the matrix Bernstein inequality, a matrix-level extension of the classical scalar Bernstein inequality describing the upper tail of a sum of independent bounded or sub-exponential random variables. The first matrix Bernstein type bound was derived by Ahlswede and Winter [AW03], and subsequently improved by Tropp [Tro10] by applying Lieb’s theorem in place of the Golden-Thompson inequality. We use the variant of the matrix Bernstein inequality of Tropp stated below.
Proposition 2** (Matrix Bernstein Inequality (Theorem 6.1.1 in [Tro15])).**
Consider a finite sequence of independent random matrices with common dimension . Assume that
[TABLE]
Introduce the random matrix
[TABLE]
Let be the matrix variance statistic of the sum:
[TABLE]
Then, for all ,
[TABLE]
Another key theorem we rely on is Baranyai’s theorem [Bar75], stated below.
Proposition 3** (Baranyai, 1973).**
Let be natural numbers such that . Then the set of -subsets of can be partitioned into disjoint families with and each is included in exactly or elements of .
2.1 Sketch of the proof of Theorem 1
Suppose that , , and satisfy the assumptions of Theorem 1. Write
[TABLE]
where
[TABLE]
Because the are independent, the expected values of and are easily calculated:
[TABLE]
We then write
[TABLE]
The approximation error is bounded deterministically using standard analysis, and converges to zero at rate , as made precise by Lemma 2. The errors decay sufficiently quickly in that the sum of all but the first many of them, , is also bounded by deterministically (Lemma 3 below). The leading error term is bounded with high probability using the Matrix Bernstein inequality. The most interesting, and most difficult, part of the proof is in bounding the intermediate terms , To do this, we appeal to Baranyai’s theorem, which implies that each such term can be approximately written as a sum of sums of independent matrix products, so that we may apply the matrix Bernstein inequality with properly tuned parameters to each sub-sum to achieve the final bound.
3 Key Ingredients
The first two lemmas use standard analysis tools; we defer the proofs to appendices.
Lemma 2**.**
Let be a square real or complex-matrix with spectral norm . The following holds:
[TABLE]
The proof of Lemma 2 is found in Appendix B.
Lemma 3**.**
Suppose that is as in Theorem 1, and let be as defined in 8. Suppose that Then
[TABLE]
The proof of Lemma 3 is found in Appendix A.
Proposition 4 contains the meat of the proof. By carefully combining the Matrix Bernstein inequality and Baranyai’s theorem, we produce high probability bounds for the error terms .
Proposition 4**.**
Assume are matrices satisfying the assumptions in Theorem 1, and suppose that and are such that
[TABLE]
where, for the case, we treat . Then
[TABLE]
where
[TABLE]
Proof.
For simplicity of notation, we drop the subscript in all matrix notation throughout; that is, we let , we let , and so on. Note Let be the unique integer such that divides , and write
[TABLE]
The random matrix is a sum of random matrix products, each of which contains at least one of the matrices . Each term is bounded in norm deterministically by , so
[TABLE]
We thus have so far that
[TABLE]
Now, as a consequence of Baranyai’s theorem, there exist partitions of , denoted by , , such that
[TABLE]
Write
[TABLE]
Because the are independent and because each constitutes a partition of , each subset of random matrices forms a mutually independent set of random matrices. We can use this to bound with high probability, using the Matrix Bernstein Inequality (Proposition 2). Indeed, we will apply the Matrix Bernstein Inequality separately to each sum of independent random matrices. To do this, we employ the bounds
2. 2.
3. 3.
[TABLE] 4. 4.
Similarly,
We can now apply the Matrix Bernstein Inequality: for any ,
[TABLE]
We take the union bound over all sums to obtain
[TABLE]
Set (where, in case , we use ). Then
[TABLE]
Set
[TABLE]
Under the assumption that
[TABLE]
which is implied by the stated condition (12) on , it follows that
[TABLE]
and so we can continue to bound
[TABLE]
Thus, we conclude that for each satisfying assumption (18), it holds that
[TABLE]
Recalling
[TABLE]
yields the result. ∎
4 Proof of Theorem 1
We can bound the error from Theorem 1 by combining Proposition 4 with Lemma 3.
Corollary 3.1**.**
Suppose that , and are such that
[TABLE]
Then with probability exceeding ,
[TABLE]
Proof.
First, by the triangle inequality. By Proposition 3,
[TABLE]
Now, given (19), we can apply Proposition 4 to each of , and via the union bound, we obtain that the following holds with probability at least
[TABLE]
where in the final inequality, we use that is maximized over at . We have the stated result.
∎
Proof of Theorem 1 from Corollary 3.1. Write . Bound using Corollary 3.1 and bound using Lemma 2 to arrive at the statement of Theorem 1.
5 Conclusion and Future Directions
We derived a large deviations bound for the convergence rate of a certain type of product of random matrices toward its limiting distribution. Our results are quite general and nearly sharp with respect to dependence on the matrix size and number of terms in the product, .
One particularly immediate application of our rates of convergence is in the analysis of random matrix products arising in stochastic iterative algorithms such as Oja’s algorithm for streaming principal component analysis [Oja82]. One area of future work would be to use our results to derive convergence rates for Oja’s method using minimal assumptions – an area of ongoing research (see, for example, [AL16, JJK*+*16]). This is particularly important because of the fundamental role streaming PCA plays in high-dimensional data analysis.
Appendix A Proof of lemma 3
See 3
Proof.
We have that
[TABLE]
Hence it remains to show that
[TABLE]
Let in the remainder. First, we observe that :
[TABLE]
Since it suffices to show that
[TABLE]
We consider two cases:
Case 1: If , then Thus we require . This clearly holds because . 2. 2.
Case 2: If , then . Since it follows that
[TABLE]
Now, for each ,
[TABLE]
By induction, it follows that
[TABLE]
Hence,
[TABLE]
∎
Appendix B Proof of Lemma 2
See 2
Proof.
The proof uses only basic analytic tools and inequalities. Recall the matrix exponential: . Let . Then we have
[TABLE]
where in the final inequality, we used that . Thus, using also that for all
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AL 16] Z. Allen-Zhu and Y. Li. First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate. Ar Xiv e-prints , July 2016.
- 2[AW 03] R. Ahlswede and A. Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory , 48:569–579, 2003.
- 3[AZL 17] Zeyuan Allen-Zhu and Yuanzhi Li. First efficient convergence for streaming k-pca: a global, gap-free, and near-optimal rate. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) , pages 487–492. IEEE, 2017.
- 4[Bar 75] Zsolt Baranyai. On the factorization of the complete uniform hypergraph. In Infinite and finite sets: To Paul Erdös on his 60th birthday , volume 10 of Colloquia mathematica Societatits János Bolyai , pages 91–108. North-Holland Publishing Company, 1975.
- 5[BDF 13] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental PCA. Advances in Neural Information Processing Systems (NIPS) , pages 3174–3182, 2013.
- 6[Ber 84] M. Berger. Central limit theorem for products of random matrices. Transactions of the American Mathematical Society , 285:777–803, 1984.
- 7[BQ 16] Y. Benoist and Je. Quint. Random walks on reductive groups , volume 62 of Results in Mathematics and Related Areas . Springer, 2016.
- 8[EH 18] Jordan Emme and Pascal Hubert. Limit laws for random matrix products. Mathematical Research Letters , 25, 2018.
