Concentration of the matrix-valued minimum mean-square error in optimal Bayesian inference
Jean Barbier

TL;DR
This paper proves that the matrix-valued minimum mean-square error in Bayesian inference concentrates as the problem size grows, extending techniques from spin glass physics to various inference models.
Contribution
It introduces a concentration result for the MMSE in vector-valued Bayesian inference, applicable to models like spiked matrices, tensors, and neural networks, under known model parameters.
Findings
MMSE concentrates in large-scale Bayesian inference models
Applicable to spiked matrix and tensor models, neural networks, and generalized linear models
Provides theoretical foundation for mutual information formulas in these settings
Abstract
We consider Bayesian inference of signals with vector-valued entries. Extending concentration techniques from the mathematical physics of spin glasses, we show that the matrix-valued minimum mean-square error concentrates when the size of the problem increases. Such results are often crucial for proving single-letter formulas for the mutual information when they exist. Our proof is valid in the optimal Bayesian inference setting, meaning that it relies on the assumption that the model and all its hyper-parameters are known. Examples of inference and learning problems covered by our results are spiked matrix and tensor models, the committee machine neural network with few hidden neurons in the teacher-student scenario, or multi-layers generalized linear models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
CONCENTRATION OF THE MATRIX-VALUED MINIMUM MEAN-SQUARE ERROR
IN OPTIMAL BAYESIAN INFERENCE
*Jean Barbier *
The Abdus Salam International Center for Theoretical Physics, Trieste, Italy.
Abstract
We consider Bayesian inference of signals with vector-valued entries. Extending concentration techniques from the mathematical physics of spin glasses, we show that the matrix-valued minimum mean-square error concentrates when the size of the problem increases. Such results are often crucial for proving single-letter formulas for the mutual information when they exist. Our proof is valid in the optimal Bayesian inference setting, meaning that it relies on the assumption that the model and all its hyper-parameters are known. Examples of inference and learning problems covered by our results are spiked matrix and tensor models, the committee machine neural network with few hidden neurons in the teacher-student scenario, or multi-layers generalized linear models.
I INTRODUCTION
This decade is witnessing a burst of mathematical studies related to inference and learning problems. One reason is that an important arsenal of methods, developed in particular in the context of spin glass physics, has found a new rich playground where it can be applied with success [1, 2, 3, 4]. In particular important progress has been made recently in the context of high-dimensional Bayesian inference and learning. Examples of problems in this class include spiked matrix and tensor models [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], random linear and generalized estimation [19, 20, 21, 22, 23, 24], models of neural networks in the teacher-student scenario [23, 25, 26], or sparse graphical models such as error-correcting codes and block models [27, 28, 29].
All these results are based in some way or another on the control of the fluctuations of the overlap, which is related up to a constant to the minimum mean-square error. Optimal Bayesian inference –optimal meaning that the true posterior is known– is an ubiquitous setting in the sense that the overlap can be shown to concentrate, and this in the whole regime of parameters (amplitude of the noise, number of observations/data points divided by the number of parameters to infer etc). When the overlap concentrates (and the problem is “random enough”) one expects single-letter variational formulas for the asymptotic mutual information.
In many statistical models the overlap is a scalar. In the context of optimal Bayesian inference it is now quite standard to show that the scalar overlap is self-averaging, see, e.g., [30, 14]. The techniques to do so are coming from communications starting with [31, 32, 19] (and then generalized in [33, 27]), and are extensions of methods used in the analysis of spin glasses [1, 34, 35, 3, 36]. In this paper we consider instead Bayesian inference of signals made of vectorial components in which case the overlap is a matrix. The concentration techniques developed for scalar overlaps do not apply directly, and need to be extended using new non-trivial ideas. Examples of inference problems where matrix overlaps appear are the factorization of matrices and tensors of rank greater than one [11], or the committee machine neural network [37, 38, 39, 25]. They also appear in the context of spin glasses [40, 41, 42].
II OPTIMAL INFERENCE OF TALL MATRICES
II-A General setting
All quantities in this paper are real. Consider a model where a “tall” matrix-signal made of components, that are each a -dimensional vector where is independent of , is generated probabilistically. Its probability prior distribution may depend on a generic hyper-parameter , i.e., . Data (also called observations) are then generated conditionally on the unknown and an hyper-parameter . Namely, the data , with a generic set: the data and hyper-parameters can be vectors, tensors etc, so that our model is very general. We assume that and are also random, with distributions .
The task is to infer the signal given the data . We moreover assume that the hyper-parameters , the kernel and the prior are known to the statistician, and call this setting optimal Bayesian inference.
The information-theoretical optimal way of infering the signal follows from its posterior distribution. Using Bayes’ formula the posterior for the base inference model reads
[TABLE]
The averaged free energy (i.e., the Shannon entropy density of the data given the hyper-parameters) equals
[TABLE]
The average is over , jointly called the quenched variables as they are fixed by the realization of the problem, in contrast with the dynamical variable which fluctuates according to the posterior. We call model (1) the “base model” in contrast with the perturbed model presented in section III, a slightly modified version of the base model where additional side-information is given, and for which concentration results can be proved without altering the limit of the averaged free energy, see Lemma 1.
The central object of interest is the overlap matrix (or simply overlap) defined as
[TABLE]
Here is a sample from the posterior and is the signal. The overlap contains a lot of information. Using that the estimator minimizing the mean-square error is the posterior mean (denoting the expectation w.r.t. the posterior (1) of the base model), the matrix-valued minimum mean-square error (MMSE) is
[TABLE]
where is a row of (all vectors are columns, including rows of matrices considered alone, transposed vectors are rows). The scalar MMSE is then simply
[TABLE]
( is the trace). Another metric for problems where, e.g., the sign of the signal is lost due to symmetries is
[TABLE]
II-B Examples
In the symmetric order- rank- tensor factorization problem, the data-tensor is generated as
[TABLE]
for . Here is a Gaussian noise tensor with independent and identically distributed (i.i.d.) entries, and the signal components are i.i.d., i.e., with a prior with supported on . The case is the spiked Wigner model and is one of the simplest probabilistic model for principal component analysis [5]. In both the analysis of [17, 18] for this problem the matrix overlap concentration is a key result.
Another model is the following generalized linear model (GLM) (recall ):
[TABLE]
Given and , the data points are i.i.d., thus the notation instead of . We also assume that the prior . A particular simple deterministic case is
[TABLE]
This model is a version of the committee machine [23, 25]. Here can be interpreted as the weights of the -th hidden neuron, and are -dimensional data points used to generate the labels . The teacher-student scenario in which our results apply is: the teacher network (4) (or (3) in general) generates from the data . The pairs are then used in order to train (i.e., learn the weights of) a student network with the same architecture.
A richer example is a multi-layer version of the GLM:
[TABLE]
where each index runs from to , for . The input is factorized as . In this model represent intermediate hidden variables, the visible variable is the data, and with representing the weight matrix at the -th layer. Note that in the single layer case (3), was interpreted as data points and as the weight vector to learn. This model has been studied by various authors when and when the components are scalars [43, 44, 45, 26, 46]. But one can define generalizations where these are multi-dimensional, in which case overlap matrices naturally appear.
A last example could be another combinaison of statistical models such as a spiked Wigner model where the hidden low-rank representation of the data has a complex generative prior. For example could be generated from a GLM over a more primitive signal (here ):
[TABLE]
This set-up has recently attracted attention [47] for studying models of complex structured data with generative priors.
III THE PERTURBED MODEL, AND RESULTS
III-A The vectorial Gaussian channel perturbation
In order to “force” overlap concentration we need, in addition to the data , infinitesimal side-information about coming from a vectorial Gaussian channel:
[TABLE]
The i.i.d. Gaussian noise . The signal-to-noise (SNR) matrix controlling the signal strength with a sequence that tends to , and belongs to
[TABLE]
(other sets could be used but this one is convenient for the proof). We also denote so that . Matrices belonging to are symmetric strictly diagonally dominant with positive entries and thus , where is the set of symmetric positive definite matrices of dimension , see [48]. As it has a unique square root matrix that we denote .
The perturbed inference model is then
[TABLE]
It is called “perturbed model” because the base model has been slightly modified by adding new data points coming from (6) that are “weak” (as ). The posterior of the perturbed model reads
[TABLE]
where . We define the bracket as the expectation w.r.t. the posterior of the perturbed model: . Thus depends on and the SNR .
It is crucial to notice that the perturbed model (7) is set in the optimal Bayesian inference setting. Again, this means that in addition to the data the statistician knows the data generating model, namely the kernel and the additive Gaussian nature of the noise in the second channel in (7), the prior as well as all hyper-parameters , and is therefore able to write the true posterior (8).
An important object is the free energy of model (7):
[TABLE]
Concentration of the overlap requires an hypothesis:
Hypothesis 1** (Free energy concentration)**
There exists a constant that may depend on everything but , and s.t.
[TABLE]
The expectation is over all quenched variables but not over , which remains fixed.
For purely generic optimal inference models without any restrictions on the distributions it is generally very hard, if not wrong, to try proving (9). The model must be “random enough” and possess some underlying factorization structure for such hypothesis to be true (thus the factorization properties assumed in the examples of section II-B). The most studied case in the literature is when the prior and the kernel factorize, namely and the data points are i.i.d. given . The examples (2)–(4) fall in this class. Under such factorization assumptions it is quite straightforward to prove (9) using standard techniques (see, e.g., [14, 23]). But such simple factorization properties are not always there, as illustrated by the two last examples in section II-B. In these examples it is a perfectly valid question to wonder wether the overlap of the hidden variables do concentrate111Note that proving concentration of the overlap for a hidden variable requires a perturbation of the form (6) over the hidden variable, not over , which in this case is just interpreted as a constitutive element of the prior of the hidden variable of interest, see [26] where this is done. (this question is crucial in the analysis of [26]). The hidden variables have very complex structured prior (i.e., probability distribution), with highly non-trivial factorization properties, in which case proving (9) requires work. See, e.g., [26] where this has been done for the multi-layer GLM (5) with a single hidden layer (), where this is already challenging.
An important feature of the perturbation is that it does not change the limit of the averaged free energy; this means that in a certain sense the perturbed model is equivalent to the base one at a “macroscopic” level, i.e., for the global quantities. We denote in this paper a generic constant that may depend on all parameters in the problem like and but not on .
Lemma 1** (Free energy equivalence)**
There exists a constant s.t. . Thus and have same limit, provided it exists.
III-B Main results
Our main results are concentration theorems for the overlap in a (perturbed) model of optimal Bayesian inference. We start with the first type of fluctuations, namely the fluctuations of the overlap w.r.t. the posterior distribution, or what is called “thermal fluctuations” in statistical mechanics. Controlling these fluctuations does not require that the free energy concentrates (the hypothesis (9) is not required). Denote the average over the perturbation matrix , where ( has independent entries). Then:
Theorem 2** (Thermal fluctuations of )**
Consider an optimal Bayesian inference problem (i.e., for which the true posterior is known), with side information coming from the channel (6); i.e., a model of the form (7). Let a sequence s.t. and . There exists s.t.
[TABLE]
The next, stronger, result takes care of the additional fluctuations due to the quenched randomness, and requires this time the free energy concentration hypothesis:
Theorem 3** (Total fluctuations of )**
Consider a perturbed optimal Bayesian inference problem of the form (7). Assume Hypothesis 1. Let verify and . There exists s.t.
[TABLE]
We emphasize that, as Theorem 2 does not require Hypothesis 1, it is valid very generically, even for very complex models without any factorization properties for the signal’s prior nor for the kernel; it is only a consequence of the perturbation and the Bayesian optimality. In such models, deriving single-letter formulas for quantities like the mutual information or the MMSE is doomed (as generally there is not). Indeed, proofs of such simple formulas always require in one way or another strong factorization properties, directly, like, e.g., in [11, 12], or indirectly through the need of the stronger concentration result Theorem 3 as in [14, 13, 23, 17, 16, 18]222In these papers the concentration needs to be proven for an appropriate “interpolating” model..
Another remark is related to the role of the perturbation (i.e., side-information). Our theorems require an external average over the perturbation: this is not an artefact of the proof. Indeed, there might be a (zero-measure) set in the hyper-parameters space where, in the limit, there are phase transitions. Phase transitions manifest themselves in particular by a non self-averaging behavior of the overlap. But averaging over a vanishing window of , which importantly is independent of , allows to “smoothen” the overlap fluctuations, effectively cancelling the dramatic effect of possible phase transitions.
IV PROOF IDEA
We give few pointers to help the reader to get idea of the proof. All details can be found in [49].
Proving overlap concentration relies on the concentration of another matrix .
Proposition 4** (Concentration of )**
Let and . Then there is s.t.
[TABLE]
If and Hypothesis 1 is verified,
[TABLE]
The fluctuations of this matrix are easier to control than the ones of the overlap because is related to the -gradient of the free energy, which is self-averaging by hypothesis (9). The proof is a straightforward extension to the matrix case of the one found in [30, 23] and requires no new ideas. This general result does not depend on the fact that we consider optimal Bayesian inference; it is only a consequence of the perturbation, i.e., the side information coming from the channel (6). What instead does require new ideas and relies on the Bayesian optimal setting is the link between the concentration of and the one of . The additional difficulty w.r.t. what is done in [30, 23] for a scalar overlap (i.e., the case ) is that the matrix is not symmetric, even if its expectation is. Symmetry in expectation is a consequence of the general identity (sometimes called “Nishimori identity”) \mathbb{E}\big{\langle}g(x,X;\widetilde{Y},Y)\big{\rangle}=\mathbb{E}\big{\langle}g(x,x^{\prime};\widetilde{Y},Y)\big{\rangle}, where is the signal, are i.i.d. samples from the posterior (8), is the expectation w.r.t. the product posterior measure, and is any bounded function. This innocent-looking key identity on which relies the whole proof follows directly from Bayes’ law –thus the importance of placing ourselves in the Bayesian optimal setting–, see [30, 23]. Applied to which is indeed symmetric.
Linking the thermal fluctuations of and : Let us start by giving the main steps behind the proof of Theorem 2. The key insight is the following inequality: by definition of the overlap (and for any ),
[TABLE]
for some using Cauchy-Schwarz, and that the prior has bounded support. Combining the Nishimori identity, Gaussian integration by parts and by careful algebra using the formula one can show
[TABLE]
Identity (10) in Proposition (4) for then implies that the ’s asymptotically “decouple”. When this is plugged in (12), this decoupling property translates into Theorem 2.
Total fluctuations of : We now consider Theorem 3, which requires to obtain Theorem 2 first. The proof ressembles the derivation of the Ghirlanda-Guerra identities in the context of spin glasses [1]. The main identity is
[TABLE]
Here is the overlap between two i.i.d. samples from the (perturbed) posterior (8), so that is symmetric. The relation (13) is shown by the Nishimori identity and Gaussian integration by parts, which in particular allows to prove , as well as the use of Theorem 2. Again, in the derivation one has to be careful as the matrices that appear are not symmetric, which complicates the task.
Now, by (11) in Proposition (4) and the Cauchy-Schwarz inequality we have that the left hand side of (13) verifies
[TABLE]
Therefore, because of the alternating signs on the right hand side of (13), showing that \mathbb{E}_{\lambda}\mathbb{E}\big{\langle}\|Q-\mathbb{E}\langle Q\rangle\|_{\rm F}^{2}\big{\rangle} is small requires to prove that the third and forth terms are small. These can be thought of as a “measure of asymmetry” of the overlap matrix . The last crucial step is therefore showing (this again relies on the Nishimori identity)
[TABLE]
ACKNOWLEDGMENTS
Funding from Fondation CFM pour la Recherche-ENS is acknowledged. I would like to thank Nicolas Macris, Dmitry Panchenko, Antoine Maillard, Florent Krzakala, Léo Miolane and Clément Luneau for discussions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Ghirlanda and F. Guerra. General properties of overlap probability distributions in disordered spin systems. towards parisi ultrametricity. Journal of Physics A: Mathematical and General , 31(46):9149, 1998.
- 2[2] F. Guerra and F. L. Toninelli. The thermodynamic limit in mean field spin glass models. Communications in Mathematical Physics , 230(1):71–79, 2002.
- 3[3] M. Talagrand. Spin glasses: a challenge for mathematicians: cavity and mean field models , volume 46. Springer, 2003.
- 4[4] D. Panchenko. The Sherrington-Kirkpatrick model . Springer Science & Business Media, 2013.
- 5[5] I. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of statistics , 29(2):295–327, 2001.
- 6[6] S. B. Korada and N. Macris. Exact solution of the gauge symmetric p-spin glass model on a complete graph. Journal of Statistical Physics , 136(2):205–230, 2009.
- 7[7] Y. Deshpande, E. Abbe, and A. Montanari. Asymptotic mutual information for the balanced binary stochastic block model. Information and Inference: A Journal of the IMA , 6(2):125–170, 2016.
- 8[8] F. Krzakala, J. Xu, and L. Zdeborová. Mutual information in rank-one matrix estimation. In 2016 IEEE Information Theory Workshop (ITW) , pages 71–75, Sept 2016.
