The Geometry of Community Detection via the MMSE Matrix
Galen Reeves, Vaishakhi Mayya, Alexander Volfovsky

TL;DR
This paper introduces a geometric framework for community detection in networks with variable community sizes, using an effective signal-to-noise ratio matrix to characterize detection limits and improve understanding of real-world network behaviors.
Contribution
It extends existing models by incorporating community variability and develops a matrix-based geometric approach to analyze detection limits, generalizing previous scalar SNR concepts.
Findings
Effective SNR matrix characterizes community detectability.
Explicit formulas for mutual information and MSE bounds.
Numerical simulations validate theoretical predictions.
Abstract
The information-theoretic limits of community detection have been studied extensively for network models with high levels of symmetry or homogeneity. The contribution of this paper is to study a broader class of network models that allow for variability in the sizes and behaviors of the different communities, and thus better reflect the behaviors observed in real-world networks. Our results show that the ability to detect communities can be described succinctly in terms of a matrix of effective signal-to-noise ratios that provides a geometrical representation of the relationships between the different communities. This characterization follows from a matrix version of the I-MMSE relationship and generalizes the concept of an effective scalar signal-to-noise ratio introduced in previous work. We provide explicit formulas for the asymptotic per-node mutual information and upper bounds on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The Geometry of Community Detection
via the MMSE Matrix
Galen Reeves
Vaishakhi Mayya
Alexander Volfovsky G. Reeves is with the Department of Electrical and Computer Engineering and the Department of Statistical Science, Duke University, Durham, NC 27708 USA (e-mail: [email protected]). V. Mayya is with the Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708 USA (e-mail: [email protected]). A. Volfovsky is with the Department of Statistical Science, Duke University, Durham, NC 27708 USA (e-mail: [email protected]).
Abstract
The information-theoretic limits of community detection have been studied extensively for network models with high levels of symmetry or homogeneity. The contribution of this paper is to study a broader class of network models that allow for variability in the sizes and behaviors of the different communities, and thus better reflect the behaviors observed in real-world networks. Our results show that the ability to detect communities can be described succinctly in terms of a matrix of effective signal-to-noise ratios that provides a geometrical representation of the relationships between the different communities. This characterization follows from a matrix version of the I-MMSE relationship and generalizes the concept of an effective scalar signal-to-noise ratio introduced in previous work. We provide explicit formulas for the asymptotic per-node mutual information and upper bounds on the minimum mean-squared error. The theoretical results are supported by numerical simulations.
1 Introduction
Modern data problems often ask questions about how individuals (or computers or countries) interact or relate to each other within a network. A frequently studied problem in this context is that of community detection: how does one partition a network into clusters (or communities or groups) of nodes? A natural partition of a network is into communities that exhibit similar connection patterns, both within and between communities. A generative model for random networks called the stochastic block model (SBM) exhibits such behavior and hence much of the theoretical analysis of community detection has focused on it [1]. Under the SBM each individual belongs to exactly one of communities, and the probability of an edge between two individuals is exclusively a function of their community memberships.
The problem of community detection can be modeled in terms of a joint distribution on where is a simple graph on vertices and is a collection of labels associated with the vertices. In the SBM this joint distribution is governed by two parameters: a probability vector of each node being assigned to one of labels, and a matrix of probabilities where is the probability of an edge between nodes in communities and . The community detection task is recovering the labels given the graph and potentially side information.
Inspired by the work of Decelle et al. [2], a recent line of work has studied the information-theoretic limits of recovery when the distribution of is known. Most of this work has focused on either the two-community SBM [3, 4, 5, 6, 7, 8, 9] or the so-called -community symmetric SBM [7, 10, 11, 12]. In all of these cases, performance is summarized in terms of a single numerical value, which is often referred to as the effective signal-to-noise ratio of the problem. General SBMs have been considered by Abbe and Sandon [10] who characterize conditions for weak recovery and also by Lesieuir et al. [7] who analyze the performance of an approximate message passing algorithm.
A different line of research within the statistics community has focused on settings where the parameters of the distribution, such as the distribution of communities and the conditional probabilities of edges, are unknown quantities that must also be inferred, along with the community memberships [13, 14]. While the models considered in this literature are highly flexible, the conditions needed for consistent recovery of communities corresponds to a very high SNR regime relative to the information theoretic analysis.
1.1 Our Contributions
The contribution of this paper is to characterize the information-theoretic limits for a large class of degree-balanced SBMs. In contrast to the symmetric SBM, these models allow for variability in the sizes and behaviors of the different communities, and thus reflect behaviors observed in real-world networks. While previous work is limited to a scalar measure of performance for the overall community detection problem, we introduce a multivariate measure of performance, the minimum mean-squared error (MMSE) matrix, which describes detection limits for individual communities. For example, this matrix allows us to characterize settings where some of the communities can be detected while other cannot.
Our analysis of the community detection problem leverages a matrix version of the I-MMSE relation [15], which both simplifies and generalizes techniques used in previous work. In particular, the upper bound on the mutual information in Theorem 2 is a consequence of a novel non-asymptotic inequality that holds under any distribution on the community labels. Many of our techniques can be applied more generally to other high-dimensional inference problems, including matrix and tensor factorization.
1.2 Overview of Approach
This paper introduces a multivariate measure of performance, which we refer to as the MMSE matrix:
[TABLE]
In this expression, is the covariance matrix of the -th node’s label after is has been embedded in to an -dimensional Euclidean space (where is either or ). We show that the MMSE matrix provides important geometrical information about the uncertainty in the community memberships. While the trace of the MMSE matrix corresponds to standard measures of performance such as the average overlap, the information provided by individual entries in the MMSE matrix can be used to answer more nuanced questions about which of the community relationships can (or cannot) be recovered.
One of the key ideas in this paper is to focus on community detection in the setting where there is additional covariate information about the labels. Specifically, we assume that one has side-information from the signal-plus-noise model:
[TABLE]
where is an positive semidefinite matrix, known as the matrix SNR, and is an matrix with i.i.d. standard Gaussian entries.
The introduction of the signal-plus-noise model plays an important role both for our analysis and for our interpretation of the results. For example, it allows us to leverage the matrix I-MMSE relation [15] to characterize the MMSE matrix in terms of the gradient of the mutual information:
[TABLE]
Remarkably, this relationship holds generally for any joint distribution on the pair . Notice that the matrix MMSE in (1) is obtained by evaluating this expression at .
The signal-plus-noise model also provides a natural way to address non-identifiability issues that arise when the distribution over the labels is invariant to permutations. The key idea is that in the large- limit, an arbitrarily small amount of side-information is sufficient to break the symmetry in the model. Hence, focusing on the double limit
[TABLE]
provides a meaningful and interpretable measure of average performance that bypasses the need to optimize over an equivalence class of permutations.
Section 3 provides formulas for the per-vertex mutual information and MMSE matrix in the large- limit. These formulas are stated for a degree-balanced stochastic block model and can be approximated numerically with arbitrary precision. Numerical simulations are provided in Section 5.
1.3 Notation
We use , to denote the space symmetric matrices and symmetric positive semi-definite matrices, respectively. Given a symmetric positive semi-definite matrix , we use to denote the unique positive semi-definite square root. Given matrix , the relation means that .
2 Definitions
The community stochastic blockmodel is frequently parameterized in terms of the tuple where is a distribution over communities and is a symmetric matrix such that is the probability of an edge between nodes in communities and . Without loss of generality, the community labels can be embedded into finite dimensional Euclidean space. Two useful representations are considered in Sections 2.1 and 2.2. In Section 2.3 we introduce the degree balanced SBM for which we state the remainder of the results in the paper. Lastly, in Section 2.4 we introduce the signal plus noise problem which we leverage to derive the results for community detection.
2.1 Standard Basis Representation
A natural embedding associates the labels with the standard basis vectors in , i.e., the columns of the identity matrix. Under this representation, the expected value of a label vector is a point on the probability simplex. The conditional covariance is defined by
[TABLE]
and the MMSE matrix is defined according to (1). By the data processing inequality for MMSE, this matrix satisfies
[TABLE]
As a consequence, the difference between the MMSE matrix and covariance provides a measure of the difference between the prior and posterior marginals of the labels.
Proposition 1**.**
Under the standard basis representation, the MMSE matrix satisfies
[TABLE]
Proof.
For each , we can write
[TABLE]
where the first equality follows from the law of total variance and the last step holds because, under the standard bases representation, we have \mathbb{E}\mathopen{}\mathclose{{}\left[X_{i\ell}\mid\bm{G}}\right]=\mathbb{P}\mathopen{}\mathclose{{}\left[X_{i\ell}=e_{\ell}\mid\bm{G}}\right]. Summing over all and normalizing by completes the proof. ∎
Furthermore, the individual entries of the MMSE matrix also provide information about different recovery tasks. For example, consider the problem of determining whether a label belongs to a subset . If we define , then is binary random variable indicating whether the -th label belongs to . Summing the entries in the MMSE matrix indexed by the set provides a measures of the average error probability:
[TABLE]
2.2 Whitened Representation
Next, we focus on the setting where the labels are identically distributed with probability vector . The whitened representation is defined to be of a set of points in with the property that
[TABLE]
Under the whitened representation, each label vector has zero mean and identity covariance and thus the MMSE matrix satisfies .
Remark 1** (Unique Specification of Whitened Representation).**
The whitened representation can be defined explicitly as a function of as follows. Let and apply the Gram-Schmidt process to the vectors to obtain an orthonormal basis for of the form where is . Then, the support of the whitened representation is related to the standard basis vectors according to
[TABLE]
where . This construction is unique and has the useful property that lies in the span of .
Proposition 2**.**
If the labels are identically distributed then the MMSE matrix of the whitened representation satisfies
[TABLE]
where denotes the chi-squared divergence.
Proof.
Noting that and using the same approach as in the proof of Proposition 1, we have
[TABLE]
Next, let denote the representation of in the standard basis and observe that
[TABLE]
where we have used (4) and the fact that \mathbb{E}\mathopen{}\mathclose{{}\left[\tilde{X}_{i}}\right]=p. Plugging this expression back into (5) gives the stated result. ∎
For the purposes of analysis, the two representations described above are equivalent in the sense that there is a one-to-one mapping between the MMSE matrix defined under the standard basis representation and the MMSE matrix defined under the whitened representation. For notational convenience we work in the whitened representation.
2.3 Degree-Balanced SBM
The average degree of an SBM corresponds to the expected number of edges for a node chosen uniformly at random and is denoted by . An SBM is said to be degree-balanced if the expected degree of a node does not depend on its community assignments. This condition is equivalent to saying that is proportional to the all ones vector.
For the purposes of this paper, it is useful to consider a reparameterization of the degree-balanced SBM in terms of the tuple where is the average degree and . Using this parameterization, the entries of are given by
[TABLE]
where are defined as a function of using the procedure described in Remark 1. The tuple is valid only if the entries of are between zero and one.
The matrix quantifies the relative strength of relationships between different communities. The eigenvalue decomposition is given by
[TABLE]
where are real numbers. To simplify the analysis, we will assume throughout that all the eigenvalues are nonzero so that is invertible.
We remark that the definition of signal-to-noise ratio given by Abbe and Sandon [10, Section 2.1] corresponds to . Furthermore, for the special case of communities, the representation of is one-dimensional and the formulation of Lelarge and Miolane [5] is equivalent to ours.
2.4 Signal-Plus-Noise Problem
Our analysis uses properties of the signal-plus-noise model given in (2). Throughout this section we will assume the labels are drawn i.i.d. according to a probability vector with strictly positive entries and are supported on the whitened representation described in Section 2.2. For each , the task of recovering from decouples into independent copies of the problem
[TABLE]
where is supported on with probability vector and is independent Gaussian noise.
Following [15] we define the the mutual information function and matrix-valued MMSE function according to
[TABLE]
The gradient and Hessian of are given by [15, Lemma 4]
[TABLE]
where denotes the Kronecker product. We note that these functions can be approximated using numerical integration methods or Monte-Carlo sampling.
3 Formulas for Mutual Information and MMSE
Our analysis focuses on a sequence of degree-balanced SBMs where the parameters are fixed as the size of the network scales to infinity. Additionally, we make two assumptions.
Assumption 1** (Diverging Average Degree).**
The average degree of the network increases with such that both and tend to infinity.
Assumption 2** (Definite Matrix).**
The matrix is either positive definite or negative definite.
Our first result is stated in terms of the potential function defined by
[TABLE]
where is defined by (7). Notice that the first term in the potential function is defined exclusively by the prior distribution of labels whereas the second term is defined exclusively by the matrix . By the matrix I-MMSE relation [15], it can be verified that every stationary point of satisfies the fixed-point equation
[TABLE]
where is defined by (8). Noting that , we see that is always a stationary point. Furthermore, every solution of (12) belongs to the set .
Theorem 1**.**
[TABLE]
where is given in (11).
The next result provides an upper bound on the mutual information in the setting where side information is generated according to the signal-plus-noise model (2) parameterized by a positive semi-definite matrix . To characterize this setting, we define the modified potential function:
[TABLE]
Notice that the main difference from (12) is that the side information changes the prior information about the labels.
Theorem 2**.**
Suppose that is generated according to the signal-plus-noise model (2) with matrix . Under Assumption 1,
[TABLE]
where is given in (13).
Remark 2**.**
Similar to previous work [3, 4, 5, 8, 6, 7], our proofs of Theorems 1 and 2 use a channel universality argument to relate the community detection problem to a low-rank estimation problem. Assumption 2 is needed for the proof of Theorem 1, which leverages [5, Theorem 12]. To prove Theorem 2 we develop a novel variation of the Guerra interpolation method that exploits the matrix I-MMSE relationship [15] to provide a general and non-asymptotic upper bound.
Next, we recall that that by the data processing inequality, the MMSE matrix satisfies
[TABLE]
for all . For any fixed problem size , the difference between these matrices converges to zero as . However, in the large- limit it is possible that the limiting behavior is discontinuous with respect to . This can occur, for example, when the SBM is invariant to permutations of the labels and hence . The presence of side-information with an arbitrarily small positive definite matrix is sufficient to break the permutation invariance, and thus the small- limit provides a meaningful measure of recovery performance that overcomes the non-identifiability issues.
The following result follows from the matrix I-MMSE relation and Theorems 1 and 2. The proof is given in Appendix A.3.
Theorem 3**.**
Consider Assumptions 1 and 2. For every ,
[TABLE]
where denotes any minimizer of . In other words,
[TABLE]
where denotes a sequence of symmetric matrices that converges to zero as .
The numerical experiments of Section 5 suggest that the upper bounds in Theorem 2 are asymptotically tight, i.e., that the MMSE matrix satisfies
[TABLE]
for almost all , where is the unique minimizer of .
The next result provides an asymptotic lower bound on the problem of estimating , which implies a lower bound on . The proof is given in Appendix A.4.
Theorem 4**.**
[TABLE]
where . Furthermore, this implies that
[TABLE]
4 Implications for Weak Recovery
Broadly speaking, weak recovery refers to the ability to produce an estimate that is positively correlated with the ground truth. In the context of community detection, the precise definition of weak recovery is a bit more nuanced due to the fact that symmetries in the problem formulation can result in a posterior distribution that is invariant to permutations of the labels. As a specific example, consider the two-community degree-balanced SBM where each community is equally likely. Even if an estimator can partition the nodes into two groups such that all of the nodes in each group belong to the same community, it is impossible to determine which label should be assigned to which group.
One approach that is taken in the literature to address this nonidentifiability assesses the performance of an estimator after choosing a permutation of the labels that leads to the best performance; see e.g., [10, Section 2]. Another approach focuses on the related problem of estimating the pairwise interaction terms . Specifically weak recovery with respect to the pairwise interactions is possible if
[TABLE]
where \operatorname{\mathsf{MMSE}}(X_{i}^{T}RX_{j}\mid\bm{G})\triangleq\mathbb{E}_{\bm{G}}\mathopen{}\mathclose{{}\left[\operatorname{\sf Var}(X_{i}^{T}RX_{j}\mid\bm{G})}\right]. Notice that under the whitened basis representation we propose, and this condition is equivalent to
[TABLE]
Following the approach taken in this paper, we see that a natural alternative is to focus on the small- behavior of the MMSE matrix. In particular, we say that weak recovery is possible if
[TABLE]
In view of these definitions, we see that Theorem 3 and Theorem 4 provide necessary and sufficient conditions for weak recovery, depending on whether the potential function has a unique minimizer at zero.
Theorem 5** (Weak Recovery).**
Consider Assumptions 1 and 2. If has a minimizer that is not equal to zero then weak recovery in the sense of (15) is possible. Conversely, if has a unique minimizer at zero, then weak recovery in the sense of (14) is not possible.
Evaluating the Hessian of the potential function at zero provides a simple test to determine whether is a local minimum. Using (10), it can be shown that
[TABLE]
Therefore, if then is not a local minimizer.
5 Numerical Experiments
This section compares the asymptotic bounds given in Section 3 with the MSE obtained using belief propagation (BP). The case of the three-community degree balanced SBM is illustrated in Figure 1. The black contour lines correspond to the trace of where is the global minimizer of the potential function defined in (11). The heat map values correspond to the empirical MSE of the BP algorithm described in [2] applied to a network of size with average degree . Each pixel is the median of eight independent trials and the MSE is measured with respect to the whitened basis representation. In each trial, the BP algorithm is run using fifteen different random initializations and the MSE is assessed based on the initialization that produces in the lowest predicted MSE.
In the case of uniform community assignments (Figure 1(a)), the weak recovery limit for acyclic BP [10] is equal to our upper bound on the weak detection threshold. Furthermore, we see that there is a close correspondence between the asymptotic formula and the empirical results. Note that the special case corresponds to the three-community symmetric SBM.
In the case of non-uniform community assignments (Figure 1(b)), there exists a region of the parameter space where weak recovery is possible with . The existence of such a region has been shown previously in the special case of the two-community asymmetric SBM [4]. We also see that the asymptotic formulas match the empirical behavior qualitatively, although the empirical MSE is worse than is suggested by the formulas. The grey region in Figure 1(b) corresponds to settings where does not define a valid SBM.
Numerical Approximation of Formulas
We use Monte Carlo sampling to approximately evaluate the functions and , and we use the concave-convex procedure [16] to explore the local minima of the potential function. Starting is an initialization point , a sequence of iterates is obtained according to
[TABLE]
where is a dampening parameter.
6 Main Steps in Proof
This section provides an overview of the main theoretical results of the paper. These results are described in the context of a more general inference problem where the goal is to estimate a random matrix . The setting of the -community degree-balanced SBM described in Section 3 corresponds to the special case where and the rows of are drawn i.i.d. from the whitened distribution described in Section 2.2.
6.1 Equivalence between Observation Models
The high-level idea behind our approach is to established an equivalence between three different observations models. The first observation model is the signal-plus-noise model given by:
[TABLE]
where and is an standard Gaussian matrix, i.e., the entries are i.i.d. .
To describe the second observation model, we first define the symmetric random matrix
[TABLE]
where . Then, the observations are given by
[TABLE]
where and is an standard Gaussian Wigner matrix, i.e. a symmetric matrix whose entries above the diagonal are i.i.d. and whose entries on the diagonal are i.i.d. .
For the last model, the observations consist of an -node simple graph, which is represented by its adjacency matrix . By convention the diagonal entries are set to zero and the off-diagonal entries are given by if there is an edge between nodes and and zero otherwise. Our results apply to the setting where the entries of the adjacency matrix are drawn independently conditional on according to
[TABLE]
where parameterizes the expected number of edges.
Notice that both (18) and (19) consist of elementwise observations of from a fixed output channel. The following result provides a link between the mutual information in these observation models. The proof is given in Appendix B.
Theorem 6** (Channel Universality).**
Let be a symmetric random matrix with bounded entries and finite support of cardinality . Let be drawn according to (18) with and be drawn according to (19). Given any , there exists a constant such that
[TABLE]
uniformly for all integers and .
Remark 3**.**
The concept of channel universality appeared in the work of Korada and Montanari [17] and subsequently developed in the context of community detection [3, 4, 5] and low-rank matrix estimation [7, 6, 8]. In relation to this work, the contribution of Theorem 6 is that it holds under more general assumptions on both and the average degree .
Theorem 6 implies that the joint information in about is asymptotically equivalent to the joint information in about .
Corollary 7**.**
Let be drawn according to the degree-balance SBM with parameters where and are fixed and scales with such that both and tend to infinity. Let be drawn according to (16) and let be drawn according to (18) with and . Then,
[TABLE]
Proof.
Combining the chain rule for mutual information with the Markov structure in leads to
[TABLE]
By assumption, has finite support of cardinality and bounded entries. This implies that has finite support of cardinality and bounded entries where the constant depends only on . For every realization of , Theorem 6 implies that there is a constant such that
[TABLE]
for all and sufficiently large. The stated result then follows from Jensen’s inequality and the assumptions on and . ∎
6.2 Interpolation via Mutual Information
Theorem 6 provides a link between community detection and symmetric matrix estimation. The next step in our analysis is to study an interpolating function that transitions smoothly from the symmetric matrix model to the signal-plus-noise model. We note that a number of approaches have been developed in the statistical physics literature, including Guerra’s interpolation method [18] and the adaptive interpolation method [19]. In this paper we consider an approach inspired by the work of Reeves [20], which leverages the functional properties of mutual information in Gaussian channels.
The central object of interest is the mutual information functions defined by
[TABLE]
This function has a number of useful properties. Combining the chain rule for mutual information with the Markov structure in allows us to write
[TABLE]
Hence, the special cases and are given by
[TABLE]
In this way, provides a bridge between the symmetric matrix estimation problem, with or without side information, and the signal-plus-noise problem. Notice that if the rows of are independent, then . In particular, if the rows are drawn i.i.d. from a distribution on (as is assumed in Theorem 2) then is equal to the mutual information function introduced in Section 2.4.
It was previously shown that possesses several desirable properties: it is concave and twice differentiable in the pair [15, Lemma 4]. Let the partial gradients with respect to the first and second arguments be denoted by and , respectively. By the matrix I-MMSE relation, it follows that:
[TABLE]
The details of this derivation are given in Appendix D.3.
The next result provides a non-asymptotic upper bound on in terms of the signal-plus-noise model. Remarkably, the only restriction on is that it has finite fourth moments. The proof is given in Section 6.3.
Theorem 8**.**
*Let be a random matrix with finite fourth moments and let where is invertible. For all and , the mutual information function defined in (20) satisfies *
[TABLE]
where \Gamma=\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[\bm{X}^{T}\bm{X}}\right].
If the rows of are sufficiently uncorrelated then the term \frac{1}{n^{2}}\mathbb{E}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left\|R\bm{X}^{T}\bm{X}-R\mathbb{E}\mathopen{}\mathclose{{}\left[\bm{X}^{T}\bm{X}}\right]}\right\|_{F}^{2}}\right] converges to zero in the large- limit. The case of i.i.d. rows is summarized as follows:
Corollary 9**.**
*Let be a random matrix whose rows are drawn i.i.d. from a distribution on with finite forth moments and let where is invertible. For all and , the mutual information function defined in (20) satisfies *
[TABLE]
Proof.
Noting that is the sum of i.i.d. matrices leads to
[TABLE]
which converges to zero as increases to infinity. ∎
Combining Corollary 7 and Corollary 9 leads directly to an upper bound on the mutual information in the community detection problem (Theorem 2). The details of the proof are given in Appendix A.2. To show that this bound is tight requires significantly more work. In this direction, we build upon the work of Lelarge and Miolane [5, Theorem 12], who give an explicit characterization of the large- limit for the matrix estimation problem in the setting where . Although their result is stated originally for the special case where is the identity matrix, it extends to the case described below, where is definite. For completeness a detailed mapping between their statement of this result and the one used in this paper is provided in Appendix C.
Theorem 10** (Lelarge and Miolane [5, Theorem 12]).**
Let be a random matrix whose rows are drawn i.i.d. from a distribution on with finite second moments and let where is either positive definite or negative definite. For all , the mutual information function defined in (20) satisfies
[TABLE]
6.3 Proof of Theorem 8
The first step in the proof is given by the the following lemma, which establishes a functional relationship between the first and second partial gradients of .
Lemma 11**.**
The gradients of the function defined in (20) satisfy
[TABLE]
where is defined according to
[TABLE]
with \Gamma=\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[\bm{X}\bm{X}^{T}}\right].
Proof.
Based on the analysis of the MMSE matrix of a linear Gaussian channel with matrix input (Appendix D.2) and the partial derivatives of the mutual information function in symmetric matrix estimation (Appendix D.3) we obtain
[TABLE]
where and are conditionally independent draws form the posterior distribution of given . Comparing these expressions with the definition of , leads to
[TABLE]
Noticing that this expression is non-negative completes the proof. ∎
The next step in our analysis is to focus on the convex conjugate (or Legendre–Fenchel transform) of . Specifically, we define the extended real-valued function according to
[TABLE]
Here, we have introduced the factor of one half in so that the dual variable can be associated with the MMSE matrix. The function is convex because it is the pointwise maximum of affine functions. By the Fenchel–Moreau theorem (see e.g., [21, Theorem 13.37]), the fact that is a proper upper-semicontinuous concave function implies that the Legendre–Fenchel transform is a bijection, and thus
[TABLE]
where
Working with the transformed representation allows us to convert the functional constraint on the partial derivatives given in Lemma 11 into an upper bound on the convex conjugate.
Lemma 12**.**
For all we have
[TABLE]
where is defined in (24).
Proof.
The assumption that combined with the fact that is non-increasing in the Loewner partial order ensures that supremum with respect to in (25) is attained on at least one point . By the Karush–Kuhn–Tucker conditions, the gradient with respect to evaluated at this point satisfies
[TABLE]
Next, we note that is non-decreasing with respect to the Loewner partial order. To see why, observe that for any , we have .
We now employ the envelope theorem [22], which implies that is absolutely continuous in with
[TABLE]
The integrand in this expression can be upper bounded as follows:
[TABLE]
The first inequality is due to Lemma 11 and the second inequality follows from (28) and the fact that is non-decreasing. Plugging this inequality back into (29) completes the proof. ∎
We are now have all the ingredients needed for the proof of Theorem 8. Starting with (26) and then applying the bound in Lemma 12 allows us to write
[TABLE]
Note that this is a variational upper bound in terms of the dual variable , which corresponds to the MMSE matrix. To rewrite this expression in terms of an infimum over the signal-to-noise matrix, we define the function according
[TABLE]
Then, a straightforward calculation shows that is the concave conjugate of in the following sense:
[TABLE]
for all . Plugging this characterization of back into (31), and then swapping the order of the infimum with respect to and leads to
[TABLE]
where the final equality follows from (26). This concludes the proof of Theorem 8.
7 Discussion
The results presented in this paper recast the community detection problem as a multivariate problem making it possible to evaluate more than just traditional overall recovery tasks. By evaluating the formulas derived in Section 3 we can now differentiate between the tasks of finding one community, all communities, and a subset of communities within a network. The formulas further allow us to identify a computational gap for regimes where certain recovery tasks should be theoretically attainable but where algorithms such as BP will fail to perform.
Acknowledgment
The authors thank Lenka Zdeborová for providing initial direction on this problem and Jiaming Xu for helpful discussion regarding channel universality. This was supported in part by funding from the Laboratory for Analytic Sciences (LAS) and by the NSF under Grant No. 1750362. Any opinions, findings, conclusions, and recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
Appendix A Proofs of Results in Section 3
A.1 Proof of Theorem 1
Combining Corollary 7 and Theorem 10 with yields
[TABLE]
for any random matrix whose rows are drawn i.i.d. from a distribution on with finite and bounded support. Under the assumption that the rows are supported on the whitened representation described in Section 2.2 it follows that \mathbb{E}\mathopen{}\mathclose{{}\left[XX^{T}}\right]=I. Furthermore, it can be verified that the infimum with respect to is attained on the compact set and thus the use of a minimum is justified. This concludes the proof of Theorem 1.
A.2 Proof of Theorem 2
Combining Corollary 7 and Corollary 9 with yields
[TABLE]
for any and random matrix whose rows are drawn i.i.d. from a distribution on with finite and bounded support. Under the assumption that the rows are supported on the whitened representation described in Section 2.2 it follows that \mathbb{E}\mathopen{}\mathclose{{}\left[XX^{T}}\right]=I. Furthermore, it can be verified that the infimum with respect to is attained on the compact set and thus the use of a minimum is justified. This concludes the proof of Theorem 2.
A.3 Proof of Theorem 3
The key idea underlying this proof is to exploit the integral form of matrix I-MMSE relationship, which gives
[TABLE]
for any differentiable path with and . Combining Theorems 1 and 2 provides an upper bound on the leading order terms of the left-hand side of this expression in the large- limit. We will show that this upper bound implies an asymptotic upper bound on the matrix MMSE with respect to the Loewner partial order.
To simplify notation we let and define the functions and according to
[TABLE]
For for all , the upper bound on the mutual information in Theorem 1 combined with the exact limit in Theorem 2 allows us to write
[TABLE]
The next step is to show that (33) implies an upper bound for the gradient for all positive definite . Let . For every , and , we can write
[TABLE]
where the inequality holds because for all and is non-increasing with respect to the Loewner partial order. Meanwhile, we note that is concave because it is the poinitwise infimum of concave functions. By the envelope theorem [22], the supergradient of at is the closure of the set \{\frac{1}{2}M_{X}(\Delta)\,:\,\text{\Deltaf}\}. Hence,
[TABLE]
where denotes any matrix in the supergradient of at . Combining (33), (34), and (35) leads to
[TABLE]
for all and
The final step in the proof is to show that (36) implies an upper bound on the maximum eigenvalue of . To proceed, observe that the set is compact, and thus for every there exists an integer and a set of matrices such that . Therefore, the maximum eigenvalue can be upper bounded as follows:
[TABLE]
By (36), the limit superior of the first term on the right-hand side is non-positive. Meanwhile the gradient is bounded uniformly with respect to and . Noting that can be chosen arbitrarily small complete the proof of Theorem 3.
A.4 Proof of Theorem 4
Given , let where is a standard Gaussian Wigner matrix. Starting with the I-MMSE relation in (62), we obtain, for all ,
[TABLE]
where the inequality holds because the integrand is non-increasing in . To characterize the asymptotic limit of the left-hand side, we start with Theorem 6 and use the same steps that led to Corollary 7 to obtain
[TABLE]
where and are conditionally independent given . By [15, Lemma 2], the information provided by two independent Gaussian observations can be expressed in terms of a signal observation according to . Thus we can apply Theorem 10 to obtain
[TABLE]
where
[TABLE]
Putting the above pieces together, we obtain
[TABLE]
for all .
Next, we consider the limiting behavior of the right-hand side of (38) as decreases to zero. Observe that the gradients of the potential function are given by
[TABLE]
Let . Starting with the envelope theorem [22], we have
[TABLE]
where the last step holds because every is a stationary point of and thus satisfies .
Combining Lemma 11, evaluated with , with the assumption \frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[\bm{X}\bm{X}^{T}}\right]=I gives
[TABLE]
where the second term on the right-hand side converges to zero in the large- limit by the law of large numbers.
Combining this inequality with (38) and (41) gives
[TABLE]
Rearranging the terms completes the proof.
Appendix B Proof of Theorem 6
Recalling that is a symmetric matrix with zeros on the diagonal and entries above the diagonal drawn according to (19), we can write . Meanwhile, the fact that is symmetric allows us to write
[TABLE]
where denotes the diagonal entries of . By the chain rule for mutual information and the conditional independence of given , the second term on the right-hand side of (43) can be upper bounded as follows:
[TABLE]
where the second inequality follows from the assumption and the capacity of the additive Gaussian noise channel. In the following, we compare with the first term on the right-hand side of (43).
To simplify notation, let and let , and denote the -dimensional vectors obtained by stacking the columns above the diagonal in , , and , respectively. The mutual information terms of interest can then be expressed as
[TABLE]
where is the conditional distribution of corresponding to a realization of and D\mathopen{}\mathclose{{}\left(P\,\|\,Q}\right) denotes the relative entropy between distributions and . Our approach is to prove that the inequality
[TABLE]
holds uniformly for all satisfying . The desired result for the mutual information then follows from Jensen’s inequality.
B.1 Proof of Inequality (44)
Condition on a realization of and let . Let be the shifted distribution defined by and let denote the support of . For each , we define the log likelihood ratio according to
[TABLE]
Using this notation, the relative entropy be written as
[TABLE]
where the expectation is with respect to . The score function associated with is the -dimensional random vector given by and the Fisher information matrix associated with is the positive semidefinite matrix given by \mathcal{I}\triangleq\operatorname{\mathsf{Cov}}(V)=-\mathbb{E}\mathopen{}\mathclose{{}\left[\nabla^{2}\mathcal{L}(0)}\right]. Under the Bernoulli observation model in (19), the entries of are independent and given by
[TABLE]
and the Fisher information matrix is diagonal with
[TABLE]
To proceed, we define two different approximations to the relative entropy in (45) according to
[TABLE]
where is a Gaussian random vector with the same mean and covariance as the score function . By the triangle inequality,
[TABLE]
The terms on the right-hand side are upper bounded in the following lemmas. The notation means that there is a universal constant such that and notation means that there is a constant such that
Lemma 13**.**
We have
[TABLE]
Proof.
Let be the zero-mean random vector defined by where denotes the second partial derivative with respect to , and let be the random process given by The second order Tayler series expansion of about the point can be expressed as
[TABLE]
where is the remainder term. In view of (45) and the definition of , it follows that
[TABLE]
We first consider the expected supremum of . By Taylor’s theorem, there exists a vector between zero and such that
[TABLE]
Direct computation reveals that . Noting that for all , one obtains the uniform upper bound
[TABLE]
Combining (48) and (49) with the fact that and leads to
[TABLE]
Next, we consider the expected supremum of . Under the Bernoulli observation model in (19), the entries of are independent and a straightforward calculation shows that there exist numbers
[TABLE]
such that \mathbb{E}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left|A_{i}}\right|^{2}}\right]\leq\nu and \mathopen{}\mathclose{{}\left|A_{i}}\right|\leq c almost surely. By Bernstein’s Inequality [23, Theorem 2.10], it follows that each is a sub-gamma random variable with variance factor and scale factor , i.e., the cumulant generating function satisfies
[TABLE]
for all . Hence, for all and ,
[TABLE]
where the equality follows from the independence of the entries of and the last inequality holds because . An application of the maximal inequality [23, Corollary 2.6] yields
[TABLE]
Combining (52) with and the scalings in (50) and (51) leads to the desired result. ∎
Lemma 14**.**
We have
[TABLE]
Proof.
Let be defined as . Then, we can write
[TABLE]
where we recall that has independent entries and is a Gaussian vector with the same first two moments as . We bound this difference using the generalized Lindeberg principle [24, Theorem 1.1], which implies that, if there exists a constant such that for each and , then
[TABLE]
From (46) and (47) it can be verified that the third moments satisfy
[TABLE]
Meanwhile, if we let be a -valued random vector drawn according to the measure
[TABLE]
then the partial derivatives of can be expressed as
[TABLE]
Noting that for all we see that |\partial^{3}_{i}\Phi(v)|=O_{B}\mathopen{}\mathclose{{}\left(n^{-3/2}}\right). Combining this inequality with (53) and (54) completes the proof. ∎
Lemma 15**.**
We have
[TABLE]
Proof.
Let be defined as
[TABLE]
where the expectation is with respect to . Then, a straightforward calculation reveals that
[TABLE]
where we recall that is a diagonal matrix given by (47).
Next, we consider the gradient of . Let be the probability measure on defined by
[TABLE]
and observe that
[TABLE]
Using Gaussian integration by parts (Stein’s lemma) in conjunction with the relation
[TABLE]
leads to
[TABLE]
This identity implies that the nuclear norm of the gradient is bounded by
[TABLE]
where the last step holds because for all .
With these results in hand, we can now write
[TABLE]
Finally, from (47), it can be verified that
[TABLE]
which completes the proof. ∎
Appendix C Derivation of Theorem 10
First we observe that if is positive definite then is well defined. Introducing the transformed representation , we can then write
[TABLE]
Note that if is negative definite then the same decomposition holds with . This transformation shows that it is sufficient to focus on setting where is the identity matrix.
The result given in [5, Theorem 12] is stated as follows:
[TABLE]
where
[TABLE]
with independent of . To see that this expression is equivalent to the on given in Theorem 10, observe that the mutual information function can be expressed as follows:
[TABLE]
Rearranging terms leads to
[TABLE]
Finally, using the scaling relationship leads to the version of the result stated in Theorem 10.
Appendix D Mutual Information and MMSE in Gaussian Noise
D.1 Linear Gaussian Channel
The scalar I-MMSE relationship [25] asserts the the derivative of mutual information in a Gaussian noise channel with respect to the inverse noise variance is equal to one half times the MMSE. A recent line of work in the information theory literature has focused on multivariate extensions of this result for linear Gaussian channel [25, 26, 27, 28]. This section briefly reviews some of results described by the first author and others [15]. Given a random vector the functions and are defined as [15]:
[TABLE]
where with independent Gaussian noise . These functions have a number of important properties. The function is concave [15, Theorem 1] and the matrix version of I-MMSE relation is given by [15, Lemma 4]. Furthermore, these functions are able to characterize a linear Gaussian channel characterized by an arbitrary matrix via the following relationship [15, Lemma 1]:
[TABLE]
where is independent of .
D.2 Linear Gaussian Channel with Matrix Input
The properties of the mutual information and MMSE described in Section D.1 extend naturally to the setting where the input is an random matrix and the observations are given by where and is an standard Gaussian matrix. In this setting, we define the functions:
[TABLE]
Using vectorization, the mutual information function can be expressed equivalently as
[TABLE]
where denotes the vector obtained by stacking the columns in and denotes the Kronecker product and. From this relationship, one finds that the I-MMSE relation still holds for matrix inputs, that is .
Next, we consider a useful representation of the MMSE matrix . Let and denote conditionally independent draws form the posterior distribution of given . Then, the conditional covariance can be expressed as
[TABLE]
and taking the expectation with resect to gives
[TABLE]
Summing over the indices leads to
[TABLE]
D.3 Symmetric Matrix Estimation
In the symmetric matrix estimation problem, the goal it estimate an unknown matrix from observations of the form
[TABLE]
where is known and is a standard Gaussian Wigner matrix. In this section, we show that this model can be viewed as special case of the linear Gaussian channel associated with matrix input given by the tensor product , and thus the mutual information and MMSE can be characterized using the functions introduced in Sectioin D.2
The first step is to observe that symmetric noise model given in (59) provides the same information as the following asymmetric noise model:
[TABLE]
where is an standard Gaussian matrix. To see why, note that can be decomposed uniquely in terms of the symmetric matrix and the antisymmetric matrix . By the orthogonal invariance of the Gaussian distribution, the antisymmetric matrix is independent of both and . Noticing that is a standard Gaussian Wigner matrix shows that .
The next step is to use vectorization to represent the observation model in (60) as a linear Gaussian channel with matrix input:
[TABLE]
In view of both (56) and (57), the mutual information can be expressed as
[TABLE]
where the first equality holds because is a deterministic function of .
This characterization of the mutual information is useful because it allows us to compute gradients with respect to the matrix . By the I-MMSE relation and the chain rule,
[TABLE]
Furthermore, by (58), the MMSE matrix can be expressed as
[TABLE]
where and denote conditionally independent draws from the posterior distribution of given . Therefore, (61) can be rewritten compactly as
[TABLE]
Finally, if we consider the parameterization for some , then the partial derivative with respect to is given by
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Social networks , vol. 5, no. 2, pp. 109–137, 1983.
- 2[2] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, “Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications,” Physitcal Review Eq , vol. 84, no. 6, Dec. 2011.
- 3[3] Y. Deshpande, E. Abbe, and A. Montanari, “Asymptotic mutual information for the balanced binary stochastic block model,” Information and Inference , vol. 6, no. 2, pp. 125–170, Jun. 2017.
- 4[4] F. Caltagirone, M. Lelarge, and L. Miolane, “Recovering asymmetric communities in the stochastic block model,” IEEE Transactions on Network Science and Engineering , vol. 5, no. 3, pp. 237–246, 2018.
- 5[5] M. Lelarge and L. Miolane, “Fundamental limits of symmetric low-rank matrix estimation,” Probability Theory and Related Fields , 2018.
- 6[6] J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, and L. Zdeborová, “Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula,” in Advances in Neural Information Processing Systems (NIPS) , vol. 29, Barcelona, Spain, 2016, pp. 424–432.
- 7[7] T. Lesieur, F. Krzakala, and L. Zdeborová, “Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications,” Journal of Statistical Mechanics: Theory and Experiment , Jul. 2017.
- 8[8] F. Krzakala, J. Xu, and L. Zdeborová, “Mutual information in rank-one matrix estimation,” in Proceedings of the IEEE Information Theory Workshop (ITW) , 2016.
