The Geometry of Community Detection via the MMSE Matrix

Galen Reeves; Vaishakhi Mayya; Alexander Volfovsky

arXiv:1907.02496·cs.IT·July 5, 2019

The Geometry of Community Detection via the MMSE Matrix

Galen Reeves, Vaishakhi Mayya, Alexander Volfovsky

PDF

TL;DR

This paper introduces a geometric framework for community detection in networks with variable community sizes, using an effective signal-to-noise ratio matrix to characterize detection limits and improve understanding of real-world network behaviors.

Contribution

It extends existing models by incorporating community variability and develops a matrix-based geometric approach to analyze detection limits, generalizing previous scalar SNR concepts.

Findings

01

Effective SNR matrix characterizes community detectability.

02

Explicit formulas for mutual information and MSE bounds.

03

Numerical simulations validate theoretical predictions.

Abstract

The information-theoretic limits of community detection have been studied extensively for network models with high levels of symmetry or homogeneity. The contribution of this paper is to study a broader class of network models that allow for variability in the sizes and behaviors of the different communities, and thus better reflect the behaviors observed in real-world networks. Our results show that the ability to detect communities can be described succinctly in terms of a matrix of effective signal-to-noise ratios that provides a geometrical representation of the relationships between the different communities. This characterization follows from a matrix version of the I-MMSE relationship and generalizes the concept of an effective scalar signal-to-noise ratio introduced in previous work. We provide explicit formulas for the asymptotic per-node mutual information and upper bounds on…

Equations312

MMSE (X ∣ G)

MMSE (X ∣ G)

Y = X S^{1/2} + N,

Y = X S^{1/2} + N,

\nabla_{S} I (X; G, Y)

\nabla_{S} I (X; G, Y)

S \to 0 lim n \to \infty lim MMSE (X ∣ G, Y),

S \to 0 lim n \to \infty lim MMSE (X ∣ G, Y),

Cov (X_{i} ∣ G)

Cov (X_{i} ∣ G)

0 ⪯ MMSE (X ∣ G) ⪯ MMSE (X) ≜ \frac{1}{n} i = 1 \sum n Cov (X_{i}) .

0 ⪯ MMSE (X ∣ G) ⪯ MMSE (X) ≜ \frac{1}{n} i = 1 \sum n Cov (X_{i}) .

\displaystyle\operatorname{tr}\mathopen{}\mathclose{{}\left(\operatorname{\mathsf{MMSE}}(\bm{X})-\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G})}\right)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\bm{G}}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left\|P_{X_{i}\mid\bm{G}}(\cdot\mid G)-P_{X_{i}}(\cdot)}\right\|_{2}^{2}}\right].

\displaystyle\operatorname{tr}\mathopen{}\mathclose{{}\left(\operatorname{\mathsf{MMSE}}(\bm{X})-\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G})}\right)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\bm{G}}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left\|P_{X_{i}\mid\bm{G}}(\cdot\mid G)-P_{X_{i}}(\cdot)}\right\|_{2}^{2}}\right].

\displaystyle\operatorname{tr}\mathopen{}\mathclose{{}\left(\operatorname{\mathsf{Cov}}(X_{i})-\mathbb{E}\mathopen{}\mathclose{{}\left[\operatorname{\mathsf{Cov}}(X_{i}\mid\bm{G})}\right]}\right)=\mathbb{E}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left\|\mathbb{E}\mathopen{}\mathclose{{}\left[X_{i}\mid\bm{G}}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[X_{i}}\right]}\right\|^{2}}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left\|P_{X_{i}\mid\bm{G}}(\cdot\mid G)-P_{X_{i}}(\cdot)}\right\|^{2}}\right],

\displaystyle\operatorname{tr}\mathopen{}\mathclose{{}\left(\operatorname{\mathsf{Cov}}(X_{i})-\mathbb{E}\mathopen{}\mathclose{{}\left[\operatorname{\mathsf{Cov}}(X_{i}\mid\bm{G})}\right]}\right)=\mathbb{E}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left\|\mathbb{E}\mathopen{}\mathclose{{}\left[X_{i}\mid\bm{G}}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[X_{i}}\right]}\right\|^{2}}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left\|P_{X_{i}\mid\bm{G}}(\cdot\mid G)-P_{X_{i}}(\cdot)}\right\|^{2}}\right],

1_{A}^{T} MMSE (X ∣ G) 1_{A}

1_{A}^{T} MMSE (X ∣ G) 1_{A}

ℓ \sum p_{ℓ} μ_{ℓ} = 0, ℓ \sum p_{ℓ} μ_{ℓ} μ_{ℓ}^{T} = I_{k - 1} .

ℓ \sum p_{ℓ} μ_{ℓ} = 0, ℓ \sum p_{ℓ} μ_{ℓ} μ_{ℓ}^{T} = I_{k - 1} .

μ_{ℓ}

μ_{ℓ}

\displaystyle\operatorname{tr}\mathopen{}\mathclose{{}\left(I-\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G})}\right)

\displaystyle\operatorname{tr}\mathopen{}\mathclose{{}\left(I-\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G})}\right)

\displaystyle\operatorname{tr}\mathopen{}\mathclose{{}\left(I-\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G})}\right)

\displaystyle\operatorname{tr}\mathopen{}\mathclose{{}\left(I-\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G})}\right)

\displaystyle\mathopen{}\mathclose{{}\left\|\mathbb{E}\mathopen{}\mathclose{{}\left[X_{i}\mid\bm{G}}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[X_{i}}\right]}\right\|_{2}^{2}

\displaystyle\mathopen{}\mathclose{{}\left\|\mathbb{E}\mathopen{}\mathclose{{}\left[X_{i}\mid\bm{G}}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[X_{i}}\right]}\right\|_{2}^{2}

\displaystyle=\sum_{\ell=1}^{k}\mathopen{}\mathclose{{}\left(\frac{1}{\sqrt{p_{\ell}}}\mathbb{P}\mathopen{}\mathclose{{}\left[\tilde{X}_{i}=e_{\ell}\mid\bm{G}}\right]-\sqrt{p_{\ell}}}\right)^{2}

\displaystyle=\chi^{2}\mathopen{}\mathclose{{}\left(P_{X_{i}\mid\bm{G}}(\cdot\mid\bm{G})\,\|\,P_{X_{i}}(\cdot)}\right),

Q_{ab} = \frac{d}{n} + \frac{d ( 1 - d / n )}{n} μ_{a}^{T} R μ_{b},

Q_{ab} = \frac{d}{n} + \frac{d ( 1 - d / n )}{n} μ_{a}^{T} R μ_{b},

R = U diag (λ) U^{T},

R = U diag (λ) U^{T},

Y = S^{1/2} X + N,

Y = S^{1/2} X + N,

I_{X} (S)

I_{X} (S)

M_{X} (S)

\nabla_{S} I_{X} (S)

\nabla_{S} I_{X} (S)

\nabla_{S}^{2} I_{X} (S)

F (Δ)

F (Δ)

M_{X} (Δ)

M_{X} (Δ)

n \to \infty lim \frac{1}{n} I (X; G) = Δ \in S_{+}^{k - 1} min F (Δ),

n \to \infty lim \frac{1}{n} I (X; G) = Δ \in S_{+}^{k - 1} min F (Δ),

F (Δ, S)

F (Δ, S)

n \to \infty lim sup \frac{1}{n} I (X; G, Y) \leq Δ \in S_{+}^{k - 1} min F (Δ, S) .

n \to \infty lim sup \frac{1}{n} I (X; G, Y) \leq Δ \in S_{+}^{k - 1} min F (Δ, S) .

MMSE (X ∣ G) ⪰ MMSE (X ∣ G, Y),

MMSE (X ∣ G) ⪰ MMSE (X ∣ G, Y),

\displaystyle\limsup_{n\to\infty}\lambda_{\mathrm{max}}\mathopen{}\mathclose{{}\left(\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G},\bm{Y})-M_{X}(\Delta^{*})}\right)\leq 0

\displaystyle\limsup_{n\to\infty}\lambda_{\mathrm{max}}\mathopen{}\mathclose{{}\left(\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G},\bm{Y})-M_{X}(\Delta^{*})}\right)\leq 0

MMSE (X ∣ G, Y) ⪯ M_{X} (Δ^{*}) + o_{n} (1),

MMSE (X ∣ G, Y) ⪯ M_{X} (Δ^{*}) + o_{n} (1),

MMSE (X ∣ G, Y) = M_{X} (S + Δ^{*}) + o_{n} (1)

MMSE (X ∣ G, Y) = M_{X} (S + Δ^{*}) + o_{n} (1)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The Geometry of Community Detection

via the MMSE Matrix

Galen Reeves

Vaishakhi Mayya

Alexander Volfovsky G. Reeves is with the Department of Electrical and Computer Engineering and the Department of Statistical Science, Duke University, Durham, NC 27708 USA (e-mail: [email protected]). V. Mayya is with the Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708 USA (e-mail: [email protected]). A. Volfovsky is with the Department of Statistical Science, Duke University, Durham, NC 27708 USA (e-mail: [email protected]).

Abstract

The information-theoretic limits of community detection have been studied extensively for network models with high levels of symmetry or homogeneity. The contribution of this paper is to study a broader class of network models that allow for variability in the sizes and behaviors of the different communities, and thus better reflect the behaviors observed in real-world networks. Our results show that the ability to detect communities can be described succinctly in terms of a matrix of effective signal-to-noise ratios that provides a geometrical representation of the relationships between the different communities. This characterization follows from a matrix version of the I-MMSE relationship and generalizes the concept of an effective scalar signal-to-noise ratio introduced in previous work. We provide explicit formulas for the asymptotic per-node mutual information and upper bounds on the minimum mean-squared error. The theoretical results are supported by numerical simulations.

1 Introduction

Modern data problems often ask questions about how individuals (or computers or countries) interact or relate to each other within a network. A frequently studied problem in this context is that of community detection: how does one partition a network into clusters (or communities or groups) of nodes? A natural partition of a network is into communities that exhibit similar connection patterns, both within and between communities. A generative model for random networks called the stochastic block model (SBM) exhibits such behavior and hence much of the theoretical analysis of community detection has focused on it [1]. Under the SBM each individual belongs to exactly one of $k$ communities, and the probability of an edge between two individuals is exclusively a function of their community memberships.

The problem of community detection can be modeled in terms of a joint distribution on $(\bm{X},\bm{G})$ where $\bm{G}$ is a simple graph on $n$ vertices and $\bm{X}=(X_{1},\dots,X_{n})$ is a collection of labels associated with the vertices. In the SBM this joint distribution is governed by two parameters: a probability vector $p$ of each node being assigned to one of $k$ labels, and a $k\times k$ matrix of probabilities $Q$ where $Q_{ab}$ is the probability of an edge between nodes in communities $a$ and $b$ . The community detection task is recovering the labels $\bm{X}$ given the graph $\bm{G}$ and potentially side information.

Inspired by the work of Decelle et al. [2], a recent line of work has studied the information-theoretic limits of recovery when the distribution of $(\bm{X},\bm{G})$ is known. Most of this work has focused on either the two-community SBM [3, 4, 5, 6, 7, 8, 9] or the so-called $k$ -community symmetric SBM [7, 10, 11, 12]. In all of these cases, performance is summarized in terms of a single numerical value, which is often referred to as the effective signal-to-noise ratio of the problem. General SBMs have been considered by Abbe and Sandon [10] who characterize conditions for weak recovery and also by Lesieuir et al. [7] who analyze the performance of an approximate message passing algorithm.

A different line of research within the statistics community has focused on settings where the parameters of the distribution, such as the distribution of communities and the conditional probabilities of edges, are unknown quantities that must also be inferred, along with the community memberships [13, 14]. While the models considered in this literature are highly flexible, the conditions needed for consistent recovery of communities corresponds to a very high SNR regime relative to the information theoretic analysis.

1.1 Our Contributions

The contribution of this paper is to characterize the information-theoretic limits for a large class of degree-balanced SBMs. In contrast to the symmetric SBM, these models allow for variability in the sizes and behaviors of the different communities, and thus reflect behaviors observed in real-world networks. While previous work is limited to a scalar measure of performance for the overall community detection problem, we introduce a multivariate measure of performance, the minimum mean-squared error (MMSE) matrix, which describes detection limits for individual communities. For example, this matrix allows us to characterize settings where some of the communities can be detected while other cannot.

Our analysis of the community detection problem leverages a matrix version of the I-MMSE relation [15], which both simplifies and generalizes techniques used in previous work. In particular, the upper bound on the mutual information in Theorem 2 is a consequence of a novel non-asymptotic inequality that holds under any distribution on the community labels. Many of our techniques can be applied more generally to other high-dimensional inference problems, including matrix and tensor factorization.

1.2 Overview of Approach

This paper introduces a multivariate measure of performance, which we refer to as the MMSE matrix:

[TABLE]

In this expression, $\operatorname{\mathsf{Cov}}(X_{i}\mid\bm{G})$ is the covariance matrix of the $i$ -th node’s label after is has been embedded in to an $\ell$ -dimensional Euclidean space (where $\ell$ is either $k$ or $k-1$ ). We show that the MMSE matrix provides important geometrical information about the uncertainty in the community memberships. While the trace of the MMSE matrix corresponds to standard measures of performance such as the average overlap, the information provided by individual entries in the MMSE matrix can be used to answer more nuanced questions about which of the community relationships can (or cannot) be recovered.

One of the key ideas in this paper is to focus on community detection in the setting where there is additional covariate information about the labels. Specifically, we assume that one has side-information from the signal-plus-noise model:

[TABLE]

where $S$ is an $\ell\times\ell$ positive semidefinite matrix, known as the matrix SNR, and $\bm{N}$ is an $n\times\ell$ matrix with i.i.d. standard Gaussian entries.

The introduction of the signal-plus-noise model plays an important role both for our analysis and for our interpretation of the results. For example, it allows us to leverage the matrix I-MMSE relation [15] to characterize the MMSE matrix in terms of the gradient of the mutual information:

[TABLE]

Remarkably, this relationship holds generally for any joint distribution on the pair $(\bm{X},\bm{G})$ . Notice that the matrix MMSE in (1) is obtained by evaluating this expression at $S=0$ .

The signal-plus-noise model also provides a natural way to address non-identifiability issues that arise when the distribution over the labels is invariant to permutations. The key idea is that in the large- $n$ limit, an arbitrarily small amount of side-information is sufficient to break the symmetry in the model. Hence, focusing on the double limit

[TABLE]

provides a meaningful and interpretable measure of average performance that bypasses the need to optimize over an equivalence class of permutations.

Section 3 provides formulas for the per-vertex mutual information and MMSE matrix in the large- $n$ limit. These formulas are stated for a degree-balanced stochastic block model and can be approximated numerically with arbitrary precision. Numerical simulations are provided in Section 5.

1.3 Notation

We use $\mathbb{S}^{d}$ , $\mathbb{S}_{+}^{d}$ to denote the space $d\times d$ symmetric matrices and symmetric positive semi-definite matrices, respectively. Given a symmetric positive semi-definite matrix $S$ , we use $S^{1/2}$ to denote the unique positive semi-definite square root. Given matrix $A,B\in\mathbb{S}^{d}$ , the relation $A\preceq B$ means that $B-A\in\mathbb{S}_{+}^{d}$ .

2 Definitions

The $k$ community stochastic blockmodel is frequently parameterized in terms of the tuple $(n,p,Q)$ where $p=(p_{1},\dots,p_{k})$ is a distribution over $k$ communities and $Q\in[0,1]^{k\times k}$ is a symmetric matrix such that $Q_{ab}$ is the probability of an edge between nodes in communities $a$ and $b$ . Without loss of generality, the community labels can be embedded into finite dimensional Euclidean space. Two useful representations are considered in Sections 2.1 and 2.2. In Section 2.3 we introduce the degree balanced SBM for which we state the remainder of the results in the paper. Lastly, in Section 2.4 we introduce the signal plus noise problem which we leverage to derive the results for community detection.

2.1 Standard Basis Representation

A natural embedding associates the labels with the standard basis vectors $\{e_{1},\dots,e_{k}\}$ in $\mathbb{R}^{k}$ , i.e., the columns of the identity matrix. Under this representation, the expected value of a label vector $X_{i}$ is a point on the probability simplex. The conditional covariance is defined by

[TABLE]

and the MMSE matrix is defined according to (1). By the data processing inequality for MMSE, this matrix satisfies

[TABLE]

As a consequence, the difference between the MMSE matrix and covariance provides a measure of the difference between the prior and posterior marginals of the labels.

Proposition 1.

Under the standard basis representation, the $k\times k$ MMSE matrix satisfies

[TABLE]

Proof.

For each $i$ , we can write

[TABLE]

where the first equality follows from the law of total variance and the last step holds because, under the standard bases representation, we have $\mathbb{E}\mathopen{}\mathclose{{}\left[X_{i\ell}\mid\bm{G}}\right]=\mathbb{P}\mathopen{}\mathclose{{}\left[X_{i\ell}=e_{\ell}\mid\bm{G}}\right]$ . Summing over all $i$ and normalizing by $n$ completes the proof. ∎

Furthermore, the individual entries of the MMSE matrix also provide information about different recovery tasks. For example, consider the problem of determining whether a label belongs to a subset $A\subset[k]$ . If we define $\bm{1}_{A}=\sum_{\ell\in A}e_{\ell}$ , then $\bm{1}_{A}^{T}X_{i}$ is binary random variable indicating whether the $i$ -th label belongs to $A$ . Summing the entries in the MMSE matrix indexed by the set $A$ provides a measures of the average error probability:

[TABLE]

2.2 Whitened Representation

Next, we focus on the setting where the labels are identically distributed with probability vector $p=(p_{1},\dots,p_{k})$ . The whitened representation is defined to be of a set of $k$ points $\{\mu_{1},\dots,\mu_{k}\}$ in $\mathbb{R}^{k-1}$ with the property that

[TABLE]

Under the whitened representation, each label vector has zero mean and identity covariance and thus the MMSE matrix satisfies $0\preceq\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G})\preceq I_{k-1}$ .

Remark 1 (Unique Specification of Whitened Representation).

The whitened representation can be defined explicitly as a function of $p$ as follows. Let $\tilde{p}=(\sqrt{p_{1}},\dots,\sqrt{p_{k}})^{T}$ and apply the Gram-Schmidt process to the vectors $\{\tilde{p},e_{1},\dots,e_{k-1}\}$ to obtain an orthonormal basis for $\mathbb{R}^{k}$ of the form $[\tilde{p},B]$ where $B$ is $k\times(k-1)$ . Then, the support of the whitened representation is related to the standard basis vectors according to

[TABLE]

where $P=\operatorname{\mathrm{d}iag}(p)$ . This construction is unique and has the useful property that $\mu_{\ell}$ lies in the span of $\{e_{1},\dots,e_{\ell}\}$ .

Proposition 2.

If the labels are identically distributed then the $(k-1)\times(k-1)$ MMSE matrix of the whitened representation satisfies

[TABLE]

where $\chi^{2}(P\,\|\,Q)=\int(\mathrm{d}P/\mathrm{d}Q)^{2}\,\mathrm{d}Q$ denotes the chi-squared divergence.

Proof.

Noting that $\operatorname{\mathsf{MMSE}}(\bm{X})=I$ and using the same approach as in the proof of Proposition 1, we have

[TABLE]

Next, let $\tilde{X}_{i}$ denote the representation of $X_{i}$ in the standard basis and observe that

[TABLE]

where we have used (4) and the fact that $\mathbb{E}\mathopen{}\mathclose{{}\left[\tilde{X}_{i}}\right]=p$ . Plugging this expression back into (5) gives the stated result. ∎

For the purposes of analysis, the two representations described above are equivalent in the sense that there is a one-to-one mapping between the $k\times k$ MMSE matrix defined under the standard basis representation and the $(k-1)\times(k-1)$ MMSE matrix defined under the whitened representation. For notational convenience we work in the whitened representation.

2.3 Degree-Balanced SBM

The average degree of an SBM corresponds to the expected number of edges for a node chosen uniformly at random and is denoted by $d$ . An SBM is said to be degree-balanced if the expected degree of a node does not depend on its community assignments. This condition is equivalent to saying that $Qp$ is proportional to the all ones vector.

For the purposes of this paper, it is useful to consider a reparameterization of the degree-balanced SBM in terms of the tuple $(n,d,p,R)$ where $d$ is the average degree and $R\in\mathbb{S}^{k-1}$ . Using this parameterization, the entries of $Q$ are given by

[TABLE]

where $\{\mu_{1},\dots,\mu_{k}\}$ are defined as a function of $p$ using the procedure described in Remark 1. The tuple $(n,d,p,R)$ is valid only if the entries of $Q$ are between zero and one.

The matrix $R$ quantifies the relative strength of relationships between different communities. The eigenvalue decomposition is given by

[TABLE]

where $\lambda=(\lambda_{1},\dots,\lambda_{k-1})$ are real numbers. To simplify the analysis, we will assume throughout that all the eigenvalues are nonzero so that $R$ is invertible.

We remark that the definition of signal-to-noise ratio given by Abbe and Sandon [10, Section 2.1] corresponds to $\max_{i}\lambda_{i}^{2}$ . Furthermore, for the special case of $k=2$ communities, the representation of $X_{i}$ is one-dimensional and the formulation of Lelarge and Miolane [5] is equivalent to ours.

2.4 Signal-Plus-Noise Problem

Our analysis uses properties of the signal-plus-noise model given in (2). Throughout this section we will assume the labels are drawn i.i.d. according to a probability vector $p=(p_{1},\dots,p_{k})$ with strictly positive entries and are supported on the whitened representation described in Section 2.2. For each $S\in\mathbb{S}_{+}^{k-1}$ , the task of recovering $\bm{X}$ from $\bm{Y}$ decouples into $n$ independent copies of the problem

[TABLE]

where $X$ is supported on $\{\mu_{1},\dots,\mu_{k}\}$ with probability vector $p$ and $N\sim\mathcal{N}(0,I)$ is independent Gaussian noise.

Following [15] we define the the mutual information function $I_{X}:\mathbb{S}_{+}^{k-1}\to[0,\infty)$ and matrix-valued MMSE function $M_{X}:\mathbb{S}_{+}^{k-1}\to\mathbb{S}_{+}^{k-1}$ according to

[TABLE]

The gradient and Hessian of $I_{X}(S)$ are given by [15, Lemma 4]

[TABLE]

where $\otimes$ denotes the Kronecker product. We note that these functions can be approximated using numerical integration methods or Monte-Carlo sampling.

3 Formulas for Mutual Information and MMSE

Our analysis focuses on a sequence of degree-balanced SBMs where the parameters $(p,R)$ are fixed as the size of the network $n$ scales to infinity. Additionally, we make two assumptions.

Assumption 1 (Diverging Average Degree).

The average degree of the network $d$ increases with $n$ such that both $d$ and $(n-d)$ tend to infinity.

Assumption 2 (Definite Matrix).

The matrix $R$ is either positive definite or negative definite.

Our first result is stated in terms of the potential function $\mathcal{F}:\mathbb{S}_{+}^{k-1}\to\mathbb{R}_{+}$ defined by

[TABLE]

where $I_{X}(\cdot)$ is defined by (7). Notice that the first term in the potential function is defined exclusively by the prior distribution of labels $p$ whereas the second term is defined exclusively by the matrix $R$ . By the matrix I-MMSE relation [15], it can be verified that every stationary point of $\mathcal{F}(\Delta)$ satisfies the fixed-point equation

[TABLE]

where $M_{X}(\cdot)$ is defined by (8). Noting that $M_{X}(0)=I$ , we see that $\Delta=0$ is always a stationary point. Furthermore, every solution of (12) belongs to the set $\{\Delta\,:\,0\preceq\Delta\preceq R^{2}\}$ .

Theorem 1.

Under Assumptions 1 and 2,

[TABLE]

where $\mathcal{F}(\Delta)$ is given in (11).

The next result provides an upper bound on the mutual information in the setting where side information is generated according to the signal-plus-noise model (2) parameterized by a positive semi-definite matrix $S$ . To characterize this setting, we define the modified potential function:

[TABLE]

Notice that the main difference from (12) is that the side information changes the prior information about the labels.

Theorem 2.

Suppose that $\bm{Y}$ is generated according to the signal-plus-noise model (2) with matrix $S\in\mathbb{S}_{+}^{k-1}$ . Under Assumption 1,

[TABLE]

where $\mathcal{F}(\Delta,S)$ is given in (13).

Remark 2.

Similar to previous work [3, 4, 5, 8, 6, 7], our proofs of Theorems 1 and 2 use a channel universality argument to relate the community detection problem to a low-rank estimation problem. Assumption 2 is needed for the proof of Theorem 1, which leverages [5, Theorem 12]. To prove Theorem 2 we develop a novel variation of the Guerra interpolation method that exploits the matrix I-MMSE relationship [15] to provide a general and non-asymptotic upper bound.

Next, we recall that that by the data processing inequality, the MMSE matrix satisfies

[TABLE]

for all $S\in\mathbb{S}^{k-1}_{+}$ . For any fixed problem size $n$ , the difference between these matrices converges to zero as $S\to 0$ . However, in the large- $n$ limit it is possible that the limiting behavior is discontinuous with respect to $S$ . This can occur, for example, when the SBM is invariant to permutations of the labels and hence $\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G})=\operatorname{\mathsf{MMSE}}(\bm{X})$ . The presence of side-information with an arbitrarily small positive definite matrix $S$ is sufficient to break the permutation invariance, and thus the small- $S$ limit provides a meaningful measure of recovery performance that overcomes the non-identifiability issues.

The following result follows from the matrix I-MMSE relation and Theorems 1 and 2. The proof is given in Appendix A.3.

Theorem 3.

Consider Assumptions 1 and 2. For every $S\succ 0$ ,

[TABLE]

where $\Delta^{*}$ denotes any minimizer of $\mathcal{F}(\Delta)$ . In other words,

[TABLE]

where $o_{n}(1)$ denotes a sequence of symmetric matrices that converges to zero as $n\to\infty$ .

The numerical experiments of Section 5 suggest that the upper bounds in Theorem 2 are asymptotically tight, i.e., that the MMSE matrix satisfies

[TABLE]

for almost all $S$ , where $\Delta^{*}$ is the unique minimizer of $\mathcal{F}(\cdot,S)$ .

The next result provides an asymptotic lower bound on the problem of estimating $\bm{X}R\bm{X}^{T}$ , which implies a lower bound on $\operatorname{\mathsf{MMSE}}(\bm{X}\mid\bm{G})$ . The proof is given in Appendix A.4.

Theorem 4.

Under Assumptions 1 and 2,

[TABLE]

where $\mathcal{D}=\arg\min\mathcal{F}(\Delta)$ . Furthermore, this implies that

[TABLE]

4 Implications for Weak Recovery

Broadly speaking, weak recovery refers to the ability to produce an estimate that is positively correlated with the ground truth. In the context of community detection, the precise definition of weak recovery is a bit more nuanced due to the fact that symmetries in the problem formulation can result in a posterior distribution that is invariant to permutations of the labels. As a specific example, consider the two-community degree-balanced SBM where each community is equally likely. Even if an estimator can partition the nodes into two groups such that all of the nodes in each group belong to the same community, it is impossible to determine which label should be assigned to which group.

One approach that is taken in the literature to address this nonidentifiability assesses the performance of an estimator after choosing a permutation of the labels that leads to the best performance; see e.g., [10, Section 2]. Another approach focuses on the related problem of estimating the pairwise interaction terms $\{X^{T}_{i}RX_{j}\}$ . Specifically weak recovery with respect to the pairwise interactions is possible if

[TABLE]

where $\operatorname{\mathsf{MMSE}}(X_{i}^{T}RX_{j}\mid\bm{G})\triangleq\mathbb{E}_{\bm{G}}\mathopen{}\mathclose{{}\left[\operatorname{\sf Var}(X_{i}^{T}RX_{j}\mid\bm{G})}\right]$ . Notice that under the whitened basis representation we propose, $\operatorname{\sf Var}(X_{i}^{T}RX_{j})=\|R\|_{F}^{2}$ and this condition is equivalent to

[TABLE]

Following the approach taken in this paper, we see that a natural alternative is to focus on the small- $S$ behavior of the MMSE matrix. In particular, we say that weak recovery is possible if

[TABLE]

In view of these definitions, we see that Theorem 3 and Theorem 4 provide necessary and sufficient conditions for weak recovery, depending on whether the potential function $\mathcal{F}(\cdot)$ has a unique minimizer at zero.

Theorem 5 (Weak Recovery).

Consider Assumptions 1 and 2. If $\mathcal{F}(\cdot)$ has a minimizer that is not equal to zero then weak recovery in the sense of (15) is possible. Conversely, if $\mathcal{F}(\cdot)$ has a unique minimizer at zero, then weak recovery in the sense of (14) is not possible.

Evaluating the Hessian of the potential function at zero provides a simple test to determine whether $\Delta=0$ is a local minimum. Using (10), it can be shown that

[TABLE]

Therefore, if $\max_{i}\lambda^{2}_{i}(R)>1$ then $\Delta=0$ is not a local minimizer.

5 Numerical Experiments

This section compares the asymptotic bounds given in Section 3 with the MSE obtained using belief propagation (BP). The case of the three-community degree balanced SBM $(n,d,p,R)$ is illustrated in Figure 1. The black contour lines correspond to the trace of $M_{X}(\Delta^{*})$ where $\Delta^{*}$ is the global minimizer of the potential function defined in (11). The heat map values correspond to the empirical MSE of the BP algorithm described in [2] applied to a network of size $n=10^{5}$ with average degree $d=30$ . Each pixel is the median of eight independent trials and the MSE is measured with respect to the whitened basis representation. In each trial, the BP algorithm is run using fifteen different random initializations and the MSE is assessed based on the initialization that produces in the lowest predicted MSE.

In the case of uniform community assignments (Figure 1(a)), the weak recovery limit for acyclic BP [10] is equal to our upper bound on the weak detection threshold. Furthermore, we see that there is a close correspondence between the asymptotic formula and the empirical results. Note that the special case $\lambda_{1}=\lambda_{2}$ corresponds to the three-community symmetric SBM.

In the case of non-uniform community assignments (Figure 1(b)), there exists a region of the parameter space where weak recovery is possible with $\max(\lambda_{1},\lambda_{2})<1$ . The existence of such a region has been shown previously in the special case of the two-community asymmetric SBM [4]. We also see that the asymptotic formulas match the empirical behavior qualitatively, although the empirical MSE is worse than is suggested by the formulas. The grey region in Figure 1(b) corresponds to settings where $(n,d,p,R)$ does not define a valid SBM.

Numerical Approximation of Formulas

We use Monte Carlo sampling to approximately evaluate the functions $I_{X}$ and $M_{X}$ , and we use the concave-convex procedure [16] to explore the local minima of the potential function. Starting is an initialization point $\Delta^{0}$ , a sequence of iterates is obtained according to

[TABLE]

where $\epsilon\in[0,1)$ is a dampening parameter.

6 Main Steps in Proof

This section provides an overview of the main theoretical results of the paper. These results are described in the context of a more general inference problem where the goal is to estimate a random $n\times\ell$ matrix $\bm{X}=[X_{1},\dots,X_{n}]^{T}$ . The setting of the $k$ -community degree-balanced SBM described in Section 3 corresponds to the special case where $\ell=k-1$ and the rows of $\bm{X}$ are drawn i.i.d. from the whitened distribution described in Section 2.2.

6.1 Equivalence between Observation Models

The high-level idea behind our approach is to established an equivalence between three different observations models. The first observation model is the signal-plus-noise model given by:

[TABLE]

where $S\in\mathbb{S}_{+}^{\ell}$ and $\bm{N}$ is an $n\times\ell$ standard Gaussian matrix, i.e., the entries are i.i.d. $\mathcal{N}(0,1)$ .

To describe the second observation model, we first define the symmetric $n\times n$ random matrix

[TABLE]

where $R\in\mathbb{S}^{\ell}$ . Then, the observations are given by

[TABLE]

where $t\in[0,\infty)$ and $\bm{\xi}$ is an $n\times n$ standard Gaussian Wigner matrix, i.e. a symmetric matrix whose entries above the diagonal are i.i.d. $\mathcal{N}(0,1)$ and whose entries on the diagonal are i.i.d. $\mathcal{N}(0,2)$ .

For the last model, the observations consist of an $n$ -node simple graph, which is represented by its adjacency matrix $\bm{G}\in\{0,1\}^{n\times n}$ . By convention the diagonal entries are set to zero and the off-diagonal entries are given by $G_{ij}=G_{ji}=1$ if there is an edge between nodes $i$ and $j$ and zero otherwise. Our results apply to the setting where the entries of the adjacency matrix are drawn independently conditional on $\bm{W}$ according to

[TABLE]

where $d\in(0,n)$ parameterizes the expected number of edges.

Notice that both (18) and (19) consist of elementwise observations of $\bm{W}$ from a fixed output channel. The following result provides a link between the mutual information in these observation models. The proof is given in Appendix B.

Theorem 6 (Channel Universality).

Let $\bm{W}$ be a symmetric $n\times n$ random matrix with bounded entries $|W_{ij}|\leq B/\sqrt{n}$ and finite support of cardinality $N$ . Let $\bm{Z}$ be drawn according to (18) with $t=1$ and $\bm{G}$ be drawn according to (19). Given any $\delta>0$ , there exists a constant $C(\delta,B)$ such that

[TABLE]

uniformly for all integers $n>\delta/2$ and $d\in[\delta,n-\delta]$ .

Remark 3.

The concept of channel universality appeared in the work of Korada and Montanari [17] and subsequently developed in the context of community detection [3, 4, 5] and low-rank matrix estimation [7, 6, 8]. In relation to this work, the contribution of Theorem 6 is that it holds under more general assumptions on both $\bm{W}$ and the average degree $d$ .

Theorem 6 implies that the joint information in $(\bm{G},\bm{Y})$ about $\bm{X}$ is asymptotically equivalent to the joint information in $(\bm{Y},\bm{Z})$ about $\bm{X}$ .

Corollary 7.

Let $(\bm{X},\bm{G})$ be drawn according to the degree-balance SBM with parameters $(n,d,p,R)$ where $p$ and $R$ are fixed and $d$ scales with $n$ such that both $d$ and $(n-d)$ tend to infinity. Let $\bm{Y}$ be drawn according to (16) and let $\bm{Z}$ be drawn according to (18) with $t=1$ and $\bm{W}=n^{-1/2}\bm{X}R\bm{X}^{T}$ . Then,

[TABLE]

Proof.

Combining the chain rule for mutual information with the Markov structure in $(\bm{W},\bm{X},\bm{Y},\bm{Z})$ leads to

[TABLE]

By assumption, $\bm{X}$ has finite support of cardinality $k^{n}$ and bounded entries. This implies that $\bm{W}$ has finite support of cardinality $N=k^{n}$ and bounded entries $|W_{ij}|\leq B/\sqrt{n}$ where the constant $B$ depends only on $(p,R)$ . For every realization $\bm{y}$ of $\bm{Y}$ , Theorem 6 implies that there is a constant $C(p,R)$ such that

[TABLE]

for all $n$ and $d$ sufficiently large. The stated result then follows from Jensen’s inequality and the assumptions on $n$ and $d$ . ∎

6.2 Interpolation via Mutual Information

Theorem 6 provides a link between community detection and symmetric matrix estimation. The next step in our analysis is to study an interpolating function that transitions smoothly from the symmetric matrix model to the signal-plus-noise model. We note that a number of approaches have been developed in the statistical physics literature, including Guerra’s interpolation method [18] and the adaptive interpolation method [19]. In this paper we consider an approach inspired by the work of Reeves [20], which leverages the functional properties of mutual information in Gaussian channels.

The central object of interest is the mutual information functions $I_{\bm{X},\bm{W}}:\mathbb{S}_{+}^{\ell}\times[0,\infty)\to\mathbb{R}$ defined by

[TABLE]

This function has a number of useful properties. Combining the chain rule for mutual information with the Markov structure in $(\bm{W},\bm{X},\bm{Y},\bm{Z})$ allows us to write

[TABLE]

Hence, the special cases $t=0$ and $S=0$ are given by

[TABLE]

In this way, $I_{\bm{X},\bm{W}}(S,t)$ provides a bridge between the symmetric matrix estimation problem, with or without side information, and the signal-plus-noise problem. Notice that if the rows of $\bm{X}$ are independent, then $I(\bm{X};\bm{Y})=\frac{1}{n}\sum_{i=1}^{n}I(X_{i};Y_{i})$ . In particular, if the rows $\bm{X}$ are drawn i.i.d. from a distribution $P_{X}$ on $\mathbb{R}^{d}$ (as is assumed in Theorem 2) then $I_{\bm{X}}(S)$ is equal to the mutual information function $I_{X}(S)$ introduced in Section 2.4.

It was previously shown that $I_{\bm{X},\bm{W}}(S,t)$ possesses several desirable properties: it is concave and twice differentiable in the pair $(S,t)$ [15, Lemma 4]. Let the partial gradients with respect to the first and second arguments be denoted by $I^{(1)}_{\bm{X},\bm{W}}:\mathbb{S}_{+}^{\ell}\times[0,\infty)\to\mathbb{S}_{+}^{\ell}$ and $I^{(2)}_{\bm{X},\bm{W}}:\mathbb{S}_{+}^{\ell}\times[0,\infty)\to\mathbb{R}$ , respectively. By the matrix I-MMSE relation, it follows that:

[TABLE]

The details of this derivation are given in Appendix D.3.

The next result provides a non-asymptotic upper bound on $I_{\bm{X},\bm{W}}(S,t)$ in terms of the signal-plus-noise model. Remarkably, the only restriction on $\bm{X}$ is that it has finite fourth moments. The proof is given in Section 6.3.

Theorem 8.

*Let $\bm{X}\in\mathbb{R}^{n\times\ell}$ be a random matrix with finite fourth moments and let $\bm{W}=\frac{1}{\sqrt{n}}\bm{X}R\bm{X}^{T}$ where $R\in\mathbb{S}^{\ell}$ is invertible. For all $S\in\mathbb{S}^{\ell}_{+}$ and $t\in(0,\infty)$ , the mutual information function defined in (20) satisfies *

[TABLE]

where $\Gamma=\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[\bm{X}^{T}\bm{X}}\right]$ .

If the rows of $\bm{X}$ are sufficiently uncorrelated then the term $\frac{1}{n^{2}}\mathbb{E}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left\|R\bm{X}^{T}\bm{X}-R\mathbb{E}\mathopen{}\mathclose{{}\left[\bm{X}^{T}\bm{X}}\right]}\right\|_{F}^{2}}\right]$ converges to zero in the large- $n$ limit. The case of i.i.d. rows is summarized as follows:

Corollary 9.

*Let $\bm{X}\in\mathbb{R}^{n\times\ell}$ be a random matrix whose rows are drawn i.i.d. from a distribution $P_{X}$ on $\mathbb{R}^{d}$ with finite forth moments and let $\bm{W}=\frac{1}{\sqrt{n}}\bm{X}R\bm{X}^{T}$ where $R\in\mathbb{S}^{\ell}$ is invertible. For all $S\in\mathbb{S}^{\ell}_{+}$ and $t\in(0,\infty)$ , the mutual information function defined in (20) satisfies *

[TABLE]

Proof.

Noting that $R\bm{X}^{T}\bm{X}=\sum_{i=1}^{n}RX_{i}X_{i}^{T}$ is the sum of $n$ i.i.d. matrices leads to

[TABLE]

which converges to zero as $n$ increases to infinity. ∎

Combining Corollary 7 and Corollary 9 leads directly to an upper bound on the mutual information in the community detection problem (Theorem 2). The details of the proof are given in Appendix A.2. To show that this bound is tight requires significantly more work. In this direction, we build upon the work of Lelarge and Miolane [5, Theorem 12], who give an explicit characterization of the large- $n$ limit for the matrix estimation problem in the setting where $S=0$ . Although their result is stated originally for the special case where $R$ is the identity matrix, it extends to the case described below, where $R$ is definite. For completeness a detailed mapping between their statement of this result and the one used in this paper is provided in Appendix C.

Theorem 10 (Lelarge and Miolane [5, Theorem 12]).

Let $\bm{X}\in\mathbb{R}^{n\times\ell}$ be a random matrix whose rows are drawn i.i.d. from a distribution $P_{X}$ on $\mathbb{R}^{\ell}$ with finite second moments and let $\bm{W}=\frac{1}{\sqrt{n}}\bm{X}^{T}R\bm{X}$ where $R$ is either positive definite or negative definite. For all $t\in(0,\infty)$ , the mutual information function defined in (20) satisfies

[TABLE]

6.3 Proof of Theorem 8

The first step in the proof is given by the the following lemma, which establishes a functional relationship between the first and second partial gradients of $I_{\bm{X},\bm{W}}(S,t)$ .

Lemma 11.

The gradients of the function $I_{\bm{X},\bm{W}}(S,t)$ defined in (20) satisfy

[TABLE]

where $g:\mathbb{S}^{\ell}_{+}\to\mathbb{R}$ is defined according to

[TABLE]

with $\Gamma=\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[\bm{X}\bm{X}^{T}}\right]$ .

Proof.

Based on the analysis of the MMSE matrix of a linear Gaussian channel with matrix input (Appendix D.2) and the partial derivatives of the mutual information function in symmetric matrix estimation (Appendix D.3) we obtain

[TABLE]

where $\bm{A}$ and $\bm{B}$ are conditionally independent draws form the posterior distribution of $\bm{X}$ given $(\bm{Y},\bm{Z})$ . Comparing these expressions with the definition of $g(U)$ , leads to

[TABLE]

Noticing that this expression is non-negative completes the proof. ∎

The next step in our analysis is to focus on the convex conjugate (or Legendre–Fenchel transform) of $I_{\bm{X},\bm{W}}(\cdot,t)$ . Specifically, we define the extended real-valued function $J_{\bm{X},\bm{W}}:\mathbb{S}_{+}^{\ell}\times[0,t)\to\mathbb{R}\cup\{+\infty\}$ according to

[TABLE]

Here, we have introduced the factor of one half in so that the dual variable $U$ can be associated with the MMSE matrix. The function $J_{\bm{X},\bm{W}}(\cdot,t)$ is convex because it is the pointwise maximum of affine functions. By the Fenchel–Moreau theorem (see e.g., [21, Theorem 13.37]), the fact that $I_{\bm{X},\bm{W}}(\cdot,t)$ is a proper upper-semicontinuous concave function implies that the Legendre–Fenchel transform is a bijection, and thus

[TABLE]

where $\mathcal{U}\triangleq\{2I^{(1)}_{\bm{X}}(S)\,:\,S\in\mathbb{S}_{+}^{\ell}\}\subseteq\mathbb{S}_{+}^{\ell}.$

Working with the transformed representation allows us to convert the functional constraint on the partial derivatives given in Lemma 11 into an upper bound on the convex conjugate.

Lemma 12.

For all $U\in\mathcal{U}$ we have

[TABLE]

where $g(U)$ is defined in (24).

Proof.

The assumption that $U\in\mathcal{U}$ combined with the fact that $I^{(1)}_{\bm{X},\bm{W}}(S,\cdot)$ is non-increasing in the Loewner partial order ensures that supremum with respect to $S$ in (25) is attained on at least one point $S^{*}(U,t)\in\mathbb{S}_{+}^{\ell}$ . By the Karush–Kuhn–Tucker conditions, the gradient with respect to $S$ evaluated at this point satisfies

[TABLE]

Next, we note that $g(U)$ is non-decreasing with respect to the Loewner partial order. To see why, observe that for any $0\preceq U\preceq V\preceq\Gamma$ , we have $g(V)-g(U)=\operatorname{tr}(R(V-U)R(2\Gamma-U-V))\geq 0$ .

We now employ the envelope theorem [22], which implies that $J_{\bm{X},\bm{W}}(U,t)$ is absolutely continuous in $t$ with

[TABLE]

The integrand in this expression can be upper bounded as follows:

[TABLE]

The first inequality is due to Lemma 11 and the second inequality follows from (28) and the fact that $g(U)$ is non-decreasing. Plugging this inequality back into (29) completes the proof. ∎

We are now have all the ingredients needed for the proof of Theorem 8. Starting with (26) and then applying the bound in Lemma 12 allows us to write

[TABLE]

Note that this is a variational upper bound in terms of the dual variable $U$ , which corresponds to the MMSE matrix. To rewrite this expression in terms of an infimum over the signal-to-noise matrix, we define the function $h:\mathbb{S}_{+}^{\ell}\to\mathbb{R}$ according

[TABLE]

Then, a straightforward calculation shows that $g(U)$ is the concave conjugate of $h(\Delta)$ in the following sense:

[TABLE]

for all $0\preceq U\preceq\Gamma$ . Plugging this characterization of $g(U)$ back into (31), and then swapping the order of the infimum with respect to $U$ and $\Delta$ leads to

[TABLE]

where the final equality follows from (26). This concludes the proof of Theorem 8.

7 Discussion

The results presented in this paper recast the community detection problem as a multivariate problem making it possible to evaluate more than just traditional overall recovery tasks. By evaluating the formulas derived in Section 3 we can now differentiate between the tasks of finding one community, all communities, and a subset of communities within a network. The formulas further allow us to identify a computational gap for regimes where certain recovery tasks should be theoretically attainable but where algorithms such as BP will fail to perform.

Acknowledgment

The authors thank Lenka Zdeborová for providing initial direction on this problem and Jiaming Xu for helpful discussion regarding channel universality. This was supported in part by funding from the Laboratory for Analytic Sciences (LAS) and by the NSF under Grant No. 1750362. Any opinions, findings, conclusions, and recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.

Appendix A Proofs of Results in Section 3

A.1 Proof of Theorem 1

Combining Corollary 7 and Theorem 10 with $t=1$ yields

[TABLE]

for any random matrix $\bm{X}\in\mathbb{R}^{n\times\ell}$ whose rows are drawn i.i.d. from a distribution $P_{X}$ on $\mathbb{R}^{\ell}$ with finite and bounded support. Under the assumption that the rows are supported on the whitened representation described in Section 2.2 it follows that $\mathbb{E}\mathopen{}\mathclose{{}\left[XX^{T}}\right]=I$ . Furthermore, it can be verified that the infimum with respect to $\Delta$ is attained on the compact set $\{\Delta\,:0\preceq\Delta\preceq R^{2}\}$ and thus the use of a minimum is justified. This concludes the proof of Theorem 1.

A.2 Proof of Theorem 2

Combining Corollary 7 and Corollary 9 with $t=1$ yields

[TABLE]

for any $S\in\mathbb{S}_{+}^{\ell}$ and random matrix $\bm{X}\in\mathbb{R}^{n\times\ell}$ whose rows are drawn i.i.d. from a distribution $P_{X}$ on $\mathbb{R}^{\ell}$ with finite and bounded support. Under the assumption that the rows are supported on the whitened representation described in Section 2.2 it follows that $\mathbb{E}\mathopen{}\mathclose{{}\left[XX^{T}}\right]=I$ . Furthermore, it can be verified that the infimum with respect to $\Delta$ is attained on the compact set $\{\Delta\,:R(I-M_{X}(S))R\preceq\Delta\preceq R^{2}\}$ and thus the use of a minimum is justified. This concludes the proof of Theorem 2.

A.3 Proof of Theorem 3

The key idea underlying this proof is to exploit the integral form of matrix I-MMSE relationship, which gives

[TABLE]

for any differentiable path $S_{u}$ with $S_{0}=0$ and $S_{1}=S$ . Combining Theorems 1 and 2 provides an upper bound on the leading order terms of the left-hand side of this expression in the large- $n$ limit. We will show that this upper bound implies an asymptotic upper bound on the matrix MMSE with respect to the Loewner partial order.

To simplify notation we let $\ell=k-1$ and define the functions $f_{n}:\mathbb{S}_{+}^{\ell}\to\mathbb{R}$ and $f:\mathbb{S}_{+}^{n}\to\mathbb{R}$ according to

[TABLE]

For for all $S\in\mathbb{S}_{+}^{\ell}$ , the upper bound on the mutual information in Theorem 1 combined with the exact limit in Theorem 2 allows us to write

[TABLE]

The next step is to show that (33) implies an upper bound for the gradient $\nabla f(S)$ for all positive definite $S$ . Let $\mathcal{T}=\{T\in\mathbb{S}_{+}^{\ell}\,:\,T\preceq I\}$ . For every $S\in\mathbb{S}_{++}^{d}$ , $T\in\mathcal{T}$ and $\epsilon\in(0,\lambda_{\text{min}}(S)]$ , we can write

[TABLE]

where the inequality holds because $uT\preceq\epsilon T\preceq\epsilon I\preceq S$ for all $u\in[0,\epsilon]$ and $\nabla f_{n}$ is non-increasing with respect to the Loewner partial order. Meanwhile, we note that $f$ is concave because it is the poinitwise infimum of concave functions. By the envelope theorem [22], the supergradient of $f(S)$ at $S=0$ is the closure of the set $\{\frac{1}{2}M_{X}(\Delta)\,:\,\text{$ \Delta $attains the minimum in the definition$ f $}\}$ . Hence,

[TABLE]

where $\nabla f(0)$ denotes any matrix in the supergradient of $f(S)$ at $S=0$ . Combining (33), (34), and (35) leads to

[TABLE]

for all $S\in\mathbb{S}_{++}^{d}$ and $T\in\mathcal{T}$

The final step in the proof is to show that (36) implies an upper bound on the maximum eigenvalue of $\nabla f_{n}(S)-\nabla f(0)$ . To proceed, observe that the set $\mathcal{T}$ is compact, and thus for every $\delta>0$ there exists an integer $M$ and a set of matrices $\{T_{1},\dots,T_{M}\}$ such that $\max_{T\in\mathcal{T}}\min_{m\in[M]}\|T_{m}-T\|_{F}\leq\delta$ . Therefore, the maximum eigenvalue can be upper bounded as follows:

[TABLE]

By (36), the limit superior of the first term on the right-hand side is non-positive. Meanwhile the gradient $\nabla f_{n}(S)$ is bounded uniformly with respect to $S$ and $n$ . Noting that $\delta$ can be chosen arbitrarily small complete the proof of Theorem 3.

A.4 Proof of Theorem 4

Given $t\in[0,\infty)$ , let $\bm{Z}(t)=\sqrt{t/n}\bm{X}R\bm{X}^{T}+\bm{\xi}$ where $\bm{\xi}$ is a standard Gaussian Wigner matrix. Starting with the I-MMSE relation in (62), we obtain, for all $t>0$ ,

[TABLE]

where the inequality holds because the integrand is non-increasing in $\tau$ . To characterize the asymptotic limit of the left-hand side, we start with Theorem 6 and use the same steps that led to Corollary 7 to obtain

[TABLE]

where $\bm{Z}^{\prime}(1)$ and $\bm{Z}(t)$ are conditionally independent given $\bm{X}$ . By [15, Lemma 2], the information provided by two independent Gaussian observations can be expressed in terms of a signal observation according to $I(\bm{X};\bm{Z}^{\prime}(1),\bm{Z}(t))=I(\bm{X};\bm{Z}^{\prime}(1+t))$ . Thus we can apply Theorem 10 to obtain

[TABLE]

where

[TABLE]

Putting the above pieces together, we obtain

[TABLE]

for all $t>0$ .

Next, we consider the limiting behavior of the right-hand side of (38) as $t$ decreases to zero. Observe that the gradients of the potential function $\mathcal{F}_{\gamma}(\Delta)$ are given by

[TABLE]

Let $\mathcal{D}=\arg\min\mathcal{F}_{1}(\cdot)$ . Starting with the envelope theorem [22], we have

[TABLE]

where the last step holds because every $\Delta\in\mathcal{D}$ is a stationary point of $\mathcal{F}_{1}(\cdot)$ and thus satisfies $\Delta=R(I-M_{X}(\Delta))R$ .

Combining Lemma 11, evaluated with $S=0$ , with the assumption $\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[\bm{X}\bm{X}^{T}}\right]=I$ gives

[TABLE]

where the second term on the right-hand side converges to zero in the large- $n$ limit by the law of large numbers.

Combining this inequality with (38) and (41) gives

[TABLE]

Rearranging the terms completes the proof.

Appendix B Proof of Theorem 6

Recalling that $\bm{G}$ is a symmetric matrix with zeros on the diagonal and entries above the diagonal drawn according to (19), we can write $I(\bm{W};\bm{G})=I(\{W_{ij}\}_{i<j};\{G_{ij}\}_{i<j})$ . Meanwhile, the fact that $\bm{Z}$ is symmetric allows us to write

[TABLE]

where $\{W_{ii}\}$ denotes the diagonal entries of $\bm{W}$ . By the chain rule for mutual information and the conditional independence of $\{Z_{ij}\}_{i\leq j}$ given $\bm{W}$ , the second term on the right-hand side of (43) can be upper bounded as follows:

[TABLE]

where the second inequality follows from the assumption $\operatorname{\sf Var}(W_{ij})\leq B^{2}/n$ and the capacity of the additive Gaussian noise channel. In the following, we compare $I(\bm{W};\bm{G})$ with the first term on the right-hand side of (43).

To simplify notation, let $m=n(n-1)/2$ and let $W$ , $G$ and $Z$ denote the $m$ -dimensional vectors obtained by stacking the columns above the diagonal in $\bm{W}$ , $\bm{G}$ , and $\bm{Z}$ , respectively. The mutual information terms of interest can then be expressed as

[TABLE]

where $P_{G\mid W=w}$ is the conditional distribution of $G$ corresponding to a realization $w$ of $W$ and $D\mathopen{}\mathclose{{}\left(P\,\|\,Q}\right)$ denotes the relative entropy between distributions $P$ and $Q$ . Our approach is to prove that the inequality

[TABLE]

holds uniformly for all $w\in\mathbb{R}^{m}$ satisfying $\|w\|_{\infty}\leq B/\sqrt{n}$ . The desired result for the mutual information then follows from Jensen’s inequality.

B.1 Proof of Inequality (44)

Condition on a realization $w$ of $W$ and let $G\sim P_{G\mid W=w}$ . Let $P_{U}$ be the shifted distribution defined by $\mathrm{d}P_{U}(u)=\mathrm{d}P_{W}(w+u)$ and let $\mathcal{U}$ denote the support of $P_{U}$ . For each $u\in\mathcal{U}$ , we define the log likelihood ratio according to

[TABLE]

Using this notation, the relative entropy be written as

[TABLE]

where the expectation is with respect to $G\sim P_{G\mid W=w}$ . The score function associated with $w$ is the $m$ -dimensional random vector given by $V\triangleq\nabla\mathcal{L}(0)$ and the Fisher information matrix associated with $w$ is the $m\times m$ positive semidefinite matrix given by $\mathcal{I}\triangleq\operatorname{\mathsf{Cov}}(V)=-\mathbb{E}\mathopen{}\mathclose{{}\left[\nabla^{2}\mathcal{L}(0)}\right]$ . Under the Bernoulli observation model in (19), the entries of $V$ are independent and given by

[TABLE]

and the Fisher information matrix is diagonal with

[TABLE]

To proceed, we define two different approximations to the relative entropy in (45) according to

[TABLE]

where $\tilde{V}\sim\mathcal{N}(0,\mathcal{I})$ is a Gaussian random vector with the same mean and covariance as the score function $V$ . By the triangle inequality,

[TABLE]

The terms on the right-hand side are upper bounded in the following lemmas. The notation $f(x)=O(g(x))$ means that there is a universal constant $C$ such that $f(x)\leq Cg(x)$ and notation $f(x)=O_{B,\delta}(g(x))$ means that there is a constant $C(B,\delta)$ such that $f(x)\leq C(B,\delta)\,g(x).$

Lemma 13.

We have

[TABLE]

Proof.

Let $A=(A_{1},\dots,A_{m})$ be the zero-mean random vector defined by $A_{i}=\partial_{i}^{2}\mathcal{L}(0)+\mathcal{I}_{ii}$ where $\partial_{i}^{2}$ denotes the second partial derivative with respect to $u_{i}$ , and let $\{\mathcal{A}(u)\,:\,u\in\mathcal{U}\}$ be the random process given by $\mathcal{A}(u)=\frac{1}{2}\sum_{i=1}^{m}u^{2}_{i}A_{i}.$ The second order Tayler series expansion of $\mathcal{L}(u)$ about the point $u=0$ can be expressed as

[TABLE]

where $\mathcal{R}(u)$ is the remainder term. In view of (45) and the definition of $\widehat{D}_{1}$ , it follows that

[TABLE]

We first consider the expected supremum of $\mathcal{R}(u)$ . By Taylor’s theorem, there exists a vector $\tilde{u}$ between zero and $u$ such that

[TABLE]

Direct computation reveals that $\partial_{i}^{3}\mathcal{L}(u)=2G(\sqrt{d/(n-d)}+w_{i}+u_{i})^{-3}-2(1-G)(\sqrt{(n-d)/d}-w_{i}-u_{i})^{-3}$ . Noting that $|u_{i}|\leq 2B/\sqrt{n}$ for all $u\in\mathcal{U}$ , one obtains the uniform upper bound

[TABLE]

Combining (48) and (49) with the fact that $m=O(n^{2})$ and $|u_{i}|\leq 2B/\sqrt{n}$ leads to

[TABLE]

Next, we consider the expected supremum of $\mathcal{A}(u)$ . Under the Bernoulli observation model in (19), the entries of $A$ are independent and a straightforward calculation shows that there exist numbers

[TABLE]

such that $\mathbb{E}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left|A_{i}}\right|^{2}}\right]\leq\nu$ and $\mathopen{}\mathclose{{}\left|A_{i}}\right|\leq c$ almost surely. By Bernstein’s Inequality [23, Theorem 2.10], it follows that each $A_{i}$ is a sub-gamma random variable with variance factor $\nu$ and scale factor $c$ , i.e., the cumulant generating function satisfies

[TABLE]

for all $|t|\leq c$ . Hence, for all $u\in\mathcal{U}$ and $|t|\leq 2B^{2}c/n$ ,

[TABLE]

where the equality follows from the independence of the entries of $A$ and the last inequality holds because $u_{i}^{2}\leq 4B^{2}/n$ . An application of the maximal inequality [23, Corollary 2.6] yields

[TABLE]

Combining (52) with $m=O(n^{2})$ and the scalings in (50) and (51) leads to the desired result. ∎

Lemma 14.

We have

[TABLE]

Proof.

Let $\Phi:\mathbb{R}^{m}\to\mathbb{R}$ be defined as $\Phi(v)=-\log\int e^{\langle v,u\rangle-\frac{1}{2}\langle u,\mathcal{I}u\rangle}\mathrm{d}P_{U}(u)$ . Then, we can write

[TABLE]

where we recall that $V$ has independent entries and $\tilde{V}$ is a Gaussian vector with the same first two moments as $V$ . We bound this difference using the generalized Lindeberg principle [24, Theorem 1.1], which implies that, if there exists a constant $L$ such that $|\partial_{i}^{3}\Phi(v)|\leq L$ for each $i$ and $v$ , then

[TABLE]

From (46) and (47) it can be verified that the third moments satisfy

[TABLE]

Meanwhile, if we let $A$ be a $\mathcal{U}$ -valued random vector drawn according to the measure

[TABLE]

then the partial derivatives of $\Phi$ can be expressed as

[TABLE]

Noting that $|A_{i}|\leq 2B/\sqrt{n}$ for all $A\in\mathcal{U}$ we see that $|\partial^{3}_{i}\Phi(v)|=O_{B}\mathopen{}\mathclose{{}\left(n^{-3/2}}\right)$ . Combining this inequality with (53) and (54) completes the proof. ∎

Lemma 15.

We have

[TABLE]

Proof.

Let $\Psi:\mathbb{S}_{+}^{m}\to\mathbb{R}$ be defined as

[TABLE]

where the expectation is with respect to $N\sim\mathcal{N}(0,I_{m})$ . Then, a straightforward calculation reveals that

[TABLE]

where we recall that $\mathcal{I}$ is a diagonal matrix given by (47).

Next, we consider the gradient of $\Psi(K)$ . Let $\mu(\cdot\mid K,N)$ be the probability measure on $\mathcal{U}$ defined by

[TABLE]

and observe that

[TABLE]

Using Gaussian integration by parts (Stein’s lemma) in conjunction with the relation

[TABLE]

leads to

[TABLE]

This identity implies that the nuclear norm of the gradient is bounded by

[TABLE]

where the last step holds because $\|u\|\leq\sqrt{m}2B/\sqrt{n}$ for all $u\in\mathcal{U}$ .

With these results in hand, we can now write

[TABLE]

Finally, from (47), it can be verified that

[TABLE]

which completes the proof. ∎

Appendix C Derivation of Theorem 10

First we observe that if $R$ is positive definite then $R^{1/2}$ is well defined. Introducing the transformed representation $\tilde{\bm{X}}=\bm{X}R^{1/2}$ , we can then write

[TABLE]

Note that if $R$ is negative definite then the same decomposition holds with $(-R)^{1/2}$ . This transformation shows that it is sufficient to focus on setting where $R$ is the identity matrix.

The result given in [5, Theorem 12] is stated as follows:

[TABLE]

where

[TABLE]

with $N\sim\mathcal{N}(0,I_{d})$ independent of $X\sim P_{X}$ . To see that this expression is equivalent to the on given in Theorem 10, observe that the mutual information function $I_{X}(S)$ can be expressed as follows:

[TABLE]

Rearranging terms leads to

[TABLE]

Finally, using the scaling relationship $I_{R^{1/2}X}(S)=I_{X}(R^{1/2}SR^{1/2})$ leads to the version of the result stated in Theorem 10.

Appendix D Mutual Information and MMSE in Gaussian Noise

D.1 Linear Gaussian Channel

The scalar I-MMSE relationship [25] asserts the the derivative of mutual information in a Gaussian noise channel with respect to the inverse noise variance is equal to one half times the MMSE. A recent line of work in the information theory literature has focused on multivariate extensions of this result for linear Gaussian channel [25, 26, 27, 28]. This section briefly reviews some of results described by the first author and others [15]. Given a random vector $X\in\mathbb{R}^{d}$ the functions $I_{X}:\mathbb{S}_{+}^{d}\to[0,\infty)$ and $M_{X}:\mathbb{S}_{+}^{d}\to\mathbb{S}_{+}^{d}$ are defined as [15]:

[TABLE]

where $Y=S^{1/2}X+N$ with independent Gaussian noise $N\sim\mathcal{N}(0,I_{d})$ . These functions have a number of important properties. The function $I_{X}(S)$ is concave [15, Theorem 1] and the matrix version of I-MMSE relation is given by $\nabla I_{X}(S)=\frac{1}{2}M_{X}(S)$ [15, Lemma 4]. Furthermore, these functions are able to characterize a linear Gaussian channel characterized by an arbitrary matrix $A\in\mathbb{R}^{m\times n}$ via the following relationship [15, Lemma 1]:

[TABLE]

where $N^{\prime}\sim\mathcal{N}(0,I_{m})$ is independent of $X$ .

D.2 Linear Gaussian Channel with Matrix Input

The properties of the mutual information and MMSE described in Section D.1 extend naturally to the setting where the input is an $n\times d$ random matrix $\bm{X}=[X_{1},\dots,X_{n}]^{T}$ and the observations are given by $\bm{Y}=\bm{X}S^{1/2}+\bm{N}$ where $S\in\mathbb{S}_{+}^{d}$ and $\bm{N}$ is an $n\times d$ standard Gaussian matrix. In this setting, we define the functions:

[TABLE]

Using vectorization, the mutual information function can be expressed equivalently as

[TABLE]

where $\operatorname{\mathsf{vec}}(\bm{X})$ denotes the $nd\times 1$ vector obtained by stacking the columns in $\bm{X}$ and $\otimes$ denotes the Kronecker product and. From this relationship, one finds that the I-MMSE relation still holds for matrix inputs, that is $\nabla I_{\bm{X}}(S)=\frac{1}{2}M_{\bm{X}}(S)$ .

Next, we consider a useful representation of the MMSE matrix $M_{\bm{X}}(S)$ . Let $\bm{A}$ and $\bm{B}$ denote conditionally independent draws form the posterior distribution of $\bm{X}$ given $\bm{Y}$ . Then, the conditional covariance can be expressed as

[TABLE]

and taking the expectation with resect to $\bm{Y}$ gives

[TABLE]

Summing over the indices leads to

[TABLE]

D.3 Symmetric Matrix Estimation

In the symmetric matrix estimation problem, the goal it estimate an unknown matrix $\bm{X}\in\mathbb{R}^{n\times d}$ from observations of the form

[TABLE]

where $R\in\mathbb{S}^{d}$ is known and $\bm{\xi}\in\mathbb{S}^{n}$ is a standard Gaussian Wigner matrix. In this section, we show that this model can be viewed as special case of the linear Gaussian channel associated with matrix input given by the tensor product $\bm{X}\otimes\bm{X}$ , and thus the mutual information and MMSE can be characterized using the functions introduced in Sectioin D.2

The first step is to observe that symmetric noise model given in (59) provides the same information as the following asymmetric noise model:

[TABLE]

where $\bm{N}$ is an $n\times n$ standard Gaussian matrix. To see why, note that $\tilde{\bm{Z}}$ can be decomposed uniquely in terms of the symmetric matrix $(\tilde{\bm{Z}}+\tilde{\bm{Z}}^{T})/\sqrt{2}=\bm{X}R\bm{X}^{T}+(\bm{N}+\bm{N}^{T})/\sqrt{2}$ and the antisymmetric matrix $(\tilde{\bm{Z}}-\tilde{\bm{Z}}^{T})/\sqrt{2}=(\bm{N}-\bm{N}^{T})/2$ . By the orthogonal invariance of the Gaussian distribution, the antisymmetric matrix is independent of both $\bm{X}$ and $(\tilde{\bm{Z}}+\tilde{\bm{Z}}^{T})/\sqrt{2}$ . Noticing that $(\bm{N}+\bm{N}^{T})/\sqrt{2}$ is a standard Gaussian Wigner matrix shows that $I(\bm{X};\bm{Z})=I(\bm{X};\tilde{\bm{Z}})$ .

The next step is to use vectorization to represent the observation model in (60) as a linear Gaussian channel with matrix input:

[TABLE]

In view of both (56) and (57), the mutual information can be expressed as

[TABLE]

where the first equality holds because $\bm{X}\otimes\bm{X}$ is a deterministic function of $\bm{X}$ .

This characterization of the mutual information is useful because it allows us to compute gradients with respect to the matrix $R$ . By the I-MMSE relation and the chain rule,

[TABLE]

Furthermore, by (58), the MMSE matrix can be expressed as

[TABLE]

where $\bm{A}$ and $\bm{B}$ denote conditionally independent draws from the posterior distribution of $\bm{X}$ given $\bm{Z}$ . Therefore, (61) can be rewritten compactly as

[TABLE]

Finally, if we consider the parameterization $R_{t}=\sqrt{t}R$ for some $t\geq 0$ , then the partial derivative with respect to $t$ is given by

[TABLE]

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Social networks , vol. 5, no. 2, pp. 109–137, 1983.
2[2] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, “Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications,” Physitcal Review Eq , vol. 84, no. 6, Dec. 2011.
3[3] Y. Deshpande, E. Abbe, and A. Montanari, “Asymptotic mutual information for the balanced binary stochastic block model,” Information and Inference , vol. 6, no. 2, pp. 125–170, Jun. 2017.
4[4] F. Caltagirone, M. Lelarge, and L. Miolane, “Recovering asymmetric communities in the stochastic block model,” IEEE Transactions on Network Science and Engineering , vol. 5, no. 3, pp. 237–246, 2018.
5[5] M. Lelarge and L. Miolane, “Fundamental limits of symmetric low-rank matrix estimation,” Probability Theory and Related Fields , 2018.
6[6] J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, and L. Zdeborová, “Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula,” in Advances in Neural Information Processing Systems (NIPS) , vol. 29, Barcelona, Spain, 2016, pp. 424–432.
7[7] T. Lesieur, F. Krzakala, and L. Zdeborová, “Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications,” Journal of Statistical Mechanics: Theory and Experiment , Jul. 2017.
8[8] F. Krzakala, J. Xu, and L. Zdeborová, “Mutual information in rank-one matrix estimation,” in Proceedings of the IEEE Information Theory Workshop (ITW) , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

The Geometry of Community Detection

Abstract

1 Introduction

1.1 Our Contributions

1.2 Overview of Approach

1.3 Notation

2 Definitions

2.1 Standard Basis Representation

Proposition 1**.**

Proof.

2.2 Whitened Representation

Remark 1** (Unique Specification of Whitened Representation).**

Proposition 2**.**

Proof.

2.3 Degree-Balanced SBM

2.4 Signal-Plus-Noise Problem

3 Formulas for Mutual Information and MMSE

Assumption 1** (Diverging Average Degree).**

Assumption 2** (Definite Matrix).**

Theorem 1**.**

Theorem 2**.**

Remark 2**.**

Theorem 3**.**

Theorem 4**.**

4 Implications for Weak Recovery

Theorem 5** (Weak Recovery).**

5 Numerical Experiments

Numerical Approximation of Formulas

6 Main Steps in Proof

6.1 Equivalence between Observation Models

Theorem 6** (Channel Universality).**

Remark 3**.**

Corollary 7**.**

Proof.

6.2 Interpolation via Mutual Information

Theorem 8**.**

Corollary 9**.**

Proof.

Theorem 10** (Lelarge and Miolane [5, Theorem 12]).**

6.3 Proof of Theorem 8

Lemma 11**.**

Proof.

Lemma 12**.**

Proof.

7 Discussion

Acknowledgment

Appendix A Proofs of Results in Section 3

A.1 Proof of Theorem 1

A.2 Proof of Theorem 2

A.3 Proof of Theorem 3

A.4 Proof of Theorem 4

Appendix B Proof of Theorem 6

B.1 Proof of Inequality (44)

Lemma 13**.**

Proof.

Lemma 14**.**

Proof.

Lemma 15**.**

Proof.

Appendix C Derivation of Theorem 10

Appendix D Mutual Information and MMSE in Gaussian Noise

D.1 Linear Gaussian Channel

D.2 Linear Gaussian Channel with Matrix Input

D.3 Symmetric Matrix Estimation

Proposition 1.

Remark 1 (Unique Specification of Whitened Representation).

Proposition 2.

Assumption 1 (Diverging Average Degree).

Assumption 2 (Definite Matrix).

Theorem 1.

Theorem 2.

Remark 2.

Theorem 3.

Theorem 4.

Theorem 5 (Weak Recovery).

Theorem 6 (Channel Universality).

Remark 3.

Corollary 7.

Theorem 8.

Corollary 9.

Theorem 10 (Lelarge and Miolane [5, Theorem 12]).

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.

Lemma 15.