Exact Recovery for a Family of Community-Detection Generative Models

Luca Corinzia; Paolo Penna; Luca Mondada; Joachim M. Buhmann

arXiv:1901.06799·cs.IT·May 2, 2019

Exact Recovery for a Family of Community-Detection Generative Models

Luca Corinzia, Paolo Penna, Luca Mondada, Joachim M. Buhmann

PDF

TL;DR

This paper introduces a new toy model called the planted REM for community detection, analyzes the error probability and recovery thresholds, and provides the first consistency results for 2-WSBM on graphs and hypergraphs with unequal communities.

Contribution

It proposes the planted REM model, derives asymptotic error probabilities, and establishes the first consistency results for 2-WSBM with unequal-sized communities.

Findings

01

Asymptotic behavior of error probability for the planted REM

02

Recovery thresholds for community detection in the model

03

First consistency results for 2-WSBM with non-equal communities

Abstract

Generative models for networks with communities have been studied extensively for being a fertile ground to establish information-theoretic and computational thresholds. In this paper we propose a new toy model for planted generative models called planted Random Energy Model (REM), inspired by Derrida's REM. For this model we provide the asymptotic behaviour of the probability of error for the maximum likelihood estimator and hence the exact recovery threshold. As an application, we further consider the 2 non-equally sized community Weighted Stochastic Block Model (2-WSBM) on $h$ -uniform hypergraphs, that is equivalent to the P-REM on both sides of the spectrum, for high and low edge cardinality $h$ . We provide upper and lower bounds for the exact recoverability for any $h$ , mapping these problems to the aforementioned P-REM. To the best of our knowledge these are the first consistency…

Figures1

Click any figure to enlarge with its caption.

Tables1

Table 1. TABLE I: Thresholds for the 2- h ℎ h WSBM at different h ℎ h , and k = o ( log ⁡ N ) 𝑘 𝑜 𝑁 k=o(\log N)

$h$	Model	$γ_{-}$	$γ_{+}$
$1$	$k$ -P-REM	$1$	$1$
$2$	2-WSBM	$\sqrt{\frac{1}{k - 1}}$	$2 \sqrt{\frac{1}{k - 1}}$
$2 < h < k$	2-hWSBM	$\sqrt{\frac{1}{(\frac{k - 1}{h - 1})}}$	$2 \sqrt{\frac{h / 2}{(\frac{k - 1}{h - 1})}}$
$k$	$1$ -P-REM	1	1

Equations173

E_{i} \sim {N (μ, σ^{2}) i \in S N (0, σ^{2}) i \in / S

E_{i} \sim {N (μ, σ^{2}) i \in S N (0, σ^{2}) i \in / S

P [S = \hat{S}] = 1 - o_{M} (1)

P [S = \hat{S}] = 1 - o_{M} (1)

\hat{S}^{M L} (E) = \hat{S} \subset {1, \dots, M + k} ∣ \hat{S} ∣ = k argmax j \in \hat{S} \sum E_{j}

\hat{S}^{M L} (E) = \hat{S} \subset {1, \dots, M + k} ∣ \hat{S} ∣ = k argmax j \in \hat{S} \sum E_{j}

P [\overset{α}{^} = α] = ⎩ ⎨ ⎧ o (1) 1 - \frac{1}{M ^{(γ - 1)^{2} + o (1)}} 1 - \frac{1}{M ^{(\frac{γ ^{2}}{2} - 1) + o (1)}} γ < 1 1 < γ < 2 2 < γ

P [\overset{α}{^} = α] = ⎩ ⎨ ⎧ o (1) 1 - \frac{1}{M ^{(γ - 1)^{2} + o (1)}} 1 - \frac{1}{M ^{(\frac{γ ^{2}}{2} - 1) + o (1)}} γ < 1 1 < γ < 2 2 < γ

P

P

= P [E_{l} \geq E_{i}, \forall i = 1, \dots, M + 1]

= E_{E_{l}} P [E_{l} \geq E_{i}, \forall i = 1, \dots, M + 1∣ E_{l}]

= E_{E_{l}} i = 1, i \neq = l \prod M + 1 P [E_{l} \geq E_{i} ∣ E_{l}] =

= E_{E_{l}} i = 1, i \neq = l \prod M + 1 ϕ (\frac{E _{l} 2}{lo g M σ ^}) = E_{E_{l}} ϕ (\frac{E _{l} 2}{lo g M σ ^})^{M}

\frac{lo g M}{π} \int_{- \infty}^{+ \infty} d ϵ ϕ (2 lo g M ϵ)^{M} e^{- l o g M (ϵ - γ)^{2}}

\frac{lo g M}{π} \int_{- \infty}^{+ \infty} d ϵ ϕ (2 lo g M ϵ)^{M} e^{- l o g M (ϵ - γ)^{2}}

E_{ij} \sim {N (μ, σ^{2}) N (0, σ^{2}) i, j \in S o t h er w i se

E_{ij} \sim {N (μ, σ^{2}) N (0, σ^{2}) i, j \in S o t h er w i se

W = [p q q q]

W = [p q q q]

P [S = \hat{S}] = 1 - o_{N} (1)

P [S = \hat{S}] = 1 - o_{N} (1)

\hat{S}^{M L} (E) = \hat{S} : ∣ \hat{S} ∣ = k argmax W (\hat{S})

\hat{S}^{M L} (E) = \hat{S} : ∣ \hat{S} ∣ = k argmax W (\hat{S})

γ_{+} = ⎩ ⎨ ⎧ 2 \frac{1}{k - 1} 2 \frac{1 + l o g 2 + \frac{1}{c}}{l o g N} 2 \frac{1 + l o g 2}{( 1 - α ) l o g N} k = o (lo g N) \frac{k}{l o g N} \to c, c \in R^{+} \cup {+ \infty} k ⪅ N^{α}, 0 < α < 1

γ_{+} = ⎩ ⎨ ⎧ 2 \frac{1}{k - 1} 2 \frac{1 + l o g 2 + \frac{1}{c}}{l o g N} 2 \frac{1 + l o g 2}{( 1 - α ) l o g N} k = o (lo g N) \frac{k}{l o g N} \to c, c \in R^{+} \cup {+ \infty} k ⪅ N^{α}, 0 < α < 1

E_{\hat{S}} = W_{R E M} (\hat{S}) \sim {N (ℓ_{m} μ, ℓ_{m} σ^{2}) N (0, ℓ_{m} σ^{2}) \hat{S} = S \hat{S} \neq = S

E_{\hat{S}} = W_{R E M} (\hat{S}) \sim {N (ℓ_{m} μ, ℓ_{m} σ^{2}) N (0, ℓ_{m} σ^{2}) \hat{S} = S \hat{S} \neq = S

E_{i_{1} i_{2} \dots i_{h}} \sim {N (μ, σ^{2}) N (0, σ^{2}) i_{1}, i_{2}, \dots, i_{h} \in S o t h er w i se

E_{i_{1} i_{2} \dots i_{h}} \sim {N (μ, σ^{2}) N (0, σ^{2}) i_{1}, i_{2}, \dots, i_{h} \in S o t h er w i se

γ_{+} = ⎩ ⎨ ⎧ 2 \frac{\frac{h}{2}}{( h - 1 k - 1 )} 2 \frac{1 + l o g 2 + \frac{1}{c}}{l o g N} 2 \frac{1 + l o g 2}{( 1 - α ) l o g N} \frac{1}{h} (h - 1 k - 1) = o (lo g N) \frac{1}{h} (h - 1 k - 1) / lo g N \to c, c \in R^{+} \cup {+ \infty} k ⪅ N^{α}, 0 < α < 1

γ_{+} = ⎩ ⎨ ⎧ 2 \frac{\frac{h}{2}}{( h - 1 k - 1 )} 2 \frac{1 + l o g 2 + \frac{1}{c}}{l o g N} 2 \frac{1 + l o g 2}{( 1 - α ) l o g N} \frac{1}{h} (h - 1 k - 1) = o (lo g N) \frac{1}{h} (h - 1 k - 1) / lo g N \to c, c \in R^{+} \cup {+ \infty} k ⪅ N^{α}, 0 < α < 1

P_{e}

P_{e}

= S \sum \int d E p (S) p (E ∣ S) \mathbbm 1 [\hat{S} (E) \neq = S]

\overset{α}{^}^{M A P} (E)

\overset{α}{^}^{M A P} (E)

= \hat{S} \subset {1, \dots, M + k} ∣ \hat{S} ∣ = k argmax p (E ∣ \hat{S})

= \hat{S} \subset {1, \dots, M + k} ∣ \hat{S} ∣ = k argmax i \in \hat{S} \prod N (E_{i} ∣ μ, σ^{2}) j \in / \hat{S} \prod N (E_{j} ∣0, σ^{2})

= \hat{S} \subset {1, \dots, M + k} ∣ \hat{S} ∣ = k argmax i \in \hat{S} \prod \frac{N ( E _{i} ∣ μ , σ ^{2} )}{N ( E _{i} ∣0 , σ ^{2} )} j \prod N (E_{j} ∣0, σ^{2})

= \hat{S} \subset {1, \dots, M + k} ∣ \hat{S} ∣ = k argmax i \in \hat{S} \sum E_{i}

ϕ (x) = ⎩ ⎨ ⎧ 1 - \frac{1}{2 π x} e^{- \frac{x ^{2}}{2}} (1 + O (1/ x^{2})) \frac{1}{2 π ∣ x ∣} e^{- \frac{x ^{2}}{2}} (1 + O (1/ x^{2})) f or x \to + \infty f or x \to - \infty

ϕ (x) = ⎩ ⎨ ⎧ 1 - \frac{1}{2 π x} e^{- \frac{x ^{2}}{2}} (1 + O (1/ x^{2})) \frac{1}{2 π ∣ x ∣} e^{- \frac{x ^{2}}{2}} (1 + O (1/ x^{2})) f or x \to + \infty f or x \to - \infty

ϕ (2 lo g M ϵ) = = ⎩ ⎨ ⎧ 1 - \frac{e ^{- l o g M ϵ^{2}}}{2 π lo g M ϵ} (1 + O (\frac{1}{lo g M ϵ ^{2}})) \frac{e ^{- l o g M ϵ^{2}}}{2 π lo g M ∣ ϵ ∣} (1 + O (\frac{1}{lo g M ϵ ^{2}})) \forall \forall ϵ > 0 ϵ < 0

ϕ (2 lo g M ϵ) = = ⎩ ⎨ ⎧ 1 - \frac{e ^{- l o g M ϵ^{2}}}{2 π lo g M ϵ} (1 + O (\frac{1}{lo g M ϵ ^{2}})) \frac{e ^{- l o g M ϵ^{2}}}{2 π lo g M ∣ ϵ ∣} (1 + O (\frac{1}{lo g M ϵ ^{2}})) \forall \forall ϵ > 0 ϵ < 0

ϕ (

ϕ (

= exp (- M \frac{e ^{- l o g M ϵ^{2}}}{2 π lo g M ϵ} (1 + O (\frac{1}{lo g M ϵ ^{2}})))

= exp (- \frac{e ^{l o g M (1 - ϵ^{2})}}{2 π lo g M ϵ} (1 + O (\frac{1}{lo g M ϵ ^{2}})))

ϕ

ϕ

= 1 - \frac{e ^{- l o g M (ϵ^{2} - 1)}}{2 π lo g M ϵ} (1 + O (\frac{1}{lo g M ( ϵ ^{2} - 1 )})) ϵ > 1

P [\hat{l} = l] =

P [\hat{l} = l] =

= \frac{lo g M}{π} \int_{- \infty}^{0} d ϵ ϕ (2 lo g M ϵ)^{M} e^{- l o g M (ϵ - γ)^{2}} +

+ \frac{lo g M}{π} \int_{0}^{1 + τ_{M}} d ϵ ϕ (2 lo g M ϵ)^{M} e^{- l o g M (ϵ - γ)^{2}} +

+ \frac{lo g M}{π} \int_{1 + τ_{M}}^{+ \infty} d ϵ ϕ (2 lo g M ϵ)^{M} e^{- l o g M (ϵ - γ)^{2}}

\displaystyle\!\begin{multlined}\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{+\infty}d\epsilon\ \phi\left(\sqrt{2\log M}\epsilon\right)^{M}e^{-\log M(\epsilon-\gamma)^{2}}=\\ =\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{\infty}d\epsilon\left(1-\frac{e^{-\log M(\epsilon^{2}-1)}}{2\sqrt{\pi\log M}\epsilon}(1+\mathcal{O}(\frac{1}{\log M}))\right)\times\\ \times e^{-\log M(\epsilon-\gamma)^{2}}\end{multlined}\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{+\infty}d\epsilon\ \phi\left(\sqrt{2\log M}\epsilon\right)^{M}e^{-\log M(\epsilon-\gamma)^{2}}=\\ =\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{\infty}d\epsilon\left(1-\frac{e^{-\log M(\epsilon^{2}-1)}}{2\sqrt{\pi\log M}\epsilon}(1+\mathcal{O}(\frac{1}{\log M}))\right)\times\\ \times e^{-\log M(\epsilon-\gamma)^{2}}

\displaystyle\!\begin{multlined}\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{+\infty}d\epsilon\ \phi\left(\sqrt{2\log M}\epsilon\right)^{M}e^{-\log M(\epsilon-\gamma)^{2}}=\\ =\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{\infty}d\epsilon\left(1-\frac{e^{-\log M(\epsilon^{2}-1)}}{2\sqrt{\pi\log M}\epsilon}(1+\mathcal{O}(\frac{1}{\log M}))\right)\times\\ \times e^{-\log M(\epsilon-\gamma)^{2}}\end{multlined}\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{+\infty}d\epsilon\ \phi\left(\sqrt{2\log M}\epsilon\right)^{M}e^{-\log M(\epsilon-\gamma)^{2}}=\\ =\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{\infty}d\epsilon\left(1-\frac{e^{-\log M(\epsilon^{2}-1)}}{2\sqrt{\pi\log M}\epsilon}(1+\mathcal{O}(\frac{1}{\log M}))\right)\times\\ \times e^{-\log M(\epsilon-\gamma)^{2}}

\displaystyle=\!\begin{multlined}\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{\infty}d\epsilon\ e^{-\log M(\epsilon-\gamma)^{2}}+\\ -\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{\infty}d\epsilon\ \frac{e^{-\log M(\epsilon^{2}-1)}}{2\sqrt{\pi\log M}\epsilon}e^{-\log M(\epsilon-\gamma)^{2}}\times\\ \times(1+\mathcal{O}(\frac{1}{\log M}))\end{multlined}\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{\infty}d\epsilon\ e^{-\log M(\epsilon-\gamma)^{2}}+\\ -\sqrt{\frac{\log M}{\pi}}\int_{1+\tau_{M}}^{\infty}d\epsilon\ \frac{e^{-\log M(\epsilon^{2}-1)}}{2\sqrt{\pi\log M}\epsilon}e^{-\log M(\epsilon-\gamma)^{2}}\times\\ \times(1+\mathcal{O}(\frac{1}{\log M}))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Exact Recovery for a Family of Community-Detection Generative Models

Luca Corinzia, Paolo Penna, Luca Mondada and Joachim M. Buhmann

Department of Computer Science

ETH Zürich, Switzerland

{luca.corinzia,paolo.penna,jbuhmann}@inf.ethz.ch, [email protected]

Abstract

Generative models for networks with communities have been studied extensively for being a fertile ground to establish information-theoretic and computational thresholds. In this paper we propose a new toy model for planted generative models called planted Random Energy Model (REM), inspired by Derrida’s REM. For this model we provide the asymptotic behaviour of the probability of error for the maximum likelihood estimator and hence the exact recovery threshold. As an application, we further consider the 2 non-equally sized community Weighted Stochastic Block Model (2-WSBM) on $h$ -uniform hypergraphs, that is equivalent to the P-REM on both sides of the spectrum, for high and low edge cardinality $h$ . We provide upper and lower bounds for the exact recoverability for any $h$ , mapping these problems to the aforementioned P-REM. To the best of our knowledge these are the first consistency results for the 2-WSBM on graphs and on hypergraphs with non-equally sized community.

I introduction

I-A Motivation and main contributions

Random combinatorial optimization problems have been subject of intensive research in recent years in various disciplines, including statistical mechanics, combinatorial optimization and information theory [1]. A fruitful toy-model for random combinatorial optimization problem is the Random Energy Model (REM) that assumes configuration with independent identically distributed weights. Despite being a very simplistic model that does not show any spin glass behaviour, it has been used as a comparison to other random combinatorial optimization problems in the community-detection field, in those regimes where solutions are weakly correlated in the thermodynamical limit [2, 3, 4].

In this line of research we define a new generative model called planted-REM (P-REM) inspired by the REM mentioned above, and embed it into a family of generative models for the community detection problem on hypergraphs. These models generate random instances of the form “signal+noise” as follows: first some randomly chosen solution is planted and then Gaussian noise is added to the instance. The P-REM model is the planted counterpart of REM where solutions have independent random weights, and one planted solution has a bias $\mu>0$ . In the generative models for hypergraphs, the planted solution is a cluster or community of $k$ random nodes, and the weights of all hyperedges within this community have a bias $\mu>0$ . Unlike the P-REM model, solutions which share some edges are statistically dependent. A fundamental question for these models is whether the planted solution can be recovered despite the noise.

The parameter $h$ that defines the edge cardinality in the hypergraph regulates how much different solutions are statistically correlated. On the two sides of the spectrum, i.e. $h=k$ and $h=1$ , the model “collapses” to the P-REM with statistically independent solutions, as each solution contains exactly one hyperedge. For intermediate values of $h$ we have the 2-Clusters Weighted Stochastic Block Model (2-WSBM) on hypergraphs with highly correlated solutions: two solutions with large overlapping nodes share many edges and thus have many common random variables. The 2-WSBM and in general the SBM is a well studied model in social networks where the edge (and the edge weights) are generated at random in a way that reflects the membership of nodes to (unknown) communities. The main question for this model is whether it is possible to recover the community from the edge weights.

Interestingly, since in the P-REM model ( $h=k$ ) solutions are independent, finding the optimum requires searching through all $\binom{N}{k}$ solutions. On the contrary, for small $h\ll k$ , the optimum might in principle be computed faster by exploiting the dependency between solutions. The contributions of this paper and its outline can be summarized as follow:

•

In Section II, we introduce a generative model for planted REM, and establish necessary and sufficient conditions for the asymptotic recoverability of the planted solution.

•

In Section III and Section IV, we define the 2-WSBM on graphs and on generic $h$ -uniform hypergraph and provide general recoverability condition for any $h$ , summarized in Table I.

•

We provide a technique to map the event of success recovery of these combinatorial problems to that of a corresponding P-REM. In this sense, this approach provides a general technique which can be of relevance also in other applications.

I-B Related work

Historically the first approaches to the random combinatorial optimization problems were driven by the interest on disordered system and spin glass behaviour in the statistical mechanics community. The REM [5, 6] was first introduced as a solvable mean field model of a disordered system. This work inspired several follow ups, like the Sherrington-Kirkpatric model for spin glass, the statistical mechanics approach to the travel salesman problem and others [7].

These problems also raise often as Maximum Likelihood (ML) estimators in generative models for graph. In these models ML estimators are used to define information-theoretic conditions under which the recovery of the generative model parameters can or cannot be solved irrespective of complexity or algorithmic considerations. In the following we give an overview on a few generative models for community detection on graphs that has been proposed so far.

In the planted clique problem, an unweighted graph is generated according to a semi-random model where the edges are drawn according to Bernulli distributions, i.e., the standard Erdös-Rényi random graph (ER). In the simplest version, a set $S$ of nodes is chosen at random as the planted clique and all edges connecting them are added to the graph, while every other edge is included with probability $1/2$ . This version exhibits a computational-statistical gap [8], meaning that despite it is possible to recover planted cliques of size $|S|\geq 2\log_{2}n$ , the best known algorithms require $|S|\geq n^{1/2}$ [9].

The SBM is another well known generative model for (multi) community detection. In its classic form, the SBM is another generalization of the ER, where each vertex belongs to one of $g$ groups, and each undirected edge is drawn according to a probability distribution that depends only on the group memberships of the relative vertices [10]. The classic SBM exhibits no statistical gap, and the recovery thresholds have been found recently for symmetric [11, 12] and non-symmetric [13] model. Despite these achievements, open problems still exist for the various SBM generalizations, like weighted-SBM (WSBM) and SBM on (homogeneous) hypergraph (hSBM). WSBM includes in the model additional information on the edges (represented by the weights) that can be used to better detect the communities [14, 15, 16]. The only information-theoretic result on WSBM is given in [17] where the exact recovery threshold is given for a homogeneous WSBM with exactly equally sized communities. The SBM on hypergraph has been instead proposed to model various grouping problems arising from computer vision and signal processing where only a multi-similarity is available as a function of more then just two points [18, 19]. The first consistency result for these models has been given in [19] and on the follow-up [20] for the special case of spectral algorithms respectively on uniform and non-uniform hypergraphs. Recently [21] studied for the first time the weighted SBM on hypergraphs, providing recoverability thresholds for the homogeneous case with equally sized community.

II Planted Random Energy Model

In this section we define the P-REM as a toy model for planted generative models on graphs. We assume here $k$ and $M$ to be respectively the number of biased (planted) and unbiased Gaussian random variables.

Definition 1

Let $M$ be a positive integer, $\mu$ and $\sigma$ two positive constants. The couple $(S,\mathbf{E})$ , with $\mathbf{E}=(E_{1},\dots,E_{M+k})$ , is drawn under P-REM $(\mu,\sigma,M,k)$ (denoted in the following also as $k$ -P-REM) if the states $E_{i}$ are random variable conditional independent given $S$ and normally distributed as:

[TABLE]

where $S\subset\{1,\dots,M+k\}$ , $|S|=k$ is drawn uniformly at random.

The Gaussian weights are used for historical reasons, and a natural extension of this work can account for any arbitrary distribution, like in the work on WSBM in [14]. Note also that in the statistical mechanics literature the number of states is denoted as $M=2^{N}$ to resemble a $1/2$ -spin model.

Definition 2

Exact recovery for the P-REM is achieved if there exists an algorithm that takes $\mathbf{E}$ as input and outputs $\hat{S}=\hat{S}(\mathbf{E})$ such that

[TABLE]

To establish the information-theoretic limit for the P-REM we have to study the algorithm the maximizes the probability of recovery the correct biased index, that is the Maximum A Posteriori (MAP) decoding. In the case of the P-REM, the indices of biased weights are drawn uniformly, hence the MAP estimator corresponds to the ML estimator, that coincides to the indices of the top (largest) $k$ weights.

Theorem 1

The ML estimator for the P-REM reads

[TABLE]

Proof:

See appendix A . ∎

Now we can establish the information-theoretic limit for exact recovery in the P-REM. We first study the recoverability condition for the $1$ -P-REM, that depends intuitively on the magnitude of the signal to noise ratio (SNR) of the model. Then we will extend the results for any $k\leq N^{\alpha}$ with $\alpha<1$ .

Theorem 2

Given a P-REM with parameters $(\hat{\mu}\log M,\hat{\sigma}\sqrt{\log M/2},M,1)$ the recovery probability of the ML estimator has asymptotics

[TABLE]

where $\gamma=\hat{\mu}/\hat{\sigma}$ is the SNR. Hence exact recovery is solvable if $\gamma>\gamma_{c}$ and unsolvable if $\gamma<\gamma_{c}$ , with $\gamma_{c}=1$ .

Proof:

Without loss of generality let assume $l$ to be the index of the planted solution. The probability for the ML estimator to correctly identify the planted solution reads:

[TABLE]

where $\phi(x)$ is the cumulative distribution function (cdf) of the standard Gaussian distribution. Now we can study the thermodynamical limit $M\to\infty$ . Let us perform a change of variable $E_{l}=\epsilon\log M\hat{\sigma}$ and let us call $\gamma$ the SNR as $\gamma=\hat{\mu}/\hat{\sigma}$ . Then the probability of success reads

[TABLE]

Note that we can not apply the method of steepest descent [22] for the integral in this form since the dependency on $M$ in not only on the exponential term but also on the function to be averaged. The idea of the proof is that the Gaussian probability distribution for large $M$ converges to a delta function centered in $\gamma$ and the function to be averaged, namely $\phi\left(\sqrt{2\log M}\epsilon\right)^{M}$ , converges to a (Heaviside) step function in $1$ in the same regime. This gives intuitively the recovery threshold $\gamma_{c}=1$ . The corrections to $1$ in the high $\gamma$ regime showed in eq. 2 are given by (i) the Gaussian cdf for $1<\gamma<2$ and (ii) the correction from $1$ of the step function for $\gamma>2$ . See appendix A for further details.

Comment: Note that the behaviour of $\phi\left(\sqrt{2\log M}\epsilon\right)^{M}$ is given by the Fisher–Tippett–Gnedenko theorem (see [23] and appendix A ) in a particular regime, namely the case $\phi\left(a_{M}x+b_{M}\right)^{M}\approx exp(-e^{-x})$ (Gumbel cdf) for any given finite $x$ and $a_{M}\approx\frac{1}{(2\log M)^{\frac{1}{2}}}$ , $b_{M}\approx(2\log M)^{\frac{1}{2}}$ . Hence it is easy to see that the function here considered $\phi\left(\sqrt{2\log M}\epsilon\right)^{M}$ behaves like a heaveside step function for $\epsilon>1$ and $\epsilon<1$ and instead behaves like the Gumbel function close to $1$ , for $\epsilon\approx 1+\frac{x}{2\log M}$ . More details in the appendix A . ∎

For the general P-REM the same recoverability threshold of the $1$ -P-REM applies as long as $k$ grows slower then any power of $M$ . For $k=M^{\alpha}$ and $\alpha$ a constant strictly positive ( $0<\alpha<1$ ) we have a gap of size $\sqrt{\alpha}$ , as shown in the following theorem:

Theorem 3

A P-REM problem with parameters $(\hat{\mu}\log M,\hat{\sigma}\sqrt{\log M/2},M,k)$ is solvable for any constant $\gamma>1+\sqrt{\alpha}$ and unsolvable for $\gamma<1$ , where $\alpha$ is the smallest value such that $k\leq M^{\alpha}$ and $0<\alpha<1$ .

Proof:

See appendix A . ∎

III Weighted Stochastic Block Model

In this section we define the weighted Stochastic Block Model (WSBM) with two clusters and gaussian weights. In particular let us consider a complete graph $\mathcal{G}$ with set of nodes $\mathcal{V}=\{1,\dots,N\}$ .

Definition 3

Let $N$ and $k$ two positive integers, $\mu$ and $\sigma$ two positive constants. The couple $(S,E)$ is drawn under 2-WSBM $(\mu,\sigma,N,k)$ if the edge weights $E_{ij}$ are random variables conditional independent given $S$ and normally distributed as:

[TABLE]

where $S\subset\mathcal{V}$ , $|S|=k$ , is drawn uniformly at random.

Note that this definition is equivalent to the definition of a non-homogeneus 2-WSBM given in [14], in the case of Gaussian weights. Formally the cluster connectivity matrix of the model described in Definition 3 according to the framework given in [14] reads

[TABLE]

where $p=\mathcal{N}\left(\mu,\sigma^{2}\right)$ is the pdf relative to the weights in the planted cluster $S$ and $q=\mathcal{N}\left(0,\sigma^{2}\right)$ is the pdf relative to both the weights of the cluster $\mathcal{V}\setminus S$ and the weights across the two clusters. Note also that in the definition here used we consider the size of one cluster to be $|S|=k$ , hence the size of the other cluster is $|\mathcal{V}\setminus S|=N-k$ .

Definition 4

Exact recovery for the 2-WSBM in Definition 3 is achieved if there exists an algorithm that takes $E$ as input and outputs $\hat{S}=\hat{S}(E)$ such that

[TABLE]

As for the case of the P-REM, the information-theoretic recoverabilty limit is given by the MAP estimator, that for clique nodes $S$ drawn at random corresponds to the ML estimator, that is described in the following theorem.

Theorem 4

The ML estimator for the 2-WSBM in Definition 3 is the densest $k$ -subgraph.

[TABLE]

where $W(\cdot)$ is the solution weight and is defined as $W(A):=\sum_{i,j\in A}E_{ij}$

Proof:

See appendix B . ∎

Also in the 2-WSBM the information-theoretic limit for exact recovery on the SNR, and the exact thresholds are given by the following theorem.

Theorem 5

Exact recovery in the 2-WSBM with parameters $(\hat{\mu}\log N,\hat{\sigma}\sqrt{\log N/2},N,k)$ is unsolvable if $\gamma<\gamma_{-}=\sqrt{\frac{1}{k-1}}$ and solvable if $\gamma>\gamma_{+}$ , where the upper threshold is defined according to the different $k$ regimes as:

[TABLE]

Proof:

Lower Bound: We shall reduce the model to the 1-P-REM problem by partitioning the set of all solutions into groups, where the solutions in each group form an instance of 1-P-REM. We then use the bounds for 1-P-REM to show that the probability that the planted solution “fails” against the solutions in a single group is sufficiently small. By taking the union bound over all groups, we get an upper bound on the fail probability that is vanishing (hence a lower bound on the recovery probability that converges to 1).

The actual partition of the solutions is done in order to guarantee that, within each group, solutions are “statistically independent” as in the P-REM (see Figure 1).

Given the planted solution $S$ , fix a possible intersection, i.e. a subset $I$ of $m$ nodes in $S$ and consider all solutions $\hat{S}$ consisting of nodes in this intersection $I$ and of $k-m$ nodes not in $S$ . Then, further partition these solutions into groups such that, each group $\mathcal{C}_{\hat{S}}$ consists of $M_{m}=\lfloor\frac{N-k}{k-m}\rfloor$ solutions that are node-disjoint except the common intersection $I$ (simply take disjoint blocks of $k-m$ nodes each). The key point is that the nodes in $I$ are contained in all solutions and in the planted solution as well. We can therefore remove all these edges from the consideration, and restrict to the remaining edges with weights $W_{REM}(\hat{S}):=W(\hat{S})-W(I)$ for any solution $\hat{S}$ as above (including the planted solution $S$ ). Note that the following holds: (i) Each $W_{REM}(\hat{S})$ is a Gaussian random variable which is given by the sum of $\ell_{m}=\binom{k}{2}-\binom{m}{2}$ edge weights. (ii) Since two solutions $\hat{S}^{\prime}$ and $\hat{S}^{\prime\prime}$ do not share any edge other than those inside $I$ , $W_{REM}(\hat{S}^{\prime})$ and $W_{REM}(\hat{S}^{\prime\prime})$ are independent Gaussian random variables. (iii) $W(S)>W(\hat{S})$ if and only if $W_{REM}(S)>W_{REM}(\hat{S})$ . Hence variables in $\mathcal{C}_{\hat{S}}$ are a 1-P-REM with $1+M_{m}=1+|\mathcal{C}_{\hat{S}}|$ states where each state (solution) is a Gaussian random variable distributed as

[TABLE]

Our construction yields $M_{m}=\lfloor\frac{N-k}{k-m}\rfloor\geq N^{1-\alpha}$ and the resulting 1-P-REM has $\mu_{REM}=\ell_{m}\mu$ and $\sigma^{2}_{REM}=\ell_{m}\sigma^{2}$ , which results in a SNR $\gamma_{m}=\frac{\hat{\mu}_{REM}}{\hat{\sigma}_{REM}}\approx\frac{\hat{\mu}}{\hat{\sigma}}\sqrt{\ell_{m}}=\gamma\sqrt{\ell_{m}}$ . For $\gamma>\gamma_{+}$ , it follows that $\gamma_{m}>2$ for all $m$ , and eq. 2 yields a probability of “failing” against some solution in $\mathcal{C}_{\hat{S}}$ which is “sufficiently small” to apply the union bound over all possible $m$ and all possible $\mathcal{C}_{\hat{S}}$ necessary to cover all solutions.

Upper Bound: We construct a single $\mathcal{C}_{\hat{S}}$ with maximal $m=k-1$ and observe that, since $\ell_{k-1}=k-1$ , for $\gamma<\gamma_{-}=\sqrt{\frac{1}{k-1}}$ the signal to noise ratio of the corresponding P-REM satisfies $\gamma_{k-1}<1$ . By eq. 2 the probability of “failing” against some solutions in $\mathcal{C}_{\hat{S}}$ tends to $1$ , and the probability of failing against an arbitrary solution is lower bounded by it. See appendix B for the full proof. ∎

Interestingly the exact recovery results given in [17] for the homogeneous 2-WSBM with exactly equally sized community is based on the $\frac{1}{2}$ -Rényi divergence between the pdfs of the inter-cluster and intra-cluster weights, that for the gaussian case is an explicit function of the SNR. Hence, despite the two models considered in this paper and in [17] are distinct, the same parameter regulates their recoverability conditions. Notice also that our scenario is complementary to that given in [17] since the equally sized community constraint is equivalent to $k=N$ , hence $\alpha=1$ .

From Theorem 5 we can see that the recoverability threshold decreases with $k$ as $\sqrt{\frac{1}{k-1}}$ up to $k$ of the magnitude of $\log N$ , so that it is easier (we can afford lower SNR) to detect larger communities. This result is intuitive and it is in line with what has been found already for the planted-clique problem [8], where planted cliques can be recovered only on the regime $|S|=k\geq 2\log_{2}N$ . For larger $k$ the threshold decreases with $k$ more slowly, only through the parameter $\alpha=\frac{\log k}{N}$ .

Conjecture 1

In the simple case in which $k=2$ ( $\alpha=0$ ), the 2-WSBM is exactly equivalent to the 1-P-REM with number of states $M=\binom{N}{2}\approx N^{2}$ , hence it has a critical $\gamma_{c}=\sqrt{2}$ . This justifies the conjecture, left for future development, that the critical gamma for this 2-WSBM with $k=o(\log N)$ is $\gamma_{c}=\sqrt{\frac{2}{k-1}}$ .

IV The 2-WSBM on Hypergraphs

In this section, we consider the generalization of the 2-WSBM to the complete $h$ -uniform hypergraph. We consider hyperedges of uniform edge-cardinality $h$ . Formally:

Definition 5 (2-hWSBM)

Let $N$ , $k$ , and $h$ be positive integers, $\mu$ and $\sigma$ two positive constants. The couple $(S,E)$ is drawn under 2-hWSBM $(\mu,\sigma,N,k,h)$ if the weights $E_{i_{1}i_{2}\cdots i_{h}}$ are random variables conditional independent given $S$ and normally distributed as:

[TABLE]

where $S\subset\mathcal{V}$ , $|S|=k$ , is drawn uniformly at random.

The corresponding exact recovery problem is defined as for the 2-WSBM (Definition 4). Also, Theorem 4 extends easily to this model, hence the ML estimator yields the densest $k$ -sub-hypergraph. The recoverability condition for generic $h$ are given by the following:

Theorem 6

The 2-hWSBM, with $2\leq h\leq k$ and with parameters $(\hat{\mu}\log N,\hat{\sigma}\sqrt{\log N/2},N,k,h)$ , is unsolvable if $\gamma<\gamma_{-}=\sqrt{\frac{1}{\binom{k-1}{h-1}}}$ and solvable if $\gamma>\gamma_{+}$ , where the upper threshold is defined according to the $k$ and $h$ regimes as:

[TABLE]

Proof:

See appendix C . ∎

Note that for $h=2$ , this yields exactly the results in Theorem 5, and also that the conjecture extends easily to the case $k=h$ , for which we recover the P-REM with number of states $M=\binom{N}{k}\approx N^{k}$ . The model has thereof an exact threshold $\gamma_{c}=\sqrt{k}$ , hence we conjecture that the threshold for the general problem on $h$ -hypergraphs reads $\gamma_{c}=\sqrt{\frac{h}{\binom{k-1}{h-1}}}$ . The other side of the $h$ range is defined by $h=1$ (not included in Theorem 6), for which it is easy to show that the corresponding 2-WSBM is equivalent to a $k$ -PREM with number of states $M=N$ . Hence the thresholds are given by Theorem 3. A summary of all the results of the paper for the regime $k=o(\log N)$ is given in Table I. Note that also in this case previous results found in [21] on 2-hWSBM are orthogonal to ours since those are given for equally sized community and homogeneous models.

V Conclusion and future work

Information-theoretic fundamental limits have been a fruitful approach to establish benchmarks for algorithm performance. A important example is given by Shannon’s coding theorem that gives a recoverability threshold for coding algorithms, located at the channel capacity, and that gave rise to a plethora of coding algorithms in the last decades. Recently the same approach has been carried out in the area of clustering and community detection in (hyper)graphs. Motivated by this we established a new toy model for generative model called (k)-planted REM giving a description of the maximum likelihood failing probability in the asymptotic limit and consequently a recoverability threshold. We embedded this model in the well know (weighted) Stochastic Block model framework for community detection in h-uniform hypergraphs. These models correspond to the planted REM for both sides of the $h$ spectrum, $h=k$ and $h=1$ . For all these models we provided the first recoverability conditions. As future research directions, we plan to provide matching thresholds for the 2-WSBM on $h$ -uniform hypergraphs for any given $h$ and to account for natural extensions, like multi-community and non Gaussian probability distributions.

Acknowledgment

We thank Wojciech Szpankowski for spotting an error in an early version this work, and Alexey Gronskiy and Andreas Krause for valuable comments.

Appendix A Proofs for section II

Proof:

[TABLE]

hence the algorithm the minimizes the error is the one the maximizes $p(\mathbf{E}|S)\mathbbm{1}[\hat{S}(\mathbf{E})=S]$ for every set $S$ and every configuration $\mathbf{E}$ .

[TABLE]

∎

Proof:

Let us first observe the behavior of $\phi(x)$ for large x:

[TABLE]

hence as a function of $\epsilon$ we have:

[TABLE]

where the asymptotic notation is referred to $M\to+\infty$ , here and in the rest of this proof. For $\epsilon>0$ then we have:

[TABLE]

then

[TABLE]

For appendix A to hold we need $\log M(\epsilon^{2}-1)\to+\infty$ hence it holds for any $\epsilon\geq 1+\tau_{M}$ for which $\log M\tau_{M}\to+\infty$ . Notice also that the asymptotics given in appendix A is equivalent to that given by the Fisher–Tippett–Gnedenko theorem for gaussian random variables [23] in the regime $\epsilon=1+\tau_{M}$ with $\tau_{M}=o(1)$ and $\log M\tau_{M}\to+\infty$ .

We can now write the probability of success splitting the integral in eq. 3 into the three intervals $(-\infty,0)$ , $(0,1+\tau_{M})$ and $(1+\tau_{M},+\infty)$ where $\tau_{M}$ is defined as above.

[TABLE]

Let us first consider the third term (eq. 8) that using appendix A reads:

[TABLE]

The term in eq. 17 has been studied before, and approaches [math] for $\gamma<1$ and instead for $\gamma>1$ reads:

[TABLE]

Now let us rewrite the term in eq. 20 as

[TABLE]

Now we can see that for $\gamma>2$ we can use steepest descent method [22] to get:

[TABLE]

For $1<\gamma<2$ we can see instead that

[TABLE]

Hence the term in eq. 8 asymptotically reads:

[TABLE]

Now we show that the other two terms eq. 6 and eq. 7 do not change the asymptotic behaviour of the recovery probability. To see this note that $\phi\left(\sqrt{2\log M}\epsilon\right)^{M}$ is an increasing function of $\epsilon$ hence we can bound easily the two terms. The first term eq. 6 reads:

[TABLE]

for any constant $\alpha>0$ , hence it is smaller than any contribution of term eq. 8 for $\gamma>1$ . For the second term, let us sum up eq. 7 and eq. 17 to get the upper bound

[TABLE]

were the second last equality is given by eq. 4 and the last equality is given by $\tau_{M}=o(1)$ . Analogously we can get the lower bound

[TABLE]

Summing up all the terms together we get:

[TABLE]

hence theorem 2 holds. ∎

Proof:

Lower bound Observe that exact recovery fails if and only if one biased weight is worst (smaller) than one of the $M+k$ unbiased weights. Then,

[TABLE]

where the inequality is simply the union bound and $i\in S$ . Now observe that the latter probability is just the probability of failing in a P-REM with $M$ states and thus the bounds in (2) for $1<\gamma<2$ yield

[TABLE]

since $\gamma>1+\sqrt{\alpha}$ implies $(\gamma-1)^{2}>\alpha$ . For $\gamma>2$ , the corresponding bound in (2) yields $k\mathbb{P}\left[E_{i}<\max_{j\notin\hat{S}}E_{j}\right]\leq\frac{1}{M^{\frac{\gamma^{2}}{2}-2}}=o(1)$ .

Upper bound For $\gamma<1$ , the bounds in (2) give $\mathbb{P}\left[E_{i}<\max_{j\notin\hat{S}}E_{j}\right]=1-o(1)$ . From the discussion above, we also have $\mathbb{P}[S\neq\hat{S}]\geq\mathbb{P}\left[E_{i}<\max_{j\notin\hat{S}}E_{j}\right]$ , thus implying that the probability of failing tends to $1$ . ∎

Theorem 7 (Fisher–Tippett–Gnedenko theorem [23])

Let $E_{1},E_{2},\dots E_{n}$ be a sequence of independent and identically-distributed Gaussian random variables, distributed as $E_{i}\sim\mathcal{N}(0,1)$ , and $M_{n}=\max\{E_{1},\ldots,E_{n}\}$ . For the pairs of real numbers

[TABLE]

the normalized cumulative distribution function converges as

[TABLE]

where $F(x)=e^{-e^{-x}}$ is the Gumbel cdf.

Appendix B Proofs for section III

Proof:

[TABLE]

hence the algorithm the minimizes the error is the one the maximizes $p(E|S)\mathbbm{1}[\hat{S}(E)=S]$ for every clique $S$ .

[TABLE]

∎

Proof:

Lower bound The event of fail for the ML estimator reads:

[TABLE]

Let us now rewrite it distinguishing the contributions given by non-planted solution $\hat{S}$ with different overlap $m=\hat{S}\cap S$ , with $m$ ranging from [math] to $k-1$ .

[TABLE]

For every such $\hat{S}$ with intersection (common nodes) $I$ of size $m$ (there are $\binom{k}{m}\binom{N-k}{k-m}$ many) let us build a set of solutions $\mathcal{C}_{\hat{S}}$ , called an independent coverage, with the following property:

$\hat{S}\in\mathcal{C}_{\hat{S}}$ . 2. 2.

Any two solutions $\hat{S}_{1}$ and $\hat{S}_{2}\in\mathcal{C}_{\hat{S}}$ share all and only the $\binom{m}{2}$ edges inside $I$ .

Intuitively, the latter condition says that we can remove all the $\binom{m}{2}$ edges inside $I$ from our consideration, and restrict to the remaining edges which result in independent random variables. Consider

[TABLE]

for any solution $\hat{S}$ as above (including the planted solution $S$ ), and note that

[TABLE]

It is easy to build such an independent coverage grouping all nodes not in $S$ by $k-m$ . This gives a coverage with cardinality

[TABLE]

Note that this cardinality depends only on the size of the overlap $m$ and not on the actual solution $\hat{S}$ . It is easy to see that the union over all the non-planted solution with overlap $m$ can be written as the union over a set of seed solutions $\mathcal{C}_{m}$ of the independent coverage of the seeds. Formally

[TABLE]

Now we can rewrite the event in eq. 21 as

[TABLE]

We can now use the union bound to establish a lower bound on the probability of success of the ML estimator as:

[TABLE]

where

[TABLE]

depends only on the overlap $m$ and not on the solution $\hat{S}^{\prime}$ , and where we bounded the cardinality of $|\mathcal{C}_{m}|$ as

[TABLE]

We can now use the results on the P-REM to upper bound $\mathbb{P}_{e}^{m}$ . Observe that each $\mathcal{C}_{\hat{S}^{\prime}}$ corresponds to a P-REM with $1+M_{m}$ states and that using (22) we can rewrite $\mathbb{P}_{e}^{m}$ as

[TABLE]

where each quantity $W_{REM}(\cdot)$ is given by the sum of the $\ell_{m}$ edge weights (the edges not in the common intersection $I$ ). Therefore

[TABLE]

and these random variables are independent since, in $\mathcal{C}_{\hat{S}^{\prime}}$ , solutions do not share edges other than those in $I$ (which are removed in the definition of $W_{REM}(\cdot)$ ). Hence the probability $\mathbb{P}_{e}^{m}$ is equal to the probability of error in a P-REM with parameters $(\ell_{m}\hat{\mu}\log N,\sqrt{\ell_{m}}\hat{\sigma}\sqrt{\frac{\log N}{2}},M_{m})$ . Let us now compute the SNR for this particular P-REM:

[TABLE]

for which follow that the SNR reads

[TABLE]

We shall prove that, for every P-REM corresponding to the overlap $m$ , it holds that

[TABLE]

where the asymptotic notiation is here referred to $N\to+\infty$ . Since $k\leq N^{\alpha}$ and $m\leq k-1$ ,

[TABLE]

This implies together with the inequalities $\binom{k}{m}\leq 2^{k}$ and $\binom{N-k}{k-m}\leq e^{k-m}\left(\frac{N-k}{k-m}\right)^{k-m}\leq e^{k}(N-k)^{k-m}$ :

[TABLE]

where the last inequality follow from $k\leq N^{\alpha}$ . Now we can use the asymptotics of the probability of fail $\mathbb{P}_{e}^{m}$ given in (2). In particular, if $\gamma_{m}>2$ , eq. 26 holds whenever

[TABLE]

and since $M_{m}\geq N^{1-\alpha}-2=N^{1-\alpha}(1+o(1))$ , it is sufficient that

[TABLE]

Since $\ell_{m}=\binom{k}{2}-\binom{m}{2}=\binom{k-m}{2}+m(k-m)$ we have

[TABLE]

Also note that since $\ell_{m}\geq\ell_{k-1}$ ,

[TABLE]

By plugging these inequalities into (27), and since $M_{m}\leq N$ , we get

[TABLE]

This implies that, for any $\gamma>2\sqrt{\frac{1}{1-\alpha}\left(\frac{1+\log 2}{\log N}+\frac{1+\frac{\alpha}{2}}{k-1}\right)}$ eq. 27 holds as desired.

To conclude the proof, we observe that indeed $\gamma_{m}>2$ : for $m\leq k-2$ we have $\gamma_{m}>\sqrt{2(k-m)}>2$ , while for $m=k-1$ we have $\ell_{k-1}=k-1$ hence it follows that $\gamma>2\sqrt{\frac{1}{k-1}}=2\sqrt{\frac{1}{\ell_{k-1}}}$ implies $\gamma_{k-1}>2\sqrt{\frac{\log N}{\log M_{k-1}}}>2$ since $M_{m}\leq N$ for all $m$ .

Upper bound: Consider $m=k-1$ and a corresponding $\mathcal{C}_{\hat{S}}$ . Note that for this particular $m$ we have $\ell_{k-1}=k-1$ , $M_{k-1}=N-k$ , and $\gamma_{k-1}=\gamma\sqrt{\ell_{k-1}\frac{\log N}{\log(N-k)}}$ . Hence for any constant $\gamma<\sqrt{\frac{1}{k-1}}$ and for $N$ large enough, it follow that $\gamma_{k-1}<1$ and thus, by (2), the probability of failing converges to 1

[TABLE]

The theorem follows from the bound $\mathbb{P}_{e}\geq\mathbb{P}_{e}^{m}$ for any $m=0,\dots,k-1$ .

Comment: Following the proof we can see that both upper and lower bound are established according to the worst case scenario in the given regime. In the case of the lower bound, the worst case is given for the smallest SNR by solutions with no overlap with the planted solution ( $m=0$ ), since those have the highest multiplicity $L_{m}\approx N^{k-m-1+\alpha}$ . In the case of the upper bound, the worst case is given at the highest SNR from the solution with the biggest overlap ( $m=k-1$ ), since those have highest fail probability. The whole problem can in turn be seen as a classical entropy-energy trade-off, where $\gamma$ assumes the role of the inverse-temperature, the entropy is given by the multiplicity and the energy term as the fail probability. ∎

Appendix C Proof for section IV

Proof:

Lower bound Note that, also in this extension, every solution is fully specified by a subset of $k$ nodes. The proof thus follows the same steps as the case $h=2$ (Theorem 5), which leads to the sufficient condition for exact recovery (27) that reads

[TABLE]

The only difference is that the quantity $\ell_{m}$ reads

[TABLE]

In particular, we have

[TABLE]

where the inequality holds for every $m\leq k-1$ and can be proved as follows. We want to show that

[TABLE]

For $m\leq h-1$ this is trivially true because $\binom{m}{h}=0$ . Otherwise, for $m\geq h$ we can expand the binomial coefficients and write the inequality above as follows:

[TABLE]

that is

[TABLE]

which is equivalent to

[TABLE]

that is

[TABLE]

which holds since $m\leq k$ .

Upper bound Also in this case, the proof is essentially the same as $h=2$ . Consider a maximal overlap $m=k-1$ and a corresponding $\mathcal{C}_{\hat{S}}$ , which gives again $M_{m}=N-k$ . Note that now $\ell_{k-1}=\binom{k-1}{h-1}$ since all hyperedges in $\hat{S}\setminus S$ consist of the single node in $\hat{S}\setminus S$ and any $h-1$ nodes in the common intersection between $S$ and $\hat{S}$ (which consists of $m=k-1$ nodes). Thus, for any constant $\gamma<\sqrt{\frac{1}{\binom{k-1}{h-1}}}$ and $N$ large enough, we have $\gamma_{k-1}=\gamma\sqrt{\ell_{k-1}\frac{\log N}{\log(N-k)}}<1$ . Hence, the probability of failing converges to 1

[TABLE]

The theorem follows from the bound $\mathbb{P}_{e}\geq\mathbb{P}_{e}^{m}$ for any $m=0,\dots,k-1$ . ∎

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Mezard and A. Montanari, Information, physics, and computation . Oxford University Press, 2009.
2[2] A. Gronskiy, J. M. Buhmann, and W. Szpankowski, “Free energy asymptotics for problems with weak solution dependencies,” in 2018 IEEE International Symposium on Information Theory (ISIT) . IEEE, 2018, pp. 2132–2136.
3[3] J. M. Buhmann, J. Dumazert, A. Gronskiy, and W. Szpankowski, “Phase transitions in parameter rich optimization problems,” in 2017 Proceedings of the Fourteenth Workshop on Analytic Algorithmics and Combinatorics (ANALCO) . SIAM, 2017, pp. 148–155.
4[4] J. M. Buhmann, A. Gronskiy, and W. Szpankowski, “Free energy rates for a class of very noisy optimization problems,” Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms , p. 61, 2014.
5[5] B. Derrida, “Random-energy model: An exactly solvable model of disordered systems,” Physical Review B , vol. 24, no. 5, p. 2613, 1981.
6[6] ——, “Random-energy model: Limit of a family of disordered models,” Physical Review Letters , vol. 45, no. 2, p. 79, 1980.
7[7] M. Mézard, G. Parisi, and M. Virasoro, Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications . World Scientific Publishing Company, 1987, vol. 9.
8[8] J. Steinhardt, “Does robustness imply tractability? A lower bound for planted clique in the semi-random model,” ar Xiv preprint ar Xiv:1704.05120 , 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Exact Recovery for a Family of Community-Detection Generative Models

Abstract

I introduction

I-A Motivation and main contributions

I-B Related work

II Planted Random Energy Model

Definition 1

Definition 2

Theorem 1

Proof:

Theorem 2

Proof:

Theorem 3

Proof:

III Weighted Stochastic Block Model

Definition 3

Definition 4

Theorem 4

Proof:

Theorem 5

Proof:

Conjecture 1

IV The 2-WSBM on Hypergraphs

Definition 5** (2-hWSBM)**

Theorem 6

Proof:

V Conclusion and future work

Acknowledgment

Appendix A Proofs for section II

Proof:

Proof:

Proof:

Theorem 7** (Fisher–Tippett–Gnedenko theorem [23])**

Appendix B Proofs for section III

Proof:

Proof:

Appendix C Proof for section IV

Proof:

Definition 5 (2-hWSBM)

Theorem 7 (Fisher–Tippett–Gnedenko theorem [23])