Correlation-Based Community Detection

Zheng Chen; Zengyou He; Hao Liang; Can Zhao; Yan Liu

arXiv:1904.04583·cs.SI·April 10, 2019

Correlation-Based Community Detection

Zheng Chen, Zengyou He, Hao Liang, Can Zhao, Yan Liu

PDF

Open Access

TL;DR

This paper introduces correlation-based evaluation functions for community detection that address the resolution limit problem and improve detection accuracy, supported by a new algorithm called CBCD.

Contribution

The paper presents a novel correlation analysis framework for community detection, unifying existing functions and proposing the CBCD algorithm that outperforms current methods.

Findings

01

CBCD outperforms existing algorithms on benchmark networks.

02

Correlation functions mitigate the resolution limit problem.

03

The framework unifies and improves community evaluation metrics.

Abstract

Mining community structures from the complex network is an important problem across a variety of fields. Many existing community detection methods detect communities through optimizing a community evaluation function. However, most of these functions even have high values on random graphs and may fail to detect small communities in the large-scale network (the so-called resolution limit problem). In this paper, we introduce two novel node-centric community evaluation functions by connecting correlation analysis with community detection. We will further show that the correlation analysis can provide a novel theoretical framework which unifies some existing evaluation functions in the context of a correlation-based optimization problem. In this framework, we can mitigate the resolution limit problem and eliminate the influence of random fluctuations by selecting the right correlation…

Tables4

Table 1. TABLE I: A 2 × 2 2 2 2\times 2 contingency table for vertex u 𝑢 u and community S 𝑆 S .

	$𝒞 (x, u) = 1$	$𝒞 (x, u) = 0$	$\sum_{j} f_{i j}$
$𝒢 (x, S \ {u}) = 1$	$ω$	$f_{10}$	$ϵ$
$𝒢 (x, S \ {u}) = 0$	$f_{01}$	$f_{00}$	$f_{01} + f_{00}$
$\sum_{i} f_{i j}$	$d_{u}$	$f_{10} + f_{00}$	$N$

Table 2. TABLE II: The characteristics of real-world network data sets.

Data Sets	$\| V \|$	$\| E \|$	$⟨ d ⟩$	$d_{m a x}$	$\| C \|$
football	$115$	$613$	$10.57$	$12$	$12$
karate	$34$	$78$	$4.59$	$17$	$2$
personal	$561$	$8375$	$29.91$	$166$	$8$
polbooks	$105$	$441$	$8.4$	$25$	$3$
polblogs	$1490$	$19090$	$27.32$	$351$	$2$

Table 3. TABLE III: The performance of different algorithms on the real-world network.

	NMI					NC
	football	karate	personal	polbooks	polblogs	football	karate	personal	polbooks	polblogs
CBCD	$0.734$	0.837	0.3639	0.330	0.391	$9$	$2$	$11$	$4$	$6$
Infomap	$0.833$	$0.563$	$0.248$	$0.293$	$0.268$	$12$	$3$	$6$	$5$	$303$
Louvain	$0.838$	$0.259$	$0.08$	$0.158$	$0.212$	$12$	$7$	$17$	$10$	$32$
SCD	0.840	$0.395$	$0.179$	$0.116$	$0.085$	14	8	125	23	664
Focs	$0.392$	$0.189$	$0.171$	$0.166$	$0.106$	5	1	13	9	19
Attractor	$0.833$	$0.04$	$0.299$	$0.315$	$0.124$	12	1	56	7	313

Table 4. TABLE IV: The performance comparison of different algorithms on large-scale real networks with overlapping ground-truth communities

Data sets	Algorithm	ONMI	NC	ET
DBLP	CBCD	$0.132$	40k	48s
	Infomap	$0.008$	109k	120s
	Louvain	$0.124$	170k	13s
	SCD	$0.146$	140k	15s
	Focs	0.213	24k	7s
	Attractor	$0.061$	17k	43min
Amazon	CBCD	0.246	40k	42s
	Infomap	$0.057$	213k	132s
	Louvain	$0.154$	266k	16s
	SCD	$0.158$	141k	6s
	Focs	$0.207$	20k	5s
	Attractor	$0.201$	23k	22min

Equations105

Q = \frac{1}{2 m} ij \sum (W_{ij} - E_{ij}) δ (C_{i}, C_{j}),

Q = \frac{1}{2 m} ij \sum (W_{ij} - E_{ij}) δ (C_{i}, C_{j}),

Q=\sum_{j}\Big{[}\frac{l_{j}}{m}-\Big{(}\frac{K_{j}}{2m}\Big{)}^{2}\Big{]},

Q=\sum_{j}\Big{[}\frac{l_{j}}{m}-\Big{(}\frac{K_{j}}{2m}\Big{)}^{2}\Big{]},

F (S) = u \in S \sum f (u, S) .

F (S) = u \in S \sum f (u, S) .

Γ (P) = j = 1 \sum l F (S_{j}) .

Γ (P) = j = 1 \sum l F (S_{j}) .

P (G) = \frac{4}{6}, P (C) = \frac{5}{6}, P (C G) = \frac{3}{6} .

P (G) = \frac{4}{6}, P (C) = \frac{5}{6}, P (C G) = \frac{3}{6} .

confidence (C \Rightarrow G) = P (G ∣ C) = \frac{P ( G C )}{P ( C )} = \frac{ω}{d},

confidence (C \Rightarrow G) = P (G ∣ C) = \frac{P ( G C )}{P ( C )} = \frac{ω}{d},

confidence (G \Rightarrow C) = P (C ∣ G) = \frac{P ( G C )}{P ( G )} = \frac{ω}{ϵ},

confidence (G \Rightarrow C) = P (C ∣ G) = \frac{P ( G C )}{P ( G )} = \frac{ω}{ϵ},

P S (u, S) = P (C G) - P (C) P (G) = \frac{ω}{N} - \frac{ϵ d _{u}}{N ^{2}} .

P S (u, S) = P (C G) - P (C) P (G) = \frac{ω}{N} - \frac{ϵ d _{u}}{N ^{2}} .

\begin{split}Q_{S}&=\frac{l_{S}}{m}-\Big{(}\frac{K_{S}}{2m}\Big{)}^{2}=\frac{l_{S}}{m}-\frac{\sum_{j\in S}d_{j}}{2m}\cdot\frac{\sum_{i\in S}d_{i}}{2m},\end{split}

\begin{split}Q_{S}&=\frac{l_{S}}{m}-\Big{(}\frac{K_{S}}{2m}\Big{)}^{2}=\frac{l_{S}}{m}-\frac{\sum_{j\in S}d_{j}}{2m}\cdot\frac{\sum_{i\in S}d_{i}}{2m},\end{split}

\frac{\partial J ( ω , ϵ , d _{u} )}{\partial ω} = \frac{1}{N} > 0, \frac{\partial J ( ω , ϵ , d _{u} )}{\partial ϵ} = \frac{- d _{u}}{N ^{2}} < 0, \frac{\partial J ( ω , ϵ , d _{u} )}{\partial d _{u}} = \frac{- ϵ}{N ^{2}} < 0.

\frac{\partial J ( ω , ϵ , d _{u} )}{\partial ω} = \frac{1}{N} > 0, \frac{\partial J ( ω , ϵ , d _{u} )}{\partial ϵ} = \frac{- d _{u}}{N ^{2}} < 0, \frac{\partial J ( ω , ϵ , d _{u} )}{\partial d _{u}} = \frac{- ϵ}{N ^{2}} < 0.

X_{v} = {10 if v has a link with u, otherwise .

X_{v} = {10 if v has a link with u, otherwise .

E [P S (u, S)] = E [\frac{ω}{N} - \frac{ϵ d _{u}}{N ^{2}}] = E [\frac{ω}{N}] - E [\frac{ϵ d _{u}}{N ^{2}}] = \frac{1}{N} E [v \in S \sum X_{v}] - \frac{ϵ}{N ^{2}} E [i = 1 \sum n X_{i}] = \frac{1}{N} v \in S \sum E [X_{v}] - \frac{ϵ}{N ^{2}} i = 1 \sum n E [X_{i}] = \frac{ϵ p}{N} - \frac{ϵ}{N ^{2}} \cdot N p = 0.

E [P S (u, S)] = E [\frac{ω}{N} - \frac{ϵ d _{u}}{N ^{2}}] = E [\frac{ω}{N}] - E [\frac{ϵ d _{u}}{N ^{2}}] = \frac{1}{N} E [v \in S \sum X_{v}] - \frac{ϵ}{N ^{2}} E [i = 1 \sum n X_{i}] = \frac{1}{N} v \in S \sum E [X_{v}] - \frac{ϵ}{N ^{2}} i = 1 \sum n E [X_{i}] = \frac{ϵ p}{N} - \frac{ϵ}{N ^{2}} \cdot N p = 0.

\begin{split}&\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\sqrt{\kappa}\Big{]}\\ &\leq\frac{1}{\kappa}\Big{(}\frac{p^{2}\epsilon^{2}-p^{2}\epsilon+p\epsilon}{N^{2}}+\frac{\lambda\epsilon^{2}(\lambda-p+1)}{N^{4}}-\frac{2p\epsilon^{2}(\lambda+1)}{N^{3}}\Big{)}.\end{split}

\begin{split}&\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\sqrt{\kappa}\Big{]}\\ &\leq\frac{1}{\kappa}\Big{(}\frac{p^{2}\epsilon^{2}-p^{2}\epsilon+p\epsilon}{N^{2}}+\frac{\lambda\epsilon^{2}(\lambda-p+1)}{N^{4}}-\frac{2p\epsilon^{2}(\lambda+1)}{N^{3}}\Big{)}.\end{split}

d_{u} = i = 1 \sum n X_{i}, ω = v \in S \sum X_{v} .

d_{u} = i = 1 \sum n X_{i}, ω = v \in S \sum X_{v} .

\begin{split}\mathbb{E}[Y^{2}]&=\mathbb{E}[\Big{(}\frac{\omega}{N}-\frac{\epsilon d_{u}}{N^{2}}\Big{)}^{2}]=\mathbb{E}[\frac{\omega^{2}}{N^{2}}-\frac{2\epsilon d_{u}\omega}{N^{3}}+\frac{\epsilon^{2}d_{u}^{2}}{N^{4}}]\\ &=\frac{1}{N^{2}}\mathbb{E}[\omega^{2}]-\frac{2\epsilon}{N^{3}}\mathbb{E}[d_{u}\omega]+\frac{\epsilon^{2}}{N^{4}}\mathbb{E}[{d_{u}}^{2}].\end{split}

\begin{split}\mathbb{E}[Y^{2}]&=\mathbb{E}[\Big{(}\frac{\omega}{N}-\frac{\epsilon d_{u}}{N^{2}}\Big{)}^{2}]=\mathbb{E}[\frac{\omega^{2}}{N^{2}}-\frac{2\epsilon d_{u}\omega}{N^{3}}+\frac{\epsilon^{2}d_{u}^{2}}{N^{4}}]\\ &=\frac{1}{N^{2}}\mathbb{E}[\omega^{2}]-\frac{2\epsilon}{N^{3}}\mathbb{E}[d_{u}\omega]+\frac{\epsilon^{2}}{N^{4}}\mathbb{E}[{d_{u}}^{2}].\end{split}

\begin{split}\mathbb{E}[\omega^{2}]&=\mathbb{E}[\Big{(}\sum_{i\in S}X_{i}\Big{)}\Big{(}\sum_{j\in S}X_{j}\Big{)}]=\mathbb{E}[\sum_{i\in S}\sum_{j\in S}X_{i}X_{j}]\\ &=\sum_{i\in S}\sum_{j\in S}\mathbb{E}[X_{i}X_{j}]=\epsilon(\epsilon-1)p^{2}+p\epsilon,\\ \mathbb{E}[d_{u}\omega]&=\mathbb{E}[\Big{(}\sum_{i=1}^{n}X_{i}\Big{)}\Big{(}\sum_{j\in S}X_{j}\Big{)}]=\mathbb{E}[\sum_{i=1}^{n}\sum_{j\in S}X_{i}X_{j}]\\ &=\sum_{i=1}^{n}\sum_{j\in S}\mathbb{E}[X_{i}X_{j}]=\epsilon Np^{2}+p\epsilon=\epsilon\lambda p+p\epsilon,\\ \mathbb{E}[{d_{u}}^{2}]&=\mathbb{E}[\Big{(}\sum_{i=1}^{n}X_{i}\Big{)}\Big{(}\sum_{j=1}^{n}X_{j}\Big{)}]=\sum_{i=1}^{n}\sum_{j=1}^{n}\mathbb{E}[X_{i}X_{j}]\\ &=N(N-1)p^{2}+Np=\lambda(\lambda-p+1).\end{split}

\begin{split}\mathbb{E}[\omega^{2}]&=\mathbb{E}[\Big{(}\sum_{i\in S}X_{i}\Big{)}\Big{(}\sum_{j\in S}X_{j}\Big{)}]=\mathbb{E}[\sum_{i\in S}\sum_{j\in S}X_{i}X_{j}]\\ &=\sum_{i\in S}\sum_{j\in S}\mathbb{E}[X_{i}X_{j}]=\epsilon(\epsilon-1)p^{2}+p\epsilon,\\ \mathbb{E}[d_{u}\omega]&=\mathbb{E}[\Big{(}\sum_{i=1}^{n}X_{i}\Big{)}\Big{(}\sum_{j\in S}X_{j}\Big{)}]=\mathbb{E}[\sum_{i=1}^{n}\sum_{j\in S}X_{i}X_{j}]\\ &=\sum_{i=1}^{n}\sum_{j\in S}\mathbb{E}[X_{i}X_{j}]=\epsilon Np^{2}+p\epsilon=\epsilon\lambda p+p\epsilon,\\ \mathbb{E}[{d_{u}}^{2}]&=\mathbb{E}[\Big{(}\sum_{i=1}^{n}X_{i}\Big{)}\Big{(}\sum_{j=1}^{n}X_{j}\Big{)}]=\sum_{i=1}^{n}\sum_{j=1}^{n}\mathbb{E}[X_{i}X_{j}]\\ &=N(N-1)p^{2}+Np=\lambda(\lambda-p+1).\end{split}

V a r [Y] = E [Y^{2}] - (E [Y])^{2} = E [Y^{2}] = \frac{ϵ ( ϵ - 1 ) p ^{2} + p ϵ}{N ^{2}} - \frac{2 p ϵ ^{2} ( λ + 1 )}{N ^{3}} + \frac{λ ϵ ^{2} ( λ - p + 1 )}{N ^{4}} .

V a r [Y] = E [Y^{2}] - (E [Y])^{2} = E [Y^{2}] = \frac{ϵ ( ϵ - 1 ) p ^{2} + p ϵ}{N ^{2}} - \frac{2 p ϵ ^{2} ( λ + 1 )}{N ^{3}} + \frac{λ ϵ ^{2} ( λ - p + 1 )}{N ^{4}} .

\begin{split}&\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\sqrt{\kappa}\Big{]}=\mathbb{P}\Big{[}\big{|}Y-\mathbb{E}[Y]\big{|}\geq\sqrt{\kappa}\Big{]}\leq\frac{Var[Y]}{\kappa}\\ &=\frac{1}{\kappa}\Big{(}\frac{\epsilon(\epsilon-1)p^{2}+p\epsilon}{N^{2}}-\frac{2p\epsilon^{2}(\lambda+1)}{N^{3}}+\frac{\lambda\epsilon^{2}(\lambda-p+1)}{N^{4}}\Big{)}.\end{split}

\begin{split}&\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\sqrt{\kappa}\Big{]}=\mathbb{P}\Big{[}\big{|}Y-\mathbb{E}[Y]\big{|}\geq\sqrt{\kappa}\Big{]}\leq\frac{Var[Y]}{\kappa}\\ &=\frac{1}{\kappa}\Big{(}\frac{\epsilon(\epsilon-1)p^{2}+p\epsilon}{N^{2}}-\frac{2p\epsilon^{2}(\lambda+1)}{N^{3}}+\frac{\lambda\epsilon^{2}(\lambda-p+1)}{N^{4}}\Big{)}.\end{split}

\lim_{N\to+\infty,p\to 0}\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\kappa\Big{]}=0.

\lim_{N\to+\infty,p\to 0}\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\kappa\Big{]}=0.

\begin{split}&\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\kappa\Big{]}\\ &\leq\frac{1}{{\kappa}^{2}}\Big{(}\frac{\epsilon(\epsilon-1)p^{2}+p\epsilon}{N^{2}}-\frac{2p\epsilon^{2}(\lambda+1)}{N^{3}}+\frac{\lambda\epsilon^{2}(\lambda-p+1)}{N^{4}}\Big{)}\\ &\leq\frac{1}{{\kappa}^{2}}\Big{(}\frac{\epsilon(\epsilon-1)p^{2}+p\epsilon}{N^{2}}+\frac{\lambda\epsilon^{2}(\lambda-p+1)}{N^{4}}\Big{)}\\ &\leq\frac{1}{{\kappa}^{2}}\Big{(}\frac{(Np)^{2}+pN}{N^{2}}+\frac{\lambda(\lambda-p+1)}{N^{2}}\Big{)}\\ &=\frac{1}{{\kappa}^{2}}\Big{(}\frac{2(\lambda)^{2}+2\lambda-\lambda p}{N^{2}}\Big{)}.\end{split}

\begin{split}&\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\kappa\Big{]}\\ &\leq\frac{1}{{\kappa}^{2}}\Big{(}\frac{\epsilon(\epsilon-1)p^{2}+p\epsilon}{N^{2}}-\frac{2p\epsilon^{2}(\lambda+1)}{N^{3}}+\frac{\lambda\epsilon^{2}(\lambda-p+1)}{N^{4}}\Big{)}\\ &\leq\frac{1}{{\kappa}^{2}}\Big{(}\frac{\epsilon(\epsilon-1)p^{2}+p\epsilon}{N^{2}}+\frac{\lambda\epsilon^{2}(\lambda-p+1)}{N^{4}}\Big{)}\\ &\leq\frac{1}{{\kappa}^{2}}\Big{(}\frac{(Np)^{2}+pN}{N^{2}}+\frac{\lambda(\lambda-p+1)}{N^{2}}\Big{)}\\ &=\frac{1}{{\kappa}^{2}}\Big{(}\frac{2(\lambda)^{2}+2\lambda-\lambda p}{N^{2}}\Big{)}.\end{split}

\begin{split}&\lim_{N\to+\infty,p\to 0}\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\kappa\Big{]}\\ &\leq\lim_{N\to+\infty,p\to 0}\frac{1}{{\kappa}^{2}}\Big{(}\frac{2(\lambda)^{2}+2\lambda-\lambda p}{N^{2}}\Big{)}=0.\end{split}

\begin{split}&\lim_{N\to+\infty,p\to 0}\mathbb{P}\Big{[}\big{|}PS(u,S)\big{|}\geq\kappa\Big{]}\\ &\leq\lim_{N\to+\infty,p\to 0}\frac{1}{{\kappa}^{2}}\Big{(}\frac{2(\lambda)^{2}+2\lambda-\lambda p}{N^{2}}\Big{)}=0.\end{split}

F (S) = u \in S \sum P S (u, S) = \frac{2 l _{S}}{N} - \frac{ϵ K _{S}}{N ^{2}},

F (S) = u \in S \sum P S (u, S) = \frac{2 l _{S}}{N} - \frac{ϵ K _{S}}{N ^{2}},

\Gamma(P)=\sum_{j=1}^{\mathcal{M}}F(S_{j})=\sum_{j=1}^{\mathcal{M}}\Big{(}\frac{2l_{S_{j}}}{N}-\frac{\epsilon K_{S_{j}}}{N^{2}}\Big{)}.

\Gamma(P)=\sum_{j=1}^{\mathcal{M}}F(S_{j})=\sum_{j=1}^{\mathcal{M}}\Big{(}\frac{2l_{S_{j}}}{N}-\frac{\epsilon K_{S_{j}}}{N^{2}}\Big{)}.

ϕ (u, S) =

ϕ (u, S) =

=

\begin{split}&\frac{\partial J(\omega,\epsilon,d_{u})}{\partial\omega}=\frac{N}{\sqrt{\epsilon(N-\epsilon)d_{u}(N-d_{u})}}>0\\ &\frac{\partial J(\omega,\epsilon,d_{u})}{\partial\epsilon}={\big{(}d_{u}(N-d_{u})\big{)}}^{-\frac{1}{2}}\cdot\frac{-N\big{[}\epsilon(d_{u}-2\omega)+\omega N\big{]}}{2{\big{(}\epsilon(N-\epsilon)\big{)}}^{\frac{3}{2}}}\\ &\frac{\partial J(\omega,\epsilon,d_{u})}{\partial d_{u}}={\big{(}\epsilon(N-\epsilon)\big{)}}^{-\frac{1}{2}}\cdot\frac{-N\big{[}d_{u}(\epsilon-2\omega)+\omega N\big{]}}{2{\big{(}d_{u}(N-d_{u})\big{)}}^{\frac{3}{2}}}.\end{split}

\begin{split}&\frac{\partial J(\omega,\epsilon,d_{u})}{\partial\omega}=\frac{N}{\sqrt{\epsilon(N-\epsilon)d_{u}(N-d_{u})}}>0\\ &\frac{\partial J(\omega,\epsilon,d_{u})}{\partial\epsilon}={\big{(}d_{u}(N-d_{u})\big{)}}^{-\frac{1}{2}}\cdot\frac{-N\big{[}\epsilon(d_{u}-2\omega)+\omega N\big{]}}{2{\big{(}\epsilon(N-\epsilon)\big{)}}^{\frac{3}{2}}}\\ &\frac{\partial J(\omega,\epsilon,d_{u})}{\partial d_{u}}={\big{(}\epsilon(N-\epsilon)\big{)}}^{-\frac{1}{2}}\cdot\frac{-N\big{[}d_{u}(\epsilon-2\omega)+\omega N\big{]}}{2{\big{(}d_{u}(N-d_{u})\big{)}}^{\frac{3}{2}}}.\end{split}

\begin{split}-N\big{[}d_{u}(\epsilon-2\omega)+\omega N\big{]}&=-N\big{[}\omega N-d_{u}(2\omega-\epsilon)\big{]}\\ &\leq-N\big{(}\omega N-d_{u}\omega\big{)}\\ &<-N\big{(}\omega N-N\omega\big{)}=0.\end{split}

\begin{split}-N\big{[}d_{u}(\epsilon-2\omega)+\omega N\big{]}&=-N\big{[}\omega N-d_{u}(2\omega-\epsilon)\big{]}\\ &\leq-N\big{(}\omega N-d_{u}\omega\big{)}\\ &<-N\big{(}\omega N-N\omega\big{)}=0.\end{split}

Y = \frac{1}{d _{u} ( N - d _{u} )} .

Y = \frac{1}{d _{u} ( N - d _{u} )} .

X_{v} = {10 if v has a link with u, otherwise .

X_{v} = {10 if v has a link with u, otherwise .

\begin{split}\mathbb{E}[X_{v}Y]&=\mathbb{E}[Y|X_{v}=1]\cdot\mathbb{P}(X_{v}=1)+0\cdot\mathbb{P}(X_{v}=0)\\ &=p\cdot\mathbb{E}\Big{[}\frac{1}{\sqrt{d_{u}(N-d_{u})}}|X_{v}=1\Big{]}=\mathbb{E}[H_{v}]p,\end{split}

\begin{split}\mathbb{E}[X_{v}Y]&=\mathbb{E}[Y|X_{v}=1]\cdot\mathbb{P}(X_{v}=1)+0\cdot\mathbb{P}(X_{v}=0)\\ &=p\cdot\mathbb{E}\Big{[}\frac{1}{\sqrt{d_{u}(N-d_{u})}}|X_{v}=1\Big{]}=\mathbb{E}[H_{v}]p,\end{split}

\begin{split}\mathbb{E}[\phi(u,S)]&=\mathbb{E}[\frac{\omega N-\epsilon d_{u}}{\sqrt{\epsilon(N-\epsilon)d_{u}(N-d_{u})}}]\\ &=\mathcal{K}\cdot\mathbb{E}[(\omega N-\epsilon d_{u})Y]\\ &=\mathcal{K}\cdot\mathbb{E}[(N\sum_{v\in S}X_{v}-\epsilon\sum_{i=1}^{n}X_{i}){Y}]\\ &=\mathcal{K}\cdot\Big{(}N\cdot\mathbb{E}[Y{\sum_{v\in S}X_{v}}]-\epsilon\cdot\mathbb{E}[Y{\sum_{i=1}^{n}X_{i}}]\Big{)}\\ &=\mathcal{K}\cdot\mathbb{E}[H_{V}]\cdot\Big{(}N\epsilon p-\epsilon Np\Big{)}=0.\end{split}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Network Analysis Techniques · Advanced Clustering Algorithms Research · Bioinformatics and Genomic Networks

Full text

Correlation-Based Community Detection

Zheng Chen, Zengyou He, Hao Liang, Can Zhao and Yan Liu

Z. Chen, H. Liang and Y. Liu are with School of Software, Dalian University of Technology, Dalian, China. Z. He is with School of Software, Dalian University of Technology, Dalian,China, and Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian, China.

E-mail: [email protected] C. Zhao is with Institute of Information Engineering, CAS.

Abstract

Mining community structures from the complex network is an important problem across a variety of fields. Many existing community detection methods detect communities through optimizing a community evaluation function. However, most of these functions even have high values on random graphs and may fail to detect small communities in the large-scale network (the so-called resolution limit problem). In this paper, we introduce two novel node-centric community evaluation functions by connecting correlation analysis with community detection. We will further show that the correlation analysis can provide a novel theoretical framework which unifies some existing evaluation functions in the context of a correlation-based optimization problem. In this framework, we can mitigate the resolution limit problem and eliminate the influence of random fluctuations by selecting the right correlation function. Furthermore, we introduce three key properties used in mining association rule into the context of community detection to help us choose the appropriate correlation function. Based on our introduced correlation functions, we propose a community detection algorithm called CBCD. Our proposed algorithm outperforms existing state-of-the-art algorithms on both synthetic benchmark networks and real-world networks.

Index Terms:

Complex networks, community detection, correlation analysis, random graph, node-centric function.

1 Introduction

COMMUNITY detection plays a key role in network science, bioinformatics [1], sociological analysis [2] and data mining. It not only helps us identify the network modules, but also offers insight into how the entire network is organized by local structures. The detected communities could be interpreted as the basic modules of various kind of networks, e.g. social circles in social networks [3], protein complexes in protein interaction network [4], or groups of organisms in food web network [5]. More generally, a widely accepted consensus on community [6] is that the community should be a set of vertices that has more edges within the community than edges linking vertices of the community with the rest of the graph.

Although community detection has been extensively investigated during the past decades, there is still no common agreement on a formal definition regarding what a community exactly is. Many existing community detection algorithms are based on the previously mentioned criterion (more internal connections than external connections), with proposed quality metrics that quantify how community nodes connect internal nodes densely and external nodes sparsely. For example, popular metrics such as betweenness [5], modularity [7], conductance [8], ratio cut [9], density [10] and normalized cut [11] are all based on this intuitive idea. And existing related algorithms first utilize these metrics to derive globally-defined objective functions, then maximize (or minimize) the objective function by partitioning the whole graph [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22].

The traditional view on community evaluation relies on counting edges in different ways. Simply doing that is not a sensible way, and it is not sufficient to convince people that identified community structure is pronounced. In [23],

Fortunato and Hric pointed out an amazing fact that an Erdos-Renyi (E-R) random graph [24] can generate the modular structures. To exemplify this phenomenon, a 400 $\times$ 400 matrix of E-R random graph is illustrated in Fig. 1. As shown in Fig. 1(a), the matrix is evidently disordered and random such that no one believes there exist community structures. However, as long as rearranging the entries of the matrix by reorganizing the index of the vertices, the modular structures emerge at the diagonal of the matrix. Such structures should not be real and are generated due to the random fluctuations in the network construction process [23]. This weird phenomenon causes that many clustering algorithms whose metrics do not consider the influence of random fluctuations identify communities even in the random networks as well. From the above illustrations, it makes sense that community metrics should be considered how significant their result is.

Probably the most popular community quality metric is the modularity introduced by Newman and Girvan [7]. This metric is based on a prior work about a measure of assortative mixing which is proposed in [25]. It evaluates the quality of a partition of the network. The global expression of the modularity function is:

[TABLE]

where $m$ is the number of edges, $W_{ij}$ is the element of the adjacency matrix $W$ , $\delta(x,y)$ is the kronecker delta function whose value is 1 if $x=y$ and 0 otherwise, $C_{i}$ and $C_{j}$ represent the community index of $i$ and $j$ respectively, $E_{ij}$ is the expected number of edges that connect vertex $i$ and vertex $j$ under the random graph null model. The null model specifies how to generate a random graph that preserves some characteristics of the original graph. The most commonly used null model is the configuration model [26] [27], where the degrees of all vertexes are preserved in the random graph. Under the configuration model, one can derive that $E_{ij}=d_{i}d_{j}/2m$ ( $d_{i}$ and $d_{j}$ are the degree of $i$ and $j$ , respectively). Now we can reformulate the modularity as:

[TABLE]

where $K_{j}$ represents the sum of degrees of all vertexes in community $j$ and $l_{j}$ is the number of internal edges within community $j$ . At the same time, we could use $\l_{j}/{m}-(K_{j}/2m)^{2}$ as a metric for evaluating the quality of community $j$ . The intuition behind this formula is that simply counting edges are not sufficient to determine a true community structure so that the number of expected edges in the null model should be incorporated as well. Unfortunately, the modularity maximum does not equal to the most pronounced community structure. This is the well-known resolution limit problem which modularity suffers from, i.e., the modularity function may fail to detect modules which are smaller than a scale in large networks. Many techniques have been used to mitigate the resolution limit problem [28], such as the multi-resolution method. Note that the multi-resolution method does not provide a reliable solution to the problem [29]. In addition, it is also known that the modularity can be a high value even in the E-R random graph [30]. This seems counterintuitive, but it is the fact since the modularity is a kind of measurement which measures the distance between real network structure and the “average” of random network structures. As a result, there is no sufficient information to confirm the distribution of the modularity.

In addition, the community is a local structure of the network. We should examine the community from the local view. The strength of connectedness between one node and a community embeds the local feature of that community. Using these local features to construct a global evaluation function for a community helps us avoid from missing the local structural information. However, how to assess the strength of connectedness between one node and the community is a challenging task since there is no convinced definition on what the strength of connectedness is.

Our Contributions. In this paper, we propose a correlation-based community detection framework, in which the local connectedness strength between each node and a community is assessed through correlation analysis. More precisely, we represent the basic structural information of a node as the corresponding row vector embedded in the adjacency matrix and encode the structural information of a community into a binary vector. To demonstrate the feasibility and advantages of this framework, we introduce two concrete node-centric community metrics based on the correlation between a node and a community, PS-metric (node version of modularity) and $\phi$ -Coefficient metric. Moreover, we provide the detailed theoretical analysis for our metrics, which show that these metrics are capable of mitigating the resolution limit problem and alleviating the fake community issue in E-R random graph. Besides, we present a correlation-based community detection (CBCD) method which adopts two introduced metrics to identify the community structure. Experimental results on both real networks and the LFR networks show that CBCD outperforms state-of-the-art methods. The summary of contributions of this paper is listed as follows:

•

To the best of our knowledge, we are the first to introduce correlation analysis into the node-centric metrics, which calculate the correlation value between a node and a community. Besides, we slightly modify three key properties for mining association rule [31] in the context of community detection, which can guide us select the right correlation function for community detection.

•

We further investigate the relationship between the correlation analysis and community detection. It has been found that correlation analysis can be viewed as a theoretical interpretation framework for community detection, which unifies some exiting metrics and provide the potential of the deriving new and better community evaluation measures.

•

The introduced PS-metric (node version of modularity) is less affected by the random fluctuations. We not only give the detailed theoretical proof, but also show the empirical results to validate the effectiveness of this metric under the E-R model. Moreover, we prove that the $\phi$ -Coefficient metric can mitigate the resolution limit problem and give an intuitive interpretation.

•

Experimental results demonstrate that our method outperforms state-of-the-art methods on real networks and LFR networks.

The organization of this paper is structured as follows: Section 2 discusses the related work. Section 3 introduces several basic definitions of correlation analysis and proves some properties of proposed metrics. Section 3 describes our corresponding community detection method. Section 4 presents the experimental results of CBCD along with the other methods. Section 5 concludes this paper.

2 Related Work

2.1 Correlation-Based Method

There are already some algorithms in the literature that utilize the different types of correlation information hidden behind the network structure to solve the community detection problem. Once the correlation definition is specified, community detection can be cast as an optimization problem by maximizing a correlation-based objective function. The existing community detection algorithms based on correlation can be categorized according to the different correlation definitions.

2.1.1 Node-Node Similarity

The pairwise similarity between two nodes can be viewed as a kind of correlation. The goal of community detection based on node-node similarity is to put the nodes which are close to each other into the same group. In [32], a random-walk based node similarity was proposed, which can be used in an agglomerative algorithm to efficiently detect the communities in the network. The method in [33] is based on the node similarity proposed in [34] to find communities in an iterative manner.

2.1.2 Node-Node Correlation

A covariance matrix of the network is derived from the incidence matrix in [35], which can be viewed as the unbiased version of the well-known modularity matrix. A correlation matrix is obtained through introducing the re-scaling transformation into the covariance matrix, which significantly outperforms the covariance matrix on the identification of communities. The algorithm in [36] constructs a node-node correlation matrix based on the Laplacian matrix so as to incorporate the feature of NMF (non-negative matrix factorization) method. In [37], MacMahon et al. introduce the appropriate correlation-based counterparts of the most popular community detection techniques via a consistent redefinition of null models based on random matrix theory. A correlation clustering [38] based community detection framework is proposed in [39], which unifies the modularity, normalized cut, sparsest cut, correlation clustering and cluster deletion by introducing a single resolution parameter $\lambda$ .

2.1.3 Edge-Community Correlation

In [40] [41], Duan et al. connect the modularity with correlation analysis by reformulating modularity’s objective function as a correlation function based on the probability that an edge falls into the community under the configuration model. Hence, it can be viewed as a method based on edge-community correlation information.

2.2 Node-Centric Method

Almost all classical community detection metrics implicitly assume that all nodes in a community are equally important. These metrics, viewing the community as a whole, only focus on the total number of edges within the community, the sum of degree of all nodes or the size of the community. The connection strength between each vertex and the community is not involved in these metrics. This may result in missing important structural information so that two distinct communities cannot be distinguished in certain circumstances. Node-centric methods take into account how each vertex connects the community densely. This is consistent with the fact that the community is a local structure of the network.

Chakraborty et al. [42] [43] [44] propose a new node-centric metric called permanence, which describes the membership strength of a node to a community. The central idea behind permanence is to consider the following two factors: (1) the distribution of the external edges rather than the number of all external connections (2) the strength of the internal-connectivity instead of the number of all internal edges. The corresponding algorithm is to maximize the permanence-based objective function. WCC is another node-centric metric proposed in [45] [46], which considers the triangle as the basic structure instead of the edge or node. WCC of a node consists of two parts: isolation and intra-connectivity. The proposed algorithm aims at the optimization of a WCC-based objective function. Focs [47] is a heuristic method that accounts for local connectedness. The local connectedness of a node to a community depends upon two scores of the node with respect to the community: community connectedness and neighborhood connectedness. Focs mainly consists of two phases: leave phase and expand phase, which is respectively based on community connectedness and neighborhood connectedness.

2.3 Summary

Our method is different from all above existing methods. Although the proposed metrics in our method are node-centric as well, they are derived from the perspective of correlation analysis. Different from the existing correlation-based methods, our method utilizes the correlation between the node and the community based on a $2\times 2$ contingency table. Thus, our method can be viwed as a node-community correlation based method. Besides, we will choose appropriate correlation functions to mitigate the resolution limit problem under the guidance of [48] [49]. The further analysis will be given in section 3.

3 Correlation Analysis In Network

In this section, we first formalize the community detection problem, then, introduce two correlation measures and extend it for community structure evaluation. Two new correlation metrics for community detection will be proposed in this section. Next we will describe modularity from correlation analysis perspective and discuss its relation with our metrics. At last, we will discuss the desirable properties for new correlation metric.

3.1 Preliminaries

Undirected Graph. Let $G=(V,E,W)$ be an undirected graph with $n$ nodes and $m$ edges, where $V$ is the node set, $E$ is the edge set, $W$ is the adjacency matrix. If an edge $(u,v)\in E$ , then $W_{uv}=1$ and $W_{uv}=0$ otherwise. In the case of undirected graph, $W_{uv}=W_{vu}$ , it means that the adjacency matrix $W$ is a symmetric matrix.

Community Detection. Given a graph $G=(V,E,W)$ , the goal of community detection is to partition the graph with $|V|=n$ vertices into $l$ pairwise disjoint groups $P=\{S_{1},...,S_{l}\}$ , where $S_{1}\bigcup..\bigcup S_{l}=V$ and $S_{i}\bigcap S_{j}=\emptyset$ for any $i,j$ .

To begin with, we define $f(u,S)$ as a measurement function which takes a vertex $u$ and a community $S$ containing $u$ as the input, and return a real value which indicates the connectivity of vertex $u$ regarding the community $S$ . The function $f(u,S)$ should satisfy certain basic requirements so that it can be used to evaluate the structure of community $S$ . First, $f(u,S)$ must be bounded to guarantee the convergence of community search algorithms. Second, the more strongly a vertex connects to a community, the higher $f(u,S)$ becomes. The detailed properties that a measurement function needs to possess for satisfying the second requirement will be further discussed in section 3.2. Now, we can define the vertex-centric metric of a community $S$ as the sum of the function $f(u,S)$ of all members $u$ that belong to the community $S$ :

[TABLE]

Analogously to what we have done before, we define the objective function of a partition $P$ through taking the sum of the vertex-centric metric value of each community $S_{i}\in P$ :

[TABLE]

Given a graph $G=(V,E,W)$ , community detection can be cast as an optimization problem with (4) as the objective function. The partition $P$ is optimal when $\Gamma(P)$ achieves a maximum value, and we call this partition the optimal partition.

Now the key point lies in how to formulate a feasible $f(u,S)$ . A natural idea is to use the correlation function between $u$ and $S$ as $f(u,S)$ . The correlation function measures the correlation relationship between $u$ and $S$ , in which a higher correlation value indicates that there is a strong affinity between $u$ and $S$ . However, we need to choose an appropriate correlation model to represent the connectivity cohesion of vertex $u$ regarding community $S$ . Moreover, the information about the topology structure of a node and a community should also be considered in this model. In this paper, we convert the set of nodes which belong to a community into a binary vector to embody the basic structural information of the community. The positions of nodes in the graph are determined by their neighbourhoods, that is, if two nodes connect to the same set of neighbors, then the whole graph will retain the same structure after exchanging the positions of these two nodes. Therefore, the local structure of a node can be depicted by its row vector embedded in the adjacency matrix. In summary, we have the following definitions.

Community vector and Neighbor vector. The community vector of $u$ with respect to $S$ , $\Psi_{S\backslash\{u\}}=(e_{1},...e_{u-1},e_{u+1},...e_{n})$ , is a binary vector of length $n-1$ , where $e_{v}=1$ if vertex $v$ belongs to community $S$ and $e_{v}=0$ otherwise. $\Psi_{S\backslash\{u\}}[v]$ denotes the value of vector element $e_{v}$ . $\psi_{u}=(g_{1},...g_{u-1},g_{u+1},...g_{n})$ is called the neighbor vector of vertex $u$ , where $g_{v}=1$ if there exists an edge between $u$ and $v$ , and $g_{v}=0$ otherwise. $\psi_{u}[v]$ denotes the value of vector element $g_{v}$ . Note that vertex $u$ is excluded from the vector, we will give an explanation from the probabilistic perspective in the following part.

The above two binary vectors can be viewed as the samples of two binary variables $\mathcal{C}(x,u)$ and $\mathcal{G}(x,S\backslash\{u\})$ , where $\mathcal{C}(x,u)=\psi_{u}[x]$ and $\mathcal{G}(x,S\backslash\{u\})=\Psi_{S\backslash\{u\}}[x]$ . A $2\times 2$ contingency table for these two variables is given in Table 2. The entry $f_{11}$ is denoted by $\omega$ , that represents the count of inner connections of vertex $u$ regarding community $S$ . The entry $f_{11}+f_{10}$ is denoted by $\epsilon$ , $\epsilon=|S|-1$ and $|S|$ is the size of community $S$ .

The entry $f_{11}+f_{01}$ is denoted by $d_{u}$ , where $d_{u}$ is the degree of vertex $u$ . The uppercase Roman letter $N$ is the length of the binary vector, where $N=n-1$ .

Now we can give several probability definitions that will be used in the correlation measure. Firstly, to simplify the notations, $\mathcal{C}$ is used to denote $\mathcal{C}(x,u)=1$ , and $\mathcal{G}$ is used to denote $\mathcal{G}(x,S\backslash\{u\})=1$ . Then, $P(\mathcal{C}\mathcal{G})=\frac{\omega}{N}$ is the probability that a randomly chosen vertex from the vertex set $V\backslash\{u\}$ connects to vertex $u$ and belongs to community $S$ simultaneously. $P(\mathcal{C})=\frac{d_{u}}{N}$ is the probability that a randomly selected vertex from $V\backslash\{u\}$ has a connection with $u$ . $P(\mathcal{G})=\frac{\epsilon}{N}$ is the probability that a randomly chosen vertex from $V\backslash\{u\}$ belongs to community $S$ . Under the assumption of independence, the probability of $\mathcal{C}\mathcal{G}$ can be calculated by $P(\mathcal{C}\mathcal{G})=P(\mathcal{C})P(\mathcal{G})={\epsilon d_{u}}/{N^{2}}$ .

The above definitions and notations are exemplified in Fig. 2, where there is a community $S$ of size 5. Here we let $u=5$ , $\mathcal{C}$ denotes $\mathcal{C}(x,5)=1$ and $\mathcal{G}$ denotes $\mathcal{G}(x,S\backslash\{5\})=1$ . We will calculate $P(\mathcal{G})$ , $P(\mathcal{C})$ and $P(\mathcal{C}\mathcal{G})$ . To begin with, we should know how to get two vectors $\Psi_{S\backslash\{5\}}$ and $\psi_{5}$ . In Fig. 2, since node 5 connects almost all the nodes except node 3, $\psi_{5}$ has only one zero entry $g_{3}$ . Since node 6 and node 7 are not included in community $S$ , their corresponding entries $e_{6}$ and $e_{7}$ are both zeros. After getting these two binary vectors, $\omega$ , $d_{5}$ , $N$ and $\epsilon$ can be calculated accordingly, that is, $\omega=3$ , $d_{5}=5$ , $N=6$ , $\epsilon=4$ . Then, we have:

[TABLE]

3.2 Correlation Measure

Many functions have been proposed to measure the correlation between two binary vectors in statistics, data mining and machine learning. There are some guidelines [48] [50] [49] provided for users to select the right measure according to their needs. The most measures consist of $P(\mathcal{G})$ , $P(\mathcal{C})$ and $P(\mathcal{C}\mathcal{G})$ . Here we will list several properties to instruct us in choosing the appropriate measure for community detection. Let $M$ be a measure for correlation analysis between two variables $\mathcal{C}$ and $\mathcal{G}$ . In [31], Piatetsky-Shaprio came up with three key properties about a good correlation measure for association analysis:

P1: $M=0$ if $\mathcal{C}$ and $\mathcal{G}$ are statistically independent;

P2: $M$ monotonically increases with $P(\mathcal{C}\mathcal{G})$ when $P(\mathcal{C})$ and $P(\mathcal{G})$ are both fixed;

P3: $M$ monotonically decreases with $P(\mathcal{C})$ (or $P(\mathcal{G})$ ) when $P(\mathcal{C}\mathcal{G})$ and $P(\mathcal{G})$ (or $P(\mathcal{C}\mathcal{G})$ and $P(\mathcal{C})$ ) are fixed.

Note that these three properties are used in association analysis, we need further investigation in the context of community detection. P1 indicates that $M$ should be able to measure the deviation from statistical independence. Then, the higher $M$ is, the stronger dependence between $\mathcal{C}$ and $\mathcal{G}$ is. Node $u$ can be regarded as the member of community $S$ when $M>0$ . In this paper, we use the E-R model as the underlying random graph model to describe the statistical independence. Given the graph $G$ with $n$ nodes and $m$ edges, one random graph under the E-R model is generated by forming an edge between any two nodes randomly and independently with the probability $p=2m/(n(n-1))$ . By redefining P1, $M$ should be a measure such that the expected value of $M$ equals to zero for any node $u$ and community $S$ under the E-R model.

According to the definitions introduced in section 3.1, $P(\mathcal{C}\mathcal{G})$ , $P(\mathcal{C})$ and $P(\mathcal{G})$ are respectively decided by $\omega$ , $d_{u}$ and $\epsilon$ since $N$ is fixed. As a result, the monotonicity of $P(\mathcal{C}\mathcal{G})$ , $P(\mathcal{C})$ and $P(\mathcal{G})$ is respectively determined by $\omega$ , $d_{u}$ and $\epsilon$ as well. P2 and P3 describe two features of the cohesion of $u$ with respect to community $S$ : intra-connection and isolation. The intra-connection of node $u$ with respect to community $S$ indicates how node $u$ connects community $S$ densely, and it can be quantified by $\omega/\epsilon$ . This is exemplified in Fig. 3, where node 7 connects four nodes of subgraph $S$ and node 6 connects two nodes of subgraph $S$ . Despite node 6 connects subgraph $S$ with all its edges, node 7 has more links than node 6 within subgraph $S$ . Apparently, node 7 is more qualified for the member of community $S$ . Note that the increase of $\omega$ will lead to the increase of $\omega/\epsilon$ when $\epsilon$ and $d_{u}$ are both fixed. According to P2, $M$ should increase with the increment of $\omega/\epsilon$ , which means that strength of the intra-connection between $u$ and $S$ is becoming higher.

In a similar way for P3, $M$ should monotonically decrease with the increment of $\epsilon$ when $\omega$ and $d_{u}$ are both fixed since $\omega/\epsilon$ will decrease. On the other hand, it will lead to the biased result if we exclusively maximize the intra-connection. In Fig. 4, node $u$ have connections with all the left three nodes, but we cannot confidently infer that node $u$ and the left three nodes should be put together to form a community. This is because the number of links from $u$ connecting the rest of the graph is twice its number of links with the left three nodes. A node belonging to a community should have strong isolation from the rest of the graph. The isolation can be quantified by $\omega/d_{u}$ , where the higher $\omega/d_{u}$ indicates that node $u$ connects external nodes sparsely. It is easy to see that the increment of $\omega$ will lead to the increment of $\omega/d_{u}$ when $\epsilon$ and $d_{u}$ are both fixed. In regard to P2, $M$ should increase with the increment of $\omega/d_{u}$ , which means that the isolation of $u$ from the rest of the graph is becoming stronger. In a similar way for P3, $M$ should monotonically decrease with the increment of $d_{u}$ when $\omega$ and $\epsilon$ are both fixed since $\omega/\epsilon$ will decrease.

To further analyze the relationship between correlation measure and community detection, let us consider the confidence measure of $\mathcal{C}$ and $\mathcal{G}$ . We have:

[TABLE]

where $\textit{confidence}(\mathcal{C}\Rightarrow\mathcal{G})$ and $\textit{confidence}(\mathcal{G}\Rightarrow\mathcal{C})$ are defined as the neighborhood connectedness score (nb-score) of a node and the community connectedness score (com-score) of a node in the Focs algorithm [47], respectively. These two node-based measures are the theoretical basis of the Focs algorithm. This fact tells us that correlation analysis can be appropriately applied in community detection.

3.2.1 Piatetsky-Shapiro’s Rule-Interest

Piatetsky-Shapiro’s rule-interest [31] measures the difference between the true probability and the expected probability under the assumption of independence between $\mathcal{C}$ and $\mathcal{G}$ . Piatetsky-Shapiro’s rule-interest is appropriate for the task of community detection, and we will show that modularity can be explained under our framework when the correlation measure is Piatetsky-Shapiro’s rule-interest later on. Our measurement function based on Piatetsky-Shapiro’s rule-interest is calculated as:

[TABLE]

Recalling the aforementioned variables in Table 2, the degree of vertex $u$ is $d_{u}$ , the size of set $S\backslash\{u\}$ is $\epsilon$ and the number of links between vertex $u$ and community $S$ is $\omega$ . If we randomly select a node from the node set $V\backslash\{u\}$ , the probability that the selected node simultaneously connects vertex $u$ and belongs to community $S$ is $\frac{\omega}{N}$ . Similarly, a randomly selected node from $V\backslash\{u\}$ connects vertex $u$ with the probability $\frac{d_{u}}{N}$ and belongs to community $S$ with the probability $\frac{\epsilon}{N}$ . In the case of random graph, the expected probability that the chosen node simultaneously connects vertex $u$ and belongs to community $S$ is $\frac{\epsilon}{N}\cdot\frac{d_{u}}{N}$ .

The modularity function has a few variants, but they are all identical in nature. Without loss of generality, we will adopt the definition given in Equation (2) as our modularity function. Given a community $S$ , the partial modularity of $S$ is reformulated as:

[TABLE]

where $m$ is the number of the edges in graph $G$ and $l_{S}$ is the number of the edges inside community $S$ . If we randomly choose an edge from the edge set $E$ , the probability of the chosen edge inside community $S$ is ${l_{S}}/{m}$ . Likewise, the probability that one randomly chosen edge has at least one end inside community $S$ is $\frac{\sum_{j\in S}d_{j}}{2m}$ . Then, the expected probability of the chosen edge inside community $S$ is $\frac{\sum_{j\in S}d_{j}}{2m}\cdot\frac{\sum_{i\in S}d_{i}}{2m}$ when $G$ is a random graph. Let $\mathcal{H}$ denotes a binary random variable where $\mathcal{H}=1$ if the randomly chosen edge has at least one end inside community $S$ and $\mathcal{H}=0$ otherwise. Then, the partial modularity $Q_{S}$ can be rewritten as: $P(\mathcal{H}\mathcal{H})-P(\mathcal{H})P(\mathcal{H})$ . We can find that the partial modularity owns the same idea with our measurement function in (5), i.e., they are both the concrete examples of Piatetsky-Shaprio Rule-Interest. However, the derivation of the partial modularity and our function starts from two different perspectives respectively. The partial modularity is edge-centric, which utilizes the relationship between edge and community. Our formulation in (5) is node-centric, which utilizes the relationship between node and community. For the sake of simplicity, we will use PS to denote our measurement function in (5) and we can also call it “node modularity”.

Despite the PS function can be used for evaluating the strength of the correlation between a node and a community, it is still not clear if it satisfies aforementioned three properties. A thorough analysis is essential to confirm the utility of PS in the context of community detection. P2 and P3 are necessary properties that a good community correlation measure should have. Essentially, these two properties can be interpreted as ”a higher value of $PS(u,S)$ indicates a strong correlation between $u$ and $S$ ”. P1 is concerned how far the correlation between $u$ and $S$ is away from what it is in the random graph. For the PS measure, we have the following two theorems, where Theorem 1 proves that P2 and P3 hold and Theorem 2 proves that P1 is correct.

Theorem 1.

For fixed $N>0$ , we have $0<\epsilon<N$ , $0<d_{u}<N$ and $0<\omega\leq min(\epsilon,d_{u})$ . Let $PS(u,S)$ be the correlation value between $u$ and $S$ defined in (5), then 1) $PS(u,S)$ monotonically increases with the increment of $\omega$ when $\epsilon$ and $d_{u}$ are fixed, 2) $PS(u,S)$ monotonically decreases with the increment of $\epsilon$ when $\omega$ and $d_{u}$ are fixed, and 3) $PS(u,S)$ monotonically decreases with the increment of $d_{u}$ when $\epsilon$ and $\omega$ are fixed.

Proof.

Let $PS(u,S)=J(\omega,\epsilon,d_{u})$ . We will directly calculate the partial derivatives of $PS(u,S)$ regarding to $\epsilon$ , $\omega$ and $d_{u}$ .

[TABLE]

Overall, the monotonicity of $PS(u,S)$ is consistent with the monotonicity of $\omega$ , $\epsilon$ and $d_{u}$ respectively. ∎

Theorem 2.

Given the E-R model $G(n,m)$ with the probability $p=2m/(n(n-1))$ and let $PS(u,S)$ be the correlation value between $u$ and $S$ , then $\mathbb{E}[PS(u,S)]=0$ holds for any subgraph $S\in\mathcal{G}_{S}$ and node $u\in S$ , where $\mathcal{G}_{S}$ is the set of all the subgraphs of $G$ .

Proof.

Let $X_{v}$ be a random variable such that:

[TABLE]

Clearly, $\mathbb{E}[X_{v}]=1\cdot\mathbb{P}(X_{v}=1)+0\cdot\mathbb{P}(X_{v}=0)=p$ , $d_{u}=\sum_{i=1}^{n}X_{i}$ and $\omega=\sum_{v\in S}X_{v}$ . By the linearity of expectations,

[TABLE]

∎

Theorem 2 shows that PS can be regarded as the distance between the correlation value of $u$ and $S$ in a real network and the average correlation value of $u$ and $S$ over all the random networks. However, this theoretic result ignores the variance of PS variable over all the random networks. If the distribution of PS values is not strongly peaked, it is very likely that most PS values will far exceed zero even in the random networks. Thus, we should investigate the variance of PS values as well. Then, we have the following Lemma.

Lemma 1.

Given the E-R model $G(n,m)$ with the probability $p=2m/(n(n-1))$ . Let $\lambda=Np$ , which is the expected degree under the E-R model. $PS(u,S)$ is correlation value between $u$ and $S$ . Let $\mathcal{G}_{S}$ be the set of all the subgraphs of $G$ . Then, $\forall\kappa>0$ , $\forall S\in\mathcal{G}_{S}$ and $\forall u\in S$ , we have the upper bound on the probability that PS(u, S) is no less than $\sqrt{\kappa}$ :

[TABLE]

Proof.

Analogously to what we have done in the proof of Theorem 2, we will use the sum of random variable $X_{i}$ to denote $\omega$ and $d_{u}$ :

[TABLE]

Let $Y=PS(u,S)$ , the second moment of $Y$ can be obtained by the linearity of expectations:

[TABLE]

Calculating $\mathbb{E}[\omega^{2}]$ , $\mathbb{E}[d_{u}\omega]$ and $\mathbb{E}[{d_{u}}^{2}]$ independently, we have

[TABLE]

To obtain the variance of $Y$ , we first apply Theorem 2:

[TABLE]

Then, by the Chebyshev inequality, we finish the proof:

[TABLE]

∎

Most large-scale real networks appear to be sparse [51] [52] [53], in which the number of edges is generally the order $n$ rather than $n^{2}$ [52]. In addition, sparse graphs are particularly sensitive to random fluctuations [23]. Therefore, we should concentrate more on the performance of PS on the sparse graph. With respect to the sparsity under the E-R model, we have the following Theorem.

Theorem 3.

Given the ER model $G(n,m)$ with the probability $p=2m/(n(n-1))$ and $\lambda=Np$ is the expected degree under E-R model. Assuming $\lambda$ always remains finite in the limit of infinite size [23]. $PS(u,S)$ is the correlation value between $u$ and $S$ . Then, $\forall\kappa>0$ , any subgraph $S$ and $\forall u\in S$ , we have:

[TABLE]

Proof.

We first apply Lemma 1:

[TABLE]

Then, we take the limit:

[TABLE]

∎

Theorem 3 shows that it is difficult to obtain a high PS value in the large sparse random network. This theoretic result indicates that the distribution of PS values is strongly peaked, that is, most PS values are near by zero. In fact, even in the small random network, the upper bound introduced in Lemma 1 is often a small value. For example, let $\kappa=0.0001$ , $N=280$ , $\epsilon=20$ and $\lambda=8$ , then we have $Pr\big{[}|PS(u,S)|\geq 0.01\big{]}\leq 6.54\%$ . Overall, the high PS value between $u$ and $S$ in the real sparse network is an indicator that there is a significant correlation between $u$ and $S$ . This provides us a theoretical basis for our community detection algorithm.

To formulate a vertex-centric metric of a community, we have:

[TABLE]

where $l_{S}$ is the number of the edges inside community $S$ and $K_{S}$ represents the sum of degrees of all vertexes in community $S$ . Then, the objective function for a community partition is:

[TABLE]

3.2.2 $\phi$ -Coefficient

$\phi$ -Coefficient [54] is a variant of Pearson’s Product-moment Correlation Coefficient for binary variables. In association mining, it is often used to estimate whether there is a non-random pattern. Our measurement function based on $\phi$ -Coefficient is calculated as:

[TABLE]

A positive $\phi(u,S)$ value indicates that node $u$ has denser intra-connection with community $S$ and stronger isolation from the rest of graph. Moreover, $\phi(u,S)$ has the same range of values as Pearson’s Product-moment Correlation Coefficient, i.e. $-1\leq\phi(u,S)\leq 1$ . When $\phi(u,S)=1$ , node $u$ will have all its links connecting all the members of community $S$ and have no other links with the rest of graph. The above discussions are subjective and intuitive. To further analyze $\phi(u,S)$ , we have some similar theoretic results as PS owns.

Theorem 4.

For fixed $N>0$ , we have $0<\epsilon<N$ , $0<d_{u}<N$ and $0<\omega\leq min(\epsilon,d_{u})$ . Let $\phi(u,S)$ be $\phi$ -Coefficient value between $u$ and $S$ defined in (8), then 1) $\phi(u,S)$ monotonically increases with the increment of $\omega$ when $\epsilon$ and $d_{u}$ are fixed, 2) $\phi(u,S)$ monotonically decreases with the increment of $\epsilon$ when $\omega$ and $d_{u}$ are fixed, and 3) $\phi(u,S)$ monotonically decreases with the increase of $d_{u}$ when $\epsilon$ and $\omega$ are fixed.

Proof.

Let $\phi(u,S)=J(\omega,\epsilon,d_{u})$ . We will directly calculate the partial derivatives of $\phi(u,S)$ regarding to $\epsilon$ , $\omega$ and $d_{u}$ .

[TABLE]

We only need to focus on the term $-N\big{[}d_{u}(\epsilon-2\omega)+\omega N\big{]}$ . This term is obviously negative when $\epsilon\geq 2\omega$ . Now assuming $\epsilon<2\omega$ . Since $\omega\leq\epsilon$ , we have $2\omega-\epsilon\leq\omega$ . Then,

[TABLE]

Thus, $\frac{\partial J(\omega,\epsilon,d_{u})}{\partial\epsilon}<0$ holds. In a similar way, $\frac{\partial J(\omega,\epsilon,d_{u})}{\partial d_{u}}<0$ holds as well. Overall, the monotonicity of $\phi(u,S)$ is consistent with the monotonicity of $\omega$ , $\epsilon$ and $d_{u}$ respectively. ∎

Theorem 5.

Given the E-R model $G(n,m)$ with the probability $p=2m/(n(n-1))$ and let $\phi(u,S)$ be $\phi$ -Coefficient value between $u$ and $S$ . For fixed $0<\epsilon<N$ , $\mathbb{E}[\phi(u,S)]=0$ holds for any subgraph $S\in\mathcal{G}_{S}$ and node $u\in S$ , where $\mathcal{G}_{S}$ is the set of all the subgraphs of $G$ .

Proof.

Firstly, we introduce a random variable:

[TABLE]

At the same time, let $X_{v}$ be a random variable such that:

[TABLE]

Integrating these two random variables, we have:

[TABLE]

where $\mathbb{E}[H_{v}]=\mathbb{E}[Y|X_{v}=1]$ . Besides, we have $d_{u}=\sum_{i=1}^{n}X_{i}$ and $\omega=\sum_{v\in S}X_{v}$ . Let $\mathcal{K}={\big{(}\epsilon(N-\epsilon)\big{)}}^{-\frac{1}{2}}$ . By the linearity of expectations,

[TABLE]

∎

Theorem 4 and Theorem 5 indicate that $\phi(u,S)$ is a good community measurement function since it owns the necessary properties P1,P2 and P3. This is the theoretic basis for $\phi(u,S)$ to be employed in community detection. Now we can take the sum of $\phi(u,S)$ over all $u\in S$ to formulate a vertex-centric metric:

[TABLE]

Despite different definitions on what a community should be have been proposed, some general ideas are widely accepted by most scholars. One of them is that the clique can be regarded as a perfect community. Thus, the loosely connected cliques should be separated from each other as different communities. The modularity function may fail to detect modules which are smaller than a scale in large networks, which is called the resolution limit problem [28] [29]. A more general example in real network has been given in [28], which is shown in Fig. 5. There are two cliques of the same size $\mathcal{C}_{s}$ and $\mathcal{C}_{t}$ , and there is an edge connecting one node $u$ from $\mathcal{C}_{s}$ and another node $v$ from $\mathcal{C}_{t}$ .

When the network is large enough, the modularity-based algorithms prefer to merge these two cliques to form a bigger community since this operation will help increase the value of modularity function. Apparently, it is counter-intuitive and disturbing. Fortunately, we can prove that such issue does not occur in $\Phi(S)$ .

Theorem 6.

Given a graph $G(n,m)$ with a subgraph which consists of two equal-sized cliques $\mathcal{C}_{s}$ and $\mathcal{C}_{t}$ . There is only one edge connecting $u$ and $v$ , where $u\in\mathcal{C}_{s}$ and $v\in\mathcal{C}_{t}$ . Let $P=\mathcal{C}_{s}\bigcup\mathcal{C}_{t}$ and $\mathcal{I}=|\mathcal{C}_{s}|=|\mathcal{C}_{t}|$ . Two nodes $c$ and $o$ , where $c\in\mathcal{C}_{s}$ and $o\in\mathcal{C}_{t}$ , both have one edge connecting the rest of graph respectively. Assuming $5\leq\mathcal{I}<\frac{n}{4}$ , then we have $\Phi(\mathcal{C}_{s})+\Phi(\mathcal{C}_{t})>\Phi(P)$ .

Proof.

From $\mathcal{I}<\frac{n}{4}$ , we can know that $\epsilon<\frac{n}{4}$ . Let $\Delta\Phi=\Phi(P)-(\Phi(\mathcal{C}_{s})+\Phi(\mathcal{C}_{t}))$ and $T=2\epsilon+1$ , then $\Delta\Phi$ can be calculated as:

[TABLE]

Let $\mathcal{L}=\frac{N}{\sqrt{d_{u}(N-d_{u})}}{\big{(}T(N-T)\big{)}}^{-\frac{1}{2}}$ and $\mathcal{B}=\sum_{i\in\mathcal{C}_{s}}$ $\frac{\omega_{i}N-\epsilon d_{i}}{\sqrt{d_{i}(N-d_{i})}}$ , we have:

[TABLE]

Since $T=2\epsilon+1\leq\frac{n}{2}$ , then $\sqrt{2\epsilon(N-2\epsilon)}<\sqrt{T(N-T)}$ . We can obtain:

[TABLE]

Now we need to analyze $\mathcal{B}$ . Firstly, $\mathcal{B}$ can be calculated as:

[TABLE]

We have $d_{i}=\omega_{i}=\epsilon$ for $i\neq u$ and $i\neq c$ and $d_{c}=d_{u}=\epsilon+1$ , then:

[TABLE]

Next, we will analyze $\mathcal{L}$ as well. Since $\epsilon<d_{u}<\frac{n}{2}$ , we have:

[TABLE]

Thus, we have a upper bound of $\frac{1}{2}\Delta\Phi$ as:

[TABLE]

Let $g(\epsilon)=-\epsilon(\epsilon-1)(N-\epsilon)\big{(}\sqrt{\frac{4}{3}}-1\big{)}+N$ . Since $4\leq\epsilon<\frac{n}{2}$ , $g(\epsilon)$ decreases as $\epsilon$ increases. Then, we have:

[TABLE]

Thus, $\Delta\Phi<0$ . We finish the proof. ∎

Theorem 6 reveals the fact that $\Phi(S)$ can avoid merging two very pronounced communities. The reason is that $\Phi(S)$ takes into account the local structural information of every vertex. Now we will provide a comprehensible interpretation on this point. Firstly, it can be found that all the $\phi(u,S)$ values of vertices from $\mathcal{C}_{s}$ are 1 or approximately 1. The value 1 is the maximum value that a vertex can achieve, which indicates that a node completely belongs to a community in the sense that this node connects every member of the community and has no external links. Obviously, the $\phi(u,S)$ value will be reduced after merging $\mathcal{C}_{s}$ and $\mathcal{C}_{t}$ if the number of edges across $\mathcal{C}_{s}$ and $\mathcal{C}_{t}$ are too few. If we want to retain the correlation value 1 for each vertex, the vertices from $\mathcal{C}_{s}$ should connect all vertices from $\mathcal{C}_{t}$ . However, it is a far cry from the situation shown in Fig. 6. We can observe that the number of inter-edges across two communities which allows two communities to merge is affected by the $\Phi(S)$ values in two communities. When $\Phi(S)$ values of two communities are both higher, it will require more inter-edges.

One of the main causes of the resolution limit problem is that the modularity only considers the whole community rather than every vertex. Besides, it only depends on the number of edges $m$ when the network is sufficiently large. Let $Q$ be the modularity sum of two communities before the mergence and $\acute{Q}$ be the modularity value after the union. Then, we have:

[TABLE]

where $K_{1}$ is the degree sum of community 1, $K_{2}$ is the degree sum of community 2 and $\Delta l$ is the difference between the number of intra-edges after merging two communities and that before the combination. If $\Delta Q>0$ , it requires $\Delta l>\frac{K_{1}K_{2}}{2m}$ . As we can see, when $m$ increases, the number of inter-edges needed will be reduced. As $m$ tends to infinity, any two communities will be merged even if there is only one edge between them. The reason why this happens is that $\Delta l$ only embodies the overall structural information of a community and the modularity misses local structural information. To avoid the resolution limit problem as much as possible, first of all, the community metric should consider every vertex rather than view the community as a single unit. Second, the community metric should take full advantage of the local structural information. Then, we have the following Theorem.

Theorem 7.

Given a graph $G(n,m)$ with a vertex $u$ and a community $S$ . To simplify the notations, let $\boldsymbol{\Psi}=\Psi_{S\backslash\{u\}}$ and $\boldsymbol{\psi}=\psi_{u}$ , where $\boldsymbol{\Psi}$ is the community vector of S and $\boldsymbol{\psi}$ is the neighbor vector of $u$ . For fixed $\epsilon>0$ , $d_{u}>0$ and $\omega>0$ , we have:

[TABLE]

where $\lVert.\rVert_{2}$ is Euclidean norm.

Proof.

We directly calculate the limit of $\phi(u,S)$ as $N$ tends to infinity.

[TABLE]

∎

Theorem 7 indicates that $\phi(u,S)$ is the Cosine similarity between $\boldsymbol{\Psi}$ and $\boldsymbol{\psi}$ as $N$ tends to infinity. Then, we can recalculate $\Phi(S)$ in the limit:

[TABLE]

Obviously, it considers the Cosine similarity values between each node and the community. As a result, it is unlikely to perform the community mergence operation in the case of weak inter-connections between two communities. Theorem 6 and Theorem 7 both indicate $\Phi(S)$ can mitigate the resolution limit problem.

3.3 Summary

As a short summary, we would like to present the following remarks. First of all, the use of different correlation functions in our node-centric framework

may lead to some different but known community evaluation measures. As shown, modularity, node modularity, neighborhood connectedness and community connectedness in Focs [47] are all concrete examples in our abstract framework. More importantly, we may explore more correlation functions to obtain more effective community evaluation measures in the future.

4 Correlation-Based Community Detection

In this section, we propose a Correlation-Based Community Detection (CBCD) algorithm, which is based on PS measure and $\phi$ -Coefficient. CBCD takes a graph $G(V,E)$ as input and generates a partition of $G$ which is the set of detected communities. The algorithm is divided into three phases: Seed Selection, Local Optimization Iteration and Community Merging.

4.1 Seed Selection

The goal of Seed Selection is to initialize a set of seed communities for our next phase. Our algorithm adopts a local search strategy to optimize the objective function defined in (7), which requires seed nodes to start with. Triadic closure is an important property of social network, which describes the basic process of social network formation [55] [56] [57]. Triangle (clique of size 3) in the network embodies this property. A node contained by a large number of triangles has higher probability of being the core of a community. Thus, we will use the number of triangles as the criterion to select the seed nodes. The specific process is described in Algorithm 1. We first count the number of triangles of every node in the graph and then sort the nodes by their triangle numbers decreasingly. For those nodes with the same number of triangles, we compare their degrees. Then, we go through all the nodes in this order. For every node $u$ that has not been previously visited, we mark $u$ and its neighbor nodes as visited and then put $u$ into the seed node set $P$ .

Triangle counting is an important task in data mining and network science [58], which has been widely used in many applications such as community detection and link prediction. There are many exact triangle counting algorithms in the literature [59] [60] [61] [62]. Here we modify the Ayz-Node-Counting algorithm [61] to implement the Triangle-Counting function in Algorithm 1. Ayz-Node-Counting divides the nodes into two parts with a degree threshold $\beta$ : one set of nodes whose degrees are at most $\beta$ and another set of nodes whose degrees are at least $\beta$ . Ayz-Node-Counting enumerates node-pairs that are adjacent to each node from the low degree node set. For the subgraph $\mathcal{G}$ induced from the high degree node set, Ayz-Node-Counting uses the fast matrix product to compute the number of triangles for each node in $\mathcal{G}$ . By choosing appropriate $\beta$ , the worst time complexity of Ayz-Node-Counting is $O(m^{1.4})$ . Since matrix multiplication-based methods may require large memory due to adjacency matrix storage, we also enumerate over node-pairs in the induced subgraph $\mathcal{G}$ . Since the total degree is $2m$ , then the number of high degree nodes is at most $\frac{2m}{\beta}$ . The worst time complexity of Algorithm 2 is $O(\sum_{u\in\mathcal{U}}deg(u)^{2})+O((\frac{2m}{\beta})^{3})$ . Since we have:

[TABLE]

the time complexity can be written as $O(2m\beta)+O((\frac{2m}{\beta})^{3})$ . Let $\beta=\sqrt{2m}$ , we have time complexity $O(m\sqrt{m})$ . In fact, in large sparse network, the number of high degree nodes is small even for low $\beta$ . In this case, the time complexity of Triangle Counting can be viewed as $O(\beta m)$ .

4.2 Local Optimization Iteration

In this phase, we aim at finding a partition that maximizes the objective function defined in (7). Let $P$ be a partition of graph $G$ and $P(u)$ represents the index of the community of a node $u$ . Since the local search strategy (Line 5 to 20) is adopted in this phase, we initially start with a set of seed nodes. Suppose we obtain a partition $P=\{S_{1},...,S_{\mathcal{M}}\}$ with $|P|=\mathcal{M}$ from Seed Selection. Despite we call $P$ the “partition”, every $S_{j}\in P$ has only one seed node and many other nodes are not assigned to any community. To deal with this initial state, for each node $u$ that is not assigned to any community, we let $P(u)=-1$ . Then, we reformulate the objective function (7) to make it convenient for implementing Local Optimization Iteration:

[TABLE]

where $l_{u,S_{j}}$ is the number of edges between $u$ and $S_{j}$ , $\delta(x,y)$ is the kronecker delta function whose value is 1 if $x=y$ and 0 otherwise. Our optimization algorithm is an iterative process. In each iteration, it will reorganize the partition to improve the value of (9) until a locally optimal solution is achieved. First, we should consider how to assign those nodes that are not contained in any community. According to (9), for any partition $P$ , we have:

[TABLE]

where $AC(u)$ is the set of communities whose nodes are adjacent to node $u$ . Obviously, (10) is more easily to find the maximum value than (9) if we do not modify the current partition and the PS value of each node is computed independently from others. Thus, we will find a community $S$ for each unassigned node $u$ by maximizing $PS(u,S)$ (Line 7 to 13). Then, we put node $u$ into community $S$ and update $\Gamma(P)$ with the difference brought by this operation (Line 14 to 18):

[TABLE]

In practice, for each node, the partition is modified after performing the steps described in Line 7 to 18, which is to make the algorithm more robust to local maxima. Next, we consider the nodes that have already been assigned to a community. Line 21 to 39 in Algorithm 3 describes the partition refinement step. It refines the partition obtained in local search step (Line 5 to 20) using a hill climbing method. In each iteration, we perform the movements of nodes between communities to improve the value of $\Gamma(P)$ . For each node $u$ belonging to a community $S$ , we compute the difference brought by removing $u$ from $S$ , $\Delta F=F(S)-F(S^{\prime})$ (Line 23 to Line 25). For a community $S_{j}$ , we compute the difference brought by adding $u$ to $S_{j}$ , $\Delta F_{j}=F(S_{j}^{\prime})-F(S_{j})$ . We choose a community $S_{i}$ such that $\Delta F_{i}>\Delta F$ and $\Delta F_{i}-\Delta F$ are maximized. Then, we add $u$ to $S_{i}$ and remove $u$ from $S$ to update the partition. The local search step and the partition refinement step are performed alternately until every node has been assigned to a community and objective function $\Gamma(P)$ converges.

4.3 Community Merging

After Local Optimization Iteration, there may be many small but significant communities. We need a merging operation to find communities with suitable size. The theoretical basis of Community Merging comes from section 3.2.2. $\Phi(S)$ is the criterion used in Community Merging to judge whether two communities should be merged. Note that here $\Phi(S)$ is just used to implement the merging operation instead of being an objective function. In the whole process, we maintain a max-heap which contains a set of elements and supports delete or insert operation in $O(\log n)$ time. Community Merging is described in Algorithm 4. First, we start off with each community $S_{i}\in P$ being the sole node of a graph $\mathcal{F}$ and establish an edge between $S_{i}$ and $S_{j}$ if at least one edge links them. $\mathcal{F}$ is a weighted graph, in which the weight between community $i$ and $j$ is $\Delta W_{ij}=\Phi(S_{i}\bigcup S_{j})-\Phi(S_{i})-\Phi(S_{j})$ . Then, for each pair $(i,j)$ that $\Delta W_{ij}>Th$ , we put a triad $(i,j,\Delta W_{ij})$ into max-heap $Max\_H$ . In each iteration, we take the triad $(i,j,\Delta W_{ij})$ from the top of $Max\_H$ whose $\Delta W_{ij}$ is the maximum and merge community $i$ and community $j$ . For the union operation, we can use the disjoint-set data structure to implement it. Next, we update the graph $\mathcal{F}$ and put the new triad $(i,k,\Delta W_{ik})$ whose $\Delta W_{ik}>Th$ into $Max\_H$ . We continue these steps until $Max\_H$ is empty. The threshold $Th$ controls the size of communities we find. If $Th$ is too high, the detected communities may be too small. If $Th$ is too low, the detected communities may be too big so that even the resolution limit problem will happen. In practice, $\Phi(S)$ often requires a great number of inter-connections for merging two communities. Thus, we need to properly relax $Th$ to be an appropriate small negative value. Such a relaxation will not lead to the resolution limit problem, but can help us find communities with suitable size. To give a theoretical analysis in large network, recalling Theorem 7 introduced in section 3.2.2, the mathematical expression of $Cos(S)$ is given in (10). If we do not want to merge two equal-sized cliques $\mathcal{C}_{1}$ and $\mathcal{C}_{2}$ ( $|\mathcal{C}_{1}|=|\mathcal{C}_{2}|\geq 5$ and there is only one edge between $\mathcal{C}_{1}$ and $\mathcal{C}_{1}$ ), we should have:

[TABLE]

We suggest the user to choose an appropriate threshold $Th$ such that $-2.9<Th<0$ .

4.4 Complexity Analysis

As it has been discussed, the time complexity of Seed Selection is $O(\beta m)$ , where $\beta$ is a constant. As for Local Optimization Iteration, we need to go through all the edges in each iteration. Thus, its time complexity is $O(Max\_It\cdot m)$ , where $Max\_It$ is the number of iterations that Local Optimization Iteration needs. In practice, we can set $Max\_It=20$ and it is sufficient for Local Optimization Iteration to converge. For Community Merging, we need to merge two communities and insert new elements into a max-heap in each iteration. We use both path compression and union by size to ensure that the amortized time per union operation is only $O(\alpha(n))$ [63] [64], where $\alpha(n)$ is the inverse Ackermann function which can be viewed as a constant. Thus, we have $O(n)$ for the union operation in general. For insert operation, the worst time complexity is $O(m\log n)$ since we have to go through every neighbor of the community in each iteration. Totally, the time complexity is $O(m+n+m\log n)$ .

5 Experimental Results

In this section, the proposed algorithm CBCD is compared with the existing state-of-the-art algorithms on both synthetic networks and real networks. Each node in these two different kinds of networks has a ground truth community label. For communities found by the detection algorithms, we need to compare them with the ground truth communities. A criterion is necessary to measure the similarity between the final partition of the algorithm and the actual communities. Many evaluation measures, such as Normalized Mutual Information (NMI) [65], ARI [66] and Purity, have been proposed to evaluate the clustering quality of algorithms. NMI is the most widely-accepted and important evaluation measure, since it is more discriminatory and more sensitive to errors in the community detection procedure [65]. We will use NMI as our performance evaluation measure in the experiments. The code of NMI calculation offered by McDaid et al. [67] can be found at https://github.com/aaronmcdaid/Overlapping-NMI.

According to the comparative analysis of community detection algorithms [68], Louvain and Infomap are two of the best classical algorithms in the literature. Thus, we will take these two algorithms as the competing algorithms. Besides, some novel algorithms proposed in recent years will also be compared with CBCD. These algorithms are listed as follow:

•

Louvain [69] is a well-known heuristic algorithm based on modularity. The algorithm is composed of two steps which are performed iteratively. The first step is to move each node to the community that the gain of modularity is positive and maximum. The next step is to build a new weighted network whose nodes are communities found in first step. The procedure will continue until modularity achieves a maximum. Louvain can unfold a complete hierarchical community structure for the network. The program can be downloaded from https://sites.google.com/site/findcommunities/.

•

Infomap [70] is another well-known algorithm based on information theory and random walk. It uses the probability flow of random walks taking place over a network as a description of the network structure and decomposes the network into modules using information theoretic result to compress the probability flow. It simplifies the organization of network and highlights their relationships. The program can be downloaded from http://www.mapequation.org/code.html.

•

FOCS [47] is a fast overlapping community detection algorithm, which can detect overlapped communities using the local connectedness. FOCS takes a parameter OVL as the input, which is a threshold allowing for maximum overlapping between two communities. Since our comparison is conducted over the non-overlapped algorithms, we set OVL to 0. The program can be downloaded from https://github.com/garishach/focs.

•

SCD [45] [46] is a community detection algorithm based on a new community metric WCC. WCC considers the triangle as the basic structure instead of the edge or node. The theoretic analysis given in [46] shows that WCC can correctly capture the community structure. The program can be downloaded from https://github.com/DAMA-UPC/SCD.

•

Attractor [71] is an algorithm based on distance dynamics. The fundamental basis of Attractor is to view the whole network as an adaptive dynamical system. Each node in this dynamical system interacts with its neighbors and distances among nodes will be changed by the interactions. At the same time, distances will affect the interactions conversely. The dynamical system eventually evolves a steady system. The Attractor algorithm require a cohesion parameter $\lambda$ that ranges from 0 to 1, which is used to determine how exclusive neighbors affect distance (positive or negative influence). According to [71], we set $\lambda=0.5$ . The program can be downloaded from https://github.com/YcheCourseProject/CommunityDetection.

For all experiments, without further statement, we set the threshold parameter of our algorithm $Th=-2.8$ when $0<|V|<4000$ and $Th=-0.43$ when $4000\leq|V|$ , corresponding to small networks and large-scale networks. All experimental results have been obtained on a workstation with 3.5 GHz Intel(R) Xeon(R) CPU E5-1620 v3 and 16.0 GB RAM. For Louvain, we adopted the lowest partition of the hierarchy, which is stored in the graph.tree file. For Infomap, the number of outer-most loops to run before picking the best solution is specified to be 10.

5.1 LFR Benchmark

LFR Benchmark which is introduced by Lancichinetti et al. [72] is a very popular graph simulation model. The most important parameter of LFR Benchmark is the mixing parameter $u$ . The community structure of LFR network becomes more pronounced as $u$ decreases. In particular, $u=0$ indicates that each node only connects the nodes inside its community and $u=1$ indicates that each node only connects the nodes outside its community. The program of LFR Benchmark can be downloaded from https://sites.google.com/site/santofortunato/inthepress2.

We generate several LFR networks characterized by different features to compre the performance of various algorithms. The number of nodes in all LFR networks is fixed to 1000. We consider $2\times 2$ cases, in which the community size parameter is specified within the range [15,30] and [20,50] and the average degree is set to be 15 and 20. For each case, we fix the average degree and the community size, and then increase the mixing parameter $u$ from 0 to 1 to generate a variety of LFR networks for comparison.

The performance comparison result in terms of NMI is shown in Fig. 7. As we can see from Fig. 7, CBCD, Louvain and Infomap almost achieve the best clustering performance. The NMI values of all algorithms will decrease when the mixing parameter $u$ tends to 1. This is because the increment of mixing parameter $u$ will introduce more edges among different communities, making it difficult to identify the underlying true communities. CBCD have better performance when the community size parameter is smaller. Note that even when the community size parameter is relatively big, CBCD still has good performance in comparison with the other algorithms. Except for Infomap and Focs, the performance of other algorithms become better when the average degree parameter is increased, and CBCD is better than all other algorithms when the degree is 20. We can find that Infomap always starts to decrease dramatically when $u$ ranges between $0.6$ and $0.7$ . It is because that Infomap is based on the random walk dynamics and is more sensitive to the noisy inter-edges between communities as $u$ tends to 1. By contrast, CBCD is more robust to these noisy inter-edges despite the fact that Infomap is slightly better than CBCD when average degree is 15 and $u$ ranges between $0.55$ and $0.65$ . Compared to Louvain, CBCD is always the winner. Let us consider the maximum community size in the partition obtained by Louvain and CBCD. The relation between the mixing parameter $u$ and the maximum community size of detected partition is plotted in Fig. 8. The maximum community size of both CBCD and Louvain increases as the mixing parameter $u$ is increased. This demonstrates that the detection algorithms tend to combine of two ground-truth communities when $u$ is high. The maximum community size of CBCD is almost always lower than that of Louvain. We can observe that, especially when the mixing parameter $u$ ranges between $0.5$ and $0.6$ , the maximum community size of CBCD is significantly lower than the maximum community size of Louvain. This result show that CBCD can mitigate the resolution problem to some extent.

5.2 Real-World Network

In most real-world networks, each node has no ground-truth label. The modularity is typically used to evaluate the quality of detected communities. However, as we have discussed before, modularity is not a good quality measure of communities because of the resolution limit problem. Besides, modularity is found out owning the tendency of following the same general pattern for different classes of networks [73]. Thus, we only conduct our experiment on several well-known real-world networks with ground-truth communities: Karate (karate) [74], Football (football) [5], Personal Facebook (personal) [75], Political blogs (polblogs) [76], Books about US politics (polbooks) [77]. The detailed statistics of real-world networks are given in Table 3, where $|V|$ denotes the number of the nodes, $|E|$ denotes the number of the edges, $d_{max}$ denotes the maximal degree of the nodes, $\langle d\rangle$ denotes the average degree of the nodes and $|C|$ denotes the number of ground-truth communities in the network. The performance of the algorithms on the real-world network is shown in Table 4, where NC is the number of communities detected by the algorithm.

American college football: American college football network describes football games between Division IA colleges during the regular season in Fall 2000. It has 115 teams and 631 games between these teams. For each node (team), there is an edge connecting two nodes if two teams played a game. The teams were partitioned into 12 conferences (communities). Louvain, Infomap, SCD and Attractor all have good performance on the football data set, and SCD achieve the best performance among these algorithms. We have to admit the fact that these algorithms outperform our method on the football data set. Fig. 9 plots the variation of NMI of the partition detected by CBCD when the threshold $Th$ used in the merging operation increases from -2.8 to 0. As $Th$ increases from -2.8 to -2.2, NMI increases to 0.773 which is the maximal NMI value of CBCD on the football data set. NMI begins to decrease when $Th$ further increases. Note that the number of detected communities increases with the increment of $Th$ . The number of detected communities is 10 when $Th=-2.2$ . The quality of algorithm 4 (Community Merging) is mainly determined by the output of algorithm 3 (Local Optimization Iteration). Although CBCD can achieve a NMI value of 0.773, it still cannot beat other algorithms except Focs. This demonstrates that there is still room for improving algorithm 3.

Zachary’s karate club network: Karate is a famous network derived from the Zachary’s observation about a karate club. The network describes the friendship among the members of a karate club. The network was divided into two parts because of the divergence between administrator and instructor. According to Table 4, CBCD outperforms all other algorithms and achieves a NMI value of 0.840. Two communities are successfully found by CBCD, but our algorithm classifies one node ‘10’ into the wrong community. We observe that this node only have two edges and each edge connects one of two communities respectively. It is difficult to decide which community it really belongs to. Thus, we consider this node as a noisy node. In fact, it can make sense to assign node ‘10’ to both communities in the context of overlapping community detection. However, this study mainly focuses on non-overlapping community detection. If we delete node ’10’ from the karate, our CBCD algorithm can achieve the perfect performance of NMI=1. Its output exactly matches the partition of ground-truth communities. Infomap achieves the second best performance. For Attractor, the worst performance is due to that it puts all the members into one community.

Personal Facebook network: Personal is the network which gives the friendship structure of the first author, where each individual (node) is labeled according to the time period when he or she met the first author. Persons are divided into the different groups according to their locations. CBCD achieves the best performance with relatively high quality (NMI=0.3639) on the personal data set. For Attractor and Infomap, they also achieve good performance. Louvain achieves the worst result.

Books about US politics: This network consists of 105 nodes and 441 edges, which is derived from the politic books about US politics published in 2004 when presidential election takes place. Each node represents the book sold at Amazon.com, and each edge represents that two books are frequently co-purchased by the same buyer. Each book is labeled with ”liberal”, ”neutral” or ”conservative”, that is given by Amazon.com. CBCD gives the best partition with NMI = 0.33 among the comparing algorithms. For Attractor and Infomap yield good performance while Lovain, SCD and Focs have relatively bad performance.

U.S. political blog: The polblog network consists of 1490 nodes and 19090 edges, which describes the degree of interaction between liberal and conservative blogs. Compared to other algorithms, CBCD has the best performance on the polblog data set. Infomap and Louvain also produce good partitions. Attractor, Focs, and SCD yield relatively bad partitions.

5.3 Large-Scale Real-World Network

The large-scale real networks which are provided from [78] all have overlapping ground-truth communities. Despite it is out of the scope of our paper, we will still run CBCD along with other algorithms on these networks to test the performance of our algorithm on large-scale real networks. To evaluate the performance of community detection algorithms on the networks with overlapping community structures, Overlapping Normalized Mutual Information (ONMI) [79] is the major evaluation metric in this section. Since we will make a comparison on the networks with overlapping structures, then we set OVL of Focs to 0.6 for detecting overlapping communities. We choose two large-scale networks, Amazon and DBLP, for testing the performance of different methods. The specific information of these two networks is given as follow.

Amazon: It is an undirected network collected by crawling the Amazon website, where each node is a product sold on the website and an edge exists between two nodes (products) if they are frequently co-purchased. The ground-truth communities are determined by the product category. Each connected component in a product category is regarded as an independent ground-truth community. The whole network have 334863 nodes, 925872 edges and 70928 communities. Ninety-one percent of the nodes participate in at least two communities.

DBLP: It is a co-authorship network derived from the DBLP computer science bibliography. Each author of a paper is viewed as a node. Two authors are connected by an edge if they have published at least one paper together. Publication venue, e.g, journal or conference, is the indicator of ground-truth community. The authors who publish papers on the same journal or conference form a community. The whole network have 317080 nodes, 1049866 edges and 13477 communities. Thirty-five percent of the nodes participate in at least two communities.

Table 4 summaries the experimental results of different algorithms on two large-scale real networks, where NC is the number of detected communities and ET is the execution time of various algorithms. For the DBLP network, Focs achieves the best performance and SCD achieves the second-best performance. Despite the performance of CBCD on DBLP is not as good as these two algorithms, CBCD is still better than others. For the Amazon netowrk, CBCD outperforms the other algorithms. Focs is the second-best performer and Attractor is slightly inferior to Focs. Infomap has the worst performance on both DBLP and Amazon, which is in contrast with its performance on small real networks and LFR networks. Although CBCD is effective on detecting meaningful communities, it has no obvious advantage with respect to the execution time. This is what we should focus on in our future work.

5.4 The Distribution of PS Values

In this section, we study the distribution of PS values defined in Formula (6) of a community under the E-R model. First, we generate 300 random networks for the specific parameters under the E-R model. These parameters are the average degree $\lambda$ and the network size $n$ of random network. Then, we calculate the PS value of a given community $S$ , which is composed of 100 fixed nodes. Consequently, we obtain 300 PS values of the given communities derived from 300 different random networks. We divide PS values into many bins with equal length, where the low (high) bins correspond to the set of lower (higher) PS values. In Fig. 10, bins are plotted on the $x$ -axis, and for each bin, the fraction of communities whose PS values fall into that bin are plotted on the $y$ -axis. We can observe that the distribution of PS value follows a Gaussian-like distribution. The PS metric values are concentrated near 0 and most values fall into a very narrow interval. This phenomenon corresponds to our theoretical result in section 3.2.1. Comparing Fig. 10 (a) with Fig. 10 (b) and comparing Fig. 10(c) with Fig. 10(d), we can find that the interval that most PS values fall into becomes shorter with the decrease of average degree $\lambda$ . Besides, the interval that most PS values fall into sharply shortens when the network size $n$ increases. It is a remarkable fact that the PS value of a community in the random networks generated from the E-R model is a very small value. This demonstrates that the PS value of a community is an effective metric for quantifying the goodness of a community structure.

6 Conclution

In this paper, we introduce two novel node-centric community evaluation functions by connecting correlation analysis with community detection. We further show that the correlation analysis is a novel theoretical framework which unifies some existing quality functions and converts community detection into a correlation-based optimization problem. In this framework, we choose PS-metric and $\phi$ -coefficient to eliminate the influence of random fluctuations and mitigate the resolution limit problem. Furthermore, we introduce three key properties used in mining association rule into the context of community detection to help us choose the appropriate correlation function. A correlation-based community detection algorithm CBCD that makes use of PS-metric and $\phi$ -coefficient is proposed in this paper. Our proposed algorithm outperforms five existing state-of-the-art algorithms on both LFR benchmark networks and real-world networks. In the future, we will investigate more correlation functions and extend our method to overlapping community detection.

Acknowledgments

This work was partially supported by the Natural Science Foundation of China under Grant No. 61572094.

Bibliography79

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] V. Spirin and L. A. Mirny, “Protein complexes and functional modules in molecular networks,” Proceedings of the National Academy of Sciences of the United States of America , vol. 100, no. 21, pp. 12 123–12 128, 2003.
2[2] S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos, “Community detection in social media,” Data Mining and Knowledge Discovery , vol. 24, no. 3, pp. 515–554, 2012.
3[3] S. L. Feld, “The focused organization of social ties,” American journal of sociology , vol. 86, no. 5, pp. 1015–1035, 1981.
4[4] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta, A. P. Tikuisis et al. , “Global landscape of protein complexes in the yeast saccharomyces cerevisiae,” Nature , vol. 440, no. 7084, p. 637, 2006.
5[5] M. Girvan and M. E. Newman, “Community structure in social and biological networks,” Proceedings of the National Academy of Sciences of the United States of America , vol. 99, no. 12, pp. 7821–7826, 2002.
6[6] S. Fortunato, “Community detection in graphs,” Physics Reports , vol. 486, no. 3, pp. 75–174, 2009.
7[7] M. E. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical review E , vol. 69, no. 2, p. 026113, 2004.
8[8] F. Chung, “Spectral graph theory,” in Regional Conference Series in Mathematics , 1997, p. 212.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Correlation-Based Community Detection

Abstract

Index Terms:

1 Introduction

2 Related Work

2.1 Correlation-Based Method

2.1.1 Node-Node Similarity

2.1.2 Node-Node Correlation

2.1.3 Edge-Community Correlation

2.2 Node-Centric Method

2.3 Summary

3 Correlation Analysis In Network

3.1 Preliminaries

3.2 Correlation Measure

3.2.1 Piatetsky-Shapiro’s Rule-Interest

Theorem 1**.**

Proof.

Theorem 2**.**

Proof.

Lemma 1**.**

Proof.

Theorem 3**.**

Proof.

3.2.2 ϕ\phiϕ-Coefficient

Theorem 4**.**

Proof.

Theorem 5**.**

Proof.

Theorem 6**.**

Proof.

Theorem 7**.**

Proof.

3.3 Summary

4 Correlation-Based Community Detection

4.1 Seed Selection

4.2 Local Optimization Iteration

4.3 Community Merging

4.4 Complexity Analysis

5 Experimental Results

5.1 LFR Benchmark

5.2 Real-World Network

5.3 Large-Scale Real-World Network

5.4 The Distribution of PS Values

6 Conclution

Acknowledgments

Theorem 1.

Theorem 2.

Lemma 1.

Theorem 3.

3.2.2 $\phi$ -Coefficient

Theorem 4.

Theorem 5.

Theorem 6.

Theorem 7.