Using synthetic networks for parameter tuning in community detection

Liudmila Prokhorenkova

arXiv:1906.04555·cs.SI·June 25, 2019

Using synthetic networks for parameter tuning in community detection

Liudmila Prokhorenkova

PDF

TL;DR

This paper introduces a method to tune community detection algorithms by generating synthetic networks that mimic real data, enabling parameter optimization without labeled data and improving detection quality.

Contribution

The paper presents a novel approach to hyperparameter tuning for community detection using synthetic networks with known communities, applicable without labeled data.

Findings

01

Significant improvements in community detection accuracy on synthetic datasets.

02

Effective parameter tuning method applicable to various algorithms.

03

Enhanced detection quality on real-world networks.

Abstract

Community detection is one of the most important and challenging problems in network analysis. However, real-world networks may have very different structural properties and communities of various nature. As a result, it is hard (or even impossible) to develop one algorithm suitable for all datasets. A standard machine learning tool is to consider a parametric algorithm and choose its parameters based on the dataset at hand. However, this approach is not applicable to community detection since usually no labeled data is available for such parameter tuning. In this paper, we propose a simple and effective procedure allowing to tune hyperparameters of any given community detection algorithm without requiring any labeled data. The core idea is to generate a synthetic network with properties similar to a given real-world one, but with known communities. It turns out that tuning parameters…

Tables4

Table 1. Table 1 : Real-world datasets

Dataset	$n$	$m$	num. clusters	mixing parameter
Karate club [27]	34	78	2	0.128
Dolphin network [16]	62	159	2	0.038
College football [21]	115	613	11	0.325
Political books [20]	105	441	3	0.159
Political blogs [1]	1224	16715	2	0.094
email-Eu-core [15]	986	16064	42	0.664
Cora citation [26]	24166	89157	70	0.458
AS [5]	23752	58416	176	0.561

Table 2. Table 2 : Louvain algorithm, default value is γ 0 = 1 subscript 𝛾 0 1 \gamma_{0}=1 , standard deviation is given in the brackets

	Rand					Jaccard					NMI
Dataset	Default		Tuned		$γ_{o p t}$	Default		Tuned		$γ_{o p t}$	Default		Tuned		$γ_{o p t}$
Karate	0.761	(0.024)	0.945	(0.018)	0.6	0.520	(0.042)	0.892	(0.030)	0.5	0.634	(0.051)	0.739	(0.067)	0.7
Dolphins	0.648	(0.021)	0.873	(0.069)	0.5	0.374	(0.037)	0.608	(0.133)	0.1	0.515	(0.039)	0.515	(0.039)	1.0
Football	0.970	(0.010)	0.992	(0.004)	1.7	0.722	(0.063)	0.903	(0.036)	1.7	0.923	(0.016)	0.969	(0.008)	1.7
Political books	0.828	(0.024)	0.845	(0.005)	0.8	0.609	(0.055)	0.654	(0.009)	0.8	0.542	(0.024)	0.560	(0.011)	0.8
Political blogs	0.883	(0.004)	0.901	(0.001)	0.7	0.782	(0.006)	0.818	(0.001)	0.7	0.635	(0.007)	0.678	(0.007)	0.8
Eu-core	0.862	(0.020)	0.932	(0.004)	1.4	0.217	(0.022)	0.348	(0.014)	1.4	0.576	(0.018)	0.656	(0.009)	1.4
Cora	0.941	(0.002)	0.964	(0.001)	2.0	0.125	(0.005)	0.146	(0.004)	2.0	0.457	(0.005)	0.494	(0.004)	2.0
AS	0.819	(0.003)	0.823	(0.001)	1.8	0.190	(0.026)	0.258	(0.013)	0.6	0.488	(0.007)	0.489	(0.010)	0.8
LFR-0.4	0.999	(0.001)	1.000	(0.000)	2.8	0.965	(0.037)	1.000	(0.000)	2.8	0.994	(0.003)	1.000	(0.000)	2.8
LFR-0.5	0.996	(0.002)	1.000	(0.000)	3.0	0.861	(0.078)	0.997	(0.007)	3.0	0.981	(0.007)	1.000	(0.001)	3.0
LFR-0.6	0.984	(0.008)	0.999	(0.000)	3.6	0.614	(0.117)	0.971	(0.010)	3.6	0.940	(0.020)	0.992	(0.002)	3.6
LFR-0.7	0.911	(0.014)	0.978	(0.001)	3.8	0.089	(0.024)	0.320	(0.032)	3.6	0.388	(0.051)	0.678	(0.024)	3.8

Table 3. Table 3 : PPM algorithm, default value γ 0 = 1 subscript 𝛾 0 1 \gamma_{0}=1 , standard deviation is given in the brackets

	Rand					Jaccard					NMI
Dataset	Default		Tuned		$γ_{o p t}$	Default		Tuned		$γ_{o p t}$	Default		Tuned		$γ_{o p t}$
Karate	0.756	(0.024)	0.782	(0.041)	0.8	0.509	(0.040)	0.487	(0.000)	0.1	0.629	(0.050)	0.628	(0.049)	1.0
Dolphins	0.622	(0.025)	0.761	(0.043)	0.7	0.330	(0.042)	0.815	(0.189)	0.1	0.466	(0.045)	0.411	(0.024)	1.6
Football	0.969	(0.007)	0.992	(0.004)	1.6	0.716	(0.041)	0.901	(0.040)	1.6	0.923	(0.011)	0.969	(0.008)	1.6
Political books	0.780	(0.016)	0.845	(0.008)	0.7	0.481	(0.038)	0.647	(0.016)	0.7	0.498	(0.015)	0.566	(0.014)	0.7
Political blogs	0.649	(0.025)	0.724	(0.039)	0.4	0.315	(0.022)	0.471	(0.037)	0.4	0.287	(0.025)	0.328	(0.033)	0.6
Eu-core	0.800	(0.021)	0.774	(0.024)	0.9	0.099	(0.012)	0.091	(0.011)	0.9	0.529	(0.018)	0.490	(0.016)	0.8
Cora	0.936	(0.003)	0.959	(0.001)	2.0	0.115	(0.004)	0.130	(0.004)	2.0	0.470	(0.005)	0.500	(0.003)	2.0
AS	0.793	(0.012)	0.815	(0.004)	1.8	0.113	(0.013)	0.152	(0.031)	0.8	0.459	(0.020)	0.459	(0.018)	1.2
LFR-0.4	1.000	(0.001)	1.000	(0.000)	2.8	0.987	(0.024)	0.995	(0.020)	2.8	0.998	(0.002)	1.000	(0.001)	2.8
LFR-0.5	0.996	(0.005)	0.999	(0.001)	3.0	0.877	(0.133)	0.961	(0.053)	3.0	0.989	(0.013)	0.995	(0.007)	3.0
LFR-0.6	0.966	(0.024)	0.991	(0.005)	3.2	0.438	(0.201)	0.667	(0.090)	3.2	0.847	(0.107)	0.911	(0.043)	3.0
LFR-0.7	0.801	(0.027)	0.966	(0.009)	2.8	0.026	(0.021)	0.148	(0.070)	2.8	0.181	(0.109)	0.510	(0.109)	2.8

Table 4. Table 4 : ILFR algorithm, default value μ 0 = 0.3 subscript 𝜇 0 0.3 \mu_{0}=0.3 , standard deviation is given in the brackets

	Rand					Jaccard					NMI
Dataset	Default		Tuned		$μ_{o p t}$	Default		Tuned		$μ_{o p t}$	Default		Tuned		$μ_{o p t}$
Karate	0.754	(0.026)	0.854	(0.040)	0.15	0.507	(0.040)	0.741	(0.073)	0.05	0.633	(0.062)	0.633	(0.062)	0.30
Dolphins	0.583	(0.009)	0.623	(0.026)	0.15	0.254	(0.018)	0.556	(0.000)	0.00	0.454	(0.019)	0.264	(0.000)	1.00
Football	0.992	(0.004)	0.993	(0.002)	0.45	0.906	(0.033)	0.912	(0.020)	0.45	0.970	(0.007)	0.971	(0.004)	0.45
Political books	0.725	(0.015)	0.818	(0.011)	0.15	0.354	(0.038)	0.591	(0.026)	0.15	0.451	(0.014)	0.528	(0.017)	0.15
Political blogs	0.774	(0.025)	0.854	(0.037)	0.15	0.569	(0.049)	0.728	(0.044)	0.15	0.440	(0.014)	0.531	(0.035)	0.20
Eu-core	0.886	(0.019)	0.944	(0.006)	0.50	0.233	(0.028)	0.369	(0.024)	0.50	0.644	(0.020)	0.712	(0.012)	0.50
Cora	0.978	(0.000)	0.977	(0.000)	0.05	0.062	(0.002)	0.097	(0.002)	0.05	0.550	(0.001)	0.432	(0.007)	0.00
AS	0.826	(0.000)	0.826	(0.000)	1.00	0.021	(0.000)	0.183	(0.001)	0.00	0.444	(0.001)	0.420	(0.000)	1.00
LFR-0.4	1.000	(0.000)	1.000	(0.000)	0.40	1.000	(0.000)	1.000	(0.000)	0.40	1.000	(0.000)	1.000	(0.000)	0.40
LFR-0.5	1.000	(0.000)	1.000	(0.000)	0.35	0.998	(0.010)	0.997	(0.013)	0.35	1.000	(0.001)	1.000	(0.001)	0.35
LFR-0.6	0.999	(0.004)	0.999	(0.002)	0.25	0.957	(0.084)	0.968	(0.057)	0.25	0.993	(0.010)	0.995	(0.003)	0.25
LFR-0.7	0.972	(0.019)	0.981	(0.007)	0.35	0.347	(0.131)	0.341	(0.119)	0.30	0.742	(0.058)	0.741	(0.064)	0.30

Equations6

Q (C, G, γ) = \frac{m _{in}}{m} - \frac{γ}{4 m ^{2}} C \in C \sum D (C)^{2},

Q (C, G, γ) = \frac{m _{in}}{m} - \frac{γ}{4 m ^{2}} C \in C \sum D (C)^{2},

θ_{o pt} = θ arg max Q (C_{A_{θ}}^{'}, C_{GT}^{'}),

θ_{o pt} = θ arg max Q (C_{A_{θ}}^{'}, C_{GT}^{'}),

θ_{o pt} = θ arg max \frac{1}{n _{r u n s}} i = 1 \sum n_{r u n s} Q (C_{A_{θ}, i}^{'}, C_{GT}^{'}),

θ_{o pt} = θ arg max \frac{1}{n _{r u n s}} i = 1 \sum n_{r u n s} Q (C_{A_{θ}, i}^{'}, C_{GT}^{'}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Moscow Institute of Physics and Technology, Dolgoprudny, Russia

Yandex, Moscow, Russia

11email: [email protected]

Using synthetic networks for parameter tuning

in community detection

Liudmila Prokhorenkova

Abstract

Community detection is one of the most important and challenging problems in network analysis. However, real-world networks may have very different structural properties and communities of various nature. As a result, it is hard (or even impossible) to develop one algorithm suitable for all datasets. A standard machine learning tool is to consider a parametric algorithm and choose its parameters based on the dataset at hand. However, this approach is not applicable to community detection since usually no labeled data is available for such parameter tuning. In this paper, we propose a simple and effective procedure allowing to tune hyperparameters of any given community detection algorithm without requiring any labeled data. The core idea is to generate a synthetic network with properties similar to a given real-world one, but with known communities. It turns out that tuning parameters on such synthetic graph also improves the quality for a given real-world network. To illustrate the effectiveness of the proposed algorithm, we show significant improvements obtained for several well-known parametric community detection algorithms on a variety of synthetic and real-world datasets.

Keywords:

Community detection Parameter tuning Hyperparameters LFR benchmark

1 Introduction

Community structure, which is one of the most important properties of complex networks, is characterized by the presence of groups of vertices (called communities or clusters) that are better connected to each other than to the rest of the network. In social networks, communities are formed based on common interests or on geographical location; on the Web, pages are clustered based on their topics; in protein-protein interaction networks, clusters are formed by proteins having the same specific function within the cell, and so on. Being able to identify communities is important for many applications: recommendations in social networks, graph compression, graph visualization, etc.

The problem of community detection has several peculiarities making it hard to formalize and, consequently, hard to develop a good solution for. First, as pointed out in several papers, there is no universal definition of communities [9]. As a result, there are no standard procedures for comparing the performance of different algorithms. Second, real-world networks may have very different structural properties and communities of various nature. Hence, it is impossible to develop one algorithm suitable for all datasets, as discussed in, e.g., [23]. A standard machine learning tool applied in such cases is to consider a parametric algorithm and tune its parameters based on the given dataset. Parameters which have to be chosen by the user based on the observed data are usually called hyperparameters and are often tuned via cross-validation, but this procedure requires a training part of the datasets with available ground truth labels. However, the problem of community detection is unsupervised, i.e., no ground truth community assignments are given, so standard tuning approaches are not applicable and community detection algorithms are often non-parametric.

We present a surprisingly simple and effective method for tuning hyperparameters of any community detection algorithm which requires no labeled data and chooses suitable parameters based only on the structural properties of a given graph. The core idea is to generate a synthetic network with properties similar to a given real-world one, but with known community assignments, hence we can optimize the hyperparameters on this synthetic graph and then apply the obtained algorithm to the original real-world network. It turns out that such a trick significantly improves the performance of the initial algorithm.

To demonstrate the effectiveness and the generalization ability of the proposed approach, we applied it to three different algorithms on various synthetic and real-world networks. In all cases, we obtained substantial improvements compared to the algorithms with default parameters. However, since communities in real-world networks cannot be formally defined, it is impossible to provide any theoretical guarantees for those parameter tuning strategies which do not use labeled data. As a result, the quality of any parameter tuning algorithm can be demonstrated only empirically. Based on the excellent empirical results obtained, we believe that the proposed approach captures some intrinsic properties of real-world communities and would generalize to other datasets and algorithms.

2 Background and related work

During the past few years, many community detection algorithms have been proposed, see [6, 7, 9, 17] for an overview. In this section, we take a closer look at the algorithms and concepts used in the current research.

2.1 Modularity

Let us start with some notation. We are given a graph $G=(V,E)$ , $V$ is a set of $n$ vertices, $E$ is a set of $m$ undirected edges. Denote by $\mathcal{C}$ a partition of $V$ into several disjoint communities: $\mathcal{C}=\{C_{1},\ldots,C_{k}\}$ . Also, let $m_{in}$ and $m_{out}$ be the number of intra- and inter-cluster edges in a graph $G$ partitioned according $\mathcal{C}$ . Finally, $d(i)$ denotes the degree of a vertex $i$ and $D(C)=\sum_{i\in C}d(i)$ is the overall degree of a community $C\in\mathcal{C}$ .

Modularity is a widely used measure optimized by many community detection algorithms. It was first proposed in [21] and is defined as follows

[TABLE]

where $\gamma$ is a resolution parameter [13]. The intuition behind modularity is the following: the first term in (1) is the fraction of intra-cluster edges, which is expected to be relatively high for good partitions, while the second term penalizes this value for having too large communities. Namely, the value $\frac{\sum_{C\in\mathcal{C}}D(C)^{2}}{{4m^{2}}}$ is the expected fraction of intra-cluster edges if we preserve the degree sequence but connect all vertices randomly, i.e., if we assume that our graph is constructed according to the configuration model [19].

Modularity was originally introduced with $\gamma=1$ and many community detection algorithms maximizing this measure were proposed. However, it was shown in [8] that modularity has a resolution limit, i.e., algorithms based on modularity maximization are unable to detect communities smaller than some size. Adding a resolution parameter allows to overcome this problem: larger values of $\gamma$ in general lead to smaller communities. However, tuning $\gamma$ is a challenging task. In this paper, we propose a solution to this problem.

2.2 Modularity optimization and Louvain algorithm

Many community detection algorithms are based on modularity optimization. In this paper, as one of our base algorithms, we choose arguably the most well-known and widely used greedy algorithm called Louvain [4]. It starts with each vertex forming its own community and works in several phases. To create the first level of a partition, we iterate through all vertices and for each vertex $v$ we compute the gain in modularity coming from removing $v$ from its community and putting it to each of its neighboring communities; then we move $v$ to the community with the largest gain, if it is positive. When we cannot improve modularity by such local moves, the first level is formed. After that, we replace the obtained communities by supervertices connected by weighted edges; the weight between two supervertices is equal to the number of edges between the vertices of the corresponding communities. Then we repeat the process described above with the supervertices and form the second level of a partition. After that, we merge the supervertices again, and so on, as long as modularity increases. The Louvain algorithm is quite popular since it is fast and was shown to provide partitions of good quality. However, by default, it optimizes modularity with $\gamma=1$ , therefore, it suffers from a resolution limit.

2.3 Likelihood optimization methods

Likelihood optimization algorithms are also widely used in community detection. Such methods are mathematically sound and have theoretical guarantees under some model assumptions [3]. The main idea is to assume some underlying random graph model parameterized by community assignments and find a partition $\mathcal{C}$ that maximizes the likelihood $P(G|\mathcal{C})$ , which is the probability that a graph generated according to the model with communities $\mathcal{C}$ exactly equals $G$ .

The standard random graph model assumed by likelihood maximization methods is the stochastic block model (SBM) or its simplified version — planted partition model (PPM). In these models, the probability that two vertices are connected by an edge depends only on their community assignments. Recently, the degree-corrected stochastic block model (DCSBM) together with the degree-corrected planted partition model (DCPPM) were proposed [12]. These models take into account the observed degree sequence of a graph, and, as a result, they are more realistic. It was also noticed that if we fix the parameters of DCPPM, then likelihood maximization based on this model is equivalent to modularity optimization with some $\gamma$ [22]. Finally, in a recent paper [24] the independent LFR model (ILFR) was proposed and analyzed. It was shown that ILFR gives a better fit for a variety of real-world networks [24]. In this paper, to illustrate the generalization ability of the proposed hyperparameter tuning strategy, in addition to the Louvain algorithm, we also use parametric likelihood maximization methods based on PPM and ILFR.

2.4 LFR model

Our parameter tuning strategy is based on constructing a synthetic graph structurally similar to the observed network. To do this, we use the LFR model [14] which is the well-known synthetic benchmark usually used for comparison of community detection algorithms. LFR generates a graph with power-law distributions of both degrees and community sizes in the following way. First, we generate the degrees of vertices by sampling them independently from the power-law distribution with exponent $\gamma_{d}$ , mean $\bar{d}$ and with maximum degree $d_{max}$ . Then, using a mixing parameter $\hat{\mu}$ , $0<\hat{\mu}<1$ , we obtain internal and external degrees of vertices: we expect each vertex to share a fraction $1-\hat{\mu}$ of its edges with the vertices of its community and a fraction $\hat{\mu}$ with the other vertices of the network. After that, the sizes of the communities are sampled from a power-law distribution with exponent $\gamma_{C}$ and minimum and maximum community sizes $C_{min}$ and $C_{max}$ , respectively. Then, vertices are assigned to communities such that the internal degree of any vertex is less than the size of its community. Finally, the configuration model [19] with rewiring steps is used to construct a graph with a given degree sequence and with the required fraction of internal edges. The detailed description of this procedure can be found in [14].

3 Tuning parameters

Assume that we are given a graph $G$ and our aim is to find a partition $\mathcal{C}$ of its vertex set into disjoint communities. To do this, we have a community detection algorithm $\mathcal{A}_{\theta}$ , where $\theta\in\Theta$ is a set of hyperparameters. Let $\theta_{0}$ be the default hyperparameters. Assume that we are also given a quality function $Q(\mathcal{C}_{\mathcal{A}_{\theta}},\mathcal{C}_{GT})$ allowing to measure goodness of a partition $\mathcal{C}_{\mathcal{A}_{\theta}}$ obtained by $\mathcal{A}_{\theta}$ compared to the ground truth partition $\mathcal{C}_{GT}$ . Ideally, we would like to find $\bar{\theta}=\operatorname*{arg\,max}_{\theta}Q(\mathcal{C}_{\mathcal{A}_{\theta}},\mathcal{C}_{GT})$ . However, we cannot do this since $\mathcal{C}_{GT}$ is not available. Therefore, we propose to construct a synthetic graph $G^{\prime}$ which has structural properties similar to $G$ and also has known community assignments. For this purpose, we use the LFR model described in Section 2.4. To apply this model, we have to define its parameters, which can be divided into graph-based ( $n$ , $\gamma_{d}$ , $\bar{d}$ , $d_{max}$ ) and community-based ( $\gamma_{C}$ , $C_{min}$ , $C_{max}$ , $\hat{\mu}$ ).

Graph-based parameters are easy to estimate:

•

$n=|V(G)|$ is the number of vertices in the observed network;

•

$\bar{d}=\frac{2|E(G)|}{n}$ is the average degree;

•

$d_{max}$ is the maximum degree in $G$ ;

•

$\gamma_{d}$ is the exponent of the power-law degree distribution; we estimate this parameter by fitting the power-law distribution to the cumulative degree distribution (we minimize the sum of the squared residuals in log-log scale).

Community-based parameters contain some information about the community structure, which is not known for the graph $G$ . However, we can try to approximate these parameters by applying the algorithm $\mathcal{A}_{\theta_{0}}$ with default parameters to $G$ . This would give us some partition $\mathcal{C}_{0}$ which can be used to estimate the remaining parameters:

•

$\hat{\mu}=\frac{m_{out}}{m}$ is the mixing parameter, i.e., the fraction of inter-community edges in $G$ partitioned according to $\mathcal{C}_{0}$ ;

•

$\gamma_{C}$ is the exponent of the power-law community size distribution; we estimate this parameter by fitting the power-law distribution to the cumulative community size distribution obtained from $\mathcal{C}_{0}$ (we minimize the sum of the squared residuals in log-log scale);

•

$C_{min}$ and $C_{max}$ are the minimum and maximum community sizes in $\mathcal{C}_{0}$ .

We generate a graph $G^{\prime}$ according to the LFR model with parameters specified above. Using $G^{\prime}$ we can tune the parameters to get a better value of $\theta$ :

[TABLE]

where $\mathcal{C}_{GT}^{\prime}$ is known ground truth partition for $G^{\prime}$ and $\mathcal{C}^{\prime}_{\mathcal{A}_{\theta}}$ is a partition of $G^{\prime}$ obtained by $\mathcal{A}_{\theta}$ . It turns out that this simple idea leads to a universal method for tuning $\theta$ , which successfully improves the results of several algorithms $\mathcal{A}_{\theta}$ on a variety of synthetic and real-world datasets, as we show in Section 4.

The detailed description of the proposed procedure is given in Algorithm 1. Note that in addition to the general idea described above we also propose two modifications improving the robustness of the algorithm. The first one reduces the effect of randomness in the LFR benchmark: if the number of vertices in $G$ is small, then a network generated by the LFR model can be noisy and the optimal parameters $\theta_{opt}$ computed according to Equation (2) may vary from sample to sample. Hence, we propose to generate $n_{graphs}$ synthetic networks and take the median of the obtained parameters. The value $n_{graphs}$ depends on computational resources: larger values, obviously, lead to more stable results. Fortunately, as we discuss in Section 4.5.4, this effect of randomness is critical only for small graphs, so we do not have to increase computational complexity much for large datasets.

The second improvement accounts for a possible randomness of the algorithm $\mathcal{A}_{\theta}$ . If $\mathcal{A}_{\theta}$ includes some random steps, then we can increase the robustness of our procedure by running $\mathcal{A}_{\theta}$ several times and averaging the obtained qualities. The corresponding parameter is called $n_{runs}$ in Algorithm 1. Formally, in this case Equation (2) should be replaced by

[TABLE]

where $\mathcal{C}^{\prime}_{\mathcal{A}_{\theta},i}$ is a (random) partition obtained by $\mathcal{A}_{\theta}$ on $G^{\prime}$ . If $\mathcal{A}_{\theta}$ is deterministic, then it is sufficient to take $n_{runs}=1$ .

Note that for the sake of simplicity in Algorithm 1 we use grid search to approximately find $\theta_{opt}$ defined in (3). However, any other method of black-box optimization can be used instead, e.g., random search [2], Bayesian optimization [25], Gaussian processes [10], sequential model-based optimization [11], and so on. More advanced black-box optimization methods can significantly speed up the algorithm.

Let us discuss the time complexity of the proposed algorithm. If complexity of $\mathcal{A}_{\theta}$ is $f(G)$ , then complexity of Algorithm 1 is $O\left(f(G)\cdot l\cdot n_{runs}\cdot n_{graphs}\right)$ , where $l$ is the number of steps made by the black-box optimization (the complexity of generating $G^{\prime}$ is usually negligible compared with community detection). In other words, the complexity is $n_{runs}\cdot n_{graphs}$ times larger than the complexity of any black-box parameter optimization algorithm. However, as we discuss in Section 4.5.4, $n_{runs}$ and $n_{graphs}$ can be equal to one for large datasets.

Finally, note that it can be tempting to make several iterations of Algorithm 1 to further improve $\theta_{opt}$ . Namely, in Algorithm 1 we estimate community-based parameters of LFR using the partition $\mathcal{C}_{0}$ obtained with $\mathcal{A}_{\theta_{0}}$ . Then, we obtain better parameters $\theta_{opt}$ . These parameters can be further used to get a better partition using $\mathcal{A}_{\theta_{opt}}$ and this partition is expected to give even better community-based parameters. However, in our preliminary experiments, we did not notice significant improvements from using several iterations, therefore we propose to use Algorithm 1 as it is without increasing its computational complexity.

4 Experiments

4.1 Parametric algorithms

We use the following algorithms to illustrate the effectiveness of the proposed hyperparameter tuning strategy.

Louvain

This algorithm is described in Section 2.2, it has the resolution parameter $\gamma$ with default value $\gamma_{0}=1$ . We take the publicly available implementation from [24],111https://github.com/altsoph/community_loglike where the algorithm is called DCPPM since modularity maximization is equivalent to the likelihood optimization for the DCPPM random graph model.

PPM

This algorithms is based on likelihood optimization for PPM (see Section 2.3). We use the publicly available implementation proposed in [24], where the Louvain algorithm is used as a basis to optimize the likelihood for several models. Since likelihood optimization for PPM is equivalent to maximizing a simplified version of modularity based on the Erdős–Rényi model instead of the configuration model [22], PPM algorithm also has a resolution parameter $\gamma$ with the default value $\gamma_{0}=1$ .

ILFR

This is a likelihood optimization algorithm based on the ILFR model (see Section 2.3). Again, we use the publicly available implementation from [24]. ILFR algorithm has one parameter $\mu$ called mixing parameter and no default value for this parameter is proposed in the literature. In this paper, we take $\mu_{0}=0.3$ , which is close to the average mixing parameter in the real-world datasets under consideration (see Section 4.2). Our experiments confirm that $\mu_{0}=0.3$ is a reasonable default value for this algorithm.

Let us stress that in this paper we are not aiming to develop the best community detection algorithm or to analyze all existing methods. Our main goal is to show that hyperparameter tuning is possible in the field of community detection. We use several base algorithms described above to illustrate the generalization ability of the proposed approach. For each algorithm, our aim is to improve its default parameter by our parameter tuning strategy.

4.2 Datasets

Synthetic networks

We generated several synthetic graphs according to the LFR benchmark described in Section 2.4 with $n=10^{4}$ , $\gamma_{d}=2.5$ , $\bar{d}=20$ , $d_{max}=200$ , $\gamma_{C}=1.5$ , $C_{min}=50$ , $C_{max}=500$ , $\hat{\mu}\in\{0.4,0.5,0.6,0.7\}$ .222Note that $\hat{\mu}>0.5$ does not mean the absence of community structure since usually a community is much smaller than the rest of the network and even if more than a half of the edges for each vertex go outside the community, the density of edges inside the community is still large. On the one hand, one would expect results obtained on such synthetic datasets to be optimistic, since the same LFR model is used both to tune the parameters and to validate the performance of the algorithms. On the other hand, recall that the most important ingredient of the model, i.e., the distribution of community sizes, is not known and has to be estimated using the initial community detection algorithm, and incorrect estimates may negatively affect the final performance.

Real-world networks

We follow the work [24], where the authors collected and shared 8 real-world datasets publicly available in different sources.333https://github.com/altsoph/community_loglike/tree/master/datasets For all these datasets, the ground truth community assignments are available and the communities are non-overlapping. These networks are of various sizes and structural properties, see the description in Table 1.

4.3 Evaluation metrics

In the literature, there is no universally accepted metric for evaluating the performance of community detection algorithms. Therefore, we analyze several standard ones [7]. Namely, we use two widely used similarity measures based on counting correctly and incorrectly classified pairs of vertices: Rand and Jaccard indices. We also consider the Normalized Mutual Information (NMI) of two partitions: if NMI is close to 1, one needs a small amount of information to infer the ground truth partition from the obtained one, i.e., two partitions are similar.

4.4 Experimental setup

We apply the proposed strategy to the algorithms described in Section 4.1. We use the grid search to find the parameter $\theta_{opt}$ (we do this to make our results easier to reproduce and we also need this for the analysis of stability in Section 4.5.4). For ILFR we try $\mu$ in the range $[0,1]$ with step size 0.05 and for Louvain and PPM on real-world datasets we take $\gamma$ in the range $[0,2]$ with step size $0.1$ . Although we noticed that in some cases the optimal $\gamma$ for PPM and Louvain can be larger than 2, such cases rarely occur on real-world datasets. On synthetic graphs, we take $\gamma$ in the range $[0,4]$ (with step size 0.2) to demonstrate the behavior of $\gamma_{opt}$ depending on $\hat{\mu}$ .

To guarantee stability and reproducibility of the obtained results, we choose a sufficiently large parameter $n_{runs}$ , although we noticed similar improvements with much smaller values. Namely, for Karate, Dolphins, Football, and Political books we take $n_{runs}=10^{3}$ , for Political blogs and Eu-core $n_{runs}=100$ , for Cora, AS, and synthetic networks $n_{runs}=2$ . We take $n_{graphs}=10^{3}$ for four smallest datasets and $n_{graphs}=100$ for the other ones (we choose such large values to plot the histograms on Figure 1).

Finally, note that it is impossible to measure the statistical significance of obtained improvements on real-world datasets since we have only one copy for each graph. However, we can account for the randomness included in the algorithms. Namely, Louvain, PPM, and ILFR are randomized, since at each iteration they order the vertices randomly. Therefore, to measure if $\theta_{opt}$ is significantly better or worse than $\theta_{0}$ , we can run each algorithm several times and then apply the unpaired t-test (we use 100 runs in all cases).

4.5 Results

In this section, we first discuss the improvements obtained for each algorithm and then analyze the stability of the parameter tuning strategy and the effect of the parameter $n_{graphs}$ .

4.5.1 Louvain algorithm

In Table 2, for each similarity measure we present the value for the baseline algorithm (with $\gamma=1$ ), the value for the tuned algorithm, and the obtained parameter $\gamma_{opt}$ . Since Louvain is randomized, we provide the mean value together with an estimate of the standard deviation, which is given in brackets. The number of runs used to compute these values depends on the size of the dataset and on the available computational resources: $10^{4}$ for Karate, Dolphins, Football and Political books, $10^{3}$ for Political blogs and Eu-core, 100 for Cora, AS and synthetic datasets.

One can see that our tuning strategy improves (or does not change) the results in all cases and the obtained improvements can be huge. For example, on Karate we obtain remarkable improvements from $0.761$ to $0.945$ (relative change is $24\%$ ) according to Rand and from $0.52$ to $0.892$ ( $72\%$ ) according to Jaccard; on Dolphins we get $35\%$ improvement for Rand and $63\%$ for Jaccard; on Football we obtain plus $25\%$ for Jaccard; and so on. As discussed in Section 4.4, we measured the statistical significance of the obtained improvements. The results which are significantly better are marked in bold in Table 2. On real-world datasets, all improvements except the one for NMI on AS are statistically significant (p-value $\ll 0.01$ ).444The results in Tables 2-4 are rounded to three decimals, so there may be a statistically significant improvement even when the numbers in the table are equal. Also, standard deviation less than 0.0005 is rounded to zero. Let us note that in many cases the results of the tuned algorithm are much better than the best results reported in [24], where the authors used other strategies for choosing the hyperparameter values.555For small datasets, our results for the default Louvain algorithm may differ from the ones reported in [24]. The reason is the high values of standard deviation. The authors of [24] averaged the results over 5 runs of the algorithm, while we use more runs, i.e., our average values are more stable.

For synthetic datasets, we also observe huge improvements and all of them are statistically significant. While for $\hat{\mu}\in\{0.4,0.5\}$ the default algorithm can be considered as good enough, for large values of $\hat{\mu}$ , $\hat{\mu}\in\{0.6,0.7\}$ , the tuned one is much better. For example, for LFR-0.7 the tuned parameter gives Jaccard index almost 4 times larger than the default one.

We noticed that for most of the datasets the values of $\gamma_{opt}$ computed using different similarity measures are the same or close to each other. However, there are some exceptions. The first one is Dolphins, where for Jaccard $\gamma_{opt}=0.1$ , for Rand $\gamma_{opt}=0.5$ , for NMI $\gamma_{opt}=1.0$ . We checked that if we take the median value $\gamma_{opt}=0.5$ , then for all measures we obtain statistically significant improvements, which seems to be another way to increase the stability of our strategy. The most notable case, where $\gamma_{opt}$ significantly differs for different similarity measures, is AS dataset, where $\gamma_{opt}=1.8>\gamma_{0}$ for Rand, $\gamma_{opt}=0.6<\gamma_{0}$ for Jaccard, and $\gamma_{opt}=0.8<\gamma_{0}$ for NMI. We will further make similar observations for other algorithms on this dataset. Such instability may mean that this dataset does not have a clear community structure (which can sometimes be the case for real-world networks [18]).

4.5.2 PPM algorithm

For PPM (Table 3), our strategy improves the original algorithm for all real-world datasets but Eu-core (for all similarity measures), Karate (only for Jaccard), and Dolphins (only for NMI). Note that Karate and Dolphins are the only datasets (except for AS, which will be discussed further in this section), where the obtained values for $\gamma_{opt}$ are quite different for different similarity measures. We checked that if for these two datasets we take the median value of $\gamma_{opt}$ , (0.8 for Karate and 0.7 for Dolphins), then we obtain improvements in all six cases, five of them, except NMI on Karate, are statistically significant (p-value $\ll 0.01$ ). On Eu-core the quality of PPM with $\gamma_{0}=1$ is worse than the quality of Louvain with $\gamma=1$ . This seems to be the reason why PPM chooses a suboptimal parameter $\gamma_{opt}$ : a partition obtained by PPM does not allow for a good estimate of the community-based parameters. As for Louvain, in many cases the obtained improvements are huge: e.g., the relative improvement for the Jaccard index is 147% on Dolphins, 26% on Football, 35% on Political books, 50% on Political blogs, an so on. All improvements are statistically significant.

We also improve the default algorithm on all synthetic datasets and for all similarity measures. As for the Louvain algorithm, the improvements are especially huge for large $\hat{\mu}$ , $\hat{\mu}\in\{0.6,0.7\}$ . All improvements are statistically significant.

4.5.3 ILFR algorithm

For real-world datasets, in almost all cases, we obtain significant improvements (see Table 4). One exception is Dolphins for NMI. This, again, can be fixed by taking a median of the values $\mu_{opt}$ obtained for all similarity measures on this dataset: $\mu_{opt}=0.15$ improves the results compared to $\mu_{0}=0.3$ for all three measures. Other bad examples are Cora and AS, where Rand and NMI decrease, while Jaccard increases. For all other datasets, we obtain improvements. In many cases, the difference is huge and statistically significant. On synthetic datasets, the default ILFR algorithm is the best among the considered ones. In some cases, however, the default algorithm is further improved by our hyperparameter tuning strategy, while in others the difference is not statistically significant. Surprisingly, for large values of $\hat{\mu}$ the tuned value $\mu_{opt}$ is much smaller than $\hat{\mu}$ . For example, for $\hat{\mu}=0.6$ we get $\mu_{opt}=0.25$ , although we checked that the estimated parameter used for generating synthetic graphs is very close to $0.6$ .

For real-world and synthetic networks, the obtained value $\mu_{opt}$ can be both larger and smaller than $\mu_{0}=0.3$ . Also, for synthetic networks, $\mu_{0}$ is close to the obtained $\mu_{opt}$ . We conclude that the chosen default value is reasonable.

In rare cases, $\mu_{opt}$ for a dataset can be quite different for different similarity measures. On AS, $\mu_{opt}=0$ for Jaccard and $\mu_{opt}=1$ for Rand and NMI. Note that if $\mu=0$ , then the obtained algorithm tends to group all vertices in one cluster, while for $\mu=1$ all vertices form their own clusters. Interestingly, for the Jaccard index, such a trivial partition outperforms the default algorithm. Moreover, the algorithm putting each vertex in its own cluster has close to the best performance according to the Rand index compared to all algorithms discussed in this section (both default and tuned). We conclude that AS does not have a clear community structure.

4.5.4 Stability of generated graphs

As discussed in Section 3, there are two sources of possible noise in the proposed parameter tuning procedure: 1) for small graphs the generated LFR network can be noisy, which may lead to unstable predictions of $\theta_{opt}$ , 2) the randomness of $\mathcal{A}$ may also affect the estimate of $\theta_{opt}$ in Equation (3). The effect of the second problem can be understood using Tables 2-4, where the standard deviations for $\theta_{0}$ and $\theta_{opt}$ are presented.

To analyze the effect of noise caused by the randomness in LFR graphs and to show that it is more pronounced for small datasets, we looked at the distribution of the parameters $\theta_{opt}$ obtained for different samples of generated graphs. We demonstrate this effect using the Louvain algorithm and NMI similarity measure (see Figure 1), we take $n_{graphs}=10^{3}$ for four smallest datasets and $n_{graphs}=100$ for the other ones. Except for the AS dataset, which is noisy according to all our experiments, one can clearly see that the variance of $\gamma_{opt}$ decreases when $n$ increases. As a result, we see that for large datasets even $n_{graphs}=1$ already provides a good estimate for $\gamma_{opt}$ .

5 Conclusion

We proposed and analyzed a surprisingly simple yet effective algorithm for hyperparameter tuning in community detection. The core idea is to generate a synthetic graph structurally similar to the observed network but with known community assignments. Using this graph, we can apply any standard black-box optimization strategy to approximately find the optimal hyperparameters and use them to cluster the original network. We empirically demonstrated that such a trick applied to several algorithms leads to significant improvements on both synthetic and real-world datasets. Now, being able to tune parameters of any community detection algorithm, one can develop and successfully apply parametric community detection algorithms, which was not previously possible.

Acknowledgements

This study was funded by the Russian Foundation for Basic Research according to the research project 18-31-00207 and Russian President grant supporting leading scientific schools of the Russian Federation NSh-6760.2018.1.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Adamic, L.A., Glance, N.: The political blogosphere and the 2004 us election: divided they blog. In: Proceedings of the 3rd international workshop on Link discovery. pp. 36–43. ACM (2005)
2[2] Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), 281–305 (2012)
3[3] Bickel, P.J., Chen, A.: A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences 106 (50), 21068–21073 (2009)
4[4] Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008 (10), P 10008 (2008)
5[5] Boguná, M., Papadopoulos, F., Krioukov, D.: Sustaining the internet with hyperbolic mapping. Nature communications 1 , 62 (2010)
6[6] Coscia, M., Giannotti, F., Pedreschi, D.: A classification for community discovery methods in complex networks. Statistical Analysis and Data Mining: The ASA Data Science Journal 4 (5), 512–546 (2011)
7[7] Fortunato, S.: Community detection in graphs. Physics reports 486 (3), 75–174 (2010)
8[8] Fortunato, S., Barthélemy, M.: Resolution limit in community detection. Proceedings of the National Academy of Sciences 104 (1), 36–41 (2007)