Discovering Nested Communities
Nikolaj Tatti, Aristides Gionis

TL;DR
This paper introduces a method for discovering nested communities in graphs, addressing the challenge of ambiguous community structures by finding a sequence of increasingly dense communities containing a starting set.
Contribution
It proposes a novel approach to identify nested communities, dividing the problem into ordering and community detection, with empirical and theoretical validation of the heuristic used.
Findings
Efficient algorithm for fixed vertex order
Heuristic for ordering shows good empirical performance
Theoretical support for the ordering heuristic
Abstract
Finding communities in graphs is one of the most well-studied problems in data mining and social-network analysis. In many real applications, the underlying graph does not have a clear community structure. In those cases, selecting a single community turns out to be a fairly ill-posed problem, as the optimization criterion has to make a difficult choice between selecting a tight but small community or a more inclusive but sparser community. In order to avoid the problem of selecting only a single community we propose discovering a sequence of nested communities. More formally, given a graph and a starting set, our goal is to discover a sequence of communities all containing the starting set, and each community forming a denser subgraph than the next. Discovering an optimal sequence of communities is a complex optimization problem, and hence we divide it into two subproblems: 1)…
| performance | |||||||
|---|---|---|---|---|---|---|---|
| Name | Time | ||||||
| Adjnoun | 112 | 425 | 2ms | 84 | |||
| Dolphins | 62 | 159 | 1ms | 41 | |||
| Karate | 34 | 78 | 1ms | 21 | |||
| Lesmis | 77 | 254 | 2ms | 37 | |||
| Polblogs | 84ms | 872 | |||||
| DBLP | 23s | ||||||
| 1. segment | D. Johnson | E. Dahlhaus | V. Vianu | G. Gottlob | A. Itai |
| M. Yannakakis | M. Garey | P. Crescenzi | P. Kanellakis | M. Sideri | A. Schäffer |
| F. Afrati | R. Karp | P. Seymour | S. Abiteboul | E. Koutsoupias | A. Aho |
| 2. segment | R. Fagin | O. Vornberger | A. Piccolboni | C. Daskalakis | P. Serafini |
| J. Ullman | 3. segment | M. Blum | D. Goldman | X. Deng | P. Raghavan |
| Y. Sagiv | G. Papageorgiou | K. Ross | E. Arkin | P. Goldberg | P. Bernstein |
| S. Cosmadakis | V. Vazirani | P. Kolaitis | I. Diakonikolas | T. Hadzilacos |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\xspaceaddexceptions
11institutetext: Helsinki Institute for Information Technology
Department of Information and Computer Science
Aalto University
{nikolaj.tatti,aristides.gionis}@aalto.fi
Discovering Nested Communities
Nikolaj Tatti
Aristides Gionis
Abstract
Finding communities in graphs is one of the most well-studied problems in data mining and social-network analysis. In many real applications, the underlying graph does not have a clear community structure. In those cases, selecting a single community turns out to be a fairly ill-posed problem, as the optimization criterion has to make a difficult choice between selecting a tight but small community or a more inclusive but sparser community.
In order to avoid the problem of selecting only a single community we propose discovering a sequence of nested communities. More formally, given a graph and a starting set, our goal is to discover a sequence of communities all containing the starting set, and each community forming a denser subgraph than the next. Discovering an optimal sequence of communities is a complex optimization problem, and hence we divide it into two subproblems: 1) discover the optimal sequence for a fixed order of graph vertices, a subproblem that we can solve efficiently, and 2) find a good order. We employ a simple heuristic for discovering an order and we provide empirical and theoretical evidence that our order is good.
Keywords:
community discovery, monotonic segmentation, graph mining, nested communities
1 Introduction
Discovering communities, tightly connected subgraphs, is one of the most well-studied problems in the field of graph mining. Given some optimization criterion, discovering a community is a computationally challending task, typically NP-hard. Additionally, as pointed out by Leskovec et al. [17], in many real applications the underlying graph does not have a clear community structure. Such cases make the community-finding problem inherently ill-posed, as the optimization criterion has to make a difficult, and eventually arbitrary, choice between selecting a tight but small community or a more inclusive but more sparse community. Moreover, the existence of a universal criterion for making such a choice is unlikely as the balance between the size and the density of the desired community will depend on the underlying application.
In order to avoid the problem of selecting only a single community, we propose a problem of discovering a sequence of nested communities. More formally, given a graph and a set of source vertices , our goal is to discover a sequence of communities around , such that each community is a subset of the next one. The first community will consist only of while the last community will contain the whole graph. Inner communities should be tighter than the outer communities. We express this requirement by computing the density of each community and require that the next community should have a lower density than the current community. In addition, we require that each community should be as uniform as possible. We measure uniformity by computing the variance of weights of the edges and requiring it to be small.
Discovering a sequence of communities by optimizing the uniformity criterion is a challenging problem. We will show that several optimization problems related to the optimal solution are NP-hard. Hence, we split the problem into two subproblems. We can view a community sequence as a bucket order on the vertices, each bucket consisting of vertices contained in the community and not contained in the previous community. Our first subproblem is to discover a total order on the vertices respecting the optimal bucket order. The second subproblem is to discover the optimal sequence of communities, given an order on the graph vertices. Fortunately, this subproblem can be formulated as a standard sequence-segmentation problem, and thus, it can be solved in polynomial time. In particular, we can solve this problem optimally in quadratic time or we can find an approximate solution in nearly-linear time. Discovering the order is more difficult as this is a complex combinatorial problem. We propose a simple ordering technique used for discovering dense subgraphs: pick iteratively a vertex with the lowest degree, and remove it from the graph. We provide theoretical evidence implying that this is a good order and we also show experimentally that this order outperforms several baselines.
The rest of the paper is organized as follows. We introduce preliminary notation in Section 2 and formalize our optimization problem in Section 3. In section 4 we develop our discovery algorithm and point out theoretical properties of our approach. Section 5 is devoted to related work and Section 6 is devoted to experimental evaluation. We conclude our paper with a short conclusion in Section 7.
2 Preliminaries
We consider a weighted undirected graph over a set of vertices and edges . We use the notation to denote the set of unordered pairs of distinct vertices from . The function assigns a weight to each edge . Also, given a subset of vertices we denote by the set of edges in the induced subgraph of defined by .
The definitions and algorithms in this paper rely on a notion of edge density, which is defined not only over subsets of vertices, but also over arbitrary pairs of subsets of vertices. Even though it is conceptually simple, our edge-density definition requires slightly complex notation for determining the set of potential edges to be used as a denominator in the density ratio. To simplify our presentation we use the notation described below.
Given the graph , we consider its completed representation , where , and where is an extension of , so that if , and if . In other words, can be seen as a complete graph, where all non-edges of become zero-weight edges in . We note again that we use the completed graph representation only to simplify our notation; in our implementation there is no need to store the zero-weight edges.
Now consider the completed representation of a graph , and let be a non-empty subset of edges. We define the weight and density of as
[TABLE]
Consider now two subsets of vertices . We define the set of cross edges from to as . It is important to note that we do not impose any constraint on the sets and ; they may overlap in an arbitrary way. For instance, if the sets and are disjoint the edges in are the cut edges from to , while if the edge set contains, among others, all the edges within .
Finally, we write as a shorthand of and we write as a shorthand of .
3 Nested Communities
As we discussed in the introduction, our goal is to find the optimal sequence of nested communities, with respect to a set of source vertices of the input graph. We denote this set of source vertices by . For conceptual simplicity, one may think of as a singleton set, that is, identifying the sequence of nested communities for a single vertex. However, all our problem definitions, algorithms, and proofs, hold for the general case of being any subset of .
Our objective is to find nested communities, where the parameter is part of the problem input. Given a set of source vertices , we represent a sequence of nested communities with respect to , by the sequence of vertex sets .
Intuitively, the inner sets of the nested-community sequence are expected to be more strongly related to the source set . This type of relatedness is expressed by the notion of density. So, is the densest community that contains , is the second densest community, and in general, we require that the density of should decrease as increases.
Considering the requirement of monotonically decreasing density in isolation is not sufficient to determine in a well-defined manner a desirable sequence of nested communities. Indeed, given a graph , a set of source vertices , and integer , there is a potentially exponential number of ways to partition the set of vertices of the graph into a sequence of nested communities .
The main question we are facing is to decide where exactly to draw the boundary between each pair of communities and . To answer this question, we follow an approach inspired by segmentation problems. In particular, our approach is as follows: consider the set of vertices that need to be added to the community in order to form community . Consider also the set of edges , defined as the additional edges brought in by extending the community to the community . We can then define the density of the set of edges . To capture the intuition that the set should form a coherent extension to we require that the density of is as uniform as possible.
The notion of uniformity for a set of edges, among many ways, can be expressed as a sum of square of difference of the weight of each edge from the average weight of the set. We thus have the following definition.
Definition 1
Given a set of edges , we define the density-uniformity score as
[TABLE]
Our goal is then to find a sequence of nested communities so that the successive segments of added edges are as uniform as possible with respect to their density. Formulating this objective as an optimization problem not only gives meaningful semantics to the nested community detection problem, but it also makes the problem well-defined. Motivated by the discussion above, our main problem definition is given below.
Problem 1
Given a weighted input graph , a set of source vertices , and an integer , find the sequence of nested communities that minimizes the density-uniformity score
[TABLE]
subject to the constraint for .
4 An Algorithm for Discovering Nested Communities
In this section we present our algorithm for discovering nested communities. We begin by demonstrating a necessary condition for the optimal solution based on dense subgraphs. Discovering such subgraphs turns out to be computationally intractable. We then split the original problem into two subproblems: discovering community sequence for a fixed order of vertices, a problem which we can solve efficiently, and discovering such an order. We provide a simple heuristic for discovering an order, and provide theoretical evidence that this order is good.
4.1 Nested Communities and Dense Subgraphs
We start our discussion by demonstrating a connection of the problem of finding the optimal sequence of nested communities, i.e., solving Problem 1, with problems related to finding dense subgraphs of a given graph.
To establish this connection, consider a triple of communities in an optimal solution to Problem 1. Consider the two corresponding segments and . Consider also any two subsets of those segments, and , that is, is a subset of the outer segment, while is a subset of the inner segment, see Figure 1(a) for a visualization. As we will show shortly, adding the outer subset in the community leads to a situation where the density of the subset with respect to the overall community is no better than the density of the subset with respect to the community . Otherwise, either adding to (see Figure 1(b)) or removing from (see Figure 1(c)) lead to a better solution. This follows from the fact that we require that the densities of the nested communities in any feasible solution of Problem 1 decrease monotonically.
Before proceeding to discussing the implications of this observation, we first give a formal statement and its proof.
Proposition 1
Consider a graph , a set of source vertices , and an integer . Let be the optimal sequence of nested communities, that is, a solution to Problem 1. Fix such that and let and . Then
[TABLE]
For the proof of the proposition we require the following lemma, which states that the mean square error of a set of numbers from a single point, increases with the distance of that point from the mean of the numbers. The lemma can be derived by simple algebraic manipulations, and its proof is omitted.
Lemma 1
Let and be two sets of real numbers. Let and . For any real number it is
[TABLE]
We are now ready to prove the proposition.
Proof (Proposition 1)
Let and . Let us break into two parts, and . Similarly, let us break into two parts, and . Define the centroids and . Lemma 1 now implies that
[TABLE]
where const is equal to
[TABLE]
Since is optimal we must have and . Otherwise, we can obtain a better segmentation by attaching to or deleting from . This implies that and . Since , this implies that and , which implies . This completes the proof. ∎
Proposition 1 implies that in an optimal solution the graph vertices can be ordered in such a way so that subgraph density, as specified by the proposition, decreases along this order. This observation motivates the following greedy algorithm for solving the problem of discovering nested communities:
Algorithm outline: Greedy–add–densest–subgraph
Start with , the set of source vertices. 2. 2.
Given the current set , find a subset of vertices that maximize . 3. 3.
Set , and repeat the previous step until the set includes all the vertices of the graph. 4. 4.
Consider the vertices in the order discovered by the previous process. Find the optimal sequence of nested communities that respects this order.
One potential problem with the above greedy approach is that the subroutine that is called iteratively in step 2, is an NP-hard problem. This is formalized below as problem DenseSuperset.
The proof of Proposition 2 is given in Section 4.3.
Problem 2 (DenseSuperset)
Given a weighted graph and a subset of vertices , find a subset of vertices maximizing .
Proposition 2
The DenseSuperset problem is NP-hard.
Similarly, one can think of solving the problem by working on the opposite direction, that is, start with the whole vertex set and “peel off” the set by removing the sparsest subgraph, until left with the set of source vertices . The corresponding algorithm will be the following.
Algorithm outline: Greedy–remove–sparsest–subgraph
Start with , the vertex set of . 2. 2.
Given a current set , find a subset of vertices that does not include the source vertex set and minimizes the density . 3. 3.
Set , and repeat the previous step until left only with the set of source vertices . 4. 4.
Consider the vertices in the order removed by the previous process. Find the optimal sequence of nested communities that respects this order.
Not surprisingly, the problem of finding the sparsest subgraph, which corresponds to step 2 of the above process is NP-hard.
The proof is given again in Section 4.3.
Problem 3 (SparseNbhd)
Given a weighted graph find a set of vertices minimizing .
Proposition 3
The SparseNbhd problem is NP-complete.
4.2 Algorithm for Discovering Nested Communities
Armed with intuition from the previous section, we now proceed to discuss the proposed algorithm. The underlying principle of both of the greedy algorithms described above is to consider the vertices of the graph in a specific order and then find a sequence of nested communities that respects this order. In one case, the order of graph vertices is obtained by starting from and iteratively adding the densest subgraph, while in the other case, the order is obtained by starting from the full vertex set and iteratively removing the sparsest subgraph.
Our algorithm is an instantiation of this general principle. We specify in detail () how to obtain an order of the graph vertices, and () how to find a sequence of nested communities that respects a given order.
We start our discussion from the second task, i.e., finding the sequence of nested communities given an order. As it turns out, this problem is an instance of sequence segmentation problems. We define this problem below, which is a refinement of Problem 1.
Problem 4 (Sequence of nested communities from a given order)
Given a graph with ordered vertices, a set of source vertices , and an integer , find a monotonically increasing sequence of integers such that
[TABLE]
minimizes the density-uniformity score and satisfies the monotonicity constraint for .
It is quite easy to see that Problem 4 can be cast as a segmentation problem. Typical segmentation problems can be solved optimally using dynamic programming, as shown by Bellman [3]. The most interesting aspect of Problem 4, seen as segmentation problem, is the monotonicity constraint , for . That is, not only we ask to segment the ordered sequence of vertices so that we minimize the density variance on the segments, but we also require that the density scores of each segment decrease monotonically. The situation can be abstracted to the monotonic segmentation problem stated below.
Problem 5 (Monotonic segmentation)
Let and be two sequences of real numbers. Given an integer , find indices minimizing
[TABLE]
where is the weighted centroid of -th segment such that .
In order to express Problem 4 with Problem 5, consider a group of edges, for each vertex . If we set and , we can apply Lemma 1 and show that the score of community sequence is equal to the variance minimized by Problem 5, plus a constant. In fact, this constant is the sum of the variances within each .
Similarly to the unconstrained segmentation problem, the monotonic segmentation problem can be solved optimally. The idea is to use as preprocessing step the classic “pool of adjacent violators” algorithm (PAV) [2], which merges points until there are no monotonicity violations, and then apply the classic dynamic-programming algorithm on the resulting sequence of merged points. This algorithm runs in time. By definition the merged points do not contain any monotonicity violations, and thus, the resulting segmentation respects the monotonicity constraint, as well. As shown by Haiminen et al. [14], this two-phase algorithm gives the optimal segmentation under the monotonicity constraints. As a result of the optimality of the monotonic segmentation problem, Problem 4 can be solved optimally.
We next proceed to discuss the first component of the algorithm, namely, how to obtain an order of the graph vertices. Recall that, according to the principles discussed in the previous section, we can either start from and iteratively add dense subgraphs, or start from and remove sparse subgraphs. We follow the latter approach. In order to overcome the NP-hard problem of finding the sparsest subgraph and in order to obtain a total order, we use the heuristic of iteratively removing the sparsest subgraph of size one, namely, a single vertex. The sparsest one-vertex subgraph is simply the vertex with the smallest weighted degree. Thus, overall, we obtain the simple algorithm SortVertices, whose pseudocode is given as Algorithm 1.
As an interesting side remark, we note that the algorithm SortVertices is encountered in the context of finding subgraphs with the highest average degree. In particular, it is known that the densest subgraph obtained by the algorithm during the process of iteratively removing the smallest-degree vertex is a factor-2 approximation to the optimally densest subgraph in the graph [4].
The natural question to ask is how good is the order produced by algorithm SortVertices? As we will demonstrate shortly, it turns out that the order is quite good. First, we note that the optimal solution obtained for Problem 4, satisfies an analogous structural property, with respect to subgraph densities, as the optimal solution for Problem 1, We omit the proof of the following proposition as it is similar to the one of Proposition 1.
Proposition 4
Consider a graph with ordered vertices, a set of source vertices , and an integer . Let be the optimal sequence of nested communities with respect to the order, that is, a solution to Problem 1. Fix such that and let . Let and such that and . Then .
The only difference between Proposition 1 and Proposition 4 is that in Proposition 4 we require additionally that starts with and ends with with respect to the order. We want this condition to be redundant, otherwise the given order is suboptimal. For example, consider the adjacency matrix of given in Figure 2(a). The given segmentation is optimal with respect to the given order. However if we rearrange the vertices in and , given in Figure 2(b), then the same segmentation is no longer optimal as and violate Proposition 4. The additional condition in Proposition 4 becomes redundant if ends with the sparsest subset while starts with densest subset. We will show that the algorithm SortVertices produces an order that satisfies this property approximately. The exact formulation of our claim is given as Propositions 5 and 6.
Proposition 5
Consider a weighted graph , whose vertices are ordered by algorithm SortVertices. Let . Let and . Let . Then for any .
Proof
Note that . Write . Since has the smallest , we have
[TABLE]
Combining the inequalities and dividing by proves the result.∎
Proposition 6
Consider a weighted graph , whose vertices are ordered by algorithm SortVertices. Let . Let and . Assume that there is such that for all it is . Let . Then for any .
Proof
Let and . The density of is bounded by
[TABLE]
Select with the highest . Then . Let us prove that . If , then we are done. Assume that . Since is fully-connected, SortVertices always picks the vertex with the lowest weight. Let . Then . Since, is fully-connected for any . Hence, dividing the inequality gives us , which proves the proposition.∎
4.3 Hardness of Finding Dense and Sparse Subgraphs
In this section we prove the NP-hardness results, stated in Section 4.1. We start with an auxiliary lemma.
Lemma 2
Let be real numbers. Let . If
[TABLE]
Similarly, if
[TABLE]
Proof
We will only prove the first case. The other 3 cases are similar. We have which is equivalent to . The left-hand side is equal to while the right hand side is equal to . The lemma follows.∎
We now give the proofs of Propositions 2 and 3.
Proposition 2
The DenseSuperset problem is NP-hard.
Proof
To prove the hardness, we will reduce Clique to DenseSuperset. Let be the given graph. Let us create a new graph by adding one extra vertex, say , to and connecting every vertex in to . We set to be for any edge in and , which we will define later, if is adjacent to . Finally, we connect the non-connected vertices with edges of weight [math]. We will use , , and as inputs to DenseSuperset.
Our next step is to define such that the maximum clique will also have the largest density. In order to do that, let be a clique of size in . Then the weight of is equal to
[TABLE]
If we have a non-clique subgraph of size , then obviously its weight is genuinely smaller than .
Assume a set of vertices with vertices. The weight of is bounded by
[TABLE]
We want , which is guaranteed if
[TABLE]
Since , Lemma 2 implies that if
[TABLE]
then the inequality in Eq 1 is guaranteed.
Let be a non-clique of size in . Then the weight of bounded by
[TABLE]
We need to have , which is guaranteed if
[TABLE]
Since , Lemma 2 guarantees that if
[TABLE]
then the inequality in Eq. 2 is guaranteed. If we choose , both inequalities in Eqs. 1–2 are now guaranteed.
Let be the minimum size of the clique given as a parameter in Clique. Set . If contains a clique of size , then there is a subgraph in with a density of . Assume now that contains a subgraph, say , with a density of at least . must contain at least vertices, otherwise bound in Eq. 1 is violated. must be a clique, otherwise bound in Eq. 2 is violated. Consequently, has a clique of size if and only if has a subgraph of density at least . The reduction is polynomial. This concludes the proof.∎
Proposition 3
The SparseNbhd problem is NP-hard.
Proof
To prove the hardness, we will reduce Clique to SparseNbhd. Let be the given graph. We will define as follows. First we attach two vertices and to . Select one vertex, say , from the clique and connect each vertex in to . We connect the non-connected vertices with edges of weight [math]. Let . We will weight the edges in with , let us define . Set the weight of an edge , for each . Due to this scheme we have for any . Finally, we set . This weight is so large that no solution for SparseNbhd will contain or .
Let be a clique of size in . Then the weight of is equal to
[TABLE]
If we have a non-clique subgraph of size , then obviously its weight is genuinely larger than .
Assume a set with vertices. The weight of is bounded by
[TABLE]
We want , which is guaranteed if
[TABLE]
If we have a non-clique subgraph of size , then obviously its weight is genuinely smaller than .
Since , Lemma 2 implies that if
[TABLE]
then the inequality in Eq 3 is guaranteed. This is guaranteed by our choice of .
Let be a non-clique of size in . Then the weight of bounded by
[TABLE]
We need to have , which is guaranteed if
[TABLE]
Since , Lemma 2 guarantees that if
[TABLE]
then Eq. 4 is guaranteed. This is guaranteed by our choice of .
Let be the minimum size of the clique given as a parameter in Clique. Set . If contains a clique of size , then there is a subgraph in with a density of . Assume now that contains a subgraph, say , with a density of at most . Note that is largest, when , that is, . If or is contained in , then the density is at least , which is a contradiction. Hence is a subgraph of . must contain at least vertices, otherwise bound in Eq. 3 is violated. must be a clique, otherwise bound in Eq. 4 is violated. Consequently, has a clique of size if and only if has a subgraph of density at least . The reduction is polynomial. This concludes the proof.∎
5 Related Work
Finding communities in graphs and social networks is one of the most well-studied topics in graph mining. The amount of literature on the subject is very extensive. This section cannot aspire to cover all the different approaches and aspects of the problem, we only provide a brief overview of the area.
Community detection. A large part of the related work deals with the problem of partitioning a graph in disjoint clusters or communities. A number of different methodologies have been applied, such as hierarchical approaches [11], methods based on modularity maximization [1, 6, 11, 26], graph-theoretic approaches [8, 9], random-walk methods [21, 24, 28], label-propagation approaches [24], and spectral graph partition [5, 15, 18, 25]. A thorough review on community-detection methods can be found on the survey by Fortunato [10]. We note that this line of work is different than the present paper, since we do not aim at partitioning a graph in disjoint communities.
Overlapping communities. Researchers in community detection have realized that, in many real situations and real applications, it is meaningful to consider that graph vertices do not belong only to one community. Thus, one asks to partition a graph into overlapping communities. Typical methods here rely on clique percolation [19], extensions to the modularity-based approaches [12, 20], analysis of ego-networks [7], or fuzzy clustering [27]. Again the problem we address in this paper is quite different. First, we find communities centered around a given set of source vertices, and not for the whole graph. Second, the communities output by our algorithm do not have arbitrary overlaps, but they have a specific nested structure.
Centerpiece subgraphs and community search. Perhaps closer to our approach is work related to the centerpiece subgraphs and the community-search problem [23, 16, 22]. In this class of problems, a set of source vertices is given and the goal is to find a subgraph so that belongs in the subgraph and the subgraph forms a tight community. The quality of the subgraph is measured with various objective functions, such as degree [22], conductance [16], or random-walk-based measures [23]. The difference of these methods with the one presented here is that these methods return only one community, while in this paper we deal with the problem of finding a sequence of nested communities.
In summary, despite the numerous research on the topic of community detection in graphs and social networks, to the best of our knowledge, this is the first paper to address the topic of nested communities with respect to a set of source vertices. Furthermore, our approach offers novel technical ideas, such as providing a solid theoretical analysis that allows to decompose the problem of finding nested communities into two sub-problems: () ordering the set of vertices, and () segmenting the graph vertices according to that given order.
6 Experimental Evaluation
We will now provide experimental evidence that our method efficiently discovers meaningful segmentations and that our ordering algorithm outperforms several natural baselines.
Datasets and experimental setup. In our experiments we used six datasets, five obtained from Mark Newman’s webpage,111http://www-personal.umich.edu/~mejn/netdata/ and a bibliographic dataset obtained from DBLP. The datasets are as follows: Adjnoun: adjacency graph of common adjectives and nouns in the novel David Copperfield, by Charles Dickens. Dolphins: an undirected social graph of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand. Karate: social graph of friendships between 34 members of a karate club at a US university in the 1970s. Lesmis: coappearance graph of characters in the novel Les Miserables. Polblogs: a directed graph of hyperlinks between weblogs on US politics, recorded in 2005. DBLP: coauthorship graph between researchers in computer science. The statistics of these datasets are given in Table 1.
For each dataset and a given source set , we considered three different weighting schemes: First we run personalized PageRank using the source node with a restart of . Let be the PageRank weight of each vertex. Given an edge , we set three different weighting schemes,
[TABLE]
These weights are selected so that the vertices that are hard to reach with a random walk will have edges with small weights, and hence will be placed in outer communities. For DBLP, we weighted the edges during PageRank computation with the number of joint papers, each paper normalized by the number of authors. We use the vertex with the highest degree as a starting set.
Time complexity. Our first step is to study the running time of our algorithm. We ran our experiments on a laptop equipped with a 1.8 GHz dual-core Intel Core i7 with 4 MB shared L3 cache, and typical running times for each dataset are given in 3rd column of Table 1.222For the code, see http://users.ics.aalto.fi/~ntatti/ Our algorithm is fast: for the largest dataset with 2 million edges, the computation took only 20 seconds. The algorithm consists of 4 steps, computing PageRank, ordering the vertices, grouping the vertices into blocks such that monotonicity condition is guaranteed, and segmenting the groups. The only computationally strenuous step is segmentation which requires quadratic time in the number of blocks. The number of vertices in DBLP is over , however, grouping according to the PAV algorithm leaves only blocks, which can be easily segmented. It is possible to select weights in such a way that there will no reduction when grouping vertices, so that finding the optimal segmentation becomes infeasible. However, in such a case, we can always resort to a near-linear approximation optimization algorithm [13].
Comparison to baseline. A key part in our approach is discovering a good order. Our next step is to compare the order induced by SortVertices against several natural baselines. For the first baseline we group the vertices based on the length of a minimal path from the source. We then compared these communities, say , to the (same number of) communities obtained with our method. The scores, given in Table 1, show that our approach beats this baseline in every case, which is expected since this naïve baseline does not take into account density. For our next two baselines we order vertices based on vertex degree and PageRank. We then compute community sequences with – communities from these orders. Typical scores are given in Figure 3. Out of comparisons, SortVertices wins both orders 158 times, ties once (Karate, , 3 communities) and loses 3 times to the degree order (DBLP, , 3–5 communities).
Examples of Communities. Our final step is to provide examples of discovered communities. In Figure 4 we provide 4 different community sequences with 3 communities using weights and and sources and . The inner-most community for contains a near 5-clique. The inner-most community for contains two 4-cliques. The normalized weight penalizes hubs. This can be seen in Figure 4(a), where hubs , move from the outer community to the middle community. Similarly, hub changes communities in Figure 4(b). Finally, we give an example of communities discovered in DBLP. Table 2 contains communities discovered around Christos Papadimitriou. Authors in inner communities share many joint papers with Papadimitriou.
7 Concluding Remarks
We considered a problem of discovering nested communities, a sequence of subgraphs such that each community is a more connected subgraph of the next community. We approach the problem by dividing it into two subproblems: discovering the community sequence for a fixed order of vertices, a problem which we can solve efficiently, and discovering an order. We provided a simple heuristic for discovering an order, and provided theoretical and empirical evidence that this order is good.
Discovering nested communities seems to have a lot of potential as it is possible to modify or extend the problem in many ways. We can generalize the problem by not only considering sequences but, for example, trees of communities, where a parent node needs to be a denser subgraph than the child node. Another possible extension is to consider multiple source sets instead of just one.
Acknowledgements.
This work was supported by Academy of Finland grant 118653 (algodan)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. Agarwal and D. Kempe. Modularity-maximizing network communities via mathematical programming. European Physics Journal B , 66(3), 2008.
- 2[2] M. Ayer, H. Brunk, G. Ewing, and W. Reid. An empirical distribution function for sampling with incomplete information. The annals of mathematical statistics , 26(4), 1955.
- 3[3] R. Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM , 4(6), 1961.
- 4[4] M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX , 2000.
- 5[5] F. R. K. Chung. Spectral Graph Theory . American Mathematical Society, 1997.
- 6[6] A. Clauset, M. E. J. Newman, , and C. Moore. Finding community structure in very large networks. Physical Review E , 2004.
- 7[7] M. Coscia, G. Rossetti, F. Giannotti, and D. Pedreschi. DEMON: a local-first discovery method for overlapping communities. In KDD , 2012.
- 8[8] G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In KDD , 2000.
