Discovering Nested Communities

Nikolaj Tatti; Aristides Gionis

arXiv:1902.01483·cs.DS·February 6, 2019

Discovering Nested Communities

Nikolaj Tatti, Aristides Gionis

PDF

TL;DR

This paper introduces a method for discovering nested communities in graphs, addressing the challenge of ambiguous community structures by finding a sequence of increasingly dense communities containing a starting set.

Contribution

It proposes a novel approach to identify nested communities, dividing the problem into ordering and community detection, with empirical and theoretical validation of the heuristic used.

Findings

01

Efficient algorithm for fixed vertex order

02

Heuristic for ordering shows good empirical performance

03

Theoretical support for the ordering heuristic

Abstract

Finding communities in graphs is one of the most well-studied problems in data mining and social-network analysis. In many real applications, the underlying graph does not have a clear community structure. In those cases, selecting a single community turns out to be a fairly ill-posed problem, as the optimization criterion has to make a difficult choice between selecting a tight but small community or a more inclusive but sparser community. In order to avoid the problem of selecting only a single community we propose discovering a sequence of nested communities. More formally, given a graph and a starting set, our goal is to discover a sequence of communities all containing the starting set, and each community forming a denser subgraph than the next. Discovering an optimal sequence of communities is a complex optimization problem, and hence we divide it into two subproblems: 1)…

Tables2

Table 1. Table 1: Basic statistics of graphs (first two columns) and performance over hops baseline. The third column represents a typical running time while the fourth column represents a typical number of entries during the segmentation. The last three columns represent the normalized score compared to the baseline score q ( ℋ ) 𝑞 ℋ \mathit{q}\mathopen{}\left(\mathcal{H}\right) .

Name	$\| V (G) \|$	$\| E (G) \|$	Time	$N$	$w_{n}$	$w_{s}$	$w_{m}$
					performance $q (𝒱) / q (ℋ)$
Adjnoun	112	425	2ms	84	$0.90 / 0.95$	$0.88 / 0.95$	$0.77 / 0.94$
Dolphins	62	159	1ms	41	$0.67 / 0.80$	$0.61 / 0.78$	$0.57 / 0.80$
Karate	34	78	1ms	21	$0.78 / 0.91$	$0.76 / 0.91$	$0.60 / 0.93$
Lesmis	77	254	2ms	37	$0.77 / 0.93$	$0.84 / 0.94$	$0.62 / 0.94$
Polblogs	$1 222$	$16 714$	84ms	872	$0.87 / 0.96$	$0.95 / 0.99$	$0.57 / 0.96$
DBLP	$703 193$	$2 341 362$	23s	$1 797$	$0.87 / 0.99$	$0.98 / 1.00$	$0.45 / 0.99$

Table 2. Table 2: Top-3 communities from a sequence of 5 communities for Christos Papadimitriou from DBLP set and using w s subscript 𝑤 𝑠 \mathit{{w}_{s}} .

1. segment	D. Johnson	E. Dahlhaus	V. Vianu	G. Gottlob	A. Itai
M. Yannakakis	M. Garey	P. Crescenzi	P. Kanellakis	M. Sideri	A. Schäffer
F. Afrati	R. Karp	P. Seymour	S. Abiteboul	E. Koutsoupias	A. Aho
2. segment	R. Fagin	O. Vornberger	A. Piccolboni	C. Daskalakis	P. Serafini
J. Ullman	3. segment	M. Blum	D. Goldman	X. Deng	P. Raghavan
Y. Sagiv	G. Papageorgiou	K. Ross	E. Arkin	P. Goldberg	P. Bernstein
S. Cosmadakis	V. Vazirani	P. Kolaitis	I. Diakonikolas	T. Hadzilacos

Equations58

w (F) = e \in F \sum w (e) and d (F) = \frac{w ( F )}{∣ F ∣} .

w (F) = e \in F \sum w (e) and d (F) = \frac{w ( F )}{∣ F ∣} .

q (F) = e \in F \sum (w (e) - d (F))^{2} .

q (F) = e \in F \sum (w (e) - d (F))^{2} .

q (V) = i = 1 \sum k q (E (V_{i}) ∖ E (V_{i - 1})),

q (V) = i = 1 \sum k q (E (V_{i}) ∖ E (V_{i - 1})),

d (X, X \cup V_{i}) \leq d (Y, V_{i}) .

d (X, X \cup V_{i}) \leq d (Y, V_{i}) .

i = 1 \sum N w_{i} (x_{i} - d)^{2} = i = 1 \sum N w_{i} (x_{i} - μ)^{2} + W (d - μ)^{2} .

i = 1 \sum N w_{i} (x_{i} - d)^{2} = i = 1 \sum N w_{i} (x_{i} - μ)^{2} + W (d - μ)^{2} .

s = q (C_{1}) + q (C_{2})

s = q (C_{1}) + q (C_{2})

s_{1} = q (C_{1} \cup D_{21}) + q (D_{22})

s_{2} = q (D_{12}) + q (C_{1} \cup D_{11})

i = 1 \sum 2 q (D_{i 1}) + q (D_{i 2}) + ∣ D_{i 2} ∣ (μ_{i 2} - λ_{i})^{2} .

i = 1 \sum 2 q (D_{i 1}) + q (D_{i 2}) + ∣ D_{i 2} ∣ (μ_{i 2} - λ_{i})^{2} .

V = (S = V_{0} \subseteq V_{1} \subseteq \dots \subseteq V_{k} = V), where V_{k} = {v_{1}, \dots, v_{b_{i}}},

V = (S = V_{0} \subseteq V_{1} \subseteq \dots \subseteq V_{k} = V), where V_{k} = {v_{1}, \dots, v_{b_{i}}},

j = 1 \sum n i = b_{j - 1} \sum b_{j} - 1 a_{i} (x_{i} - μ_{j})^{2},

j = 1 \sum n i = b_{j - 1} \sum b_{j} - 1 a_{i} (x_{i} - μ_{j})^{2},

s = x \in X \sum m_{x} d (x, W) \geq d (v_{c}, W) x \in X \sum m_{x} \geq d (v_{c}, W) ∣ c (X, W) ∣ .

s = x \in X \sum m_{x} d (x, W) \geq d (v_{c}, W) x \in X \sum m_{x} \geq d (v_{c}, W) ∣ c (X, W) ∣ .

d (X, X \cup W) = \frac{w ( A ) + w ( B )}{∣ A ∣ + ∣ B ∣} \leq \frac{w ( A ) + α w ( A )}{∣ A ∣ + ∣ B ∣} \leq \frac{( 1 + α ) w ( A )}{∣ A ∣} = (1 + α) d (A) .

d (X, X \cup W) = \frac{w ( A ) + w ( B )}{∣ A ∣ + ∣ B ∣} \leq \frac{w ( A ) + α w ( A )}{∣ A ∣ + ∣ B ∣} \leq \frac{( 1 + α ) w ( A )}{∣ A ∣} = (1 + α) d (A) .

a > r and y > x or if a < r and x < y, then \frac{x + a}{x + b} > \frac{y + a + c}{y + b} .

a > r and y > x or if a < r and x < y, then \frac{x + a}{x + b} > \frac{y + a + c}{y + b} .

a < r and y > x or if a > r and x < y, then \frac{x + a}{x + b} < \frac{y + a + c}{y + b} .

a < r and y > x or if a > r and x < y, then \frac{x + a}{x + b} < \frac{y + a + c}{y + b} .

d (X, X \cup S) = \frac{( 2 N ) + α N}{( 2 N ) + N} = \frac{N - 1 + 2 α}{N - 1 + 2} .

d (X, X \cup S) = \frac{( 2 N ) + α N}{( 2 N ) + N} = \frac{N - 1 + 2 α}{N - 1 + 2} .

d (Z, Z \cup S) \leq \frac{( 2 K ) + α K}{( 2 K ) + K} = \frac{K - 1 + 2 α}{K - 1 + 2} .

d (Z, Z \cup S) \leq \frac{( 2 K ) + α K}{( 2 K ) + K} = \frac{K - 1 + 2 α}{K - 1 + 2} .

\frac{N - 1 + 2 α}{N + 1} > \frac{K - 1 + 2 α}{K + 1} .

\frac{N - 1 + 2 α}{N + 1} > \frac{K - 1 + 2 α}{K + 1} .

2 α < 2 + \frac{2 + N - 1}{( K - 1 ) - ( N - 1 )} 0 = 2,

2 α < 2 + \frac{2 + N - 1}{( K - 1 ) - ( N - 1 )} 0 = 2,

d (Y, Y \cup S) \leq \frac{( 2 M ) + α M - 1}{( 2 M ) + M} = \frac{M - 1 + 2 α - 2/ M}{M - 1 + 2} .

d (Y, Y \cup S) \leq \frac{( 2 M ) + α M - 1}{( 2 M ) + M} = \frac{M - 1 + 2 α - 2/ M}{M - 1 + 2} .

\frac{N - 1 + 2 α}{N - 1 + 2} > \frac{M - 1 + 2 α - 2/ M}{M - 1 + 2} .

\frac{N - 1 + 2 α}{N - 1 + 2} > \frac{M - 1 + 2 α - 2/ M}{M - 1 + 2} .

2 α > 2 + \frac{- 2}{M} \frac{2 + N - 1}{( M - 1 ) - ( N - 1 )} = 2 - \frac{2 ( N + 1 )}{M ( M - N )},

2 α > 2 + \frac{- 2}{M} \frac{2 + N - 1}{( M - 1 ) - ( N - 1 )} = 2 - \frac{2 ( N + 1 )}{M ( M - N )},

d (X, V^{'}) = \frac{α N - ( 2 N )}{P N - ( 2 N )} = \frac{2 α - N + 1}{2 P - N + 1} .

d (X, V^{'}) = \frac{α N - ( 2 N )}{P N - ( 2 N )} = \frac{2 α - N + 1}{2 P - N + 1} .

d (Z, V^{'}) \geq \frac{α K - ( 2 K )}{P K - ( 2 K )} = \frac{2 α - K + 1}{2 P - K + 1} .

d (Z, V^{'}) \geq \frac{α K - ( 2 K )}{P K - ( 2 K )} = \frac{2 α - K + 1}{2 P - K + 1} .

\frac{2 α - N + 1}{2 P - N + 1} < \frac{2 α - K + 1}{2 P - K + 1} .

\frac{2 α - N + 1}{2 P - N + 1} < \frac{2 α - K + 1}{2 P - K + 1} .

2 α < 2 P + \frac{2 P - N + 1}{( N - 1 ) - ( K - 1 )} 0 = 2 P,

2 α < 2 P + \frac{2 P - N + 1}{( N - 1 ) - ( K - 1 )} 0 = 2 P,

d (Y, V^{'}) \geq \frac{α M - ( 2 M ) + 1}{P M - ( 2 M )} = \frac{2 α + 2/ M - M + 1}{2 P - M + 1} .

d (Y, V^{'}) \geq \frac{α M - ( 2 M ) + 1}{P M - ( 2 M )} = \frac{2 α + 2/ M - M + 1}{2 P - M + 1} .

\frac{2 α - N + 1}{2 P - N + 1} < \frac{2 α + 2/ M - M + 1}{2 P - M + 1} .

\frac{2 α - N + 1}{2 P - N + 1} < \frac{2 α + 2/ M - M + 1}{2 P - M + 1} .

2 α > 2 P + \frac{2}{M} \frac{2 P - N + 1}{( N - 1 ) - ( M - 1 )} = 2 P - \frac{2 ( 2 P - N + 1 )}{M ( M - N )}

2 α > 2 P + \frac{2}{M} \frac{2 P - N + 1}{( N - 1 ) - ( M - 1 )} = 2 P - \frac{2 ( 2 P - N + 1 )}{M ( M - N )}

w_{n} (e) = \frac{p ( v )}{deg ( v )} + \frac{p ( w )}{deg ( w )}, w_{s} (e) = p (v) + p (w), w_{m} (e) = min (p (v), p (w)) .

w_{n} (e) = \frac{p ( v )}{deg ( v )} + \frac{p ( w )}{deg ( w )}, w_{s} (e) = p (v) + p (w), w_{m} (e) = min (p (v), p (w)) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\xspaceaddexceptions

11institutetext: Helsinki Institute for Information Technology

Department of Information and Computer Science

Aalto University

{nikolaj.tatti,aristides.gionis}@aalto.fi

Discovering Nested Communities

Nikolaj Tatti

Aristides Gionis

Abstract

Finding communities in graphs is one of the most well-studied problems in data mining and social-network analysis. In many real applications, the underlying graph does not have a clear community structure. In those cases, selecting a single community turns out to be a fairly ill-posed problem, as the optimization criterion has to make a difficult choice between selecting a tight but small community or a more inclusive but sparser community.

In order to avoid the problem of selecting only a single community we propose discovering a sequence of nested communities. More formally, given a graph and a starting set, our goal is to discover a sequence of communities all containing the starting set, and each community forming a denser subgraph than the next. Discovering an optimal sequence of communities is a complex optimization problem, and hence we divide it into two subproblems: 1) discover the optimal sequence for a fixed order of graph vertices, a subproblem that we can solve efficiently, and 2) find a good order. We employ a simple heuristic for discovering an order and we provide empirical and theoretical evidence that our order is good.

Keywords:

community discovery, monotonic segmentation, graph mining, nested communities

1 Introduction

Discovering communities, tightly connected subgraphs, is one of the most well-studied problems in the field of graph mining. Given some optimization criterion, discovering a community is a computationally challending task, typically NP-hard. Additionally, as pointed out by Leskovec et al. [17], in many real applications the underlying graph does not have a clear community structure. Such cases make the community-finding problem inherently ill-posed, as the optimization criterion has to make a difficult, and eventually arbitrary, choice between selecting a tight but small community or a more inclusive but more sparse community. Moreover, the existence of a universal criterion for making such a choice is unlikely as the balance between the size and the density of the desired community will depend on the underlying application.

In order to avoid the problem of selecting only a single community, we propose a problem of discovering a sequence of nested communities. More formally, given a graph $G$ and a set of source vertices $S$ , our goal is to discover a sequence of $k$ communities around $S$ , such that each community is a subset of the next one. The first community will consist only of $S$ while the last community will contain the whole graph. Inner communities should be tighter than the outer communities. We express this requirement by computing the density of each community and require that the next community should have a lower density than the current community. In addition, we require that each community should be as uniform as possible. We measure uniformity by computing the variance of weights of the edges and requiring it to be small.

Discovering a sequence of communities by optimizing the uniformity criterion is a challenging problem. We will show that several optimization problems related to the optimal solution are NP-hard. Hence, we split the problem into two subproblems. We can view a community sequence as a bucket order on the vertices, each bucket consisting of vertices contained in the community and not contained in the previous community. Our first subproblem is to discover a total order on the vertices respecting the optimal bucket order. The second subproblem is to discover the optimal sequence of communities, given an order on the graph vertices. Fortunately, this subproblem can be formulated as a standard sequence-segmentation problem, and thus, it can be solved in polynomial time. In particular, we can solve this problem optimally in quadratic time or we can find an approximate solution in nearly-linear time. Discovering the order is more difficult as this is a complex combinatorial problem. We propose a simple ordering technique used for discovering dense subgraphs: pick iteratively a vertex with the lowest degree, and remove it from the graph. We provide theoretical evidence implying that this is a good order and we also show experimentally that this order outperforms several baselines.

The rest of the paper is organized as follows. We introduce preliminary notation in Section 2 and formalize our optimization problem in Section 3. In section 4 we develop our discovery algorithm and point out theoretical properties of our approach. Section 5 is devoted to related work and Section 6 is devoted to experimental evaluation. We conclude our paper with a short conclusion in Section 7.

2 Preliminaries

We consider a weighted undirected graph $G=(V,E,\mathit{{w}})$ over a set of vertices $V$ and edges $E\subseteq{{{V}\choose 2}}$ . We use the notation ${{V}\choose 2}$ to denote the set of unordered pairs of distinct vertices from $V$ . The function $\mathit{{w}}:E\rightarrow{\mathbb{R}}$ assigns a weight $\mathit{{w}}\mathopen{}\left(e\right)$ to each edge $e\in E$ . Also, given a subset of vertices $V^{\prime}\subseteq V$ we denote by $E(V^{\prime})$ the set of edges in the induced subgraph of $G$ defined by $V^{\prime}$ .

The definitions and algorithms in this paper rely on a notion of edge density, which is defined not only over subsets of vertices, but also over arbitrary pairs of subsets of vertices. Even though it is conceptually simple, our edge-density definition requires slightly complex notation for determining the set of potential edges to be used as a denominator in the density ratio. To simplify our presentation we use the notation described below.

Given the graph $G=(V,E,\mathit{{w}})$ , we consider its completed representation ${G}_{0}=(V,{E}_{0},\mathit{{w}_{0}})$ , where ${E}_{0}={{V}\choose 2}$ , and where $\mathit{{w}_{0}}$ is an extension of $\mathit{{w}}$ , so that $\mathit{{w}_{0}}\mathopen{}\left(e\right)=\mathit{{w}}\mathopen{}\left(e\right)$ if $e\in E$ , and $\mathit{{w}_{0}}\mathopen{}\left(e\right)=0$ if $e\not\in E$ . In other words, ${G}_{0}$ can be seen as a complete graph, where all non-edges of $G$ become zero-weight edges in ${G}_{0}$ . We note again that we use the completed graph representation only to simplify our notation; in our implementation there is no need to store the zero-weight edges.

Now consider the completed representation ${G}_{0}=(V,{E}_{0},\mathit{{w}_{0}})$ of a graph $G$ , and let $F\subseteq{E}_{0}$ be a non-empty subset of edges. We define the weight and density of $F$ as

[TABLE]

Consider now two subsets of vertices $S,T\subseteq V$ . We define the set of cross edges from $S$ to $T$ as $\mathit{c}\mathopen{}\left(S,T\right)=\left\{(x,y)\in E\mid x\in S,y\in T\right\}$ . It is important to note that we do not impose any constraint on the sets $S$ and $T$ ; they may overlap in an arbitrary way. For instance, if the sets $S$ and $T$ are disjoint the edges in $\mathit{c}\mathopen{}\left(S,T\right)$ are the cut edges from $S$ to $T$ , while if $S\subseteq T$ the edge set $\mathit{c}\mathopen{}\left(S,T\right)$ contains, among others, all the edges within $S$ .

Finally, we write $\mathit{{w}}\mathopen{}\left(S,T\right)$ as a shorthand of $\mathit{{w}}\mathopen{}\left(\mathit{c}\mathopen{}\left(S,T\right)\right)$ and we write $\mathit{d}\mathopen{}\left(S,T\right)$ as a shorthand of $\mathit{d}\mathopen{}\left(\mathit{c}\mathopen{}\left(S,T\right)\right)$ .

3 Nested Communities

As we discussed in the introduction, our goal is to find the optimal sequence of nested communities, with respect to a set of source vertices of the input graph. We denote this set of source vertices by $S$ . For conceptual simplicity, one may think of $S$ as a singleton set, that is, identifying the sequence of nested communities for a single vertex. However, all our problem definitions, algorithms, and proofs, hold for the general case of $S$ being any subset of $V$ .

Our objective is to find $k$ nested communities, where the parameter $k$ is part of the problem input. Given a set of source vertices $S$ , we represent a sequence of nested communities with respect to $S$ , by the sequence of vertex sets $S=V_{0}\subseteq V_{1}\subseteq\cdots\subseteq V_{k}=V$ .

Intuitively, the inner sets of the nested-community sequence are expected to be more strongly related to the source set $S$ . This type of relatedness is expressed by the notion of density. So, $V_{1}$ is the densest community that contains $S$ , $V_{2}$ is the second densest community, and in general, we require that the density of $V_{i}$ should decrease as $i$ increases.

Considering the requirement of monotonically decreasing density in isolation is not sufficient to determine in a well-defined manner a desirable sequence of nested communities. Indeed, given a graph $G$ , a set of source vertices $S$ , and integer $k$ , there is a potentially exponential number of ways to partition the set of vertices of the graph into a sequence of nested communities $V_{0},\ldots,V_{k}$ .

The main question we are facing is to decide where exactly to draw the boundary between each pair of communities $V_{i}$ and $V_{i+1}$ . To answer this question, we follow an approach inspired by segmentation problems. In particular, our approach is as follows: consider the set of vertices $D_{i+1}=V_{i+1}\setminus V_{i}$ that need to be added to the community $V_{i}$ in order to form community $V_{i+1}$ . Consider also the set of edges $E_{i+1}=E(V_{i+1})\setminus E(V_{i})$ , defined as the additional edges brought in by extending the community $V_{i}$ to the community $V_{i+1}$ . We can then define the density of the set of edges $E_{i+1}$ . To capture the intuition that the set $D_{i+1}$ should form a coherent extension to $V_{i}$ we require that the density of $E_{i+1}$ is as uniform as possible.

The notion of uniformity for a set of edges, among many ways, can be expressed as a sum of square of difference of the weight of each edge from the average weight of the set. We thus have the following definition.

Definition 1

Given a set of edges $F\subseteq E$ , we define the density-uniformity score as

[TABLE]

Our goal is then to find a sequence of nested communities so that the successive segments of added edges are as uniform as possible with respect to their density. Formulating this objective as an optimization problem not only gives meaningful semantics to the nested community detection problem, but it also makes the problem well-defined. Motivated by the discussion above, our main problem definition is given below.

Problem 1

Given a weighted input graph $G=(V,E,\mathit{{w}})$ , a set of source vertices $S\subset V$ , and an integer $k$ , find the sequence of nested communities $\mathcal{V}=\{S=V_{0}\subseteq V_{1}\subseteq\cdots\subseteq V_{k}=V\}$ that minimizes the density-uniformity score

[TABLE]

subject to the constraint $\mathit{d}\mathopen{}\left(V_{i}\right)<\mathit{d}\mathopen{}\left(V_{i-1}\right)$ for $i=2,\ldots,k$ .

4 An Algorithm for Discovering Nested Communities

In this section we present our algorithm for discovering nested communities. We begin by demonstrating a necessary condition for the optimal solution based on dense subgraphs. Discovering such subgraphs turns out to be computationally intractable. We then split the original problem into two subproblems: discovering community sequence for a fixed order of vertices, a problem which we can solve efficiently, and discovering such an order. We provide a simple heuristic for discovering an order, and provide theoretical evidence that this order is good.

4.1 Nested Communities and Dense Subgraphs

We start our discussion by demonstrating a connection of the problem of finding the optimal sequence of nested communities, i.e., solving Problem 1, with problems related to finding dense subgraphs of a given graph.

To establish this connection, consider a triple of communities $V_{i-1}\subseteq V_{i}\subseteq V_{i+1}$ in an optimal solution to Problem 1. Consider the two corresponding segments $D_{i+1}=V_{i+1}\setminus V_{i}$ and $D_{i}=V_{i}\setminus V_{i-1}$ . Consider also any two subsets of those segments, $X\subseteq D_{i+1}$ and $Y\subseteq D_{i}$ , that is, $X$ is a subset of the outer segment, while $Y$ is a subset of the inner segment, see Figure 1(a) for a visualization. As we will show shortly, adding the outer subset $X$ in the community $V_{i}$ leads to a situation where the density of the subset $X$ with respect to the overall community $V_{i}$ is no better than the density of the subset $Y$ with respect to the community $V_{i}$ . Otherwise, either adding $X$ to $V_{i}$ (see Figure 1(b)) or removing $Y$ from $V_{i}$ (see Figure 1(c)) lead to a better solution. This follows from the fact that we require that the densities of the nested communities in any feasible solution of Problem 1 decrease monotonically.

Before proceeding to discussing the implications of this observation, we first give a formal statement and its proof.

Proposition 1

Consider a graph $G=(V,E,\mathit{{w}})$ , a set of source vertices $S\subseteq V$ , and an integer $k$ . Let $\mathcal{V}=\left(S=V_{0}\subseteq V_{1}\subseteq\cdots\subseteq V_{k}=V\right)$ be the optimal sequence of nested communities, that is, a solution to Problem 1. Fix $i$ such that $1\leq i\leq k-1$ and let $X\subseteq V_{i+1}\setminus V_{i}$ and $Y\subseteq V_{i}\setminus V_{i-1}$ . Then

[TABLE]

For the proof of the proposition we require the following lemma, which states that the mean square error of a set of numbers from a single point, increases with the distance of that point from the mean of the numbers. The lemma can be derived by simple algebraic manipulations, and its proof is omitted.

Lemma 1

Let ${w_{1}},\ldots,{w_{N}}$ and ${x_{1}},\ldots,{x_{N}}$ be two sets of real numbers. Let $W=\sum_{i=1}^{N}w_{i}$ and $\mu=\frac{1}{W}\sum_{i=1}^{N}w_{i}x_{i}$ . For any real number $d$ it is

[TABLE]

We are now ready to prove the proposition.

Proof (Proposition 1)

Let $C_{1}=E(V_{i+1})\setminus E(V_{i})$ and $C_{2}=E(V_{i})\setminus E(V_{i-1})$ . Let us break $C_{1}$ into two parts, $D_{11}=\mathit{c}\mathopen{}\left(X,X\cup V_{i}\right)$ and $D_{12}=C_{1}\setminus D_{11}$ . Similarly, let us break $C_{2}$ into two parts, $D_{21}=\mathit{c}\mathopen{}\left(Y,V_{i}\right)$ and $D_{22}=C_{2}\setminus D_{21}$ . Define the centroids $\mu_{ij}=\mathit{d}\mathopen{}\left(D_{ij}\right)$ and $\lambda_{i}=\mathit{d}\mathopen{}\left(C_{i}\right)$ . Lemma 1 now implies that

[TABLE]

where const is equal to

[TABLE]

Since $\mathcal{V}$ is optimal we must have $s\leq s_{1}$ and $s\leq s_{2}$ . Otherwise, we can obtain a better segmentation by attaching $X$ to $V_{i}$ or deleting $Y$ from $V_{i}$ . This implies that ${\left|\mu_{21}-\lambda_{2}\right|}\leq{\left|\mu_{21}-\lambda_{1}\right|}$ and ${\left|\mu_{11}-\lambda_{1}\right|}\leq{\left|\mu_{11}-\lambda_{2}\right|}$ . Since $\lambda_{2}\geq\lambda_{1}$ , this implies that $\mu_{21}\geq(\lambda_{1}+\lambda_{2})/2$ and $\mu_{11}\leq(\lambda_{1}+\lambda_{2})/2$ , which implies $\mu_{11}\leq\mu_{21}$ . This completes the proof. ∎

Proposition 1 implies that in an optimal solution the graph vertices can be ordered in such a way so that subgraph density, as specified by the proposition, decreases along this order. This observation motivates the following greedy algorithm for solving the problem of discovering nested communities:

Algorithm outline: Greedy–add–densest–subgraph

Start with $S$ , the set of source vertices. 2. 2.

Given the current set $S$ , find a subset of vertices $T$ that maximize $\mathit{d}\mathopen{}\left(T,S\cup T\right)$ . 3. 3.

Set $S\leftarrow S\cup T$ , and repeat the previous step until the set $S$ includes all the vertices of the graph. 4. 4.

Consider the vertices in the order discovered by the previous process. Find the optimal sequence of $k$ nested communities that respects this order.

One potential problem with the above greedy approach is that the subroutine that is called iteratively in step 2, is an NP-hard problem. This is formalized below as problem DenseSuperset.

The proof of Proposition 2 is given in Section 4.3.

Problem 2 (DenseSuperset)

Given a weighted graph $G=(V,E,\mathit{{w}})$ and a subset of vertices $S\subseteq V$ , find a subset of vertices $T$ maximizing $\mathit{d}\mathopen{}\left(T,S\cup T\right)$ .

Proposition 2

The DenseSuperset problem is NP-hard.

Similarly, one can think of solving the problem by working on the opposite direction, that is, start with the whole vertex set $V$ and “peel off” the set $V$ by removing the sparsest subgraph, until left with the set of source vertices $S$ . The corresponding algorithm will be the following.

Algorithm outline: Greedy–remove–sparsest–subgraph

Start with $V$ , the vertex set of $G$ . 2. 2.

Given a current set $V$ , find a subset of vertices $T$ that does not include the source vertex set $S$ and minimizes the density $\mathit{d}\mathopen{}\left(T,V\right)$ . 3. 3.

Set $V\leftarrow V\setminus T$ , and repeat the previous step until left only with the set of source vertices $S$ . 4. 4.

Consider the vertices in the order removed by the previous process. Find the optimal sequence of $k$ nested communities that respects this order.

Not surprisingly, the problem of finding the sparsest subgraph, which corresponds to step 2 of the above process is NP-hard.

The proof is given again in Section 4.3.

Problem 3 (SparseNbhd)

Given a weighted graph $G=(V,E,\mathit{{w}})$ find a set of vertices $T$ minimizing $\mathit{d}\mathopen{}\left(T,V\right)$ .

Proposition 3

The SparseNbhd problem is NP-complete.

4.2 Algorithm for Discovering Nested Communities

Armed with intuition from the previous section, we now proceed to discuss the proposed algorithm. The underlying principle of both of the greedy algorithms described above is to consider the vertices of the graph in a specific order and then find a sequence of nested communities that respects this order. In one case, the order of graph vertices is obtained by starting from $S$ and iteratively adding the densest subgraph, while in the other case, the order is obtained by starting from the full vertex set $V$ and iteratively removing the sparsest subgraph.

Our algorithm is an instantiation of this general principle. We specify in detail ( $i$ ) how to obtain an order of the graph vertices, and ( $ii$ ) how to find a sequence of nested communities that respects a given order.

We start our discussion from the second task, i.e., finding the sequence of nested communities given an order. As it turns out, this problem is an instance of sequence segmentation problems. We define this problem below, which is a refinement of Problem 1.

Problem 4 (Sequence of nested communities from a given order)

Given a graph $G=(V,E,\mathit{{w}})$ with ordered vertices, a set of source vertices $S=\left\{v_{1},\ldots,v_{s}\right\}\subset V$ , and an integer $k$ , find a monotonically increasing sequence of $k+1$ integers $b=\left(b_{0}=s,\ldots,b_{k}={\left|V\right|}\right)$ such that

[TABLE]

minimizes the density-uniformity score $\mathit{q}\mathopen{}\left(\mathcal{V}\right)$ and satisfies the monotonicity constraint $\mathit{d}\mathopen{}\left(V_{i}\right)<\mathit{d}\mathopen{}\left(V_{i-1}\right)$ for $i=1,\ldots,k$ .

It is quite easy to see that Problem 4 can be cast as a segmentation problem. Typical segmentation problems can be solved optimally using dynamic programming, as shown by Bellman [3]. The most interesting aspect of Problem 4, seen as segmentation problem, is the monotonicity constraint $\mathit{d}\mathopen{}\left(V_{i}\right)<\mathit{d}\mathopen{}\left(V_{i-1}\right)$ , for $i=1,\ldots,k$ . That is, not only we ask to segment the ordered sequence of vertices so that we minimize the density variance on the segments, but we also require that the density scores of each segment decrease monotonically. The situation can be abstracted to the monotonic segmentation problem stated below.

Problem 5 (Monotonic segmentation)

Let ${a_{1}},\ldots,{a_{n}}$ and ${x_{1}},\ldots,{x_{n}}$ be two sequences of real numbers. Given an integer $k$ , find $k+1$ indices $b_{0}=1,\ldots,b_{k}=n+1$ minimizing

[TABLE]

where $\mu_{j}$ is the weighted centroid of $j$ -th segment such that $\mu_{j}<\mu_{j-1}$ .

In order to express Problem 4 with Problem 5, consider a group of edges, $P_{i}=\mathit{c}\mathopen{}\left(v_{i},\left\{v_{1},\ldots,v_{i-1}\right\}\right)$ for each vertex $v_{i}\in V\setminus S$ . If we set $a_{i}={\left|P_{i+{\left|S\right|}}\right|}$ and $x_{i}=\mathit{d}\mathopen{}\left(P_{i+{\left|S\right|}}\right)$ , we can apply Lemma 1 and show that the score of community sequence is equal to the variance minimized by Problem 5, plus a constant. In fact, this constant is the sum of the variances within each $P_{i}$ .

Similarly to the unconstrained segmentation problem, the monotonic segmentation problem can be solved optimally. The idea is to use as preprocessing step the classic “pool of adjacent violators” algorithm (PAV) [2], which merges points until there are no monotonicity violations, and then apply the classic dynamic-programming algorithm on the resulting sequence of merged points. This algorithm runs in $O({\left|V\right|})$ time. By definition the merged points do not contain any monotonicity violations, and thus, the resulting segmentation respects the monotonicity constraint, as well. As shown by Haiminen et al. [14], this two-phase algorithm gives the optimal $k$ segmentation under the monotonicity constraints. As a result of the optimality of the monotonic segmentation problem, Problem 4 can be solved optimally.

We next proceed to discuss the first component of the algorithm, namely, how to obtain an order of the graph vertices. Recall that, according to the principles discussed in the previous section, we can either start from $S$ and iteratively add dense subgraphs, or start from $V$ and remove sparse subgraphs. We follow the latter approach. In order to overcome the NP-hard problem of finding the sparsest subgraph and in order to obtain a total order, we use the heuristic of iteratively removing the sparsest subgraph of size one, namely, a single vertex. The sparsest one-vertex subgraph is simply the vertex with the smallest weighted degree. Thus, overall, we obtain the simple algorithm SortVertices, whose pseudocode is given as Algorithm 1.

As an interesting side remark, we note that the algorithm SortVertices is encountered in the context of finding subgraphs with the highest average degree. In particular, it is known that the densest subgraph obtained by the algorithm during the process of iteratively removing the smallest-degree vertex is a factor-2 approximation to the optimally densest subgraph in the graph [4].

The natural question to ask is how good is the order produced by algorithm SortVertices? As we will demonstrate shortly, it turns out that the order is quite good. First, we note that the optimal solution obtained for Problem 4, satisfies an analogous structural property, with respect to subgraph densities, as the optimal solution for Problem 1, We omit the proof of the following proposition as it is similar to the one of Proposition 1.

Proposition 4

Consider a graph $G=(V,E,\mathit{{w}})$ with ordered vertices, a set of source vertices $S\subset V$ , and an integer $k$ . Let $\mathcal{V}=\left(S=V_{0}\subseteq V_{1}\subseteq\cdots\subseteq V_{k}=V\right)$ be the optimal sequence of nested communities with respect to the order, that is, a solution to Problem 1. Fix $i$ such that $1\leq i\leq k-1$ and let $b={\left|V_{i}\right|}$ . Let $X\subseteq V_{i+1}\setminus V_{i}$ and $Y\subseteq V_{i}\setminus V_{i-1}$ such that $X=\left\{v_{b+1},\ldots,v_{b+{\left|X\right|}}\right\}$ and $Y=\left\{v_{b-{\left|Y\right|}+1},\ldots,v_{b}\right\}$ . Then $\mathit{d}\mathopen{}\left(X,X\cup V_{i}\right)\leq\mathit{d}\mathopen{}\left(Y,V_{i}\right)$ .

The only difference between Proposition 1 and Proposition 4 is that in Proposition 4 we require additionally that $V_{i+1}$ starts with $X$ and $V_{i}$ ends with $Y$ with respect to the order. We want this condition to be redundant, otherwise the given order is suboptimal. For example, consider the adjacency matrix of $G$ given in Figure 2(a). The given segmentation is optimal with respect to the given order. However if we rearrange the vertices in $D_{1}$ and $D_{2}$ , given in Figure 2(b), then the same segmentation is no longer optimal as $X$ and $Y$ violate Proposition 4. The additional condition in Proposition 4 becomes redundant if $V_{i}$ ends with the sparsest subset while $V_{i+1}$ starts with densest subset. We will show that the algorithm SortVertices produces an order that satisfies this property approximately. The exact formulation of our claim is given as Propositions 5 and 6.

Proposition 5

Consider a weighted graph $G=(V,E,\mathit{{w}})$ , whose vertices are ordered by algorithm SortVertices. Let $1\leq b<c\leq{\left|V\right|}$ . Let $U=\left\{v_{b},\ldots,v_{c}\right\}$ and $W=\left\{v_{1},\ldots,v_{c}\right\}$ . Let $f=\mathit{d}\mathopen{}\left(v_{c},W\right)$ . Then $2f\leq\mathit{d}\mathopen{}\left(X,W\right)$ for any $X\subseteq U$ .

Proof

Note that $s=\sum_{x\in X}\mathit{{w}}\mathopen{}\left(x,W\right)=2\mathit{{w}}\mathopen{}\left(X\right)+\mathit{{w}}\mathopen{}\left(X,W\setminus X\right)\leq 2\mathit{{w}}\mathopen{}\left(X,W\right)$ . Write $m_{x}={\left|\mathit{c}\mathopen{}\left(x,W\right)\right|}$ . Since $v_{c}$ has the smallest $\mathit{d}\mathopen{}\left(v_{c},W\right)$ , we have

[TABLE]

Combining the inequalities and dividing by ${\left|\mathit{c}\mathopen{}\left(X,W\right)\right|}$ proves the result.∎

Proposition 6

Consider a weighted graph $G=(V,E,\mathit{{w}})$ , whose vertices are ordered by algorithm SortVertices. Let $1\leq b<c\leq{\left|V\right|}$ . Let $U=\left\{v_{b},\ldots,v_{c}\right\}$ and $W=\left\{v_{1},\ldots,v_{b-1}\right\}$ . Assume that there is $\alpha\geq 0$ such that for all $v\in U$ it is $\alpha\mathit{{w}}\mathopen{}\left(v,W\right)\geq\mathit{{w}}\mathopen{}\left(v,U\right)$ . Let $f=\mathit{d}\mathopen{}\left(v_{b},W\right)$ . Then $(1+\alpha)^{2}f\geq\mathit{d}\mathopen{}\left(X,X\cup W\right)$ for any $X\subseteq U$ .

Proof

Let $A=\mathit{c}\mathopen{}\left(X,W\right)$ and $B=\mathit{c}\mathopen{}\left(X,X\right)$ . The density of $X$ is bounded by

[TABLE]

Select $x\in X$ with the highest $\mathit{d}\mathopen{}\left(x,W\right)$ . Then $\mathit{d}\mathopen{}\left(A\right)\leq\mathit{d}\mathopen{}\left(x,W\right)$ . Let us prove that $\mathit{d}\mathopen{}\left(x,W\right)\leq(1+\alpha)f$ . If $v_{b}=x$ , then we are done. Assume that $v_{b}\neq x$ . Since $G$ is fully-connected, SortVertices always picks the vertex with the lowest weight. Let $Z=\left\{v_{1},\ldots,x\right\}$ . Then $\mathit{{w}}\mathopen{}\left(x,W\right)\leq\mathit{{w}}\mathopen{}\left(x,Z\right)\leq\mathit{{w}}\mathopen{}\left(v_{b},Z\right)=\mathit{{w}}\mathopen{}\left(v_{b},W\right)+\mathit{{w}}\mathopen{}\left(v_{b},U\right)\leq(1+\alpha)\mathit{{w}}\mathopen{}\left(v_{b},W\right)$ . Since, $G$ is fully-connected $\mathit{{w}}\mathopen{}\left(y,W\right)={\left|W\right|}\mathit{d}\mathopen{}\left(y,W\right)$ for any $y\in U$ . Hence, dividing the inequality gives us $\mathit{d}\mathopen{}\left(x,W\right)\leq(1+\alpha)f$ , which proves the proposition.∎

4.3 Hardness of Finding Dense and Sparse Subgraphs

In this section we prove the NP-hardness results, stated in Section 4.1. We start with an auxiliary lemma.

Lemma 2

Let $x,y,a,b,c$ be real numbers. Let $r=b+(b+x)c/(y-x)$ . If

[TABLE]

Similarly, if

[TABLE]

Proof

We will only prove the first case. The other 3 cases are similar. We have $(x-y)a>(x-y)b+(b+x)c$ which is equivalent to $xy+ay+xb+ab>xy+ax+cx+by+bc+ab$ . The left-hand side is equal to $(x+a)(y+b)$ while the right hand side is equal to $(y+a+c)(x+b)$ . The lemma follows.∎

We now give the proofs of Propositions 2 and 3.

Proposition 2

The DenseSuperset problem is NP-hard.

Proof

To prove the hardness, we will reduce Clique to DenseSuperset. Let $G=(V,E)$ be the given graph. Let us create a new graph $G^{\prime}$ by adding one extra vertex, say $s$ , to $G$ and connecting every vertex in $G$ to $s$ . We set $\mathit{{w}}\mathopen{}\left(e\right)$ to be $1$ for any edge in $E$ and $\alpha$ , which we will define later, if $e$ is adjacent to $s$ . Finally, we connect the non-connected vertices with edges of weight [math]. We will use $G^{\prime}$ , $S=\left\{s\right\}$ , and $\mathit{{w}}$ as inputs to DenseSuperset.

Our next step is to define $\alpha$ such that the maximum clique will also have the largest density. In order to do that, let $X$ be a clique of size $N$ in $G$ . Then the weight of $X$ is equal to

[TABLE]

If we have a non-clique subgraph of size $N$ , then obviously its weight is genuinely smaller than $\mathit{d}\mathopen{}\left(X,X\cup S\right)$ .

Assume a set of vertices $Z$ with $K<N$ vertices. The weight of $Z$ is bounded by

[TABLE]

We want $\mathit{d}\mathopen{}\left(X,X\cup S\right)>\mathit{d}\mathopen{}\left(Z,Z\cup S\right)$ , which is guaranteed if

[TABLE]

Since $N-1>K-1$ , Lemma 2 implies that if

[TABLE]

then the inequality in Eq 1 is guaranteed.

Let $Y$ be a non-clique of size $M>N$ in $G$ . Then the weight of $Y$ bounded by

[TABLE]

We need to have $\mathit{d}\mathopen{}\left(X,X\cup S\right)>\mathit{d}\mathopen{}\left(Y,Y\cup S\right)$ , which is guaranteed if

[TABLE]

Since $N-1<M-1$ , Lemma 2 guarantees that if

[TABLE]

then the inequality in Eq. 2 is guaranteed. If we choose $\alpha=1-0.5/{\left|V\right|}^{2}$ , both inequalities in Eqs. 1–2 are now guaranteed.

Let $k$ be the minimum size of the clique given as a parameter in Clique. Set $\beta=\frac{k-1+2\alpha}{k-1+2}$ . If $G$ contains a clique of size $k$ , then there is a subgraph in $G^{\prime}$ with a density of $\beta$ . Assume now that $G^{\prime}$ contains a subgraph, say $H$ , with a density of at least $\beta$ . $H$ must contain at least $k$ vertices, otherwise bound in Eq. 1 is violated. $H$ must be a clique, otherwise bound in Eq. 2 is violated. Consequently, $G$ has a clique of size $k$ if and only if $G^{\prime}$ has a subgraph of density at least $\beta$ . The reduction is polynomial. This concludes the proof.∎

Proposition 3

The SparseNbhd problem is NP-hard.

Proof

To prove the hardness, we will reduce Clique to SparseNbhd. Let $G=(V,E)$ be the given graph. We will define $G^{\prime}=(V^{\prime},E^{\prime})$ as follows. First we attach two vertices $s$ and $t$ to $G$ . Select one vertex, say $s$ , from the clique and connect each vertex in $G$ to $s$ . We connect the non-connected vertices with edges of weight [math]. Let $P={\left|V^{\prime}\right|}-1$ . We will weight the edges in $G$ with $1$ , let us define $\alpha=P-0.5/P^{2}$ . Set the weight of an edge $\mathit{{w}}\mathopen{}\left((s,n)\right)=\alpha-\deg\left(n\right)$ , for each $n\in V$ . Due to this scheme we have $\sum_{(n,y)\in E^{\prime}}\mathit{{w}}\mathopen{}\left((n,y)\right)=\alpha$ for any $n\in V$ . Finally, we set $\mathit{{w}}\mathopen{}\left((s,t)\right)={\left|V^{\prime}\right|}\alpha$ . This weight is so large that no solution for SparseNbhd will contain $s$ or $t$ .

Let $X$ be a clique of size $N$ in $G$ . Then the weight of $X$ is equal to

[TABLE]

If we have a non-clique subgraph of size $N$ , then obviously its weight is genuinely larger than $\mathit{d}\mathopen{}\left(X,V^{\prime}\right)$ .

Assume a set $Z\subseteq V$ with $K<N$ vertices. The weight of $Z$ is bounded by

[TABLE]

We want $\mathit{d}\mathopen{}\left(X,V^{\prime}\right)<\mathit{d}\mathopen{}\left(Z,V^{\prime}\right)$ , which is guaranteed if

[TABLE]

If we have a non-clique subgraph of size $N$ , then obviously its weight is genuinely smaller than $\mathit{d}\mathopen{}\left(X,X\cup S\right)$ .

Since $-K+1>-N+1$ , Lemma 2 implies that if

[TABLE]

then the inequality in Eq 3 is guaranteed. This is guaranteed by our choice of $\alpha$ .

Let $Y\subseteq V$ be a non-clique of size $M>N$ in $G$ . Then the weight of $Y$ bounded by

[TABLE]

We need to have $\mathit{d}\mathopen{}\left(X,V^{\prime}\right)<\mathit{d}\mathopen{}\left(Y,V^{\prime}\right)$ , which is guaranteed if

[TABLE]

Since $-M+1<-N+1$ , Lemma 2 guarantees that if

[TABLE]

then Eq. 4 is guaranteed. This is guaranteed by our choice of $\alpha$ .

Let $k$ be the minimum size of the clique given as a parameter in Clique. Set $\beta=\frac{2\alpha-k+1}{2P-k+1}$ . If $G$ contains a clique of size $k$ , then there is a subgraph in $G^{\prime}$ with a density of $\beta$ . Assume now that $G^{\prime}$ contains a subgraph, say $H$ , with a density of at most $\beta$ . Note that $\beta$ is largest, when $k=1$ , that is, $\beta\leq\alpha/P$ . If $s$ or $t$ is contained in $H$ , then the density is at least $2\mathit{{w}}\mathopen{}\left((s,t)\right)/P(P+1)>\alpha/P$ , which is a contradiction. Hence $H$ is a subgraph of $G$ . $H$ must contain at least $k$ vertices, otherwise bound in Eq. 3 is violated. $H$ must be a clique, otherwise bound in Eq. 4 is violated. Consequently, $G$ has a clique of size $k$ if and only if $G^{\prime}$ has a subgraph of density at least $\beta$ . The reduction is polynomial. This concludes the proof.∎

5 Related Work

Finding communities in graphs and social networks is one of the most well-studied topics in graph mining. The amount of literature on the subject is very extensive. This section cannot aspire to cover all the different approaches and aspects of the problem, we only provide a brief overview of the area.

Community detection. A large part of the related work deals with the problem of partitioning a graph in disjoint clusters or communities. A number of different methodologies have been applied, such as hierarchical approaches [11], methods based on modularity maximization [1, 6, 11, 26], graph-theoretic approaches [8, 9], random-walk methods [21, 24, 28], label-propagation approaches [24], and spectral graph partition [5, 15, 18, 25]. A thorough review on community-detection methods can be found on the survey by Fortunato [10]. We note that this line of work is different than the present paper, since we do not aim at partitioning a graph in disjoint communities.

Overlapping communities. Researchers in community detection have realized that, in many real situations and real applications, it is meaningful to consider that graph vertices do not belong only to one community. Thus, one asks to partition a graph into overlapping communities. Typical methods here rely on clique percolation [19], extensions to the modularity-based approaches [12, 20], analysis of ego-networks [7], or fuzzy clustering [27]. Again the problem we address in this paper is quite different. First, we find communities centered around a given set of source vertices, and not for the whole graph. Second, the communities output by our algorithm do not have arbitrary overlaps, but they have a specific nested structure.

Centerpiece subgraphs and community search. Perhaps closer to our approach is work related to the centerpiece subgraphs and the community-search problem [23, 16, 22]. In this class of problems, a set of source vertices $S$ is given and the goal is to find a subgraph so that $S$ belongs in the subgraph and the subgraph forms a tight community. The quality of the subgraph is measured with various objective functions, such as degree [22], conductance [16], or random-walk-based measures [23]. The difference of these methods with the one presented here is that these methods return only one community, while in this paper we deal with the problem of finding a sequence of nested communities.

In summary, despite the numerous research on the topic of community detection in graphs and social networks, to the best of our knowledge, this is the first paper to address the topic of nested communities with respect to a set of source vertices. Furthermore, our approach offers novel technical ideas, such as providing a solid theoretical analysis that allows to decompose the problem of finding nested communities into two sub-problems: ( $i$ ) ordering the set of vertices, and ( $ii$ ) segmenting the graph vertices according to that given order.

6 Experimental Evaluation

We will now provide experimental evidence that our method efficiently discovers meaningful segmentations and that our ordering algorithm outperforms several natural baselines.

Datasets and experimental setup. In our experiments we used six datasets, five obtained from Mark Newman’s webpage,111http://www-personal.umich.edu/~mejn/netdata/ and a bibliographic dataset obtained from DBLP. The datasets are as follows: Adjnoun: adjacency graph of common adjectives and nouns in the novel David Copperfield, by Charles Dickens. Dolphins: an undirected social graph of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand. Karate: social graph of friendships between 34 members of a karate club at a US university in the 1970s. Lesmis: coappearance graph of characters in the novel Les Miserables. Polblogs: a directed graph of hyperlinks between weblogs on US politics, recorded in 2005. DBLP: coauthorship graph between researchers in computer science. The statistics of these datasets are given in Table 1.

For each dataset and a given source set $S$ , we considered three different weighting schemes: First we run personalized PageRank using the source node with a restart of $0.1$ . Let $p(v)$ be the PageRank weight of each vertex. Given an edge $e=(v,w)$ , we set three different weighting schemes,

[TABLE]

These weights are selected so that the vertices that are hard to reach with a random walk will have edges with small weights, and hence will be placed in outer communities. For DBLP, we weighted the edges during PageRank computation with the number of joint papers, each paper normalized by the number of authors. We use the vertex with the highest degree as a starting set.

Time complexity. Our first step is to study the running time of our algorithm. We ran our experiments on a laptop equipped with a 1.8 GHz dual-core Intel Core i7 with 4 MB shared L3 cache, and typical running times for each dataset are given in 3rd column of Table 1.222For the code, see http://users.ics.aalto.fi/~ntatti/ Our algorithm is fast: for the largest dataset with 2 million edges, the computation took only 20 seconds. The algorithm consists of 4 steps, computing PageRank, ordering the vertices, grouping the vertices into blocks such that monotonicity condition is guaranteed, and segmenting the groups. The only computationally strenuous step is segmentation which requires quadratic time in the number of blocks. The number of vertices in DBLP is over $700\,000$ , however, grouping according to the PAV algorithm leaves only $2\,000$ blocks, which can be easily segmented. It is possible to select weights in such a way that there will no reduction when grouping vertices, so that finding the optimal segmentation becomes infeasible. However, in such a case, we can always resort to a near-linear approximation optimization algorithm [13].

Comparison to baseline. A key part in our approach is discovering a good order. Our next step is to compare the order induced by SortVertices against several natural baselines. For the first baseline we group the vertices based on the length of a minimal path from the source. We then compared these communities, say $\mathcal{H}$ , to the (same number of) communities obtained with our method. The scores, given in Table 1, show that our approach beats this baseline in every case, which is expected since this naïve baseline does not take into account density. For our next two baselines we order vertices based on vertex degree and PageRank. We then compute community sequences with $2$ – $10$ communities from these orders. Typical scores are given in Figure 3. Out of $6\times 3\times 9=162$ comparisons, SortVertices wins both orders 158 times, ties once (Karate, $\mathit{{w}_{m}}$ , 3 communities) and loses 3 times to the degree order (DBLP, $\mathit{{w}_{n}}$ , 3–5 communities).

Examples of Communities. Our final step is to provide examples of discovered communities. In Figure 4 we provide 4 different community sequences with 3 communities using weights $\mathit{{w}_{s}}$ and $\mathit{{w}_{n}}$ and sources $S=\left\{1\right\}$ and $S=\left\{33,34\right\}$ . The inner-most community for $1$ contains a near 5-clique. The inner-most community for $33,34$ contains two 4-cliques. The normalized weight $\mathit{{w}_{n}}$ penalizes hubs. This can be seen in Figure 4(a), where hubs $33$ , $34$ move from the outer community to the middle community. Similarly, hub $1$ changes communities in Figure 4(b). Finally, we give an example of communities discovered in DBLP. Table 2 contains communities discovered around Christos Papadimitriou. Authors in inner communities share many joint papers with Papadimitriou.

7 Concluding Remarks

We considered a problem of discovering nested communities, a sequence of subgraphs such that each community is a more connected subgraph of the next community. We approach the problem by dividing it into two subproblems: discovering the community sequence for a fixed order of vertices, a problem which we can solve efficiently, and discovering an order. We provided a simple heuristic for discovering an order, and provided theoretical and empirical evidence that this order is good.

Discovering nested communities seems to have a lot of potential as it is possible to modify or extend the problem in many ways. We can generalize the problem by not only considering sequences but, for example, trees of communities, where a parent node needs to be a denser subgraph than the child node. Another possible extension is to consider multiple source sets instead of just one.

Acknowledgements.

This work was supported by Academy of Finland grant 118653 (algodan)

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Agarwal and D. Kempe. Modularity-maximizing network communities via mathematical programming. European Physics Journal B , 66(3), 2008.
2[2] M. Ayer, H. Brunk, G. Ewing, and W. Reid. An empirical distribution function for sampling with incomplete information. The annals of mathematical statistics , 26(4), 1955.
3[3] R. Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM , 4(6), 1961.
4[4] M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX , 2000.
5[5] F. R. K. Chung. Spectral Graph Theory . American Mathematical Society, 1997.
6[6] A. Clauset, M. E. J. Newman, , and C. Moore. Finding community structure in very large networks. Physical Review E , 2004.
7[7] M. Coscia, G. Rossetti, F. Giannotti, and D. Pedreschi. DEMON: a local-first discovery method for overlapping communities. In KDD , 2012.
8[8] G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In KDD , 2000.