A Fast and Efficient Incremental Approach toward Dynamic Community   Detection

Neda Zarayeneh; Ananth Kalyanaraman

arXiv:1904.08553·cs.SI·September 24, 2019

A Fast and Efficient Incremental Approach toward Dynamic Community Detection

Neda Zarayeneh, Ananth Kalyanaraman

PDF

TL;DR

This paper introduces a fast incremental method called Δ-screening for dynamic community detection, enabling efficient updates in evolving networks while maintaining high quality, demonstrated by significant speedups on large real-world datasets.

Contribution

The paper presents a generic Δ-screening technique that accelerates dynamic community detection by selectively reevaluating affected vertices, adaptable to existing modularity-based methods.

Findings

01

Achieved a 3x speedup on large real-world networks.

02

Maintained output quality despite heuristic approach.

03

Enabled identification of optimal temporal resolution intervals.

Abstract

Community detection is a discovery tool used by network scientists to analyze the structure of real-world networks. It seeks to identify natural divisions that may exist in the input networks that partition the vertices into coherent modules (or communities). While this problem space is rich with efficient algorithms and software, most of this literature caters to the static use-case where the underlying network does not change. However, many emerging real-world use-cases give rise to a need to incorporate dynamic graphs as inputs. In this paper, we present a fast and efficient incremental approach toward dynamic community detection. The key contribution is a generic technique called $Δ - scr ee nin g$ , which examines the most recent batch of changes made to an input graph and selects a subset of vertices to reevaluate for potential community (re)assignment. This technique can be…

Tables1

Table 1. TABLE I : Input network statistics. See Fig. B.2 of Appendix in [ 23 ] for more details on individual time steps.

Input	Input graph	No. vertices	No. edges (cumulative)	No. timesteps
Synthetic	50k_ll	50,000	2,362,448	10
	50k_hh	50,000	2,367,024	10
	5M_ll	5,000,000	213,656,492	10
	5M_hh	5,000,000	213,771,700	10
Real-world	Arxiv HEP-TH	27,770	352,807	11
Real-world	sx-stackoverflow	2,601,977	63,497,050	2-28

Equations8

Q_{t} = \frac{1}{2 m _{t}} (i \in V_{t} \sum e_{i \to C (i)} - \frac{1}{2 m _{t}} C \in C_{t} \sum a_{C}^{2})

Q_{t} = \frac{1}{2 m _{t}} (i \in V_{t} \sum e_{i \to C (i)} - \frac{1}{2 m _{t}} C \in C_{t} \sum a_{C}^{2})

C (i) \leftarrow ar g C (j) max Δ Q_{i \to C (j)}

C (i) \leftarrow ar g C (j) max Δ Q_{i \to C (j)}

Δ Q_{i_{2} \to C_{t - 1} (j_{*})}^{new} = + - = \frac{e _{i_{2} \to C_{t - 1} (j_{*})} - e _{i_{2} \to C_{t - 1} (i) \ {i_{2}}}}{2 m _{t}} \frac{d _{ω} ( i _{2} ) . ( a _{C_{t - 1} (i) \ {i_{2}}} - d _{ω} ( i ))}{( 2 m _{t} ) ^{2}} \frac{d _{ω} ( i _{2} ) . ( a _{C_{t - 1} (j_{*})} + d _{ω} ( i ))}{( 2 m _{t} ) ^{2}} Δ Q_{i_{2} \to C_{t - 1} (j_{*})}^{old} - \frac{2 d _{ω} ( i _{2} ) . d _{ω} ( i ))}{( 2 m _{t} ) ^{2}}

Δ Q_{i_{2} \to C_{t - 1} (j_{*})}^{new} = + - = \frac{e _{i_{2} \to C_{t - 1} (j_{*})} - e _{i_{2} \to C_{t - 1} (i) \ {i_{2}}}}{2 m _{t}} \frac{d _{ω} ( i _{2} ) . ( a _{C_{t - 1} (i) \ {i_{2}}} - d _{ω} ( i ))}{( 2 m _{t} ) ^{2}} \frac{d _{ω} ( i _{2} ) . ( a _{C_{t - 1} (j_{*})} + d _{ω} ( i ))}{( 2 m _{t} ) ^{2}} Δ Q_{i_{2} \to C_{t - 1} (j_{*})}^{old} - \frac{2 d _{ω} ( i _{2} ) . d _{ω} ( i ))}{( 2 m _{t} ) ^{2}}

Δ Q_{k \to C_{t - 1} (j_{*})}^{new} \approx + - \approx \frac{e _{k \to C_{t - 1} (j_{*})} - e _{k \to C_{t - 1} \ {k}}}{2 m _{t}} \frac{d _{ω} ( k ) . a _{C_{t - 1} (k) \ {k}}}{( 2 m _{t} ) ^{2}} \frac{d _{ω} ( k ) . ( a _{C_{t - 1} (j_{*})} + d _{ω} ( i ))}{( 2 m _{t} ) ^{2}} Δ Q_{k \to C_{t - 1} (j_{*})}^{old} - \frac{d _{ω} ( k ) . d _{ω} ( i )}{( 2 m _{t} ) ^{2}}

Δ Q_{k \to C_{t - 1} (j_{*})}^{new} \approx + - \approx \frac{e _{k \to C_{t - 1} (j_{*})} - e _{k \to C_{t - 1} \ {k}}}{2 m _{t}} \frac{d _{ω} ( k ) . a _{C_{t - 1} (k) \ {k}}}{( 2 m _{t} ) ^{2}} \frac{d _{ω} ( k ) . ( a _{C_{t - 1} (j_{*})} + d _{ω} ( i ))}{( 2 m _{t} ) ^{2}} Δ Q_{k \to C_{t - 1} (j_{*})}^{old} - \frac{d _{ω} ( k ) . d _{ω} ( i )}{( 2 m _{t} ) ^{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\AtAppendix

A Fast and Efficient Incremental Approach toward Dynamic Community Detection

Neda Zarayeneh1 and Ananth Kalyanaraman1 1N. Zarayeneh and A. Kalyanaraman are with the School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99163, USA. Email: [email protected], [email protected]

Abstract

Community detection is a discovery tool used by network scientists to analyze the structure of real-world networks. It seeks to identify natural divisions that may exist in the input networks that partition the vertices into coherent modules (or communities). While this problem space is rich with efficient algorithms and software, most of this literature caters to the static use-case where the underlying network does not change. However, many emerging real-world use-cases give rise to a need to incorporate dynamic graphs as inputs.

In this paper, we present a fast and efficient incremental approach toward dynamic community detection. The key contribution is a generic technique called $\Delta$ -screening, which examines the most recent batch of changes made to an input graph and selects a subset of vertices to reevaluate for potential community (re)assignment. This technique can be incorporated into any of the community detection methods that use modularity as its objective function for clustering. For demonstration purposes, we incorporated the technique into two well-known community detection tools. Our experiments demonstrate that our new incremental approach is able to generate performance speedups without compromising on the output quality (despite its heuristic nature). For instance, on a real-world network with 63M temporal edges (over 12 time steps), our approach was able to complete in 1056 seconds, yielding a 3 $\times$ speedup over a baseline implementation. In addition to demonstrating the performance benefits, we also show how to use our approach to delineate appropriate intervals of temporal resolutions at which to analyze an input network.

I Introduction

Community detection is a fundamental problem in many graph applications. The goal of community detection is to identify tightly-knit groups of vertices in an input network, such that the members of each “community” share a high concentration of edges among them than to the rest of the network. Owing to its ability to reveal natural divisions that may exist in a network (in an unsupervised manner), community detection has become one of the fundamental discovery tools in a network scientist’s toolkit. The operation is widely used in a variety of application domains including (but not limited to) social networks, biological networks, internet and web networks, citation and collaboration networks, etc.

Designing efficient algorithms and implementations for community detection has been an area of active research for well over a decade. While theoretical formulations are known to be NP-Hard [1], there are a number of efficient heuristics and related software already available. A comprehensive review of community detection methods and related applications is available in [2]. However, most of the existing tools target static networks, where the input graph cannot change. On the other hand, most real-world networks are dynamic in nature, where vertices and edges can be added and/or removed over a period of time.

Owing to the increasing availability of dynamic networks, the problem of dynamic community detection has become an actively researched topic of late, and multiple methods have been proposed over the last decade (e.g., [3, 4, 5]). In Section II we present a brief review of such related works. Despite these advances, a key remaining challenge in the design of these algorithms is in quickly identifying the parts of the graph that are likely to be impacted by a change (or collectively by a recent batch of changes), so that it becomes possible to update the community information with minimal recomputation effort.

Contributions: In this paper, we propose an algorithmic technique and a corresponding incremental approach that would complement the developments made in dynamic community methods, and in particular those that use the modularity function [6] as their clustering objective. More specifically, the main contributions are as follows:

i)

We visit the problem of identifying vertex subsets that are likely to be impacted by the most recent batch of changes made to the graph. To address this problem, we present a technique called $\Delta$ -screening, which can be efficiently implemented and incorporated as part of existing dynamic community algorithms that use modularity. 2. ii)

To demonstrate and evaluate this technique, we incorporated the technique into two well-known classical community detection methods—namely, the Louvain method [7] and the SLM method [19]—thereby generating two incremental clustering implementations. 3. iii)

Using these two implementations, we present a thorough experimental evaluation on both synthetic and real-world inputs. Our results show that the $\Delta$ -screening technique is effective in pruning work (to reduce recomputation effort) without compromising on output quality. 4. iv)

In addition to demonstrating its performance benefits, we also show how to use our approach to delineate appropriate intervals of temporal resolutions at which to analyze an input network.

II Related Work

Algorithms to compute dynamic communities over time-evolving graphs can be broadly classified into two types.

One class of methods follows a two-step strategy of first identifying the best set of communities for the current time step and then subsequently mapping them onto the communities from previous generations to track evolution. Hopcroft et al. [10] present a method in which a static community detection tool is individually applied to the graphs at all time steps and the results are later combined by computing community similarities between successive steps. Greene et al. [4] propose a variation of this approach where they use a matching-based formulation and a related heuristic to map the communities of the latest time step to the communities of previous generations (as identified by a “front”).

In general, the two-step strategy is better suited if the magnitude and/or complexity of changes to the input graph is more drastic or random. However, the strategy suffers drawbacks in two ways: It can make tracking of communities difficult as an independent application of static community detection at each time step may not necessarily preserve previous generations communities, and therefore the outputs can become non-deterministic. Secondly, these approaches could become expensive to run on large inputs as the approach entails significant recomputation for each time step, and the community tracking (via mapping) process can also be expensive.

The other class of methods for detecting dynamic communities follows a more incremental strategy where communities from the previous generation(s) are propagated and updated using changes reflected in the current time step. Maillard et al. [13] propose a modularity-based incremental approach extending upon the classical Clauset-Newman-Moore static method [17]. Aktunc et al. [3] propose a method DSLM as an extension of its static predecessor [19]. Xie et al. [8] present an incremental method based on label propagation which is a fast heuristic. Saifi and Guillaume [11] provide a way to track and update community “cores” across time steps. Zakrzewska and Bader [5] present another variant that tracks the communities of a selected set of seed vertices in the graph. The FacetNet approach introduced by Lin et al. [12] is a hybrid approach that operates with a dual objective of maximizing modularity for the current time step while trying to also preserve as much of the previous generation communities.

In general, the incremental strategy has the advantage in runtime (because of the reuse of community information from previous steps), and it also has the advantage of outputting a relatively stable set of communities across time steps. The technique of $\Delta$ -screening proposed in this paper is aimed at helping these incremental methods to be able to quickly identify the relevant parts of the graph that are potentially impacted by a recent batch of changes, so that the computation effort in the incremental step can be reduced without compromising on clustering quality.

We note here that most of the existing methods use modularity or one of its variants as the objective function for optimizing community structure. Bassett et al. [9] propose and evaluate the choice of alternative null hypothesis models in modularity-based dynamic community detection methods.

III Method

III-A Basic Notation and Terminology

A dynamic graph $G(V,E)$ can be represented as a sequence of graphs $G_{1}(V_{1},E_{1}),\ldots,G_{T}(V_{T},E_{T})$ , where $G_{t}(V_{t},E_{t})$ denotes the graph at time step $t$ ; we use $n_{t}=|V_{t}|$ and $M_{t}=|E_{t}|$ . In this paper, we consider only undirected graphs. The graphs may be weighted—i.e., each edge $(i,j)\in E_{t}$ is associated with a numerical positive weight $\omega_{ij}\geq 0$ ; if the graphs are unweighted, then the edges are assumed to be associated with unit weight, without loss of generality. We denote the neighbors of a vertex $i$ as $\varGamma(i)=\{j\;|\;(i,j)\in E_{t}\}$ . We use $m_{t}$ to denote the sum of the weights of all edges in $G_{t}$ —i.e., $m_{t}=\sum_{(i,j)\in E_{t}}\omega_{ij}$ . We denote the degree of a vertex $i$ by $d(i)$ . The weighted degree of a vertex $i$ , denoted by $d_{\omega}(i)$ , is the sum of weights of all edges incident on $i$ .

In this paper, we consider incrementally growing dynamic graphs, where edges and vertices can be added (but not deleted) from one time step to another. This implies that $V_{t}\supseteq V_{t-1}$ and $E_{t}\supseteq E_{t-1}$ , for all $1<t\leq T$ . We denote the newly added edges at any time step $t$ as $\Delta_{t}=E_{t}\setminus E_{t-1}$ .

We denote the set of communities detected at time step $t$ as $\mathcal{C}_{t}$ . Note that, by definition, $\mathcal{C}_{t}$ represents a partitioning of the vertices in $V_{t}$ —i.e., each community $C\in\mathcal{C}_{t}$ is a subset of $V_{t}$ ; all communities in $\mathcal{C}_{t}$ are pairwise disjoint; and $\bigcup_{C\in\mathcal{C}_{t}}C=V_{t}$ .

For any vertex $i\in V_{t}$ , we denote the community containing $i$ , at any point in the algorithm’s execution, as $C(i)$ , following the convention used in [14]. Also, let $e_{i\rightarrow C}$ denote the sum of the weights for the edges linking vertex $i$ to vertices in community $C$ —i.e., $e_{i\rightarrow C}=\sum_{j\in C\cap\varGamma(i)}\omega_{ij}$ . Furthermore, let $a_{C}$ denote the sum of the weighted degrees of all vertices in $C$ —i.e., $a_{C}=\sum_{i\in C}d_{\omega}(i)$ .

Given the above, the modularity, $Q_{t}$ , as imposed by a community-wise partitioning $\mathcal{C}_{t}$ over $G_{t}$ , is given by [6]:

[TABLE]

Given a community-wise partitioning on an input graph, the modularity gain that can be achieved by moving a particular vertex $i$ from its current community to another target community (say $C(j)$ ) can be calculated in constant time [7]. We denote this modularity gain by $\Delta Q_{i\rightarrow C(j)}$ .

III-B Problem Statement

Definition III.1

Dynamic Community Detection:* Given a dynamic graph $G(V,E)$ with $T$ time steps, the goal of dynamic community detection is to detect an output set of communities $\mathcal{C}_{t}$ at each time step $t$ , that maximizes the modularity $Q_{t}$ for the graph $G_{t}(V_{t},E_{t})$ .*

Since the static version of the modularity optimization problem is NP-Hard [1], it immediately follows that the dynamic version is also intractable.

For the static version, a number of efficient heuristics have been developed (as surveyed in [2]). These approaches can be broadly classified into three categories: divisive approaches [15, 16], agglomerative approaches [6, 17], and multi-level approaches [18, 19, 20]. Of these, the multi-level approaches have demonstrated to be fast and effective at producing high-quality partitioning in practice. In Algorithm 1 we show a generic algorithmic pseudocode for this class of approaches. While they vary in the specific details of how each step is implemented, they share several common traits (note that this description is for the static use-case):

•

At the start of each level, all vertices are assigned to a distinct community id.

•

An iterative process is initiated, in which all vertices are visited (in some arbitrary order) within each iteration, and a decision is made on whether to keep the vertex in its current community, or to migrate it to one of its neighboring communities. This decision is typically made in a local-greedy fashion. For instance, in the Louvain algorithm [7], a vertex migrates to a neighboring community that maximizes the modularity gain of that vertex—i.e., let $j\in\varGamma(i)\cup\{i\}$ . Then,

[TABLE]

•

When the net modularity gain resulting from an iteration drops below a certain threshold $\tau$ , the current level is terminated (i.e., intra-level convergence), and the algorithm compacts the graph into a smaller graph by using the information from the communities. This procedure represents a graph coarsening step, and the coarsened graph is subsequently processed using the same iterative strategy until there is no longer an appreciable modularity gain between successive levels.

Algorithm 1 succinctly captures the main steps of the multi-level approaches.

III-C A Naive Algorithm for Dynamic Community Detection

A simple approach to address the dynamic community detection problem is to directly apply the static version of the algorithm (Algorithm 1) on the graph at every time step. However, such an approach suffers from multiple limitations. First, it completely ignores the communities identified at the previous time steps. This can cause outputs to become non-deterministic, as static approaches typically prefer a random ordering of vertices. Furthermore, by ignoring the previous community information, the algorithm is essentially forced to recompute from scratch, and as a result evaluate the community affiliation for all vertices at each time step. This can be wasteful in computation. Consider an edge $(i,j)\in\Delta_{t}$ that has been newly introduced at time step $t$ ; it is reasonable to expect that only those vertices in the “vicinity” of this newly added edge to be impacted by this addition. However, the naive strategy is not suited to exploit such proximity information, thereby negatively impacting performance particularly for large real-world networks where event-triggered changes tend to happen in a more localized manner at different time steps.

III-D An Incremental Approach via $\Delta$ -screening

Here, we present an alternative approach in which we first identify a subset of vertices to evaluate at the start of every time step, using the changes $\Delta_{t}$ . The idea is to identify all (or most) those vertices whose community affiliation could potentially change due to $\Delta_{t}$ ; the remaining vertices will simply retain their previous community assignments. This new filtering technique, which we call $\Delta$ -screening (and abbreviated as $\Delta$ S), is generic enough to be applied to any incremental clustering approach that uses modularity. For the purpose of this paper, we demonstrate it on multi-level approaches.

More specifically, let $Static(G)$ denote any static community detection algorithm of choice, that takes in an input (static) graph $G$ and outputs a set of communities $\mathcal{C}$ . Then, our incremental approach is as follows.

At $t=1$ , the algorithm simply calls $Static(G_{1})$ to output $\mathcal{C}_{1}$ . 2. 2.

For each subsequent time step $t>1$ :

a)

The algorithm initially assigns each pre-existing vertex $i\in V_{t}\cap V_{t-1}$ to the same community label as $\mathcal{C}_{t-1}(i)$ . Each of the remaining vertices (i.e., those that were newly added at $t$ ) is assigned a distinct (new) community label. 2. b)

Next, the algorithm calls a function $\Delta$ -screening $(G_{t},\Delta_{t})$ that returns a subset of vertices $\mathcal{R}_{t}\subseteq V_{t}$ . This subset corresponds to the set of vertices that have been selected for processing during time step $t$ . 3. c)

Subsequently, the algorithm calls $Static\Delta S(G_{t},\mathcal{R}_{t})$ , which is a variant of $Static(G_{t})$ that loads $G_{t}$ but visits only the vertex subset $\mathcal{R}_{t}$ for evaluation during each iteration—i.e., a modification to the for loop of line #4 in Algorithm 1. Note that this procedure that uses $\mathcal{R}_{t}$ is only relevant to the iterations at the first level, as in the subsequent levels, the algorithm uses compacted versions of the same graph.

To demonstrate the $\Delta$ -screening technique, we modified two well-known community detection methods: Louvain algorithm [7], and smart local moving (SLM) algorithm [19]. We call the resulting modified implementations as dLouvain- $\Delta$ s and dSLM- $\Delta$ s respectively.

Note that there is also a simpler incremental version that can be implemented for both these methods—by following all steps outlined in our incremental approach except for $\Delta$ -screening and instead trivially setting $\mathcal{R}_{t}=V_{t}$ . For a comparative assessment of the $\Delta$ -screening strategy, we implemented this baseline version as well—we refer to the resulting two implementations as dLouvain-base and dSLM-base111We note that our dSLM-base implementation is in effect same as [3]. respectively.

III-E The $\Delta$ -screening* Scheme*

In what follows, we describe our $\Delta$ -screening scheme in detail. Given the graph ( $G_{t}$ ) and changes ( $\Delta_{t}$ ) at time step $t$ , the goal of $\Delta$ -screening is to identify a vertex subset $\mathcal{R}_{t}\subseteq V_{t}$ for reevaluation at time step $t$ —i.e., any vertex that is added to $\mathcal{R}_{t}$ will be evaluated for potential migration by the iterative clustering algorithm (Algorithm 1); all other vertices are not evaluated (i.e., they retain their respective community assignment from the previous time step $t-1$ ). Our $\Delta$ -screening scheme is also a heuristic and it does not guarantee the reproduction of the results from the corresponding baseline version—dLouvain-base or dSLM-base—which are also heuristics. The main objective here is to save runtime by reducing the number of vertices to process, without significantly altering the quality. Despite its heuristic nature, however, our $\Delta$ -screening scheme is designed to preserve the key behavioral traits of the baseline version (as we show in lemmas later in this section).

Algorithm: We assume that $\Delta_{t}$ is stored as a list of ordered pairs of the form $(i,j)$ . This implies that for each newly added edge $(i,j)$ , there will be two entries stored in $\Delta_{t}$ : $(i,j)$ and $(j,i)$ , as the input graph is undirected. We refer to the first entry ( $i$ ) of an ordered pair ( $(i,j)$ ) as the “source” vertex and the other vertex ( $j$ ) as the “sink”. Let $S_{\Delta}$ denote the set of all source vertices in $\Delta_{t}$ , and $T_{\Delta}(i)$ denote the set of all sinks for a given source $i$ .

Algorithm 2 shows the algorithm for $\Delta$ -screening. We initialize $\mathcal{R}_{t}$ to $\emptyset$ . Subsequently, we examine all edges of $\Delta_{t}$ in the sorted order of its source vertices. Sorting helps in two ways: It helps us to consider all the new edges incident on a given source vertex collectively and identify the edge that (locally) maximizes the net modularity gain (consistent with line #4 of Algorithm 2). This way we are able to mimic the behavior of the baseline versions which also use the same greedy scheme to migrate vertices. Furthermore, this sorted treatment helps reduce overhead by helping to update $\mathcal{R}_{t}$ in a localized manner (relative to the source vertices) and avoiding potential duplications in the computations associated with a vertex.

Once sorted, we read the adjacency list for each source vertex (Algorithm 2:line #3), identify a neighbor ( $j_{*}$ ) that maximizes the modularity gain (line #4), and update $\mathcal{R}_{t}$ based on that vertex (line #8). However, prior to updating $\mathcal{R}_{t}$ , we check if the selected vertex $j_{*}$ has a better incentive to move to $i$ ’s community $C_{t-1}(i)$ (line #7); if that happens, then $\mathcal{R}_{t}$ is not updated from source $i$ and instead, that decision is left/deferred until $j_{*}$ is visited as the source. This way we avoid making conflicting decisions between source and sink, while decreasing the time for processing (by reducing $\mathcal{R}_{t}$ size). Note that we only use the direction of migration that results in the larger of the two gains for updating $\mathcal{R}_{t}$ . The decision to migrate itself is deferred until the stage of execution of the iterative algorithm. In other words, the $\Delta$ -screening procedure does not modify the state of communities, but it sets the stage for which communities to be visited during the main iterative process.

The main part of Algorithm 2 is on line #8, where $\mathcal{R}_{t}$ is updated. Our scheme adds the following subset to $\mathcal{R}_{t}$ : vertices $i$ and $j_{*}$ , all neighbors of $i$ ( $\varGamma(i)$ ), and all vertices in the community containing $j_{*}$ . In what follows, using a combination of lemmas, we show that the $\mathcal{R}_{t}$ so constructed is positioned to capture all (or most of the) “essential” vertices that are likely to be impacted by the edge additions in $\Delta_{t}$ . In other words, if a vertex is not added to $\mathcal{R}_{t}$ , it can be concluded that it is less likely (if at all) to be impacted by the changes to the graph, and therefore it can stay in its previous community state—thereby saving runtime.

In all these lemmas, for sake of convenience (and without loss of generality), we analyze the potential impact of the event represented by moving $i$ to $j_{*}$ ’s community. Intuitively, the key to populating $\mathcal{R}_{t}$ is in anticipating which vertices are likely to alter their community status triggered by this migration event. Fig. 1 shows the different representative cases that originate for consideration in our lemmas.

First, we claim that any vertex that is a neighbor of $i$ can be potentially impacted.

Lemma III.1

If $i^{\prime}\in\varGamma(i)$ , then the community state for $i^{\prime}$ could potentially alter at time step $t$ if $i$ migrates to $C(j_{*})$ .

Proof:

There are two subcases: (A) if $i^{\prime}$ is also in $C_{t-1}(i)$ ; and (B) otherwise.

Subcase (A) is represented by vertex label $i_{1}$ in Fig. 1. If $i$ were to leave $C_{t-1}(i)$ , the strength of the connection of $i_{1}$ to $C_{t-1}(i)$ can only weaken because of a decrease in the positive term of the modularity (Eqn. 1)). Even if the negative term of the same equation also decreases (due to departure of $i$ from $C_{t-1}(i)$ , it may or may not be sufficient to keep $i_{1}$ in $C_{t-1}(i)$ . Therefore, we add $i_{1}$ to $\mathcal{R}_{t}$ .

Subcase (B) is represented by vertex label $k$ in Fig. 1. Here, $k$ is in a community different from $C_{t-1}(i)$ . However, the situation with $k$ is similar to that of $i_{1}$ in subcase (A), as $k$ ’s connection to its present community could potentially weaken if it discovers a stronger connection to $C(j_{*})$ as a result of $i$ ’s move. Therefore, we add $k$ to $\mathcal{R}_{t}$ . ∎

Next, we analyze the potential of vertices that are in $C_{t-1}(i)$ but not in $\varGamma(i)$ to be impacted by the migration of $i$ . In fact, we conclude that there is no need to include such vertices in $\mathcal{R}_{t}$ .

Lemma III.2

If $i^{\prime}\in$ $C_{t-1}(i)$ , then at time step $t$ , a change to the community state of $i^{\prime}$ is possible, only if $i^{\prime}$ is also a neighbor of $i$ .

Proof:

We have already considered the case where $i^{\prime}\in\varGamma(i)$ (as part of Lemma III.1). Therefore we only need to consider the case where $i^{\prime}\notin\varGamma(i)$ . This is represented by vertex label $i_{2}$ in Fig. 1. Since $i_{2}$ is already in $C_{t-1}(i)$ and since $i_{2}$ does not share an edge with $i$ , a departure of $i$ from $C_{t-1}(i)$ can only positive reinforce $i_{2}$ ’s membership in $C_{t-1}(i)$ . This can be shown more formally by comparing the modularity gains associated with $i_{2}$ . Owing to space limitations, we show the expanded proof in Appendix: Section -A1 in [23]. In summary, vertex $i_{2}$ will have little incentive to change community status and therefore can be excluded from $\mathcal{R}_{t}$ . ∎

Next, we analyze the potential impact of $i$ ’s migration on members of $j_{*}$ ’s community.

Lemma III.3

If $j_{1}\in C_{t-1}(j_{*})$ , then at time step $t$ , a change to the community status of any such $j_{1}$ is possible.

Proof:

Regardless of whether $j_{1}$ shares a direct edge with $j_{*}$ or not, the migration of a new vertex ( $i$ ) into its present community ( $C_{t-1}(j_{*})$ ) increases the negative term in Eqn. 1. This may or may not be accompanied with an increase in the positive term as well (depending on whether $j_{1}$ shares an edge with the incoming vertex $i$ ). In either case, however, we need to re-evaluate the community status of such vertices. Therefore, we add $j_{1}$ to $\mathcal{R}_{t}$ . ∎

Finally, we analyze the impact of $i$ ’s potential migration from $C_{t-1}(i)$ to $C_{t-1}(j_{*})$ , on vertices that are in neither of those two communities and are also not in $\varGamma(i)$ .

Lemma III.4

If $k\in V_{t}\backslash\{C(i)\cup C(j)\}$ , then at time step $t$ , unless $k$ is also in $\varGamma(i)$ , there is no need to include $k$ in $\mathcal{R}_{t}$ .

Proof:

We consider only vertices $k\notin\varGamma(i)$ , as the other case was already covered in Lemma III.1. There are three subcases: (A) $k$ shares an edge with some vertex in $C_{t-1}(i)$ except $i$ ; (B) $k$ shares an edge with some vertex in $C_{t-1}(j_{*})$ ; and (C) $k$ has no neighbors in $C_{t-1}(i)$ or $C_{t-1}(j_{*})$ . However, in none of these cases a migration of $i$ to $C_{t-1}(j_{*})$ , could create an incentive for $k$ to move to $C_{t-1}(j_{*})$ . This is shown formally in Appendix: Section -A2 in [23]. ∎

IV Experimental Evaluation

IV-A Experimental Setup

Input data: For experimental testing, we used a combination of synthetic and real-world networks. Table I shows the input statistics for the inputs used.

As synthetic inputs, we used a collection of streaming networks available on the MIT Graph Challenge 2018 [kao2017streaming]. We used two types of networks: i) Low block overlap, Low block size variation (abbreviated as “ll”), and ii) High block overlap, High block size variation (abbreviated as “hh”). These two types are in the increasing order of their community complexity (ll $<$ hh). However, in both cases, the number of edges grows linearly with time step (see Appendix Fig. B.2in [23]). The datasets are available from sizes of 1K nodes to 20M nodes, and each of these datasets has ten time steps. For our testing purposes, we used the 50K and 5M datasets.

As real-world inputs, we used two networks downloaded from SNAP database [22]:

**Arxiv HEP-TH: ** This is a citation graph for 27,770 papers (vertices) with 352,807 edges (cross-citations). Even though the edges are directed, for the purpose of analysis in this paper we treated them as undirected. The dataset covers papers published between 1993 and 2003. Consequently, we partitioned this period into 11 time steps (one for each year). 2. 2.

**sx-stackoverflow: ** This is a temporal network of interactions on Stack Overflow, with $2,601,977$ vertices (users) and $63,497,050$ temporal edges (interactions). Interactions could be one of many types—e.g., a user answers another user’s query, a user commented on another user’s answers, etc. We treated all these interactions equivalently (as edges), and for the purpose of our analysis we used only the first instance of a user-user interaction as an edge.

Implementations tested: In our experiments, we tested the following implementations:

Static: This is a (static) community detection code run from scratch on the graph at each time step $i$ . Louvain [7] and SLM [19] are the two tools we used for this purpose. 2. 2.

Baseline: This is a community detection code run incrementally on the graph at each time step $i$ . “Incremental” here implies that at the start of every time step $i$ , we initialize the state of communities to that of the end of the previous time step $i-1$ (for $i>0$ ). For this purpose, we implemented our own incremental version of the Louvain tool—we call this dLouvain-base); and for SLM, we use the already available incremental version DSLM [3]—we call this dSLM-base). 3. 3.

$\Delta$ -screening*:* This is a modified baseline version incorporated with our $\Delta$ -screening step to identify the $\mathcal{R}_{t}$ set for use within each time step. The corresponding two implementations are referred to as dLouvain- $\Delta$ S and dSLM- $\Delta$ S.

IV-B Runtime and Quality Evaluation

First, we evaluate the impact of $\Delta$ -screening technique on clustering algorithm’s performance. Since the main part of the algorithm is the iterative loop that scans every vertex and assigns communities (the for loop in Algorithm 1), we measured the average time taken per iteration of a given level, and the results are plotted in Fig. 2. As can be observed, $\Delta$ -screening achieves a significant reduction in the time spent within each iteration (compared to both static and baseline).

The savings are a result of the reductions in the number of vertices processed per iteration in the $\Delta$ S version (i.e., $\mathcal{R}_{t}$ set size). We found the $\mathcal{R}_{t}$ set-based savings to be more significant for the real-world inputs compared to the synthetic. This is shown in Fig. 3, which shows the percentage of vertices processed per iteration—the $\mathcal{R}_{t}$ set size fractions range from as little as under 10% in some time steps (for real-world inputs) to as much as 100% in some time steps (for the synthetic inputs). This wide variation in efficacy can be attributed to the nature of input changes. Even though both classes of input graphs (synthetic and real-world) show linear growth rates in size, for the synthetic inputs it is harder to benefit from $\Delta$ -screening because edges are connected almost randomly from new to existing vertices (as introduced by an edge sampling randomized procedure [kao2017streaming]); whereas, in the real-world networks, changes happen more in a localized manner, giving an opportunity to benefit from $\Delta$ -screening. In fact, even between the two real-world networks, we observed a significant difference in the filtering efficacy of the $\Delta$ -screening technique. More specifically, with the ArXiv HEP-TH input, the $\mathcal{R}_{t}$ -size fraction varied between $\approx$ 50%-90%; whereas the savings were much more significant in the case of sx-stackoverflow (and also showing a linear trend).

Next, we evaluate the total runtime including the time taken to execute all levels. Note that in multi-level codes, the number of iterations per level may vary across the different implementations. Fig. 4 shows the runtime as a function of the time steps, for different combinations of four inputs (5M_ll, 5M_hh, Arxiv HEP-TH, sx-stackoverflow) and two sets of implementations (Louvain and SLM).

We find that in all cases both baseline and $\Delta$ S implementations consistently outperform the respective static implementation, providing more than 2 orders of magnitude in some cases. Between the baseline and $\Delta$ S implementations, the difference varies based on the input. For the synthetic inputs, both versions perform comparably with a slight advantage to the $\Delta$ S implementation in some time steps. As discussed earlier, this can be attributed to the random nature of changes induced in the synthetic inputs. For the real-world inputs, $\Delta$ S significantly outperforms baseline, yielding over 5X speedup in some time steps. For example, in Fig. 4(c) we have a 5X speedup in time step $t_{10}$ , since in time step $t_{9}$ we have 55158947 edges and this number has increased to 60714297 in time step $t_{10}$ and most of the new edges are intra-community edges which result in less number of nodes for reevaluation and consequently less time.

We also evaluated the quality (measured by modularity) achieved by each version. In nearly all cases, we observed that the $\Delta$ S version yielded almost the same modularity as the baseline version despite its heuristic nature in selecting subsets of vertices for evaluation within each iteration. Owing to space limitations, we show these results in Appendix Fig. B.3 in [23].

IV-C Effect of Varying the Temporal Resolution

In many real-world use-cases, even though the input graph is available as a temporal stream, the appropriate temporal scale to analyze them is not known a priori. In fact, this scale is an input property that a domain expert expects to discover through the analysis of dynamic communities. In order to facilitate such a study through dynamic community detection, in this section, we study the effect of varying the temporal resolution, as defined by the number of time steps used to partition a graph stream, on the output clustering.

More specifically, using the sx-stackoverflow input, we first generated multiple temporal datasets, each of which representing the input stream divided into a certain number of time steps, ranging from {2, 4, 8, 12, 16, 20, 24, 28} steps. Note that in this scheme, there are multiple nested hierarchies—for instance, the 16-time steps partitioning can be achieved by splitting each of the 8-time steps partitions into two. Subsequently, we ran our $\Delta$ S-enabled incremental clustering method on the different temporal input datasets (we used dLouvain- $\Delta$ S for this analysis).

Fig. 5 shows the results of our analysis. Fig. 5a shows the change in average modularity as we increase the temporal resolution from coarser to finer (left to right along the x-axis). We observe that the modularity values decline gradually until around 16 time steps, after which the decline starts to accelerate. The decline in modularity suggests that the community-based structure of the underlying network (at different scales) starts to weaken as we increase the temporal resolution. This is to be expected as the temporally binned graphs tend to only become sparser with increasing resolution. Notably, the more rapid slide that starts to appear after the 16 time steps-resolution suggests that the community structure starts to deteriorate after that resolution for this input.

Interestingly, this property is better captured by Fig. 5b, which shows the % savings in total runtime achieved by our $\Delta$ -screening filtering scheme. Intuitively, when the % savings remains approximately steady (highlighted by the plateau region from the resolution of 4 time steps to 16 time steps), it means that the nature of the evolution of the graphs within those resolutions is also relatively consistent. However, a steeper decline (on either end of the plateau) suggests that under those temporal resolution scales the temporally partitioned graphs become either too sparse (right) or too dense (left).

Note that software speed becomes an important enabling factor for conducting these types of experiments, where one needs to run the dynamic community analysis repeatedly under different configurations.

V Conclusion

Conducting community detection-based analysis on large dynamic networks is a time-consuming problem and there have been many incremental strategies proposed. In this paper, we visit a subproblem in this context—one of identifying vertices that are likely to be impacted by a new batch of changes. We presented a generic technique called $\Delta$ -screening that examines and selects provably “essential” vertices for evaluation during the $i^{th}$ time step based on the loci of the changes. Subsequently we incorporated this technique into two widely-used community detection tools (Louvain and SLM) and demonstrated speedups in performance without compromising on the output quality, for a collection of synthetic and real-world inputs. Future research directions include: i) extension of the $\Delta$ -screening technique to edge deletions; ii) parallelization on multicore platforms; iii) extensions to other incremental community detection tools; and iv) application and dynamic community characterization on large-scale real-world networks.

ACKNOWLEDGMENT

This research was supported by U.S. National Science Foundation grant 1815467.

-A Expanded Versions of Proofs

-A1 For Lemma 3.2

Lemma: If $i^{\prime}\in$ $C_{t-1}(i)$ , then at time step $t$ , a change to the community state of $i^{\prime}$ is possible, only if $i^{\prime}$ is also a neighbor of $i$ .

Proof:

We have already considered the case where $i^{\prime}\in\varGamma(i)$ (as part of Lemma III.1). Therefore we only need to consider the case where $i^{\prime}\notin\varGamma(i)$ . This is represented by vertex label $i_{2}$ in Fig. 1. Since $i_{2}$ is already in $C_{t-1}(i)$ and since $i_{2}$ does not share an edge with $i$ , a departure of $i$ from $C_{t-1}(i)$ can only positive reinforce $i_{2}$ ’s membership in $C_{t-1}(i)$ . More formally, this can be shown by comparing the old vs. new modularity gains, ${\Delta Q}_{i_{2}\rightarrow C(j_{*})}$ , resulting from moving $i$ to $C_{t-1}(j_{*})$ :

[TABLE]

Since the new modularity gain is less than the old, vertex $i_{2}$ will have no incentive to change community status, and therefore can be excluded from $\mathcal{R}_{t}$ . ∎

Note that the above proof only analyzes the direct impact that vertex $i$ ’s migration from $C_{t-1}(i)$ will have on vertices of $C_{t-1}(i)$ but not in $\varGamma(i)$ . However, there could still be an indirect impact (cascading from Lemma III.1)—e.g., the migration of a vertex $i_{1}\in\varGamma(i)$ may trigger the migration of a vertex $i_{2}\notin\varGamma(i)$ but in $\varGamma(i_{1})$ , and so on. However, as this distance from the locus of change increases, the likelihood of its impact is expected to decay rapidly in practice (as observed in our experiments).

-A2 For Lemma 3.4

Lemma: If $k\in V_{t}\backslash\{C(i)\cup C(j)\}$ , then at time step $t$ , unless $k$ is also in $\varGamma(i)$ , there is no need to include $k$ in $\mathcal{R}_{t}$ .

Proof:

We consider only vertices $k\notin\varGamma(i)$ , as the other case was already covered in Lemma III.1. There are three subcases: (A) $k$ shares an edge with some vertex in $C_{t-1}(i)$ except $i$ ; (B) $k$ shares an edge with some vertex in $C_{t-1}(j_{*})$ ; and (C) $k$ has no neighbors in $C_{t-1}(i)$ or $C_{t-1}(j_{*})$ . However, in none of these cases a migration of $i$ to $C_{t-1}(j_{*})$ , create an incentive for $k$ to move to $C_{t-1}(j_{*})$ . This can be shown more formally using modularity gains as follows.

(A)

We can follow the same logic in lemma -A2. If $k$ is a neighbor of $i$ which are located in the other communities like $k$ in Fig. 1, the modulaity gain ${\Delta Q}_{k\rightarrow C_{t-1}(j_{*})}$ will increase; therefore $k$ should be considered for reevaluation. 2. (B)

Any node $k$ connected to $C_{t-1}(j_{*})$ which is located in other communities have the following modularity gain after moving $i$ to $C_{t-1}(j_{*})$ :

[TABLE]

which means the modularity gain will decrease and $k$ will not change its community membership. 3. (C)

If a node $k$ is not connected to either $C_{t-1}(i)$ or $C_{t-1}(j_{*})$ like $k\prime$ in Fig. 1, the modularity gain of $k$ moving to any communities will not change and therefore the node will not reconsider moving to other community.

∎

-B Figures

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, and D. Wagner, “On modularity clustering,” IEEE transactions on knowledge and data engineering , vol. 20, no. 2, pp. 172–188, 2008.
2[2] S. Fortunato, “Community detection in graphs,” Physics reports , vol. 486, no. 3-5, pp. 75–174, 2010.
3[3] R. Aktunc, I. H. Toroslu, M. Ozer, and H. Davulcu, “A dynamic modularity based community detection algorithm for large-scale networks: Dslm,” in Advances in Social Networks Analysis and Mining (ASONAM), 2015 IEEE/ACM International Conference on . IEEE, 2015, pp. 1177–1183.
4[4] D. Greene, D. Doyle, and P. Cunningham, “Tracking the evolution of communities in dynamic social networks,” in 2010 international conference on advances in social networks analysis and mining . IEEE, 2010, pp. 176–183.
5[5] A. Zakrzewska and D. A. Bader, “A dynamic algorithm for local community detection in graphs,” in Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 . ACM, 2015, pp. 559–564.
6[6] M. E. Newman, “Fast algorithm for detecting community structure in networks,” Physical review E , vol. 69, no. 6, p. 066133, 2004.
7[7] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of statistical mechanics: theory and experiment , vol. 2008, no. 10, p. P 10008, 2008.
8[8] J. Xie, M. Chen, and B. K. Szymanski, “Labelrankt: Incremental community detection in dynamic networks via label propagation,” in Proceedings of the Workshop on Dynamic Networks Management and Mining . ACM, 2013, pp. 25–32.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Fast and Efficient Incremental Approach toward Dynamic Community Detection

Abstract

I Introduction

II Related Work

III Method

III-A Basic Notation and Terminology

III-B Problem Statement

Definition III.1

III-C A Naive Algorithm for Dynamic Community Detection

III-D An Incremental Approach via Δ\DeltaΔ-screening

III-E The Δ\DeltaΔ-screening* Scheme*

Lemma III.1

Proof:

Lemma III.2

Proof:

Lemma III.3

Proof:

Lemma III.4

Proof:

IV Experimental Evaluation

IV-A Experimental Setup

IV-B Runtime and Quality Evaluation

IV-C Effect of Varying the Temporal Resolution

V Conclusion

ACKNOWLEDGMENT

-A Expanded Versions of Proofs

-A1 For Lemma 3.2

Proof:

-A2 For Lemma 3.4

Proof:

-B Figures

III-D An Incremental Approach via $\Delta$ -screening

III-E The $\Delta$ -screening* Scheme*