Community Structure Characterization

Vincent Labatut (LIA); G\"unce Keziban Orman

arXiv:1705.10621·cs.SI·May 18, 2018

Community Structure Characterization

Vincent Labatut (LIA), G\"unce Keziban Orman

PDF

TL;DR

This paper addresses how to interpret community structures in complex networks by proposing methods for extracting meaningful information after communities have been detected, aiding human understanding.

Contribution

It introduces a framework for analyzing and interpreting community structures in networks, focusing on post-detection processing for better human comprehension.

Findings

01

Provides a methodology for community interpretation

02

Enhances understanding of community roles and functions

03

Facilitates human analysis of complex network communities

Abstract

This entry discusses the problem of describing some communities identified in a complex network of interest, in a way allowing to interpret them. We suppose the community structure has already been detected through one of the many methods proposed in the literature. The question is then to know how to extract valuable information from this first result, in order to allow human interpretation. This requires subsequent processing, which we describe in the rest of this entry.

Equations26

δ (C_{i}) = \frac{m _{i}}{n _{i} ( n _{i} - 1 ) /2}

δ (C_{i}) = \frac{m _{i}}{n _{i} ( n _{i} - 1 ) /2}

ℓ (C_{i}) = \frac{1}{n _{i} ( n _{i} - 1 ) /2} u, v \in C_{i} \sum d (u, v)

ℓ (C_{i}) = \frac{1}{n _{i} ( n _{i} - 1 ) /2} u, v \in C_{i} \sum d (u, v)

T (u) = \frac{t _{i} ( u )}{k _{in t} ( u ) ( k _{in t} ( u ) - 1 ) /2}

T (u) = \frac{t _{i} ( u )}{k _{in t} ( u ) ( k _{in t} ( u ) - 1 ) /2}

S (C) = 1 - \frac{1}{m} i = 1 \sum λ m_{i}

S (C) = 1 - \frac{1}{m} i = 1 \sum λ m_{i}

h (C_{i}) = \frac{u \in C _{i} max ( k _{in t} ( u ))}{n _{i} - 1}

h (C_{i}) = \frac{u \in C _{i} max ( k _{in t} ( u ))}{n _{i} - 1}

e (u) = \frac{k _{in t} ( u )}{k ( u )}

e (u) = \frac{k _{in t} ( u )}{k ( u )}

z (u) = \frac{k _{in t} ( u ) - μ _{i} ( k _{in t} )}{σ ( k _{in t} )}

z (u) = \frac{k _{in t} ( u ) - μ _{i} ( k _{in t} )}{σ ( k _{in t} )}

P (u) = 1 - i = 1 \sum λ (\frac{k _{i} ( u )}{k ( u )})^{2}

P (u) = 1 - i = 1 \sum λ (\frac{k _{i} ( u )}{k ( u )})^{2}

Q (C) = i = 1 \sum λ (\frac{m _{i}}{m} - \frac{m _{i +}^{2}}{m ^{2}})

Q (C) = i = 1 \sum λ (\frac{m _{i}}{m} - \frac{m _{i +}^{2}}{m ^{2}})

C s d (C_{i}) = \frac{r _{i} / q _{i} - 1}{n _{i} - 1}

C s d (C_{i}) = \frac{r _{i} / q _{i} - 1}{n _{i} - 1}

R (C_{i}, t_{1}, t_{2}) = \frac{∣ C _{i} ( t _{1} ) \cap C _{i} ( t _{2} ) ∣}{∣ C _{i} ( t _{1} ) \cup C _{i} ( t _{2} ) ∣}

R (C_{i}, t_{1}, t_{2}) = \frac{∣ C _{i} ( t _{1} ) \cap C _{i} ( t _{2} ) ∣}{∣ C _{i} ( t _{1} ) \cup C _{i} ( t _{2} ) ∣}

ζ (C_{i}) = \frac{t = t _{0} \sum t _{ma x} - 1 R ( C _{i} , t , t + 1 )}{t _{ma x} - t _{0} - 1}

ζ (C_{i}) = \frac{t = t _{0} \sum t _{ma x} - 1 R ( C _{i} , t , t + 1 )}{t _{ma x} - t _{0} - 1}

P i (C_{i}, t) = J (C_{i}, t) - L (C_{i}, t)

P i (C_{i}, t) = J (C_{i}, t) - L (C_{i}, t)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Community Structure Characterization

Vincent Labatut

Laboratoire Informatique d’Avignon – LIA EA 4128, Université d’Avignon, France

[email protected]

Günce Keziban Orman

Computer Engineering Department, Galatasaray University, Ortaköy, Istanbul, Turkey

[email protected]

Abstract

This entry discusses the problem of describing some communities identified in a complex network of interest, in a way allowing to interpret them. We suppose the community structure has already been detected through one of the many methods proposed in the literature. The question is then to know how to extract valuable information from this first result, in order to allow human interpretation. This requires subsequent processing, which we describe in the rest of this entry.

Keywords: Community Structure, Community Features, Mesoscale Characterization, Cluster Description, Community Evolution, Attributed Networks, Dynamic Networks.

Cite as: V. Labatut & G. K. Orman. Community Structure Characterization, Encyclopedia of Social Network Analysis and Mining, 2nd Edition, Eds.: R. Alhajj & J. Rokne, Springer, 2017. Doi: 10.1007/978-1-4614-7163-9_110151-1

1 Introduction

Community detection is one of the most studied problem in the domain of Network Science, as illustrated by the hundreds of algorithms proposed in the literature [12] and the various definitions of the notion of community itself [42]. However, from the application perspective, detecting the community structure of a network of interest is only half the work. Indeed, this information has no value in itself: one must additionally interpret the community structure relatively to the system modeled by the network, in order to bring some sense to it, thus allowing human understanding. Yet, almost all works in the field of community detection deal with the design of detection tools, and the evaluation of their precision or speed [14]. Very few researchers have addressed the problem of characterizing and interpreting the detected communities [40, 21, 22, 43, 32].

In this entry, we consider the interpretation problem as independent from the method used for community detection. We adopt an approach based on the original definition of the notion of community in social sciences, which underlines that nodes belonging to the same community should be relatively similar and/or share a common behavior. Assessing node similarity requires describing nodes, which can be performed both in terms of individual information (i.e. personal characteristics) and relational information (i.e. connection to the rest of the network). Concretely, the former corresponds to nodal attributes, whereas the latter depends on the network topology. The behavior of a node can be described in terms of evolution of its individual and relational information. This approach allows us to take advantage of most types of information one can encode in a network (structure, directions, weights, attributes, time…).

2 Key Points

We review the main methods allowing to characterize communities. We distinguish them depending on the type of information they are based upon: structure only, nodal attributes (which requires an attributed network) and temporal evolution (dynamic networks). Networks encoding more information allow to apply more advanced analysis methods, possibly leading to more informative results. Moreover, interpretation methods can also be distinguished depending on their focus. The community structure is described at the macroscopic scale (that of the whole network), whereas a community can be described at two levels, either by considering it as a whole (mesoscopic scale) or by focusing on its constituting nodes (microscopic scale).

3 Historical Background

Community detection as such is quite a recent subfield, dating back to the seminal work by [16]. Related (but different) problems have been the objects of prior works, though, such as graph partitioning or spectral clustering [28].

In early community detection works, the studied networks were very small, which allowed to interpret the communities manually. In other words, one would study subjectively the individuals composing some community of interest, and try to identify some relevant patterns or regularities in order to reach some observations considered useful to understand the studied system [16, 29]. When the scale of the networks increased from tens to hundreds, it was still possible to consult domain experts to perform interpretation in a similar fashion [34, 35, 36].

However, this method showed its limit on larger networks. [3] applied their Louvain algorithm on a network of Belgian mobile phone communications, including 2.6 million nodes representing persons. Interpreting such a large network obviously requires a more automatic approach. [3] verified the accuracy of the top level of their hierarchical community structure by considering the homogeneity of the nodes on an attribute representing the language people spoke on the phone. This highlights the difficulty of interpreting communities in networks of this size, and also shows one solution can be to consider some additional information, such as nodal attributes.

4 Notations & Glossary

•

Graph or network: A pair $G=(V,E)$ constituted of a set of nodes $V$ and a set of links $E$ . We note $n=|V|$ the number of nodes and $m=|E|$ the number of links.

•

Community structure: Partition of the node set $V$ into a set of $\lambda$ distinct communities, i.e. $\mathcal{C}=\{C_{1},...,C_{\lambda}\}$ , with $V=\bigcup_{i=1}^{\lambda}C_{i}$ and $\bigcap_{i=1}^{\lambda}C_{i}=\emptyset$ .

•

Community: Roughly corresponds to a group of nodes more densely interconnected, relatively to the rest of the network. Formally, community number $i$ is a subset of nodes: $C_{i}\subset V$ . We note $n_{i}$ the number of nodes in $C_{i}$ , and $m_{i}$ the number of links between these nodes.

•

Attributed network: Network whose nodes are described by individual attributes, e.g. in a social network: age, gender, ethnicity, etc.

•

Dynamic network: Network whose structure and/or attributes evolve through time, represented as a sequence of static consecutive networks.

•

Time slice: Static network representing the state of a dynamic network for a given period of time.

5 Focusing on Topology Only

In some cases, the only available data is the network structure, or alternatively one wants to focus on a purely topological interpretation of the communities. A number of measures exist, which allow to describe one community or the whole community structure. In this entry, we focus on the most widespread ones. They assess the cohesion and separation of the communities, i.e. the way the intra-community links (i.e. links located inside communities) and inter-community links (i.e. links located between communities) are distributed, respectively.

We distinguish two types of measures. The first involves selecting a measure originally designed to describe a whole network, and restricting it to a single community. This generally requires considering the subgraph induced by the community of interest, and processing the measure as originally defined. The second type includes measures specifically designed to characterize communities or community structures.

5.1 General Measures

A number of measures have been proposed to characterize complex networks, each one focusing on a specific aspect of their topology (see [11] for a review). The most basic is the Size $n_{i}$ , expressed in number of nodes. Studying the size of the detected communities is informative in itself, moreover their distribution can also reveal some properties of the network. Indeed, it was often observed that community size follows a power-law distribution in real-world networks [8].

The Link Density $\delta$ is a simple measure which can be used to assess community cohesion. It is particularly relevant here, since communities are, by definition, supposed to be more densely connected than the rest of the network. It is defined as:

[TABLE]

i.e. the proportion of existing to possible links inside the community. Certain authors use a normalized version called Scaled Density instead: $\delta^{\prime}(C_{i})=n_{i}\delta(C_{i})$ [23]. It has the advantage of taking the value $2$ if the community is a tree, and $n_{i}$ if it is a clique.

Cohesion can also be described in terms of Average Distance $\ell$ :

[TABLE]

where $d(u,v)$ is the geodesic distance between nodes $u$ and $v$ . It is worth studying how this measure evolves as a function of the community size, since communities are supposedly small-world [23].

Another way to characterize cohesion is to use the community transitivity (a.k.a. clustering coefficient) [23]. In its local version, it is defined as:

[TABLE]

where $t(u)$ is the number of links between the neighbors of $u$ belonging to its community, whereas $k_{int}(u)$ is the internal degree of node $u$ , i.e. its number of neighbors in the same community. The measure corresponds to the proportion of links between the neighbors of $u$ , among all possible such connections. A community can be described simply by averaging $T$ over its nodes.

5.2 Community-Specific Measures

A simple way to measure the separation of a community structure is to process the proportion of inter-community links $S$ [21]:

[TABLE]

The internal structure of a community can take various forms, which explains why cohesion can be assessed through several different measures. Certain communities are organized around one or a few hubs, i.e. nodes connected to most of the nodes belonging to the same community, which can have various effects such as a small average distance. This can be assessed through the Hub-Dominance measure $h$ [23]:

[TABLE]

The numerator therefore corresponds to the highest degree in the community, when considering only internal links. The measure ranges from [math] (only isolate nodes) to $1$ (at least one node connected to all others).

The Embeddedness $e(u)$ of a node $u$ is a measure assessing both cohesion and separation, since it corresponds to the proportion of the node’s neighbors located in the same community:

[TABLE]

where $k(u)$ is the plain degree of node $u$ . If $e(u)$ is close to $1$ , the node is particularly well connected to its communities, and vice versa if is close to zero.

The Within-Community Degree $z(u)$ is also based on the internal degree, but relies on a $z$ -score normalization [18]:

[TABLE]

where $\mu_{i}(k_{int})$ is the average internal degree for community $C_{i}$ , and $\sigma(k_{int})$ is its standard deviation. It represents how well a node is connected to its community. It is completed by the participation coefficient $P(u)$ [18]:

[TABLE]

where $k_{i}(u)$ is the community degree of node $u$ , i.e. its number of neighbors in community $C_{i}$ . The participation coefficient gets close to $1$ when the node is evenly connected to many communities, and reaches zero when all its neighbors are in the same community. Both measures were originally defined to identify the community roles of nodes, but they also have been used to characterize communities [21]. Some modifications were later proposed to solve certain limitations and generalize them to directed networks [10].

The quality of the whole community structure can be assessed using one of the many objective functions designed to perform community detection (see [12] for a very complete review, or more recently [9]). Among them, the most widespread is clearly Newman’s modularity [31]:

[TABLE]

where $m_{i+}$ is half111Only half for matters of normalization, see note 50 in [31]. the number of links between $C_{i}$ and the other communities. The modularity is defined at the node level, so it allows characterizing not only the community structure as a whole, but also individual communities.

A number of measures have been proposed to assess the statistical significance of the estimated community structure (see Section 14 of [12] for a review). The $B$ - and $C$ -scores are particularly interesting, because they allow characterizing individual communities (by opposition to the whole community structure) [24]. They measure, with different levels of precision, the likeliness of observing a community similar to the one at hand, in a random network (using the same null-model than Newman’s modularity).

5.3 Use Examples

[21] use most of these measures to study a specific network of social relationships between university students. One of their main objective is to characterize individual communities in order to understand the differences between them, and interpret them in the context of the studied system. In other words, it is a case study. As explained later, the authors also have access to nodal attributes, which allows them to complete their purely topological analysis.

The objective of [23] is very different: they do not focus on a single network. Instead, they want to compare different classes of networks (biological, technological, social, etc.), through their community structures. For this purpose, they consider their community size distributions, and study the evolution of several measures describing individual communities (density, average distance, hub dominance, embeddedness), as functions of the community size. They observe some classes of networks are indeed characterized by certain types of community structures.

The work of [25] also focuses on the community structure as a whole. It builds upon the Conductance [37], a graph partitioning objective measure (a normalized cut), to define the notion of Network Community Profile. For a given network, they first estimate the maximal conductance for a community of a fixed size. They repeat this process for an increasing community size. Finally, they plot the obtained conductance as a function of the community size. The resulting curve is considered as characteristic of the community structure of the network. [25] show that it discriminates not only between random and real-world networks, but also between several types of real-world networks.

6 Taking Advantage of Nodal Attributes

As showed in the previous section, it is possible to characterize the community structure as well as individual communities using only topological information. However, the results are quite limited in terms of the interpretation they allow.

Fortunately, more and more real-world networks also include non-topological information, taking the form of nodal attributes, i.e. individual information describing each node. Two approaches are possible to take advantage of this information: either consider attributes separately, which can be done through classic statistical tools designed for non-relational data, or consider them jointly with the structural information, which requires using specific tools.

6.1 Attribute-Only Approaches

A number of classic statistical tools were designed to characterize groups of objects described by various types of attributes. They can therefore be applied to communities detected by purely topological methods, but whose nodes possess individual attributes. The simplest approach consists in focusing on a single attribute at once. The most straightforward method is to identify, in each community, the most widespread value of the attribute of interest. For instance, [21] study which mobile phones are the most popular in each community of a network of university students. On the same note, [33] study the proportion of nodes holding the majority value (community-wise) of the attribute of interest, as a function of the community size.

More advanced statistical tools exist, though. [21] propose to study the association between community membership and the attribute of interest. For this purpose, it is possible to test for the significance of this supposed association, using Pearson’s chi-square test if the attribute at hand is nominal, or an ANOVA if it is numerical (the community itself being considered as a nominal attribute). Moreover, the strength of the association can be assessed through a collection of measures, among others: Pearson’s $\Phi$ , Cramér’s $V$ and Goodman & Kruskal’s $\lambda$ .

It is also possible to take simultaneously several nodal attributes into account. Concerning the statistical tests and association measures presented above to assess the relation between a single attribute and community membership, there exist generalizations allowing to deal with several attributes at once [21]. [21] also propose to use discriminant analysis to build models able to predict community membership as a function of several numerical (with Linear Discriminant Analysis) or nominal (with Discriminant Correspondence Analysis) attributes. This results in a set of discriminant factors, whose associated weight represent the discriminant power relatively to the predicted variable (here: the community). The same authors also use Multinomial Logistic Regression for the same purpose. The main limitation of these approaches is the underlying assumption that the most discriminant attributes are the same for all communities. But in fact, any tool able to predict a nominal variable depending on a set of numerical and/or nominal variables could be used instead, including those not making this assumption, such as association rule mining methods.

[40] precisely focus on the characterization of communities in terms of attribute values, with their approach based on the notion of over-expressed gene. They present their tool as specifically designed to study networks, but it actually ignores the topological information, so it could as well be applied to non-relational data (like the other methods presented in this subsection). In a given community, they consider an attribute value is over-expressed (i.e. characteristic of this community) if it appears among its nodes more often than expected from a null model assuming a hypergeometric distribution.

6.2 Hybrid Approaches

By hybrid, we mean that the tools described here consider both the network structure and the node attributes. Indeed, as shown in [22], when dealing with partitions of the node set, the information conveyed by the network structure can be complementary to that encoded in the node attributes. Therefore, confronting both aspects seems relevant.

The Homophily measure (a.k.a. assortativity) is particularly interesting, because of its simplicity and straightforward interpretation. This measure assesses the tendency for nodes to connect with other nodes similar (or dissimilar) to them, relatively to some attribute of interest. Let us consider two paired series constituted of the attribute values of pairs of connected nodes: then the homophily is basically the level of association between these two series. Newman proposes to use Cohen’s Kappa statistic and Pearson’s correlation coefficient for nominal and numeric attributes, respectively [27]. It is generally processed over the whole network, but it can also be used to characterize individual communities, as in [21], where it is applied to show inter-gender relationships vary among communities in a social network.

On a related note, [19] define the Community Similarity Degree, a measure originally aiming at describing how homogeneous a community of people is in terms of the interests they share, while taking inter-personal relations into account. [19] focus on the case of online social networking services, in which users can be connected to other users, and express their interest for certain topics. They want to measure how much users belonging to the same community share these interests. The community similarity degree $Csd(C_{i})$ of a community $C_{i}$ is formally defined as follows:

[TABLE]

where $r_{i}$ is the total number of manifestations of interest from all users belonging to $C_{i}$ , over all available topics, and $q_{i}$ is the number of topics for which at least one user belonging to $C_{i}$ has expressed his interest. So, $r_{i}/q_{i}$ can be considered as the average popularity of a topic in $C_{i}$ , expressed in number of manifestations of interest. The ratio of this value to $n_{i}$ (the number of users in $C_{i}$ ) can therefore be interpreted as the average number of manifestations of interest from a user for a topic. The rest of the formula ( $-1$ in both the numerator and denominator) is just normalization. The measure ranges from [math] (no common interest at all between users) to $1$ (all users share exactly the same interests). The measure can be applied to the more general context of attributed graphs, not necessarily representing social networking services. Indeed, a topic can be represented by a nodal attribute, whose value is $1$ if the considered user is interested in this topic, and [math] otherwise. However, note that these attributes must be binary, which can constitute an important constraint, depending on the modeled system.

To study attributed graphs, [38] propose a method based on frequent pattern mining, consisting in identifying so-called Frequent Conceptual Links [38]. A Conceptual Link corresponds to set of links connecting nodes sharing similar attributes. It is considered frequent when the size of this link set is above a fixed threshold. This method can be seen as a generalization of the notion of homophily, and was initially used to simplify the network and help understanding it. However, it can also be used to characterize communities, as illustrated in [39]. [39] define a set of measures to assess how homogeneous communities are in terms of attributes, and vice versa. The approach is not unlike that adopted in [22] to compare communities and clusters. It is worth noticing that both homophily and frequent conceptual links consider only direct connections between nodes, which can be viewed as a limitation in the sense it has a purely local view.

[6] propose a method to jointly detect communities and identify so-called Community Profiles in social networking services, by considering jointly the users’ relationships, their attributes, the content they produce, and how this content propagates through the network. They explicitly make the assumption of homophily, and define a community as a group of densely connected users sharing similar interests and behaviors. Their method detects a set of topics, each one corresponding to a multinomial distribution over a dictionary (each word has a certain probability to belong to the topic), and two types of community profiles. The Content Profile of a community is a multinomial distribution over the set of topics, reflecting the probability for each topic to be discussed in the community ; whereas its Diffusion Profile is a multinomial distribution representing the probability for this community to propagate a certain topic to a given community. Obviously, the latter can be processed only if some sort of information propagation process is taking place on the studied system, and if the information describing it is available.

A number of community detection methods have been specifically designed to use both the network structure and the nodal attributes when identifying the communities (see [4] for a recent review). It seems natural to suppose some by-product of their processing can be used in some way to ease the interpretation of communities. And indeed, some of them output the most characteristic attributes and/or attribute values of the detected communities (e.g. [43], or [6] from the previous paragraph). However, there are two important limitations. First, the overwhelming majority of existing algorithms rely on the (sometimes implicit) assumption of homophily, i.e. communities are supposed to be homogeneous in terms of attributes [4]. Yet, several experimental works show that this is not necessarily the case in practice, and that the level of homophily can even largely differ from one community to the other in the same network [21, 22, 39]. However, certain very recent methods allow heterophily and/or independence, e.g. [30]. Second, it is not always clear which information is used exactly when detecting the communities, especially concerning the network structure. This is due to the fact the problem of community detection is ill-defined, because there is no clear unique definition of what a community is [13]. Some authors even define the notion of community just in a procedural way, i.e. simply as the output of their community detection method [12]. It is only recently that certain works tried to propose a typology of the definitions for the concept of community [42]. Moreover, the way attributes and structure are combined to reach some form of compromise is not always clear or controlled. All of this makes it very difficult to characterize the communities based on the outputs of such algorithms.

7 Considering the Network Evolution

Besides nodal attributes, time is another aspect that can be used to complement topology-based interpretation methods. Of course, taking advantage of the evolution of the studied system requires both to have access to a proper representation (dynamic network) and to apply an appropriate algorithm (dynamic community detection). There is now a number of methods to perform this task: see [2] as well as the entries Dynamic Community Detection and Community Evolution for a review. However, community detection in dynamic networks is not as widely studied as in static networks, so there are only a few works tackling their characterization. We can distinguish two types of approaches: some works focus purely on how the topology evolves, through the analysis of community events and derived measures, whereas others take the evolution of the nodal attributes into account, sometimes in conjunction with the structure.

7.1 Structure-Only Methods

The most direct method to take both structure and time into account is simply to study the evolution of the topological measures presented in Section 5, e.g. modularity as a function of time [20].

color=red!40, author=VL, inline][Toyoda2003] actually proposed the community events before [33]. They also proposed a bunch of measures which could be included here. More advanced methods exist though, which are based on the characterization of community evolution through the detection of so-called Community Events occurring between two consecutive time slices. [33] originally proposed three pairs of opposed events: Growth vs. Contraction (a community size increases vs. decreases), Merging vs. Splitting (several distinct communities become one vs. one community gets separated into several ones), and Birth vs. Death (a community appears vs. disappears). Of course, it is also possible for a community to undergo no event at all. Most authors use these events, or equivalent ones, sometimes under different names, e.g. Form for Birth, Dissolve/Vanish for Death, Join/Expansion for Growth, Leave/Shrinking for Contraction [1, 7, 17, 5]. Note that the methods proposed by [33] based on these events were originally used on overlapping communities, but they also apply to disjoint ones.

The most straightforward use of these events is to count them and study their evolution in order to characterize the community structure, and therefore the network dynamics [1, 7, 17, 5]. However, they also allow defining the notions of Community Age, i.e. the time elapsed since the birth $t_{0}$ of the community, and Community Lifetime, i.e. the time between its birth and disappearance $t_{max}$ . Their correlation with other measures can then be studied, for instance [33] observe community age is correlated with community size in their data: the older the community, the larger it is [33].

[33] also propose to study some kind of community auto-correlation, by applying Jaccard’s coefficient to the node sets of the community of interest considered at two different time slices $t_{1}$ and $t_{2}$ :

[TABLE]

where $C_{i}(t)$ is the node set of $C_{i}$ at $t$ and $|C_{i}(t)|$ is its cardinality. It is then possible to focus on the birth time $t_{0}$ of the community, and study how $R(C_{i},t_{0},t)$ evolves as a function of $t$ , and/or depending on some other measure such as community size. For instance, in the case of [33], the auto-correlation decays faster for larger communities (meaning their members change faster).

They additionally define the Stationarity Measure $\zeta(C_{i})$ of community $C_{i}$ as:

[TABLE]

where $t_{max}$ is the last time slice before the community disappears. The stationarity can be interpreted as the average proportion of nodes staying in the community at each time slice. [33] characterize communities by comparing it to their lifetime and size. They observe that, for their data, small stable communities can survive for a long time, whereas small unstable ones have a very short lifespan. The opposite is observed for large communities: stable ones do not last a long time, whereas unstable ones do, because their instability is caused by expansion.

Another set of measures leveraging community events was proposed by [1], among which one is designed to characterize communities. They propose the Popularity Index, which aims at assessing the attractiveness of a community, i.e. how much nodes are likely to joint it. The popularity index $Pi(C_{i},t)$ of community $C_{i}$ at time $t$ is:

[TABLE]

where $J(C_{i},t)$ and $L(C_{i},t)$ are the numbers of nodes joining and leaving $C_{i}$ at time $t$ , respectively. On the same note, [41] propose their Member Stability Measure, which is a normalized version of $L$ . Again, it is worth studying the relation between these measures and other community properties. For instance, [1] study how $Pi$ correlates with the community size, and observe that for their data, large clusters tend to be more node-attractive [1].

7.2 Attribute-Based and Hybrid Methods

As before, the most straightforward way of considering both attributes and time, and possibly also the network structure, is to study the evolution of the measures from Section 6, e.g. homophily as a function of time. Although we are not aware of any such work, it would also be possible to take advantage of the community events presented in Section 7.1 to study the evolution of nodal attributes. For instance, by processing the association between the occurrence of a certain type of event and the most widespread attributes in the concerned communities. This would allow answering questions of the types: Do similar (in terms of nodal attributes) communities tend to merge? When communities split, are the resulting smaller communities generally uniform in terms of attributes? Are new-born community uniform?

A more advanced approach consists in using the methods previously designed in the field of data mining for the analysis of natural time series (i.e. not networks), such as the ones reviewed in [15]. These are made to extract relevant information from sequences of values of various types (numerical, ordinal or nominal). When focusing only on the node attributes, they can be applied as is, whereas taking structure into account could require to modify the original method. Although the idea of applying/adapting classic data mining methods to networks seems obvious, it is not widespread at all in the domain of community interpretation: to the best of our knowledge, there exist only one work of this type, which we describe briefly here.

[32] formalize the community characterization problem as a sequential pattern mining problem [26]. Each node is represented through a sequence of individual descriptor values, describing its evolution through time. A descriptor can directly correspond to a nodal attribute, but it can also be a topological measure: this allows [32] to represent simultaneously attributes and various aspects of the network topology (degree, centrality, transitivity, etc.). A community is therefore described by the set of sequences representing its nodes. It is characterized by identifying the most relevant sequential patterns among these nodal sequences, which allows detecting common changes in topological features and attribute values over time periods. This relevance is enforced through various constraints. First, to get prevalent patterns, they focus on frequent ones (i.e. those supported by a at least a certain number of nodes). Second, to get informative patterns, they only detect closed ones, i.e. patterns not included in larger patterns possessing the same support. Third, to get distinctive patterns, they select those with the highest growth rates. The growth rate of a pattern is a measure showing how much representative a sequence is inside of a group, relatively to the whole studied population. In this case, a group corresponds to a community.

The interest of this method is that it allows characterizing each community independently from the others, in terms of both structure and attributes. It also allows detecting outliers, i.e. nodes not following the dominant trend of their community. [32] apply their method to two real-world networks: DBLP (academic collaborations) and LastFM (music-oriented social media). It results in the extraction of high level information, helping better characterizing and understanding the studied systems. For instance, in DBLP, they associate a scientific domain to the detected communities, and are also able to identify authors on the point of switching to another scientific domain. In LastFM, they focus on the Jazz user group, and can provide a relatively clear interpretation to various communities, such as: users preferring vocal artists, or users only remotely interested in Jazz. However, it is worth noticing the method is likely to produce a large number of sequential patterns, which must then be processed manually for interpretation, resulting in a substantive work for the end-user. But in return, the method does not require making any assumption regarding whether communities would be better characterized by topological information or nodal attributes: the most relevant descriptors will automatically be selected, so the method output can be purely topological or attribute-based, as well as a combination of both.

8 Key Applications

The tools presented in this entry aim at characterizing communities and community structures, in order to ease their interpretation by human operators. Community detection is itself a very general analysis, which can be performed on any complex network, whatever the modeled system. Therefore, the presented tools are likely to be used on any system.

More precisely, some of them are designed to describe the community structure as a whole, which is useful to compare graphs at a mesoscopic scale, whether they represent distinct systems or a given system at different times. Others characterize communities individually, which is more appropriate to focus on specific communities of interest, understand them, and compare them.

Articles mentioned in this entry include applications to real-world social networks [21, 22, 38], social media [25, 23, 32, 10], the Internet [18, 23], the Web (or parts of it) [20, 25, 1, 23], biological networks [18, 7, 23], communication networks [33, 7, 17, 23, 5], collaboration networks [18, 33, 1, 40, 32], transportation networks [18], sale co-occurrence networks [39], electronic circuits [20].

9 Future Directions

There are now countless community detection methods, but the need for a way to extract meaningful information from the detected communities is still very strong. The future works in this field could follow two complementary ways. First, certain existing tools need improvement in terms of reliability, usability (noticeably parameter estimation, e.g. thresholds), computational complexity, and quality of the produced results. Second, the methods presented here do not allow taking into account all the possible information one can encode in a network: it is necessary to extend them, or propose new ones, to deal with multilayer/multiplex networks, as well as signed relationships (in conjunction with attributes, and temporal evolution). It is worth noticing data mining researchers deal with very similar problems, but applied to non-relational data (i.e. not networks). Their tools would therefore constitute a very relevant base in the constitution of new, network-related methods, but this source of inspiration has been largely ignored until now by the complex network scientists.

Acknowledgments

This article is supported by the Galatasaray University Research Fund (BAP) within the scope of the project number 14.401.002, and titled ”Sosyal Ağlarda Küme Bulma ve Anlamlandırma: Zamana Bağlı Sıralı Örüntü Uygulaması”.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Asur, S. Parthasarathy and D. Ucar “An event-based framework for characterizing the evolutionary behavior of interaction graphs” In ACM Transactions on Knowledge Discovery from Data 3.4 , 2009, pp. 1–36 DOI: 10.1145/1631162.1631164 · doi ↗
2[2] T. Aynaud, É. Fleury, J.-L. Guillaume and Q. Wang “Communities in Evolving Networks: Definitions, Detection, and Analysis Techniques” In Dynamics On and Of Complex Networks 2 , Modeling and Simulation in Science, Engineering and Technology Springer, 2013, pp. 159–200 DOI: 10.1007/978-1-4614-6729-8˙9 · doi ↗
3[3] V.. Blondel, J.-L. Guillaume, R. Lambiotte and E. Lefebvre “Fast unfolding of communities in large networks” In Journal of Statistical Mechanics , 2008, pp. P 10008 DOI: 10.1088/1742-5468/2008/10/P 10008 · doi ↗
4[4] C. Bothorel, J.. Cruz, M. Magnani and B. Micenkova “Clustering attributed graphs: models, measures and methods” In Network Science 3 , 2015, pp. 408–444 DOI: 10.1017/nws.2015.9 · doi ↗
5[5] P. Bródka, S. Saganowski and P. Kazienko “GED: the method for group evolution discovery in social networks” In Social Network Analysis and Mining 3.1 , 2013, pp. 1–14 DOI: 10.1007/s 13278-012-0058-8 · doi ↗
6[6] H. Cai, V.. Zheng, F. Zhu, K..-C. Chang and Z. Huang “From Community Detection to Community Profiling” In Proceedings of the Very Large Database Endowment in press , 2017 URL: https://arxiv.org/abs/1701.04528
7[7] Z. Chen, K.. Wilson, Y. Jin, W. Hendrix and N.. Samatova “Detecting and Tracking Community Dynamics in Evolutionary Networks” In IEEE International Conference on Data Mining Workshops , 2010, pp. 318–327 DOI: 10.1109/ICDMW.2010.32 · doi ↗
8[8] A. Clauset, M… Newman and C. Moore “Finding community structure in very large networks” In Physical Review E 70.6 , 2004, pp. 066111 DOI: 10.1103/Phys Rev E.70.066111 · doi ↗