Exploring Communities in Large Profiled Graphs

Yankai Chen; Yixiang Fang; Reynold Cheng; Yun Li; Xiaojun Chen; Jie; Zhang

arXiv:1901.05451·cs.DB·January 18, 2019

Exploring Communities in Large Profiled Graphs

Yankai Chen, Yixiang Fang, Reynold Cheng, Yun Li, Xiaojun Chen, Jie, Zhang

PDF

TL;DR

This paper introduces profiled community search (PCS) in large hierarchical labeled graphs, demonstrating its effectiveness in identifying thematically coherent communities and proposing an efficient tree index for fast online querying.

Contribution

It presents the PCS problem, shows its advantages over existing methods, and develops a tree index to enable efficient community search in profiled graphs.

Findings

01

PCS identifies thematically coherent communities.

02

The tree index significantly improves search efficiency.

03

PCS outperforms existing community search approaches.

Abstract

Given a graph $G$ and a vertex $q \in G$ , the community search (CS) problem aims to efficiently find a subgraph of $G$ whose vertices are closely related to $q$ . Communities are prevalent in social and biological networks, and can be used in product advertisement and social event recommendation. In this paper, we study profiled community search (PCS), where CS is performed on a profiled graph. This is a graph in which each vertex has labels arranged in a hierarchical manner. Extensive experiments show that PCS can identify communities with themes that are common to their vertices, and is more effective than existing CS approaches. As a naive solution for PCS is highly expensive, we have also developed a tree index, which facilitate efficient and online solutions for PCS.

Tables4

Table 1. TABLE I: Notations and meanings.

Notation

Meaning

G ​ (V, E)

A profiled graph with vertex set

V

and edge set

E

n

the number of vertices in

V

m

the the number of edges in

E

d ​ e ​ g_{G} ​ (v)

The degree of vertex

v

in

G

T ​ (v)

The P-tree of vertex

v

M ​ (G_{q})

The maximal common subtree of

G_{q}

G ​ [T]

the largest connected subgraph of

G

s.t.

q \in G ​ [T]

,

\forall v \in G ​ [T]

,

T \subseteq T ​ (v)

G_{k} ​ [T]

the largest connected subgraph of

G ​ [T]

s.t.

q \in G_{k} ​ [T]

,

d ​ e ​ g_{G_{k}} ​ (v) \geq k

Table 2. TABLE II: Datasets used in our experiments.

Dataset	Vertices	Edges	$\hat{d}$	$\hat{P}$	$\|$ GP-tree $\|$
ACMDL	107,656	717,958	13.34	11.54	1,908
Flickr	581,099	4,972,274	17.11	26.63	1,908
PubMed	716,459	4,742,606	13.22	27.10	10,132
DBLP	977,288	6,864,546	14.04	37.98	1,908

Table 3. TABLE III: Locations of maximal feasible subtrees.

	ACMDL	Flickr	PubMed	DBLP
Level 1	3%	8%	11%	5%
Level 2	15%	23%	5%	13%
Level 3	18%	32%	43%	37%
Level 4	26%	25%	24%	31%
Level 5	38%	12%	17%	14%

Table 4. TABLE IV: Facebook datasets.

Dataset	Vertices	Edges	$\hat{d}$	$\hat{P}$
FB1	1,233	11,972	19.41	34.54
FB2	1,447	17,533	24.23	29.12
FB3	982	10,112	20.59	31.10

Equations8

f (x) = {1 ma x_{i = 0}^{x} {f (i) \cdot [f (x - i) - 1]} + 1 x = 0 x \geq 1, x \in N

f (x) = {1 ma x_{i = 0}^{x} {f (i) \cdot [f (x - i) - 1]} + 1 x = 0 x \geq 1, x \in N

CPS(\mathcal{G})=1-\sum\limits_{l=1}^{\cal|\mathcal{G}|}\Bigg{[}\frac{1}{|G_{l}|^{2}}\sum\limits_{j=1}^{|G_{l}|}\sum\limits_{i=1}^{|G_{l}|}\frac{TED(T_{i},T_{j})}{|T_{i}\cup T_{j}|}\Bigg{]}

CPS(\mathcal{G})=1-\sum\limits_{l=1}^{\cal|\mathcal{G}|}\Bigg{[}\frac{1}{|G_{l}|^{2}}\sum\limits_{j=1}^{|G_{l}|}\sum\limits_{i=1}^{|G_{l}|}\frac{TED(T_{i},T_{j})}{|T_{i}\cup T_{j}|}\Bigg{]}

LDR(q,F)=\frac{1}{\cal L}\sum\limits_{i=1}^{\cal L}\frac{\sum\limits_{h=1}^{\cal H}L_{i}\Big{[}\mathcal{T}({F,q,h)}\Big{]}}{\sum\limits_{j=1}^{\cal J}L_{i}\Big{[}\mathcal{T}({PCS,q,j)}\Big{]}}

LDR(q,F)=\frac{1}{\cal L}\sum\limits_{i=1}^{\cal L}\frac{\sum\limits_{h=1}^{\cal H}L_{i}\Big{[}\mathcal{T}({F,q,h)}\Big{]}}{\sum\limits_{j=1}^{\cal J}L_{i}\Big{[}\mathcal{T}({PCS,q,j)}\Big{]}}

C P F (q) = \frac{1}{∣ G _{l} ∣ \cdot ∣ T ( q ) ∣} i = 1 \sum ∣ G_{l} ∣ j = 1 \sum ∣ T (q) ∣ \frac{f r e _{i, j}}{∣ G _{i} ∣}

C P F (q) = \frac{1}{∣ G _{l} ∣ \cdot ∣ T ( q ) ∣} i = 1 \sum ∣ G_{l} ∣ j = 1 \sum ∣ T (q) ∣ \frac{f r e _{i, j}}{∣ G _{i} ∣}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Exploring Communities in Large Profiled Graphs

Yankai Chen, Yixiang Fang, Reynold Cheng , Yun Li, Xiaojun Chen, Jie Zhang

Y. Chen, Y. Fang, and R. Cheng are with the Department of Computer Science, The University of Hong Kong, Hong Kong.

E-mail: {ykchen, yxfang, ckcheng}@cs.hku.hk Y. Li is with Department of Computer Science and Technology, Nanjing University, China.

E-mail: [email protected] X. Chen is with College of Computer Science and Software,Shenzhen University, China.

E-mail: [email protected] J. Zhang is with School of Computer Science and Engineering, Nanyang Technological University, Singapore.

E-mail: [email protected] Manuscript received March 20, 2018.

Abstract

Given a graph $G$ and a vertex $q\in G$ , the community search (CS) problem aims to efficiently find a subgraph of $G$ whose vertices are closely related to $q$ . Communities are prevalent in social and biological networks, and can be used in product advertisement and social event recommendation. In this paper, we study profiled community search (PCS), where CS is performed on a profiled graph. This is a graph in which each vertex has labels arranged in a hierarchical manner. Extensive experiments show that PCS can identify communities with themes that are common to their vertices, and is more effective than existing CS approaches. As a naive solution for PCS is highly expensive, we have also developed a tree index, which facilitate efficient and online solutions for PCS.

Index Terms:

community search, social networks, graph queries, profiled graph

1 Introduction

Due to the recent developments of gigantic social networks (e.g., Flickr, Facebook, and Twitter), topics of graph queries have attracted attention from industry and research areas [1, 2, 3, 4, 5, 6, 7]. Communities, which are often found in large graphs, can be used in various applications, such as social event setting, friend recommendation, and research collaboration analysis [8, 9, 10, 11, 12]. Given a graph $G$ and a query vertex $q\in G$ , the goal of community search (CS) is to extract communities, or densely connected subgraphs of $G$ that contain $q$ , in an online manner.

In this paper, we investigate the CS problem for a profiled graph. This is essentially a kind of attributed graphs, where each graph vertex is associated with a set of labels arranged in a hierarchical manner called a P-tree. Fig. 1(a) shows a profiled graph, which is a computer science collaboration network; each vertex represents a researcher, and a link between two vertices depicts that the two corresponding researchers have worked together before. Each vertex is associated with a P-tree, which describes the expertise of researchers. Fig.1(c) shows the meanings of the terms in each P-tree, following the ACM Computing Classification System (CCS) 111ACM CCS: http://www.acm.org/publications/class-2012, which is partially presented in Fig.1(b). For instance, vertex $B$ denotes a researcher, whose research domain is in computing methodology (CM), with specific interest in machine learning (ML) and artificial intelligence (AI). Profiled graphs are informative and can be found in various graph applications (e.g., knowledge bases, social and collaboration networks). Moreover, the P-trees of profiled graphs systematically organize labels related to a vertex (e.g., hierarchical and interrelated knowledge in knowledge bases, affiliation, expertise, and locations in social and collaboration networks), reflecting the semantic relationship among them. For example, in a P-tree, label “London” can be a child node of “UK”, because London is a UK city.

Prior works. The methods related to retrieval communities can generally be classified into community detection (CD) methods and community search (CS) methods. In general, the aim of CD algorithms is to retrieve all communities for a graph [13, 14, 15, 16, 17, 18, 19, 20]. Note that these solutions are not “query-based”. This means that, given a user-specified query vertex, they are not customized for a query request. As a result, these algorithms normally take a long time to find all the communities for a large graph. Thus it is not suitable to use CD algorithms for quick or online retrieval of communities. To solve these problems, CS solutions have been recently proposed [21, 22, 23, 10, 24, 8]. Compared with CD solutions, CS approaches are query-based, and thus are suitable to derive communities in an “online” manner.

However, to our best knowledge, previous CS algorithms are not designed for profiled graphs. Early solutions (e.g., [8, 9, 10]) often only consider graph topology (e.g., a $k$ -core is a community such that each vertex is connected to $k$ or more vertexes). They did not consider the use of vertex labels. As pointed out in [11], the communities returned by those solutions are often huge (e.g., a community can easily contain over $1,000$ vertices). Moreover, the vertices included in the communities were not quite related. Recent works, such as ACQ [11] and ATC [12], propose to use both graph structure and vertex label information. While these works have been shown to be more effective than CS solutions that do not utilize vertex labels, they did not employ the hierarchical relationship among labels (e.g., P-trees in Fig. 1(a)). This may lead to suboptimal results. In Fig. 1(a), suppose that a renowned expert $D$ wants to organize a seminar where researchers are closely related to each other. Based on the ACQ solution [11], with $k$ =2, only a 2-core is searched (Fig. 2(b)), whose vertices { $B$ , $C$ , $D$ } have several labels (i.e., r, CM, ML, AI) in common. However, it fails to return the community in Fig. 2(c), whose vertices are also highly similar. For these two communities, the shared labels as well as their relationships in the P-tree are very different. Therefore, both communities can be presented to the organizer for further selection.

Profiled community search. In this paper, we study profiled community search (PCS), which aims to find profiled communities, or PC’s, for a profiled graph. To obtain high-quality communities, we use structure cohesiveness and profile cohesiveness to constrain PC’s. We adopt widely used metric minimum degree [8, 25, 26, 27, 28, 29] to measure the structure cohesiveness. Note that in PCS problem, the minimum degree matric can be replaced by other useful matrics, e.g., $k$ -truss [10] and $k$ -clique [22], to fit in other possible application scenarios. In a profiled graph, each vertex is associated with a P-tree. To measure the profile cohesiveness, we fully utilize the information in P-trees. Conceptually, a PC is a group of densely connected vertices, whose P-trees have the largest degree of overlap. This overlapping part is the largest common subtree shared by all the vertices. Fig. 2(a) illustrates two PC’s in the profiled graph of Fig. 1, namely { $B$ , $C$ , $D$ } and { $A$ , $D$ , $E$ }. In Fig. 2(b) and Fig. 2(c), the two PC’s, as well as their largest common subtrees are respectively shown. For example, in Fig. 2(c), vertices $A$ , $D$ , and $E$ all possess the subtree with root $r$ and leaf nodes “IS” and “DMS”. Notice that these three vertices also form a 2-core of $D$ , and the common subtree among them is the largest. The common subtree sufficiently reflects the “theme” of the community. In the PC of Fig. 2(b), all the researchers involved share interest in machine learning and artificial intelligence, whereas for Fig. 2(c), the researchers are all interested in information systems and hardware studies.

Personalization. PCS problem allows a query user to search communities that exhibits both structure cohesiveness and profiled cohesiveness. The parameter $k$ controls the density of connection intensiveness. The profiled cohesiveness constrains the community to be semantically similar as much as possible. For instance, PCS methods can answer questions such as who are my close friends so that we have strong connection and common intresets and expertise? In contrast, existing CD methods [30, 31, 32] often use some global criteria (e.g., modularity) where the graph is partitioned a-priori with no reference to the particular query vertices. Thus existing CD methods are not suitable for personalized queries.

Online search. Similar to other online CS approaches, our PCS method is able to find PC’s from a large-scale profiled graph effectively and efficiently. However, existing CD methods for graph query problems are generally slower. This is mainly because that they are designed for retrieving all the communities for an entire graph.

Contributions. As we will explain, a simple solution to solve the PCS problem is extremely expensive. To improve the efficiency of finding PC’s (so that they can be used in online applications), we first introduce an anti-monotonicity property, which allows the candidates for a PC to be pruned efficiently. We further develop the CP-tree index, which systematically organizes the graph vertices and P-trees of a profiled graph. The CP-tree index enables the development of two fast PC discovery algorithms. We experimentally evaluate our solutions on two real large profiled graphs and two synthetic profiled graphs. Our results show that PC’s are better representations of communities, and the CP-tree based algorithms are up to 4 order-of-magnitude faster than basic solution.

Organization. We review the related work in Section 2. Section 3 presents the PCS problem and a basic solution. Section 4 discusses the CP-tree and its related solutions. We report the experimental results in Section 5, and conclude in Section 6.

2 Related Work

In the literature, there are two kinds of work related to the retrieval of communities, namely community detection (CD) and community search (CS).

Community detection (CD) aims to obtain all the communities from a given graph. Earlier works [16, 33] use link-based analysis to obtain these communities. However, they do not consider the textual information associated with graphs. Recent works focus on attributed graphs and use some advanced techniques such as clustering techniques to identify communities. However, these studies often assume that the attribute of the vertex is a set of keywords, and do not consider the hierarchical relationship among them. For Example, Zhou et al. [20] used keywords to describe vertices and further compute the vertices’ pairwise similarities to cluster the graph. Qi et al. [34] studied a problem of dynamically maintaining communities of moving objects using their trajectories. Ruan et al. [35] proposed a method called CODICIL. Based on content similarity, CODICIL augments the original graphs by creating new edges, and then uses an effective graph sampling to boost the efficiency of clustering. Another wide-used approach is based on topic models [36, 18]. Essentially, these methods still analyze the one-dimensional content to obtain the communities.

Another common approach is based on topic models. Link-PLSA-LDA [14] and Topic-Link LDA [37] models jointly model vertices’ links and content based on the LDA model. In [6], the communities are clustered based on probabilistic inference. In [38], information such as topics, interaction types and the social connections are considered to explore the communities. CESNA [19] detects overlapping communities by assuming communities “generate” both the link and content. As we introduced before, CD solutions are typically time consuming, and they may not be suitable for online applications that require fast retrieval of communities. It is also interesting to examine how our PCS solutions can be extended to support CD.

Community search (CS) returns the communities for a given graph vertex in a fast and online manner. Most existing CS solutions [8, 9, 10, 26, 25] only consider graph topologies, but not the labels associated with the vertices. To define the structure cohesiveness of the community, the minimum degree is often used [25, 26, 8]. Sozio et al. [8] proposed the first algorithm Global to find the $k$ - $\widehat{core}$ containing the query vertex. Cui et al. [25] proposed Local, which uses local expansion techniques to improve Global. We will compare these two solutions in our experiments. Other definitions, such as $k$ -clique [9], $k$ -truss [10] and edge connectivity [39], have been considered for searching meaningful communities. Recent CS solutions, such as ACQ [11, 40] and ATC [12], make use of both vertex labels and graph structure to find communities.

Since CS is “query-based”, it is much more suitable for fast and online query of the communities on large-scale profiled graphs. However, all above works are not designed for profiled graphs, and they do not consider the hierarchical relationship among vertex labels. Thus in this paper, we propose methods to solve the community search problem on profiled graphs. We have performed detailed experiments on real datasets (Section 5). As we will show, our algorithms yield better communities than state-of-the-art CS solutions do.

3 Problem Definition and Basic Solution

In this section, we first formally introduce the PCS problem, and then give a basic solution to the PCS problem. Table I lists all notations used in this paper.

3.1 The PCS Problem

A profiled community is a subgraph of $G$ that firstly satisfies the structure cohesiveness (i.e., the vertices in this community are connected to each other in some way). Formal definition will be introduced later. A common notion of structure cohesiveness is that the minimum degree of all the vertices that in the community has to be at least $k$ [8, 25, 26, 27, 28, 29]. This is used in the $k$ -core and the PC. Let us discuss the $k$ -core first.

Definition 1 ( $k$ -core [41, 27]).

Given an integer $k$ ( $k\geq 0$ ), the $k$ -core of $G$ , is the largest subgraph of $G$ , such that $\forall v\in k$ -core, degree of v is at least $k$ .

Notice that $k$ -core may not be connected [27]. Its connected components, denoted by $k$ - $\widehat{core}$ , are the “communities” retreieved by $k$ -core search algorithms. We use Example 1 to illustrate it.

Example 1.

In Figure 2(a), each dashed circle represents a 2-core and also a 2- $\widehat{core}$ . Vertices { $A$ , $B$ , $D$ , $E$ }* group a 3- $\widehat{core}$ and vertices { $A$ , $B$ , $C$ , $D$ , $E$ } form a 2- $\widehat{core}$ because $C$ only has a degree of 2, even though other vertices has a higher degree.*

A profiled graph $G(V,E)$ is an undirected graph with vertex set $V$ and edge set $E$ . Each vertex $v\in V$ is associated with a profiled tree (P-tree) to describe $v$ ’s hierarchical attributes.

Definition 2 (P-tree).

The P-tree of vertex $q$ , denoted by $T(q)$ = $(V_{T(q)},E_{T(q)})$ , is a rooted ordered tree, where $V_{T(q)}$ is the set of attribute labels and $E_{T(q)}$ is the set of edges between labels. A P-tree satisfies following constraints: (1) There is only one root node $r\in V_{T(q)}$ ; (2) $\forall(x,y)\in E_{T(q)}$ , it is directed and $y$ is the child attribute label of $x$ ; and (3) $\forall y\in V_{T(q)}$ and $y\neq r$ , there is one and only one $x\in V_{T(q)}$ , s.t. $(x,y)\in E_{T(q)}$ .

In practice, labels in the upper levels of the P-tree are more semantically general than those in lower levels. All edges in $E_{T(q)}$ preserve the semantic relationships among labels in $V_{T(q)}$ .

Definition 3 (induced rooted subtree).

Given two P-trees $S$ = $(V_{S},E_{S})$ and $T$ = $(V_{T},E_{T})$ , $S$ is the induced rooted subtree of $T$ , denoted by $S\subseteq T$ , if $V_{S}\subseteq V_{T}$ and $E_{S}\subseteq E_{T}$ .

Essentially, an induced rooted subtree defines an inclusion relationship between two P-trees. Unless otherwise specified, we use “subtree” to mean “induced rooted subtree”. We call the unified P-tree of all vertices’ P-trees a Global P-tree (GP-tree), which usually corresponds to a taxonomy system in practice.

Definition 4 (maximal common subtree).

Given a profiled graph $G$ , the maximal common subtree of $G$ , denoted by $\mathcal{M}$ ( $G$ ), holds the properties: (1) $\forall v\in G$ , $\mathcal{M}$ ( $G$ ) $\subseteq T(v)$ ; (2) there exists no other common subtree $\mathcal{M^{\prime}}$ ( $G$ ) such that $\mathcal{M}$ ( $G$ ) $\subseteq\mathcal{M^{\prime}}$ ( $G$ ).

The common subtree depicts the common hierarchical part among all P-trees in a subgraph. We use the maximal structure $\mathcal{M}$ ( $G$ ) to consider both the high-level and low-level labels and it fully mines the common features of this subgraph. As a result, by using the maximal common subtree, we can maximize vertices’ common profiles, including the topology and semantics of users’ profiles. Next, we formally introduce the PCS problem.

Problem 1 (PCS).

Given a profiled graph $G(V,E)$ , a positive integer $k$ , and a query node $q\in G$ , find a set $\mathcal{G}$ of graphs, such that $\forall G_{q}\in\mathcal{G}$ , the following properties hold:

$\bullet$ * Connectivity. $G_{q}\subseteq G$ is connected and contains $q$ ;*

$\bullet$ * Structure cohesiveness. $\forall v\in G_{q}$ , $deg_{G_{q}}(v)\geq k$ , where $deg_{G_{q}}(v)$ denotes the degree of $v$ in $G_{q}$ ;*

$\bullet$ * Profile cohesiveness. There exists no other $G^{\prime}_{q}\subseteq G$ satisfying the above two constraints, such that $\mathcal{M}(G_{q})\subseteq\mathcal{M}(G^{\prime}_{q})$ .*

$\bullet$ * Maximal structure. There exists no other $G^{\prime}_{q}$ satisfying the above properties, such that $G_{q}\subset G^{\prime}_{q}$ and $\mathcal{M}(G_{q})$ = $\mathcal{M}(G^{\prime}_{q})$ ;*

Essentially, a profiled community (PC) is a subgraph of $G$ , in which vertices are closely related in both structure and semantics. In Problem 1, the first two properties and last property ensure the structure cohesiveness, as shown in the literature [40, 26]. The unique property profile cohesiveness captures the maximal shared profile among all the vertices of $G_{q}$ . Moreover, since the shared subtree $\mathcal{M}(G_{q})$ shows the common hierarchical attribute, it can well explain the semantic theme of the community.

3.2 A Basic Solution

Since vertices in the PC’s share a common subtree of the query vertex $q$ , a straightforward method it that we can enumerate all the subtrees of $q$ ’s P-tree and find the corresponding PC’s. However, as illustrated in Lemma 1, the search space may be exponentially large and computation overhead renders this method impractical. To alleviate this issue, we iteratively perform the following two steps.

Lemma 1.

The maximum number of subtrees of a P-tree with $x$ nodes is $2^{x-1}+1$ .

Proof.

Let $f(x)=max\{L$$\in\mathbb{N}|$ $L$ is the number of subtrees of a tree with $x$ nodes $\}$ . As shown in Fig. 3(a), $p_{i}$ denotes the $i$ th child of the P-tree. Then it is not hard to find that there are $(2^{n-1}+1)$ subtrees including the “empty tree” (no P-tree node is contained). So $f(x)\geq 2^{x-1}+1$ . In this case, we do need to worry about the “parent-child” relationship between P-tree nodes so that $2^{x-1}+1$ is also the upper bound of $f(x)$ . Then we can infer that $f(x)=2^{x-1}+1$ .

More formally, we can verify the correctness of this formula. As shown in Fig. 3(b), the left triangle (including $r$ ) denotes the subtree with $i$ nodes and the right one represents the subtree with $x-i$ nodes. We present the following equation 1. Note that the “empty tree” should be included and thus $f(0)=1$ . Obviously, we can construct different subtrees by combining subtrees in left and right parts. Then we can compute $f(x)$ by using $f(i)$ and $f(x-i)$ . Note that the “empty tree” in both left and right part should not be included simultaneously. Finally we add 1 to $f(x)$ to represent the “empty tree”.

[TABLE]

Now we can directly verify that $f(x)=2^{x-1}+1$ satisfy the equation and this complete the proof.

∎

Step 1: candidate subtree generation. To generate the candidate subtrees, the key problem is how to avoid redundancies of the subtree enumeration. In [42], Asai et al. introduced a tree pattern enumeration strategy, and it is based on the following two concepts: (1) Rightmost leaf is the last P-tree node according to the depth-first traversal order. (2) Rightmost path is defined as a path from the root node to the rightmost leaf. Given a tree $T^{\prime}$ , a new subtree $T$ can only be generated by adding a new node $t$ to $T^{\prime}$ such that the following hold: (1) $t$ ’s parent node is on the rightmost path of $T^{\prime}$ ; (2) $t$ is the rightmost leaf of $T$ . As shown in [42], this generation strategy guarantees that all the subtrees of the P-tree will be enumerated without repetition. Thus, we follow this strategy to generate the candidate subtrees.

Step 2: community verification. After a candidate subtree $T$ has been generated, we verify the existence of the corresponding community. We use $G_{k}[T]$ to represent the largest connected subgraph of $G$ containing $q$ where each vertex has at least $k$ neighbors and contains the subtree $T$ . We say that, $T$ is feasible, if $G_{k}[T]$ exists. The verification step is mainly based on the following lemma.

Proposition 1.

Given a profiled graph $G$ , two P-tree $T^{\prime},T$ and the query vertex $q$ , if $T^{\prime}\subseteq T,G_{k}[T]\subseteq G_{k}[T^{\prime}]$ .

Proof.

As we defined before, $G_{k}[T]$ denotes the $k$ - $\widehat{core}$ containing $q$ where each vertex contains the subtree $T$ . (1) If $G_{k}[T]=\emptyset$ , $G_{k}[T]\subseteq G_{k}[T^{\prime}]$ always holds. (2) If $G_{k}[T]\neq\emptyset$ , we have $\forall v\in G_{k}[T]$ , $T\subseteq T(v)$ . Then from $T^{\prime}\subseteq T$ , we can infer $\forall v\in G_{k}[T]$ , $T^{\prime}\subseteq T(v)$ . This means each vertex $v\in G_{k}[T]$ also contains the P-tree $T^{\prime}$ . Thus if $G_{k}[T]\neq\emptyset$ , $G_{k}[T]\subseteq G_{k}[T^{\prime}]$ . In summary, Proposition 1 holds. ∎

Lemma 2 (Anti-monotonicity).

Given a subtree $T$ , if $G_{k}[T]\neq\emptyset$ , then $\forall T^{\prime}\subseteq T$ , $G_{k}[T^{\prime}]\neq\emptyset$ .

Proof.

From Proposition 1, we know $\forall T^{\prime}\subseteq T$ , $G_{k}[T]\subseteq G_{k}[T^{\prime}]$ . Now since $G_{k}[T]\neq\emptyset$ , we have $\forall T^{\prime}\subseteq T,G_{k}[T^{\prime}]\neq\emptyset$ . ∎

By Lemma 2, we can conclude that, if $G_{k}[T]$ is infeasible, then we can stop generating subtrees from $T$ . The basic method begins with generating a subtree from the root node. Then, it iteratively performs the two steps above to retrieve all the feasible $G_{k}[T]$ s, until no larger subtrees can be generated. Pseudocodes of basic are attached in Algorithm 1.

Complexity analysis. Let $m$ be the number of edges in $G$ . In worst case all edges are traversed to compute the $G_{k}[T]$ and all the subtrees are verified. As a result, basic completes in $O$ ( $2^{|T(q)|}\cdot m$ ) time where $|T(q)|$ denotes the number of nodes of $T(q)$ . In practice, the value of $2^{|T(q)|}$ could be exponentially large and this makes basic impractical. To alleviate this issue, we propose more efficient index-based solutions in next section.

Algorithm 1 presents basic. We first initilize the result set $\mathcal{G}$ and load the $q$ ’s P-tree $T(q)$ (line 2). Then we need to compute $G_{k}$ , the largest connected subgraph of $G$ containing $q$ where each vertex has at least $k$ degrees (line 3). Now in the iteration, we generate new subtrees from current subtree $T^{\prime}$ . For each new subtree $T$ , we verify the existence of $G_{k}[T]$ (lines 4-10). If $G_{k}[T]$ exists, we add $T$ in $\Phi$ (lines 11-12); otherwise if no subtree can be generated from $T^{\prime}$ or all subtrees generated from $T^{\prime}$ are infeasible, we add $G_{k}[T^{\prime}]$ in $\mathcal{G}$ if $T^{\prime}$ is maximal (line 13). Finally, all PC’s are returned (line 14).

4 Index-based Solutions

We first introduce some preliminaries and the proposed CP-tree index, and then discuss the index-based query algorithms.

4.1 $k$ -core and CL-tree

$k$ -core. In line with existing CS [26, 11], we use $k$ -core to satisfy the constraints of minimum degree and maximal structure of a PC. Given an integer $k$ ( $k\geq 0$ ), the $k$ -core of $G$ , denoted by $G_{k}$ , is the largest subgraph of $G$ , such that $\forall v\in G_{k}$ , $deg_{G_{k}}(v)\geq k$ . Since $G_{k}$ may be disconnected, we use $k$ - $\widehat{core}$ s to denote one of its connected components. An important property of $k$ -core is the “nested” property: given two integer $i$ and $j$ , $j$ - $\widehat{core}\subseteq$ $i$ - $\widehat{core}$ if $i<j$ . In Fig. 4(a), the [math]-core represents the whole graph, and 3-core is nested in 2-core. Computing all the $k$ -cores of a graph $G$ , known as core decomposition, can be completed by an $O$ ( $m$ ) algorithm [27], where $m$ is the number of edges in $G$ .

CL-tree. Since $k$ -cores are nested, all the $k$ -cores of a graph can be organized into a tree structure, called CL-tree [11]. In this paper, we adopt it, but skip the labels on the tree. The CL-tree of the graph in Fig. 4(a) is shown in Fig. 4(b). Clearly, vertices in each CL-tree node and other vertices in all its descendant nodes represent a $k$ - $\widehat{core}$ . For example, vertex $C$ and other vertices $\{A,B,D,E\}$ in its child node compose a 2- $\widehat{core}$ . Since each vertex appears only once, the space cost of CL-tree is $O(n)$ where $n$ is the number of vertices in $G$ . In addition, we maintain a map vertexNodeMap, where the key is the vertex and the value is the node of the corresponding CL-tree node, and it allows us to locate the $k$ - $\widehat{core}$ containing any query vertex efficiently.

4.2 CP-tree Index

Index Overview. We build the Core Profiled tree (CP-tree) index by considering both the P-tree structure and $k$ -cores. We depict an example CP-tree in Fig. 5 using the profiled graph in Fig. 1(a).

Each CP-tree node corresponds to a label and stores the $k$ -cores sharing this label. To summarize, each node $p$ consists of following four elements:

(1) label: the attribute label;

(2) parentNode: the parent node of $p$ ;

(3) childList: a list of child CP-tree nodes of $p$ ; and

(4) vertexNodeMap: a map that stores the CL-tree.

In addition, we maintain a map headMap, where the key is a vertex $v$ , and the value is a list of CP-tree nodes, each of which corresponds to a leaf node of $v$ ’s P-tree. Main advantages of CP-tree are listed below.

$\bullet$ Restore P-trees. By utilizing the headMap, each vertex’s P-tree can be restored by traversing the leaf nodes up to the root node.

$\bullet$ Locating $k$ - $\widehat{core}$ . Given an integer $k$ , a query vertex $q$ and a CP-tree node $t$ , using vertexNodeMap, we design a function $get(k,q,t)$ to get the $k$ - $\widehat{core}$ containing $q$ where each vertex contains the label $t.label$ in constant time cost.

$\bullet$ Query efficiency. As discussed above, the label information of each vertex’s P-tree can be efficiently accessed using the headMap.

Index Construction. We incrementally create CP-tree nodes and then link them up to build the CP-tree index. Pseudocodes of CP-tree index construction are presented in Algorithm 2. For each vertex $v$ , we read $T(v)$ and create new CP-tree nodes (lines 2-5). For each CP-tree node $t$ , we add $v$ in $t$ for later CL-tree construction (lines 6, 9). If P-tree node $x$ is a leaf node, we update $headMap$ (line 7). Then we link up all CP-tree nodes according to the GP-tree structure. Note that if GP-tree is unknown, we can simultaneously unify it whiling reading P-trees in the previous step (line 10). Finally, $\mathcal{I}$ is returned (line 11).

Complexity analysis. Obviously, lines 2-7 take the linear time. The time complexity of building a CL-tree is $O$ ( $m\cdot\alpha(n)$ ) [40, 11] where $m$ is the number of edges in $G$ and $\alpha(n)$ , the inverse Ackermann function, is less than 5 for large value of $n$ . Thus the time complexity of building CP-tree is $O$ ( $|P|\cdot m\cdot\alpha(n)$ ), and it is linear to the size of $G$ . The space cost of CP-tree is $O$ ( $|P|\cdot n$ ) where $|P|$ denotes the number of labels in $G$ . The space cost of the headMap is $O$ ( $\hat{l}\cdot n$ ) where $\hat{l}$ denotes the average number of leaf nodes in each vertex’s P-tree and $\hat{l}<|P|$ . Therefore, the total space complexity is $O$ ( $|P|\cdot n$ ) which is linear to the size of $G$ .

4.3 Index-based Query Algorithms

Now we present our index-based query solutions. The first one follows the framework of basic, and it incrementally generates and verifies the subtrees of P-tree (from smaller subtrees to larger ones). Thus we call it incre. The advanced methods borrows some ideas from MARGIN [43], the algorithm of mining maximal frequent subgraphs. As we will explain later, advanced methods can find all PC’s by examining a small fraction of subtrees, resulting in high efficiency. In addition, their time complexities are $O$ ( $2^{|T(q)|}\cdot m$ ), because in the worst case all the subtrees are verified. However, as we will show in Section 5.4, in practice they are much more efficient than such worse-case time complexities.

4.3.1 The Method incre

We begin with an interesting lemma, which greatly accelerates the verification step.

Lemma 3.

Given a CP-tree index $\mathcal{I}$ , a subtree $T^{\prime}$ and a new subtree $T$ which is generated from $T^{\prime}$ by adding a new P-tree node. We have $G_{k}[T]\subseteq G_{k}[T^{\prime}]\cap\mathcal{I}.get(k,q,T$$\setminus$$T^{\prime})$ , where $T$$\setminus$$T^{\prime}$ denotes the new added node.

Proof.

$T=T^{\prime}\cup t$ , so we have $T^{\prime}\subseteq T$ . Based on Proposition 1, we know $G_{k}[T]\subseteq G_{k}[T^{\prime}]$ . Similarly, $t\subseteq T$ , then we have that $G_{k}[T]\subseteq\mathcal{I}.get(k,q,T$$\setminus$$T^{\prime})$ where $\mathcal{I}.get(k,q,T$$\setminus$$T^{\prime})$ is the $k$ - $\widehat{core}$ containing the query vertex $q$ and P-tree node $T$$\setminus$$T^{\prime}$ . Hence $G_{k}[T]\subseteq G_{k}[T^{\prime}]\cap\mathcal{I}.get(k,q,T$$\setminus$$T^{\prime})$ . ∎

As incre searches the communities in the subgraph which are found in former iteration, the query efficiency is improved. We present incre in Algorithm 3.

We first use headMap to locate the leaf nodes of $T(q)$ and then restore $T(q)$ (line 2). We initialize $\Psi$ by using $T(q)$ (line 3). In the iteration, for current subtree $T^{\prime}$ , we generate new subtrees. For each new subtree $T$ , we verify the existence of $G_{k}[T]$ using the index (lines 4-8). If $G_{k}[T]$ exists, we add $T$ in $\Phi$ (lines 9-10); otherwise if no subtree can be generated from $T^{\prime}$ or all subtrees generated from $T^{\prime}$ are infeasible, we add $G_{k}[T^{\prime}]$ in $\mathcal{G}$ if $T^{\prime}$ is maximal (line 11). Finally, all PC’s are returned (line 12).

4.3.2 The Advanced Methods

The method incre follows the Apriori-based method, which explores all possible subtrees by traversing the search space from smaller subtrees to larger ones; while, as demonstrated in the Section 5.1, the maximal feasible subtrees often lie in the middle of the search space, which implies that most of the exploration may be avoided. Based on this observation, we adapt MARGIN [43] to tackle PCS.

MARGIN: It does not perform a bottom-up (or top-down) traversal of the search space; instead, it narrows the search space by examining only subgraphs that lie on the border of frequent and infrequent subgraphs. It firstly finds an initial pair of graphs ( $CR$ , $R$ ) where $R$ is frequent and $CR$ is not. In addition, $CR$ is the child subgraph of $R$ (i.e., $CR$ is the subgraph of $R$ and they differ by exactly one edge). Similarly, $R$ is the parent subgraph of $CR$ . ( $CR$ , $R$ ) is called a cut and from this cut, MARGIN expands and finds all other cuts by adding or deleting an edge to obtain new adjacent subgraphs. MARGIN defines this function as expandCut and Thomas et al. [43] has proved that expandCut is able to find all maximal frequent subgraphs.

Inspired by MARGIN, we design the following functions.

1. Function expandPtree. This function is adapted from expandCut [43] and the main modifications are as follows.

$\bullet$ We dynamically obtain child subgraphs and parent sugraphs, which are called child subtrees and parent subtrees in our case, using the parentNodes and childLists of CP-tree nodes, instead of pre-computing all subtrees in the search space as MARGIN does.

$\bullet$ We define a pair of P-trees ( $IF$ , $F$ ) as a cut, where $IF$ is the child subtree of $F$ and $F$ is feasible while $IF$ is not;

$\bullet$ We dynamically verify whether a feasible subtree is maximal.

$\bullet$ We develop a function verifyPTree to verify the feasibility.

We now illustrate expandPtree in Algorithm 4. As we will introduce later, if $IF=\emptyset$ and $F\neq\emptyset$ we can directly update $\mathcal{G}$ because the $F$ is already the maximal common subtree (line 2). Otherwise, we first use ( $IF,F$ ) to initialize the queue $Q$ (line 4). Then, for each pair, we iteratively verify its adjacent pairs (lines 5-17). If the parent subtree $Y_{i}$ of $IF$ is feasible, $G_{k}[Y_{i}]$ here may not be the final result. This is because subtrees are not regularly enumerated, and thus $Y_{i}$ may be temporarily maximal, so we need to repeatedly verify it. If there exist other feasible subtrees verified in previous steps that are the subtree of $Y_{i}$ , we need to replace their corresponding subgraphs with $G_{k}[Y_{i}]$ (line 9). Finally, we return $\mathcal{G}$ (line 18).

Lemma 4.

Given a P-tree pair ( $IF,F$ ), expandPtree can find all feasible subtrees for a PCS query.

The proof of Lemma 4 is based on following preliminaries.

Lattice is essentially a pre-processed data structure where all possible subgraphs of a given graph are enumerated. Taking the graph in Fig. 6(a) as an example, its subgraphs in each level have the same size (i.e., numbers of edges). The bottom level (level 0) corresponds to the empty graph and the level $i$ lists all size- $i$ subgraphs. In lattice, each subgraph is linked to its parent graphs (i.e., subgraph of this graph and they differ exactly by one edge) and childs (i.e., super-graph of this graph and they differ exactly by one edge). We can observe that the P-tree can directly replace the graph to construct the lattice.

Property 1 (Upper- $\Diamond$ -Property [43]).

Any two child subgraphs $C_{i},C_{j}$ of a graph $P$ will have a common child subgraph $A$ .

In Property 1, $C_{i},C_{j},P$ and $A$ are four subgraphs. $C_{i},C_{j}$ are two child subgraphs of $P$ (i.e., subgraphs of $P$ and they respectively differ with $P$ by one egde $e_{1},e_{2}$ ). Then there must exist one subgraph $A$ such that $A$ is the child subgraph of $C_{i}$ and $C_{j}$ . Property 1 is very intuitive in graphs. Based on Proposition2, we prove that the Upper- $\Diamond$ -Property can be simply adapted to fit in P-tree models.

Proposition 2.

P-trees satisfy the Upper- $\Diamond$ -Property.

Proof.

In P-trees, $e_{1}$ and $e_{2}$ can be two P-tree nodes such that subtrees $C_{i}=P\cup e_{1}$ and $C_{j}=P\cup e_{2}$ . There must exist a P-tree $A=P\cup e_{1}\cup e_{2}=(P\cup e_{1})\cup e_{2}=(P\cup e_{2})\cup e_{1}$ . Thus $A=C_{i}\cup e_{2}=C_{j}\cup e_{1}$ which means $A$ is the common child subtree of $C_{i}$ and $C_{j}$ . ∎

Now we formally give the proof of Lemma 4.

Proof.

Method expandPtree is mainly adpted from MARGIN. As mentioned in MARGIN, the correctness holds when the adapted problem satisfies the following constraints [43]:

(1)

The search space is a subset of the lattice.

(2)

The Upper- $\Diamond$ -property holds.

(3)

The anti-monotone property is satisfied.

(4)

A candidate set can be defined which is a “boundary” set such that every in the set satisfies a given user-constraint and there exists an immediate child in the lattice that does not satisfy the constraint because of the anti-monotone property. For every in the set, there exists an immediate parent that does not satisfy the constraint for the monotone property.

(5)

Solution sets can be generated from the candidate sets.

For PCS problem, the “element” in constraint (1) is the P-tree and obviously constraint (1) is satisfied. Proposition 2 has proved that constraint (2) is satisfied. The anti-monotonicity property has been proved in Lemma 2 and thus constraint (3) is also satisfied. In MARGIN, the “user-constraint” of the constraint (4) is that, given a threshold, whether a graph is frequent or not. Here for constraint (4), the “user-constraint” is that whether a P-tree is feasible. For instance, a P-tree $T^{\prime}$ is feasible which means $G_{k}[T^{\prime}]$ exists. If $T$ , which is the child of $T^{\prime}$ , is not feasible (i.e., $G_{k}[T^{\prime}]$ does not exist). Then $T^{\prime}$ can be defined in this “boundary” set and its immediate child $T$ does not satisfy this “user-constraint” for the anti-monotone property. Hence constraint (4) holds. Once a is added in the candidate set, we need to verify whether this is maximal. It means the solution set is the subset of this candidate set. Thus constraint (5) is satisfied. In conclusion, the correctness of Lemma 4 holds. ∎

2. Function verifyPtree. Given a subtree $T$ , $T_{child}$ and $T_{parent}$ denote a child and the parent subtree of $T$ . Let $l$ denote the number of $T_{parent}$ ’s leaf nodes and $t_{n_{i}}$ represent the $i$ th leaf node of $T_{parent}$ . Derived from Lemma 3, we have

$\bullet$ $G_{k}[T_{child}]\ \subseteq\ G_{k}[T]\cap\mathcal{I}.get(k,q,T_{child}$$\setminus$$T)$ .

$\bullet$ $G_{k}[T_{parent}]\ \subseteq\ \bigcap_{i=1}^{l}\mathcal{I}.get(k,q,t_{n_{i}})$ .

Since all P-trees are subtrees of the GP-tree, if a P-tree has the attribute $t$ , then $t$ ’s parent attribute $t^{\prime}$ is also included. Thus, $\mathcal{I}.get(k,q,t)\subseteq\mathcal{I}.get(k,q,t^{\prime})$ . For a special subtree $T_{i}$ (a path from leaf node $t_{n_{i}}$ to root node $r$ ), we can finally get $G_{k}[T_{i}]=\mathcal{I}.get(k,q,t_{n_{i}})$ . Note that $T_{parent}$ can be seen as several paths and thus we get $G_{k}[T_{parent}]\subseteq\bigcap_{i=1}^{l}\mathcal{I}.get(k,q,t_{n_{i}})$ .

Based on CP-tree, verifyPtree can efficiently verify subtrees. Next we discuss three methods to find the initial cut.

3. Function find-I. We can adapt incre to find the initial cut. As shown in Algorithm 5, we incrementally enumerate subtrees and verify the existence of the corresponding communities. Once we find a subtree which is feasible while its child subtree is not, then we can regard them as an initial cut (lines 2-15).

4. Function find-D. We can decrementally generate subtrees from larger subtrees to smaller ones. We represent find-D pseudocodes in Algorithm 6. Firstly, if $G_{k}[T(q)]$ exists, we can directly return it as a qualified community (lines 2-4). In each step, for an infeasible subtree $T$ , we remove one of $T$ ’s leaf nodes and verify the feasibility of the new subtrees (lines 6-11). Once there is a new feasible subtree, we treat $T$ and this new subtree as the initial cut (lines 12-17).

4. Function find-P. We can find the initial cut by directly verifying subtrees instead of the node one by one. Intuitively, P-tree can be divided into several paths (from leaf nodes to the root). According to Lemma 2, these paths can be further verified by checking the corresponding leaf nodes. We call it find initial cut by path (find-P).

We present the pseudocodes of find-P in Algorithm 7. $S$ denotes a P-tree node set. Initially, it consists of all leaf nodes of $T(q)$ . If there does not exist a feasible node in $S$ , we trace up to verify their parent nodes (lines 13-14). Next, we iteratively check the nodes in $S$ . If we find a node $t$ and $G_{k}[F\cup t]$ exists, we update $F$ (lines 5-6). Let $t^{\prime}_{parent}$ denote the parent node of $t^{\prime}$ . If we find a node $t$ that $G_{k}[F\cup t]$ does not exist, we trace up to find the “boundary” where $G_{k}[t^{\prime}_{parent}]$ exists while $G_{k}[t^{\prime}]$ does not and thus we find an initial pair (lines 8-11). Note that at now stage, $IF$ , $F$ may not be complete subtrees. Thus for the nodes in $IF$ and $F$ , we need to include all their ancestor nodes and then return $(IF,F)$ as a cut (lines 15-16).

Algorithm 8 gives the overall advanced methods. Notice that, there are three functions, i.e., find-I, find-D, and find-P, of finding the initial cut, so we have three variants of advanced, denoted by adv-I, adv-D and adv-P respectively.

5 Experiments

5.1 Setup

We consider two real datasets (ACMDL and PubMed) and two synthetic datasets (Flickr and DBLP). ACMDL 222https://dl.acm.org/ and PubMed 333https://www.nlm.nih.gov are the co-authorship networks of researchers in computer science and biomedical areas respectively. Each vertex of them represents an author, and an edge is a co-authorship between two authors. For each author, her papers have been categorized by a hierarchical subject classification system (ACM CCS or Medical Subject Headings (MeSH) 444https://meshb.nlm.nih.gov/), so we build the P-tree by unifying the categorization information of all her papers. For Flickr 555https://www.flickr.com/ [44], each vertex represents a user and each edge denotes a “follow” relationship between two users. For DBLP 666http://dblp.uni-trier.de/xml/, a vertex is an author and an edge represents a co-authorship relationship. For each user, we use a hash function and map the associated textual content to subjects of CCS to synthesize a P-tree. By doing this, the same textual contents could be mapped for constructing the same nodes in P-trees. Table II shows the statistics of the datasets, including the numbers of vertices and edges, vertices’ average degree $\widehat{d}$ , the average number of labels in P-trees $\widehat{P}$ , and the average number of labels in the GP-tree.

To evaluate PCS queries, in line with [11], we set the default value of $k$ to 6. For each dataset, we randomly select 100 query vertices from the 6-core. We implement all the algorithms in Java, and run experiments on a machine having an eight-core Intel 3.40GHz processor, and 16GB of memory, with Ubuntu installed.

we consider all the four datasets and check the locations of maximal feasible subtrees of 100 communities in search space for each dataset. In our experiments, because the search space may be very large, according to the depth, we average them into 5 levels. Notice that, in this case, level 3 represents the middle location of the search space. The experimental results are attached below. For example, there are 43% maximal feasible subtrees lying on the middle of the search space in PubMed. This demonstrates the above view and explains the motivation for the advanced methods.

5.2 PCS Effectiveness

As mentioned before, the existing CS methods mainly focus on non-attributed graphs. A recent work ACQ [40, 11] investigates CS on attributed graphs. In ACQ, each vertex in the attributed graph is associated with a set of keywords. Communities retrieved by ACQ should satisfy the structure cohesiveness ( $k$ -core constraint) and “keyword cohesiveness” [11, 40], i.e., the number of common keywords shared by all vertices in communities should be maximum. We compare PCS with ACQ. To run ACQ queries, we set each vertex’s attribute as a set of keywords, which are the keywords in its P-tree. In the following, we first present a case study, and then show the quality and diversity of communities.

$\bullet$ A Case Study: We perform a case study on the ACMDL dataset and consider a renowned researcher: Jim Gray. We set $k$ = 4 here. We present Jim’s two PC’s, i.e., PC1 and PC2, with different research areas in Fig. 7 and Fig. 8. Notice that ACQ only finds one community PC1 shown in Fig. 7(a). This is because, ACQ maximizes the number of shared keywords, so PC2 shown in Fig. 8(a), which has five shared keywords, cannot be returned. In addition, as shown in Fig. 7(b), all shared keywords of PC1 are organized in a tree with few branches, which implies that the semantics of keywords are highly overlapped with each other. In contrast, the shared subtree of PC2 shown in Fig. 8(b) has multiple branches, so the semantics of keywords are very different and diversified. Hence, PCS are more effective than ACQ for extracting communities from profiled graphs.

$\bullet$ Community Pairwise Similarity (CPS): We compare PCS with three classic CS methods using “minimum degree” definition: ACQ [11], Global [8] and Local [25]. We use Tree Edit Distance (TED) to compute the similarity between the P-trees of any pair of vertices in community $G_{l}$ . Let $T_{i}$ be the P-tree of the $i$ -th vertex in $G_{l}$ . The CPS is then the average similarity over all pairs of $G_{l}$ ’s vertices, and all communities of $\mathcal{G}$ :

[TABLE]

The CPS( $\mathcal{G}$ ) value has a range of 0 and 1. The higher the value is, the more cohesive the community is. As shown in Fig 9(a), PCs∗ denotes the communities that only PCS can search. P-ACs represents those returned by both of PCS and ACQ. P-ACs have the most P-tree nodes (i.e., keywords in ACQ definition) in common, and the fewest vertices. Thus they have the highest CPS values. Note that PCs∗ have a close CPS value with P-ACs which implies that these unique PC’s are also of highly quaility.

$\bullet$ Level-diversity ratio (LDR): To further measure the quality of PC’s, we define a metric, called level-diversity ratio (LDR), to measure the diversity of attributes level by level in the shared subtrees. $F$ denotes the method that we use here to compare with PCS. Given a query vertex $q$ , we use $\mathcal{T}({F,q,j})$ to represent the maximal common P-trees of $j$ -th community returned by the method $F$ . $\cal L$ is the number of levels in P-tree $T(q)$ . $L_{i}(T)$ is the number of unique labels in the $i$ -th level of P-tree $T$ . ${\cal H}$ and ${\cal J}$ denote the numbers of communities returned by the method $F$ and PCS respectively. A lower LDR value implies that the method $F$ is less diverse than PCS.

[TABLE]

Intuitively, LDR reflects the proportion of unique labels in each level. The experimental results are depicted in Fig. 9(b), which shows that communities returned by ACQ can only cover 40% to 60% labels of PC’s in each level. This implies that PC’s found by PCS have higher diversity than those of ACQ, because PCS focuses on maximizing the common structure of P-trees, rather than the number of common keywords. As a result, all communities with the semantically maximal properties can be found, and the communities are of high diversity.

$\bullet$ Community numbers: Fig. 10(a) reports the average number of communities that per query request returns in these methods. From the results, we can see that PCS finds more communities than others. This is because only PCS focuses on profiled graphs and hierarchical information in P-trees to retrieve communities. Comapred with other methods, PCS is able to extract communities with more semantic focuses.

$\bullet$ Community P-tree Frequency (CPF): CPF is inspired by the document frequency measure. Let $fre_{i,j}$ represent the number of vertices in $G_{i}$ whose P-tree contains $T(q)$ ’s $j$ -th P-tree node. We use CPF to compute the occurence frequency over all nodes in $T(q)$ and all communities in $\mathcal{G}$ :

[TABLE]

Note that CPF( $q$ ) ranges from 0 to 1 and a higher value implies a better cohesiveness. As shown in Fig 9(a), compared with the communties retrieved by both of PCS and ACQ, those unique PCs also have a highly degree of cohesiveness.

$\bullet$ F1-score: Here we use Facebook ego-networks 777http://snap.stanford.edu/ to evaluate the accuracy. We use FBX to denote the X-th network and each ego-network has several overlapping ground-truth communities, called friendship circles [45]. See Table IV, each vertex has real profiles, such as political, education, etc. Similar to Flickr, we build each P-tree by using a hash function to map the real profiles to CCS subjects. We random query 100 vertices in these ground-truth communities and compute the F1-scores 888https://en.wikipedia.org/wiki/F1score over different methods. The F1-scores of all methods over three networks are shown in Fig. 11. The experimental results show that, compared with other methods, PCS can stably extract communities with high accuracy over three real networks.

5.3 Comparison with Other Definition Metrics

In this section, we compare several potential metrics to define the PCS problems. Generally, a good community should be a group of users, which are cohesive in both structures and profiles. To measure structure cohesiveness, we use the minimum degree metric, which is in line with existing works [25, 26, 8, 11, 12]. To measure the profile cohesiveness, we have tried a list of possible metrics, including:

(a) common nodes of P-trees;

(b) common path of P-trees (from the P-tree leaf to the root);

(c) common subtree of P-tree structures;

(d) similarity of vertex P-trees.

We compare these four metrics over two real datasets (ACMDL and pubMed). As shown in Fig 12, compared with other metrics, Metric (c) can achieve highest scores over four indices. We now discuss the reason for such differences.

In a recently work ACQ [40], the authors define the vertex attribute as a set of keywords and use the number of “shared keywords” to constrain the communities. Thus, in our PCS problem, it is natural to use the number of common P-tree nodes to measure the profile cohesiveness, and it is natural to require the number of common nodes to be the largest. However, as we have analyzed before, this will ignore the interrelated relations among the nodes and violate the basic motivation for the PCS problem. Thus Metric (a) is not suitable for PCS definition. Metric (b) is defined by common paths (i.e., a common path from P-tree root to a leaf node) shared by all the nodes in the returned community. Intuitively, we can require the number of common paths to be maximum. This metric will still have some inadequacies, as it amounts to maximize the number of common leaf nodes, which will miss out meaningful communities with fewer common leaves. As a result, based on the discussions above, we think metric (b) is also not suitable for PCS problem definition. Metric (c) focuses on the common subtree of all P-trees. Clearly, a sub-tree consists a set of nodes and their hierarchical relationships. Compared with the metrics above, the common subtree of P-tree structure is more suitable for measuring the profile cohesiveness of a community, as it can adequately present the commonalities of vertex P-trees. Inspired by another recent community search work [12], we tried to use the similarity of P-trees to define the problem. It means, given a threshold, to find all vertices with a budgeted similarity score. However, it is still not suitable for the PCS problem. This is because, normally, if two P-trees are to be compared by some similarity methods, the diversity of these P-trees will be nevertheless regarded as the dissimilarity. Thus, based on above discussion and experimental results in Fig 12, we adopt Metric (c) in our PCS problem definition.

5.4 Results of Efficiency Evaluation

In this section, we show the efficiency results of index construction and PCS queries.

1. Index construction. Fig. 13(a)-13(b) show the scalability of the CP-tree index construction method. To evaluate the scalability of index construction method w.r.t the dataset size, for each dataset, we randomly select 20%, 40%, 60% and 80% of its vertices to obtain four sub-datasets respectively. As shown in Fig. 13(a), we observe that, the time cost of the index construction is linear to the size of profiled graphs, which confirms our analysis before. Furthermore, to evaluate the scalability of index construction method over different P-tree sizes of vertices and over different fractions of the GP-tree size, we obtain four sub-datasets in a similar way. As shown in Fig. 13(b) and Fig. 13(c), we demonstrate that the time cost of the index construction is linear to the size of P-trees and GP-trees.

2. Query efficiency. We vary the value of $k$ and show the query efficiency of different algorithms in Fig. 14(a)-14(d). The method incre is 100 times faster than the basic method, but slower than the method adv-I. Further, adv-D and adv-P are 10 times faster than incre. The reason is that, compared with incre, the advanced methods narrow the search space by verifying a smaller fraction of subtrees. Also, the efficiency gap in finding an initial cut results in the sightly different performance of the advanced methods. Thus, the index-based methods run fast and adv-P stably scales the best. Note that three advanced methods perform similarly on Flickr. This is because the initial cut results are in the middle of the search space. Thus they have similar performance even though they search from different directions.

3. Scalability w.r.t. vertex. Fig. 14(e)-14(h) report the scalability over different fraction of vertices. For each dataset, we randomly select 20%, 40%, 60% and 80% of its vertices to respectively obtain four sub-datasets. Note that in this experiment, vertices’ P-trees are fully considered. As shown in the experimental results, the algorithms run slower as more vertices involved. From the above analysis and experimental results, basic has been proved to be quite inefficient and we do not involve it afterwards. As show in Fig. 14(e)-14(h), incre and adv-I sacle simlilarly and adv-I is slightly better. This is because after finding a feasible answer, adv-I will quickly search all answers instead of exploring the whole search space. Not surprisingly, adv-D and adv-P scale the best.

4. Scalability w.r.t. P-tree. Fig. 14(i)-14(l) examine scalability over different fraction of P-trees for each vertex. For the P-tree of each vertex, we randomly select 20%, 40%, 60% and 80% of its P-tree nodes to generate the corresponding subtree. Here all vertices are considered. As shown in Fig. 14(i)-14(l), adv-I performs better than incre. This is because adv-I avoids exploring the whole search space after finding an initial solution, which accelarates the query process. Also, adv-D and adv-P stably perform the best and adv-P performs slightly better than adv-D. The reason is that, as we introduced before, adv-P finds initial P-tree cuts by directly verifying a batch of P-tree nodes rather than verifying nodes one by one. Thus adv-P is always faster than adv-D.

5. Scalability w.r.t. GP-tree. We test the importance of GP-tree size in Fig. 14(m)-14(p). For the GP-tree of each dataset, we randomly select 20%, 40%, 60% and 80% of its P-tree nodes to generate new GP-trees. Here we consider all the vertex and P-trees. As GP-tree varies, again, adv-I is faster than incre. Moreover, adv-D and adv-P achieve the best performance.

6. Effect of find functions. By varying $k$ from 4 to 8, we compare three find functions of the advanced methods in Fig. 14(q)-14(t). find-P and find-D are about 10-100 times faster than find-I. This is because, as we expained before, find-P can find initicial cut by directly verifying the leaf nodes of subtrees instead of enumerating nodes from the root node one by one. Note that in Fig. 14(r), adv-D and adv-P have similar performance. The reason is that the efficiency of finding initial cuts depends on the distribution of initial cuts in search space. Actually, adv-D search initial cuts incrementally and adv-P explores the search space by decrementally verifying subtrees. Thus adv-D and adv-P may have similar performance.

6 Conclusions and Future Directions

In this paper,we study the online community search problem which exhibit both semantic and strucutre cohesiveness on large-scale profiled graphs. Given a vertex $q$ of a profiled graph $G$ , we study the PCS problem, which aims to find profiled communities containing $q$ . We firstly introduce a basic solution. To further accelarate the query efficiency, we develop an index and some index-based query algorithms accordingly. We evalute the algorithms on both real and synthetic datasets, and our experimental results demonstrate the effectiveness of PCS and the efficiency of our solutions.

In the future, we will study other structure cohesiveness measures (e.g., $k$ -truss and $k$ -clique) and profile cohesiveness measures in the PCS definition. An potential extension of PCS is to explore how to relax the query condition and optimize our solutions. For instance, we can stimulate that each vertex of the targeted community has a semantic similarity with the query vertex $q$ to be at least $\beta$ ( $\beta$ ≥ 0), where $\beta$ is a predefined threshold. Or we can try to relax the structure cohesiveness (e.g., the proportion of vertices in a community having degrees of at least $k$ is at least $\delta$ where parameter $\delta$ ¿ 0). Another useful question is to examine how the directions of edges will affect the formation of an PC. For example, D-core [14], a concept extended from k-core for directed graphs, can be utilized to measure the structure cohesiveness, and develop algorithms that is similar to those of PCS. In practice, directed edges are more common in many real social networks. In addition, we will also investigate other problems that can be done on profiled graphs, such as communities search family of problems, or labeled graph search problems. We will study how other graph pattern matching techniques can be extended to find PCs on profiled graphs and how to automatically generate a meaningful graph pattern that well reflects the ground-truth communities.

Acknowledgment

Reynold Cheng, Yixiang Fang, and Yankai Chen were supported by the Research Grants Council of Hong Kong (RGC Projects HKU 106150091, 17229116, 17205115) and the University of Hong Kong (Projects 104004572, 102009508 ,104004129). Xiaojun Chen was supported by NSFC under Grant no. 61773268. Jie Zhang was supported by the MOE AcRF Tier 1 funding (M4011894.020) and the Telenor-NTU Joint R&D funding. We thank the editors and reviewers for their insightful comments.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin, “Finding top-k min-cost connected trees in databases,” in Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on . IEEE, 2007, pp. 836–845.
2[2] Y. Fang, H. Zhang, Y. Ye, and X. Li, “Detecting hot topics from twitter: A multiview approach,” Journal of Information Science , vol. 40, no. 5, pp. 578–593, 2014.
3[3] H. He, H. Wang, J. Yang, and P. S. Yu, “Blinks: ranked keyword searches on graphs,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data . ACM, 2007, pp. 305–316.
4[4] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar, “Bidirectional expansion for keyword search on graph databases,” in Proceedings of the 31st international conference on Very large data bases . VLDB Endowment, 2005, pp. 505–516.
5[5] M. Kargar and A. An, “Keyword search in graphs: Finding r-cliques,” Proceedings of the VLDB Endowment , vol. 4, no. 10, pp. 681–692, 2011.
6[6] Z. Xu, Y. Ke, Y. Wang, H. Cheng, and J. Cheng, “A model-based approach to attributed graph clustering,” in Proceedings of the 2012 ACM SIGMOD international conference on management of data . ACM, 2012, pp. 505–516.
7[7] J. X. Yu, L. Qin, and L. Chang, “Keyword search in databases,” Synthesis Lectures on Data Management , vol. 1, no. 1, pp. 1–155, 2009.
8[8] M. Sozio and A. Gionis, “The community-search problem and how to plan a successful cocktail party,” in KDD , 2010, pp. 939–948.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Exploring Communities in Large Profiled Graphs

Abstract

Index Terms:

1 Introduction

2 Related Work

3 Problem Definition and Basic Solution

3.1 The PCS Problem

Definition 1** (kkk-core [41, 27]).**

Example 1**.**

Definition 2** (P-tree).**

Definition 3** (induced rooted subtree).**

Definition 4** (maximal common subtree).**

Problem 1** (PCS).**

3.2 A Basic Solution

Lemma 1**.**

Proof.

Proposition 1**.**

Proof.

Lemma 2** (Anti-monotonicity).**

Proof.

4 Index-based Solutions

4.1 kkk-core and CL-tree

4.2 CP-tree Index

4.3 Index-based Query Algorithms

4.3.1 The Method incre

Lemma 3**.**

Proof.

4.3.2 The Advanced Methods

Lemma 4**.**

Property 1** (Upper-◊\Diamond◊-Property [43]).**

Proposition 2**.**

Proof.

Proof.

5 Experiments

5.1 Setup

5.2 PCS Effectiveness

5.3 Comparison with Other Definition Metrics

5.4 Results of Efficiency Evaluation

6 Conclusions and Future Directions

Acknowledgment

Definition 1 ( $k$ -core [41, 27]).

Example 1.

Definition 2 (P-tree).

Definition 3 (induced rooted subtree).

Definition 4 (maximal common subtree).

Problem 1 (PCS).

Lemma 1.

Proposition 1.

Lemma 2 (Anti-monotonicity).

4.1 $k$ -core and CL-tree

Lemma 3.

Lemma 4.

Property 1 (Upper- $\Diamond$ -Property [43]).

Proposition 2.