From Community Detection to Community Profiling

Hongyun Cai; and Vincent W. Zheng; and Fanwei Zhu; and Kevin; Chen-Chuan Chang; and Zi Huang

arXiv:1701.04528·cs.SI·January 18, 2017

From Community Detection to Community Profiling

Hongyun Cai, and Vincent W. Zheng, and Fanwei Zhu, and Kevin, Chen-Chuan Chang, and Zi Huang

PDF

Open Access

TL;DR

This paper introduces a novel community profiling framework that characterizes communities by internal content and external diffusion profiles, addressing key challenges with a scalable joint model and demonstrating superior performance on large datasets.

Contribution

It formalizes the concept of community profiling, proposes a joint model for profiling and detection, and develops a scalable inference algorithm for large-scale data.

Findings

01

CPD outperforms state-of-the-art baselines in community profiling tasks.

02

The inference algorithm scales linearly with data size and is parallelizable.

03

Community profiles effectively capture both internal content and external diffusion characteristics.

Abstract

Most existing community-related studies focus on detection, which aim to find the community membership for each user from user friendship links. However, membership alone, without a complete profile of what a community is and how it interacts with other communities, has limited applications. This motivates us to consider systematically profiling the communities and thereby developing useful community-level applications. In this paper, we for the first time formalize the concept of community profiling. With rich user information on the network, such as user published content and user diffusion links, we characterize a community in terms of both its internal content profile and external diffusion profile. The difficulty of community profiling is often underestimated. We novelly identify three unique challenges and propose a joint Community Profiling and Detection (CPD) model to address…

Tables6

Table 1. Table 1: Comparison with the related work.

Data

Diffusion factors

Tasks

Methods

text

attribute

node feature

friend. link

diff. link

individual

community

topic

topic extract.

community detect.

diffusion pred.

community profile

MaxFlow [11]

∙

∙

SN-LDA [32]

∙

∙

∙

∙

CODICIL [30]

∙

∙

∙

SocialCircle [26]

∙

∙

∙

CESNA [41]

∙

∙

∙

BAGC [40]

∙

∙

∙

SA-Cluster [42]

∙

∙

∙

MetaFac [23]

∙

user actions

∙

∙

PMM [35]

interaction

∙

∙

TURCM [31]

∙

interactions

∙

∙

∙

GenClus [33]

∙

∙

∙

GF [1]

user-item pairs

∙

∙

CFF [27]

user-item pairs

∙

∙

Influlearner [9]

∙

∙

∙

∙

∙

LADP [22]

∙

∙

∙

∙

∙

TopicInfluence [24]

∙

∙

∙

∙

∙

∙

∙

INFEST [25]

∙

∙

∙

∙

topcgo [10]

∙

∙

∙

∙

∙

BlackHole [21]

∙

∙

HAM [16]

∙

∙

PMTLM [43]

∙

∙

∙

∙

∙

∙

WTM [37]

∙

user features

∙

∙

∙

∙

CRM [15]

∙

∙

∙

∙

∙

∙

COLD [17]

∙

∙

∙

∙

∙

∙

CPD (ours)

∙

∙

∙

∙

∙

∙

∙

∙

∙

∙

Table 2. Table 2: Notations

Notations	Description
$\| U \|$ , $\| W \|$	The number of users and words
$\| C \|$ , $\| Z \|$	The number of communities and topics
$\| F \|$ , $\| E \|$	The number of friendship links and diffusion links
$d_{u i}$	The $i$ -th document published by user $u$
$D_{u}$	The set of documents published by user $u$
$W_{u i}$	The set of words in document $d_{u i}$
$w_{u i k}$	The $k$ -th word in document $d_{u i}$
$c_{u i}$ , $z_{u i}$	The community assignment and topic assignment for $d_{u i}$
$E_{i j}^{t}$	A diffusion link from document $i$ to document $j$ at time $t$
$F_{u v}$	A friendship link from user $u$ to user $v$
$𝝅_{u}$	Multinomial distribution over communities specific to user $u$
$𝜽_{c}$	Multinomial distribution over topics specific to community $c$
$ϕ_{z}$	Multinomial distribution over words specific to topic $z$
$η_{c, c^{'} z}$	Probability of community $c$ diffusing community $c^{'}$ on topic $z$
$𝝂$	The parameters for modeling individual diffusion preference
$α$ , $β$ , $ρ$	Dirichlet priors

Table 3. Table 3: Data set statistics.

	#(user)	#(friend. link)	#(diff. link)	#(doc.)	#(word)
Twitter	137,325	3,589,811	992,522	39,952,379	2,316,020
DBLP	916,907	3,063,186	10,210,652	4,121,213	330,334

Table 4. Table 4: Differences with baselines.

	Data			Diffusion factors			Tasks
Methods	text	friend	diff	indiv-	comm	topic	topic	comm	diff	comm
		links	links	idual			extract	detect	pred	profile
PMTLM [43]	$∙$		$∙$			$∙$	$∙$		$∙$
WTM [37]	$∙$	$∙$	$∙$	$∙$					$∙$
CRM [15]		$∙$	$∙$	$∙$	$∙$			$∙$	$∙$
COLD [17]	$∙$		$∙$		$∙$		$∙$	$∙$	$∙$
Ours	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$

Table 5. Table 5: Top four words in each topic.

Topic	Word Distribution (listed by “word:probability”)
$T_{22}$	network:0.059, wireless:0.050, sensor:0.046, routing:0.038
$T_{49}$	network:0.042, performance:0.037, traffic:0.031, routing:0.028
$T_{47}$	service:0.056, web:0.028, mobile:0.025, management:0.024
$T_{8}$	security:0.031, key:0.028, authentication:0.027, protocol:0.020
$T_{9}$	code:0.061, algorithm:0.032, function:0.028, linear:0.027
$T_{0}$	design:0.049, circuit:0.034, power:0.027, cmos:0.017
$T_{44}$	parallel:0.053, performance:0.036, memory:0.03, architecture:0.02
$T_{46}$	analysis:0.061, reliability:0.029, optical:0.024, design:0.021

Table 6. Table 6: Top three communities ranked for query “router”.

K	AP@K	AR@K	AF@K	Topic Distribution
1	0.919	0.327	0.483	$T_{22}$ :0.976, $T_{49}$ :0.013, $T_{47}$ :0.006
2	0.900	0.424	0.576	$T_{8}$ :0.988, $T_{22}$ :0.004, $T_{9}$ :0.003
3	0.891	0.528	0.663	$T_{0}$ :0.977, $T_{44}$ :0.008, $T_{46}$ :0.005

Equations65

max \prod_{u} p (content ∣ u) = \prod_{u} \sum_{c} p (content ∣ c) p (c ∣ u),

max \prod_{u} p (content ∣ u) = \prod_{u} \sum_{c} p (content ∣ c) p (c ∣ u),

max \prod_{(u, v)} p (diffusion ∣ content, u, v)

max \prod_{(u, v)} p (diffusion ∣ content, u, v)

= \prod_{(u, v)} \sum_{c} \sum_{c^{'}} p (diffusion ∣ content, c, c^{'}) p (c ∣ u) p (c^{'} ∣ v),

P (F_{uv} = 1) = σ (\hat{π}_{u}^{T} \hat{π}_{v}),

P (F_{uv} = 1) = σ (\hat{π}_{u}^{T} \hat{π}_{v}),

p (s = 1, z ∣ u, v)

p (s = 1, z ∣ u, v)

\propto 2 \sum_{c} \sum_{c^{'}} η_{c, c^{'} z} \hat{θ}_{c, z} \overset{π}{^}_{u, c} \hat{θ}_{c^{'}, z} \overset{π}{^}_{v, c^{'}},

p (E_{ij}^{t} = 1∣ u, v, z, t) = σ (\overset{ˉ}{c}_{ij}^{T} \overset{ˉ}{η} + n_{z}^{t} + ν^{T} f_{uv}) .

p (E_{ij}^{t} = 1∣ u, v, z, t) = σ (\overset{ˉ}{c}_{ij}^{T} \overset{ˉ}{η} + n_{z}^{t} + ν^{T} f_{uv}) .

p (W, F, E, C, Z, f, ν, \overset{ˉ}{η} ∣ ρ, α, β)

p (W, F, E, C, Z, f, ν, \overset{ˉ}{η} ∣ ρ, α, β)

= p (C ∣ ρ) p (Z ∣ C, α) p (W ∣ Z, β) p (F ∣ C) p (E ∣ C, η, Z, ν, f),

x = \frac{1}{2 π ^{2}} \sum_{k = 1}^{\infty} \frac{g _{k}}{( k - 1/2 ) ^{2} + b ^{2} / ( 4 π ^{2} )},

x = \frac{1}{2 π ^{2}} \sum_{k = 1}^{\infty} \frac{g _{k}}{( k - 1/2 ) ^{2} + b ^{2} / ( 4 π ^{2} )},

\frac{1}{1 + e ^{- w}} = \frac{1}{2} \int_{0}^{\infty} ψ (w, x) p (x ∣1, 0) d x,

\frac{1}{1 + e ^{- w}} = \frac{1}{2} \int_{0}^{\infty} ψ (w, x) p (x ∣1, 0) d x,

p (F_{uv} = 1, λ_{uv}) = \frac{1}{2} ψ (\hat{π}_{u}^{T} \hat{π}_{v}, λ_{uv}) p (λ_{uv} ∣1, 0) .

p (F_{uv} = 1, λ_{uv}) = \frac{1}{2} ψ (\hat{π}_{u}^{T} \hat{π}_{v}, λ_{uv}) p (λ_{uv} ∣1, 0) .

p (E_{ij}^{t} = 1, δ_{ij}) = \frac{1}{2} ψ (\overset{ˉ}{c}_{ij}^{T} \overset{ˉ}{η} + n_{z}^{t} + ν^{T} f_{uv}, δ_{ij}) p (δ_{ij} ∣1, 0) .

p (E_{ij}^{t} = 1, δ_{ij}) = \frac{1}{2} ψ (\overset{ˉ}{c}_{ij}^{T} \overset{ˉ}{η} + n_{z}^{t} + ν^{T} f_{uv}, δ_{ij}) p (δ_{ij} ∣1, 0) .

p (F, λ)

p (F, λ)

p (E, δ)

p (W, F, E, C, Z, f, ν, \overset{ˉ}{η}, λ, δ ∣ ρ, α, β)

p (W, F, E, C, Z, f, ν, \overset{ˉ}{η}, λ, δ ∣ ρ, α, β)

= p (C ∣ ρ) p (Z ∣ C, α) p (W ∣ Z, β) p (F, λ ∣ C) p (E, δ ∣ C, \overset{ˉ}{η}, Z, ν, f)

= \int_{π} P (C ∣ π) P (π ∣ ρ) d π \cdot \int_{θ} p (Z ∣ C, θ) P (θ ∣ α) d θ \cdot

\int_{ϕ} P (W ∣ Z, ϕ) P (ϕ ∣ β) d ϕ \cdot p (F, λ) \cdot p (E, δ)

= u = 1 \prod ∣ U ∣ \frac{Δ ( n _{u}^{c} + ρ )}{Δ ( ρ )} \cdot c = 1 \prod ∣ C ∣ \frac{Δ ( n _{c}^{z} + α )}{Δ ( α )} \cdot z = 1 \prod ∣ Z ∣ \frac{Δ ( n _{z}^{w} + β )}{Δ ( β )} \cdot p (F, λ) \cdot p (E, δ),

p (z_{u i} = z ∣ C, Z_{\neg {u i}}, W, F, E, f, ν, \overset{ˉ}{η}, λ, δ)

p (z_{u i} = z ∣ C, Z_{\neg {u i}}, W, F, E, f, ν, \overset{ˉ}{η}, λ, δ)

= \frac{p ( W , F , E , C , Z , f , ν , η ˉ , λ , δ ∣ ρ , α , β )}{p ( W , F , E , c _{u i} = c , C _{\neg {u i}} , Z _{\neg {u i}} , f , ν , η ˉ , λ , δ , ρ , α , β )}

\propto \prod_{c = 1}^{∣ C ∣} \frac{Δ ( n _{c}^{z} + α )}{Δ ( n _{c, \neg {u i}}^{z} + α )} \cdot \prod_{z = 1}^{∣ Z ∣} \frac{Δ ( n _{z}^{w} + β )}{Δ ( n _{z, \neg {u i}}^{w} + β )} \cdot

p (F ∣ λ, C_{\neg {u i}}) \cdot p (E ∣ δ, C_{\neg {u i}}, Z_{\neg {u i}})

= \frac{n _{c, \neg {u i}}^{z} + α}{n _{c, \neg {u i}}^{(\cdot)} + ∣ Z ∣ α} \cdot \frac{\prod _{w = 1}^{∣ W ∣} \prod _{i = 1}^{n_{u i}^{w}} ( n _{z, \neg {u i}}^{w} + β + i - 1 )}{\prod _{j = 1}^{n_{u i}^{(\cdot)}} ( n _{z, \neg {u i}}^{(\cdot)} + ∣ W ∣ β + j - 1 )} \cdot

\prod_{v \in Λ_{u}} ψ (\hat{π}_{u}^{T} \hat{π}_{v}, λ_{uv} ∣ C_{\neg {u i}}) \cdot

\prod_{j \in Λ_{i}} ψ (\overset{ˉ}{c}_{ij}^{T} \overset{ˉ}{η} + n_{z}^{t} + ν^{T} f_{uv}, δ_{ij} ∣ C_{\neg {u i}}, Z_{\neg {u i}}),

p (c_{u i} = c ∣ C_{\neg {u i}}, Z, W, F, E, f, ν, \overset{ˉ}{η}, λ, δ)

p (c_{u i} = c ∣ C_{\neg {u i}}, Z, W, F, E, f, ν, \overset{ˉ}{η}, λ, δ)

= \frac{p ( W , F , E , C , Z , f , ν , η ˉ , λ , δ ∣ ρ , α , β )}{p ( C _{\neg {u i}} , z _{u i} = z , Z _{\neg {u i}} , W , F , E , f , ν , η ˉ , λ , δ ∣ ρ , α , β )}

\propto \prod_{u = 1}^{∣ U ∣} \frac{Δ ( n _{u}^{c} + ρ )}{Δ ( n _{u, \neg {u i}}^{c} + ρ )} \prod_{c = 1}^{∣ C ∣} \frac{Δ ( n _{c}^{z} + α )}{Δ ( n _{c, \neg {u i}}^{z} + α )} \cdot

p (F ∣ λ, C_{\neg {u i}}) \cdot p (E ∣ δ, C_{\neg {u i}}, Z_{\neg {u i}})

= \frac{n _{u, \neg {u i}}^{c} + ρ}{n _{u, \neg {u i}}^{(\cdot)} + ∣ C ∣ ρ} \cdot \frac{n _{c, \neg {u i}}^{z} + α}{n _{c, \neg {u i}}^{(\cdot)} + ∣ Z ∣ α} \cdot \prod_{v \in Λ_{u}} ψ (\hat{π}_{u}^{T} \hat{π}_{v}, λ_{uv} ∣ C_{\neg {u i}})

\cdot \prod_{j \in Λ_{i}} ψ (\overset{ˉ}{c}_{ij}^{T} \overset{ˉ}{η} + n_{z}^{t} + ν^{T} f_{uv}, δ_{ij} ∣ C_{\neg {u i}}, Z_{\neg {u i}}),

\begin{array}[]{l}\textstyle p({\lambda_{uv}}|W,F,E,C,Z,\mathbf{f},\boldsymbol{\nu},\boldsymbol{\eta},\boldsymbol{\delta})\\ \textstyle\propto e^{\frac{-\lambda_{uv}(\hat{\boldsymbol{\pi}}_{u}^{T}\hat{\boldsymbol{\pi}}_{v})^{2}}{2}}p({\lambda_{uv}}|1,0)=PG(1,\hat{\boldsymbol{\pi}}_{u}^{T}\hat{\boldsymbol{\pi}}_{v}).\end{array}

\begin{array}[]{l}\textstyle p({\lambda_{uv}}|W,F,E,C,Z,\mathbf{f},\boldsymbol{\nu},\boldsymbol{\eta},\boldsymbol{\delta})\\ \textstyle\propto e^{\frac{-\lambda_{uv}(\hat{\boldsymbol{\pi}}_{u}^{T}\hat{\boldsymbol{\pi}}_{v})^{2}}{2}}p({\lambda_{uv}}|1,0)=PG(1,\hat{\boldsymbol{\pi}}_{u}^{T}\hat{\boldsymbol{\pi}}_{v}).\end{array}

p (δ_{ij} ∣ W, F, E, C, Z, f, ν, η, δ)

p (δ_{ij} ∣ W, F, E, C, Z, f, ν, η, δ)

\propto e^{\frac{- δ _{ij} ( c ˉ _{ij}^{T} η ˉ + n _{z}^{t} + ν ^{T} f _{uv} ) ^{2}}{2}} p (δ_{ij} ∣1, 0)

= P G (1, \overset{ˉ}{c}_{ij}^{T} \overset{ˉ}{η} + n_{z}^{t} + ν^{T} f_{uv}) .

max \sum_{i = 1}^{M} o_{i} x_{i}, s.t. \sum_{i = 1}^{M} o_{i} x_{i} \leq \frac{1}{M} O,

max \sum_{i = 1}^{M} o_{i} x_{i}, s.t. \sum_{i = 1}^{M} o_{i} x_{i} \leq \frac{1}{M} O,

p (E_{ij}^{t} = 1∣ u, v, d_{v j}, t) = 1 \sum_{z} p (E_{ij}^{t} = 1∣ u, v, z, t) p (z ∣ d_{v j})

p (E_{ij}^{t} = 1∣ u, v, d_{v j}, t) = 1 \sum_{z} p (E_{ij}^{t} = 1∣ u, v, z, t) p (z ∣ d_{v j})

= 2 \sum_{z} σ (\sum_{c} \sum_{c^{'}} π_{u, c} θ_{c, z} η_{c, c^{'} z} π_{v, c^{'}} θ_{c^{'}, z} + n_{z}^{t} + ν^{T} f_{uv}) p (z ∣ d_{v j}) .

p (s = 1∣ c, q) = 1 \sum_{z} \sum_{c^{'}} p (s = 1∣ c, c^{'}, z) p (z ∣ q, c^{'}) p (c^{'} ∣ q)

p (s = 1∣ c, q) = 1 \sum_{z} \sum_{c^{'}} p (s = 1∣ c, c^{'}, z) p (z ∣ q, c^{'}) p (c^{'} ∣ q)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Network Analysis Techniques · Recommender Systems and Techniques · Data-Driven Disease Surveillance

Full text

From Community Detection to Community Profiling

Hongyun Cai $~{}^{{\dagger}{\ddagger}}$ , Vincent W. Zheng $~{}^{{\dagger}}$ , Fanwei Zhu $~{}^{\#}$ , Kevin Chen-Chuan Chang $~{}^{\diamond}$ , Zi Huang $~{}^{{\ddagger}}$

† Advanced Digital Sciences Center, Singapore

$~{}^{{\ddagger}}$ School of ITEE, The University of Queensland, Australia

$~{}^{\#}$ Zhejiang University City College, China

⋄ University of Illinois at Urbana-Champaign, USA

{hongyun.c, vincent.zheng}@adsc.com.sg, [email protected], [email protected], [email protected]

(1 August 2016)

Abstract

Most existing community-related studies focus on detection, which aim to find the community membership for each user from user friendship links. However, membership alone, without a complete profile of what a community is and how it interacts with other communities, has limited applications. This motivates us to consider systematically profiling the communities and thereby developing useful community-level applications. In this paper, we for the first time formalize the concept of community profiling. With rich user information on the network, such as user published content and user diffusion links, we characterize a community in terms of both its internal content profile and external diffusion profile. The difficulty of community profiling is often underestimated. We novelly identify three unique challenges and propose a joint Community Profiling and Detection (CPD) model to address them accordingly. We also contribute a scalable inference algorithm, which scales linearly with the data size and it is easily parallelizable. We evaluate CPD on large-scale real-world data sets, and show that it is significantly better than the state-of-the-art baselines in various tasks.

1 Introduction

Thanks to the pioneer studies on community detection [19, 38], we have been able to model a community in terms of its member users. Such community membership assists us to better understand the network structure. However, membership alone, without knowing what a community is and how it interacts with others, has only limited applications– e.g., we cannot rank communities by desired characteristics, exploit inter-community diffusions, and visualize communities and their interactions. With this critical lacking of community “understanding”, this paper proposes systematic community profiling– to characterize the intrinsic nature and extrinsic behavior of a community– thereby enabling useful community-level applications. As social networks increasingly capture more and richer user information, it is now feasible to profile communities. E.g., beyond traditional friendship links which connect users on a social network, there are also users’ attributes, published content, diffused content and so on. We can leverage such rich user data to estimate the community profiles.

In this paper, we for the first time formalize the concept of “community profile”. We ask two fundamental questions:

$\bullet$ What is a community profile?

By name, the profile should characterize a community, both internally (i.e., what it is) and externally (i.e., how it interacts with others). Since a community is an aggregation of users, its profile is essentially an aggregation of user information. Denote $X$ as some type of user information. To accommodate uncertainty in $X$ , we define an internal profile as probabilities of “community- $X$ ”, and an external profile as probabilities of “community-community- $X$ ”. Here we focus on $X$ as content, which is the primary user information in many social networks. E.g., in Twitter, users write tweets and retweet from others; in DBLP, authors publish papers and cite papers from others. We call the probabilities of “community-content” as content profile (i.e., what a community is about), and those of “community-community-content” as diffusion profile (i.e., how a community diffuses certain content with another). Other types of $X$ ’s may exist in different networks, e.g., attributes in Facebook. Thus, “community profile” is a flexible concept. We leave other types of $X$ ’s as future work.

$\bullet$ Is a community profile good?

Due to network homophily, users in the same community tend to have similar behaviors. Thus, a community’s profile should explain the common behaviors of its users. In other words, we do not see any “community-content” distribution as a good content profile; instead, only that well explaining the observations of user content as generated by the communities is a “good” one. Analogously, only the “community-community-content” distribution that well explains the observations of user-to-user content diffusion as generated by the communities is a “good” diffusion profile. This quality criterion will later guide us to estimate the profiles accurately. It is also a key to differentiating us from other work– some prior attempt simply aggregates user information to output community properties (mostly internal ones) [14], but it does not require such properties to best explain the observations of user behaviors as generated by the communities through them.

We consider “community profiling” as a new problem to solve due to three reasons. Firstly, community profiling is different from community detection, because detection focuses on getting community membership for each user, whereas profiling focuses on getting the “community-content” and “community-community-content” probabilities. Secondly, community profile has never been defined. Some recent work exploits rich user information [15, 30, 32, 33, 40, 41, 42] to improve community detection and output some user information aggregation as a by-product. But they neither define a community profile both internally and externally, nor try to identify new applications that community profiles enable. Finally, the difficulty of community profiling is often underestimated. As we shall discuss soon, due to the inter-dependency with community detection, the heterogeneity of social observations and the nonconformity of user behaviors, finding a good community profile is challenging. None of the existing work has ever identified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for each community, and ultimately enable new applications. In Fig. 1(a), we show the input for community profiling: a set of users, each of whom publishes documents; users are connected by friendship links, and interact with each other by diffusion links. E.g., in Twitter, each user posts tweets, users are connected by followership links, and they retweet each other to diffuse information. In Fig. 1(b), for each community, we output: a content profile (e.g., community $c_{1}$ tends to publish topics $z_{1}$ and $z_{2}$ ) and a diffusion profile (e.g., $c_{1}$ tends to diffuse itself and $c_{2}$ on $z_{1}$ ). In Fig. 1(c), we enable three new applications as follows (novelty to be discussed in Sect. 2, applications to be concretized in Sect. 5 and evaluated in Sect. 6):

$\bullet$ Community-aware diffusion.

As community profiles aggregate user behaviors, we can use them to more robustly model the diffusion in a community level, rather than an individual level [9, 22, 25]. E.g., we can explain a retweet happens as one user’s communities often retweet the other’s on a certain topic. We acknowledge diffusion as a complex decision– beyond community profiles, there are also nonconformity factors such as individual preference and topic popularity. This partially explains why community profiling is challenging– we cannot account community profiles for all the diffusions; instead, we have to model different factors, to accurately estimate the profiles and the community-aware diffusion.

$\bullet$ Profile-driven community ranking.

We often need to target audiences for disseminating information in the networks. E.g., a company wants to target communities, which are most likely to retweet about its product, so as to launch a campaign. A funding agency wants to target communities, which actively cite papers about its grant theme on “deep learning”, so as to disseminate the grant call. Since we have known what content each community is interested in and how it diffuses that content with others, we can rank the communities. Profile-driven community ranking is different from the traditional community recommendation, which often relies on only “community-X” properties and is unaware of diffusion [7, 14].

$\bullet$ Profile-driven community visualization.

Holistic modeling leads to rich visualization– we can now visualize not only how communities feature distinct contents (e.g., what an IT community tweets), but also how they interact (e.g., how an IT community retweets others) which is often overlooked before [8, 23].

We make two remarks about the above applications:

we complete one task of community profiling to support multiple applications at a time, thus community profiling is only done once offline; 2) we build an interactive system111http://sociallens.adsc.com.sg/ for profile-driven community visualization and ranking, which for the first time allows people to freely browse the communities by both content and diffusion [4].

The difficulty of community profiling is often largely underestimated; as we shall discuss next, there exist many challenges:

$\bullet$ Inter-dependency with community detection.

A straightforward approach of community profiling is to first detect communities and then aggregate each community’s user observations as the profiles. However, because this approach does not try to “best explain” the user observations as generated by the communities through their profiles, it is often suboptimal. Take content profile as an example. Denote a user as $u$ and a community as $c$ . For simplicity, we denote $c$ ’s content profile as $p(\text{content}|c)$ and the likelihood of $u$ ’s content as $p(\text{content}|u)$ . To best explain the user content as generated by the communities through their content profiles, we effectively solve

[TABLE]

where $p(c|u)$ is the probability of user $u$ assigned to community $c$ . Ideally, to optimize Eq. 1, we shall optimize both the profile $p(\text{content}|c)$ ’s and the community assignment $p(c|u)$ ’s. But in the straightforward approach, the detection first fixes the $p(c|u)$ ’s, then the best result this aggregation can return is the $p(\text{content}|c)$ ’s that maximize Eq. 1. It is clear that, the maximal likelihood we get with fixed $p(c|u)$ ’s is suboptimal, unless the $p(c|u)$ ’s are “perfect”. A perfect detection of $p(c|u)$ ’s also needs to maximize the likelihood in Eq. 1, which depends on the profile $p(\text{content}|c)$ ’s. In all, content profiles and community detection are coupled. Similarly, we can show that diffusion profile and community detection are also coupled. Denote one more user as $v$ , and one more community as $c^{\prime}$ . For simplicity, we denote $c$ ’s diffusion profile as $p(\text{diffusion}|\text{content},c,c^{\prime})$ ’s, each as a probability of having a diffusion between $c$ and $c^{\prime}$ about some content. Then, to best explain the user-to-user diffusion as generated by the communities through their diffusion profiles, we effectively solve

[TABLE]

where the product over $(u,v)$ is taken over all the user pairs having a diffusion link. To optimize the likelihood in Eq. 2, we shall optimize the diffusion profile $p(\text{diffusion}|\text{content},c,c^{\prime})$ ’s, as well as the community assignments $p(c|u)$ ’s and $p(c^{\prime}|v)$ ’s.

$\bullet$ Heterogeneity of social observations.

Social observations, especially the user links (i.e., friendship links and diffusion links), often carry different semantics; e.g., friendship links indicate user connections and diffusion links indicate user interactions. Traditionally, we often try to enforce user connections to be denser within each community than across communities [16, 19]. But in diffusion, the “weak ties” theory recognizes that the inter-community interactions may not be weak [12]. E.g., software engineering community cites more papers from machine learning community than itself on “deep learning.” This means we have to separate the modeling of user connection and user diffusion. Such user link heterogeneity is largely overlooked in the previous work [32, 35], thus how to model heterogeneous user links together remains unclear.

$\bullet$ Nonconformity of user behaviors.

User behaviors, especially their diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other factors as well. E.g., some diffusion happens due to its topic (e.g., presidential election) being popular at the moment or its author (e.g., Lady Gaga) being preferred as a celebrity. Such topic popularity and user preference are the other two typical nonconformity factors for diffusion, and we must accommodate them. No prior work has explored both community factor and nonconformity factors [17, 25], and it is not clear how to balance them in diffusion.

Our technical novelty is identifying the above challenges and developing a unified Community Profiling and Detection (CPD) model (Sect. 3) to address them accordingly.

$\bullet$

To model the inter-dependency with community detection, we propose to take a novel profile-aware generative approach– we realize the detection by latent membership variables and the profiling by latent community profile variables, which together generate the user friendship links, user content and user diffusion links in the network. Then we infer these latent variables by maximizing the likelihood. None of the existing work has taken a profile-aware generative approach– they may use a generative model for community detection [26, 31, 33], but they never consider internal and external profiles together with detection.

$\bullet$

To address the heterogeneity of social observations, we propose to separate the generation of friendship links from latent community assignments and the generation of diffusion links from latent profiles. In particular, we require that two users are more likely to share a friendship link if they have similar community assignments. Thus maximizing the likelihood of observing the friendship links enforces intra-community friendship links to be denser than inter-community ones. In contrast, we use the community diffusion profiles to generate the diffusion links, but we do not require inter-community diffusion strengths to be always smaller than intra-community ones; instead, the diffusion profiles are freely learned in maximizing the likelihood of the diffusion link observations.

$\bullet$

To accommodate the nonconformity of user behaviors, we propose to define the generative probability of observing a diffusion link as a logistic function over multiple factors, including the topic-aware community diffusion profiles, the time-sensitive topic popularities and the individual user preferences. By maximizing the likelihood of diffusion link observations, we learn the diffusion profiles, as well as the weights to combine these different factors.

Finally, we design a scalable inference algorithm for CPD (Sect. 4). As shown later, our inference algorithm scales linearly to the data set size. We further parallelize our inference algorithm, by taking the data skewness into account.

We summarize our contributions as follows:

$\bullet$ We identify a new problem of community profiling, which together with detection enables a holistic modeling of communities.

$\bullet$ We identify three unique challenges and design a novel CPD model for joint community profiling and detection.

$\bullet$ We develop a scalable inference algorithm for CPD, and we further parallelize it by taking the data skewness into account.

$\bullet$ We perform extensive experiments to evaluate CPD over large-scale data sets, and show both its effectiveness and scalability.

2 Related Work

In this section, we review the related work on community detection and relevant applications, and distinguish the differences between existing work and our community profiling model. We further organize such differences in Table 4.

[FIGURE:]

Community Detection. Detecting communities from various networks has been extensively studied in the last decade. There exist comprehensive surveys [39, 19, 36] on community detection, which review different community detection methods in terms of detection algorithms, quality measures, benchmarks and so on.

Conventionally, a community is defined as a group of nodes, in which intra-group connections are much denser than inter-group ones [11, 38]. The pioneer community detection studies aim to generate the community membership for each node purely based on the links amongst them [19, 38]. The prevalence of social networks offers a rich collection of user links to use for community detection, such as the followership in Twitter [32], Flickr [30] and Facebook/Google+ [26, 41], the co-authorship in DBLP [40, 42], the email exchange [32]. However, most of these existing work only consider one single type of links. There are other different types of user links; e.g., users comment/reply other users in digg [23], contact/co-contact/co-subscribe other users in YouTube [35], follow/reply/retweet other users in Twitter [31]. But these different links were often modeled in the same way. So far as we know, none of the existing community work considers the heterogeneity among user links (i.e., friendship links and diffusion links) as we do.

Recent studies start to exploit the rich user information, such as content [32], attribute [33, 40], action [23, 31], to improve the detection. Consequently, in addition to community membership, they also occationally output some “community- $X$ ” associations, such as “community-content” [32], “community-attribute” [33, 40] and “community-action” [23, 31]. In our work, we simultaneously discover communities and characterize them with both internal and external profiles. Although some forms of internal community profiles may be obtained in some prior work ([23, 31, 32, 33, 40]) as the by-products, the external profiles are greatly overlooked.

There are some recent studies on aggregating each community’s user preferences as some form of community profiles, so as to enable item recommendation to each community. Their work is different from ours in two aspects. On one hand, most of these community recommendation studies are given the communities as input [14, 29]. Even though some of them did try to detect communities [1, 27], their definition of a community is a group of users who share similar preferences to a recommended item, which is not based on network links at all. In contrast, our community is a group of densely connected users, who share similar interests and diffusion behaviours. On the other hand, their community profile is obtained by aggregating the users’ preferences, which is usually based on a least misery or aggregate voting approach. In contrast, we formalize the community profiles as the probabilities of “community- $X$ ” and probabilities of “community-community- $X$ ”. Besides, we estimate these community profiles by a generative model together with community detection.

Community-aware Applications. The community profiles deepen our understanding of the detected communities and thus benefit a lot of community-level applications. Here we review the related work to our three example applications, including community ranking, community diffusion and community visualization. Firstly, for community ranking, most of existing studies [14, 7] rank communities based on users’ interests on them, i.e., to find the favourite communities for users. Moreover, the communities to be ranked are often already predefined over the networks. In our work, the communities are not provided as the input, and our focus is to rank communities by both their internal content profiles and external diffusion profiles together. This will help the company/author to choose the promising community to promote their products/papers as much as possible. Secondly, for community diffusion, in contrast to our community-level diffusion modelling, most diffusion models are at the individual level [9, 22, 25]. Recently, there are some studies that consider diffusion at the community level, but either the communities are predefined [10] or the topic-awareness is overlooked [15, 17]. Besides, unlike our modeling of various diffusion factors together, individual factor is missing in [17] and topic popularity factor is missing in [15]. Last, but not the least, for community visualization, although a lot of efforts have been devoted to community detection, only a few of them further visualize the results to facilitate the deep analysis and semantic interpretation. In [8], the authors propose a community detection and visualization model, which differentiates the inner nodes and the border nodes for visualizing the interactions between communities. While their objective is to design a layout algorithm for clearly displaying the communities and their interactions, we focus on demonstrating the topic-aware user interaction strengths among the communities.

3 Joint Profiling and Detection

In the following, we first define some key notions; then we formulate the joint community profiling and detection problem. Table 2 summarizes the notations used in this paper.

Definition 1

A social graph is $\mathcal{G}=(U,$ $D,F,E)$ , where $u\in U$ is a user and $d\in D$ is a user published document. There are two types of links in $\mathcal{G}$ . $F_{uv}\in F$ is a friendship link from user $u$ to user $v$ ; $E_{ij}\in E$ is a diffusion link from document $i$ to document $j$ . Both types of links are directed.

For a Twitter network, $D_{u}\subset D$ is the set of tweets posted by user $u$ ; $F_{uv}$ represents that user $u$ follows user $v$ ; $E_{ij}$ represents that tweet $i$ is a retweet of tweet $j$ . For a DBLP network, $D_{u}$ is the set of papers published by author $u$ ; $F_{uv}$ represents that author $u$ co-authors with author $v$ ; $E_{ij}$ represents that paper $i$ cites paper $j$ .

To enable content modeling, we first define topic.

Definition 2

A topic $z\in\{1,\dots,|Z|\}$ is a $|W|$ -dimensional multinomial distribution $\boldsymbol{\phi}_{z}$ over words, where each dimension $\phi_{z,w}$ is the probability of a word $w\in\{1,\dots,|W|\}$ belonging to $z$ .

Then, we define the community membership, as well as our community content profile and diffusion profile.

Definition 3

A user $u$ ’s community membership is a $|C|$ -dimensional multinomial distribution $\boldsymbol{\pi}_{u}$ , where each dimension $\pi_{u,c}$ is the probability of $u$ belonging to community $c$ , $\forall c\in\{1,\dots,|C|\}$ .

Definition 4

The content profile of community $c$ is a $|Z|$ -dimensional multinomial distribution $\boldsymbol{\theta}_{c}$ over topics, where each dimension $\theta_{c,z}$ is the probability of $c$ discussing topic $z$ .

Definition 5

The diffusion profile of community $c$ is a $|C|\times|Z|$ -dimensional matrix $\boldsymbol{\eta}_{c}$ , where each entry $\eta_{c,c^{\prime}z}$ is the probability of $c$ diffusing another community $c^{\prime}$ on topic $z$ .

Take community $c_{1}$ in Fig. 1 as an example. As $c_{1}$ ’s users publish more content on $z_{1}$ and $z_{2}$ , the resulting $\theta_{c_{1},z_{1}}$ and $\theta_{c_{1},z_{2}}$ are bigger. Besides, as $c_{1}$ ’s users often retweet/cite themselves on $z_{1}$ , the resulting $\eta_{c_{1},c_{1}z_{1}}$ is big. As motivated in Sect. 1, we formalize a joint profiling and detection problem to solve in this paper.

Problem 1

Given a social graph $\mathcal{G}=(U,D,F,E)$ , the task of joint community profiling and detection is to infer: 1) each user $u$ ’s community membership $\boldsymbol{\pi}_{u}$ , $\forall u\in U$ ; 2) each community $c$ ’s content profile $\boldsymbol{\theta}_{c}$ and diffusion profile $\boldsymbol{\eta}_{c}$ , $\forall c\in\{1,\dots,|C|\}$ .

3.1 Model Design

Next, we concretize our model design w.r.t. the three technical challenges for community profiling as discussed in Sect. 1. We will later evaluate how well we address each challenge in Sect. 6.2.

Profile-aware generative model. Community detection aims to infer a community membership assignment $\boldsymbol{\pi}_{u}$ for each user $u$ based on the friendship links $F_{uv}$ ’s. Community profiling aims to infer a content profile $\boldsymbol{\theta}_{c}$ and a diffusion profile $\boldsymbol{\eta}_{c}$ for each community $c$ based on its member users’ published content $D_{u}$ ’s and diffusion links $E_{ij}^{t}$ ’s. We can reinforce profiling and detection, by letting them leverage each other’s data. As a result, we wish to infer a set of community-level latent variables, including $\boldsymbol{\pi}_{u}$ ’s, $\boldsymbol{\theta}_{c}$ ’s and $\boldsymbol{\eta}_{c}$ ’s, together from all the observations $(D,F,E)$ .

Since joint profiling and detection is an unsupervised task, we adopt a generative framework for our CPD model. We design CPD as a graphical model in Fig. 2, where we use communities to explain all the user observations on the network. Firstly, we consider a user $u$ to publish a document $d_{ui}$ of topic $z$ , due to her community assignment $c_{ui}$ and the community content profile $\boldsymbol{\theta}_{c_{ui}}$ . E.g., an author publishes a paper on deep learning, because she is from the machine learning (ML) community, which studies deep learning. As we deal with short documents (e.g., tweets in Twitter and paper titles in DBLP) and a short document is likely to be about one single topic [31, 17], we assign one single topic to each document in our model. Secondly, we consider a user $u$ to publish a document $d_{ui}$ of topic $z$ , which diffuses another user $v$ ’s document $d_{vj}$ , due to both users’ community assignments $c_{ui}$ and $c_{vj}$ , as well as the community diffusion profile $\eta_{c_{ui},c_{vj}z}$ . E.g., an author $u$ publishes a paper on software repositories, and cites another author $v$ ’s paper on deep learning, because $u$ is from the software engineering (SE) community, $v$ is from the ML community, and SE community tends to cite papers on deep learning from the ML community. Finally, we consider a user $u$ to form a friendship link with a user $v$ , due to their similar community memberships $\boldsymbol{\pi}_{u}$ and $\boldsymbol{\pi}_{v}$ . E.g., an author $u$ is a co-author with another author $v$ , because they are both from the ML community.

Addressing data heterogeneity. We model the friendship links $F$ and the diffusion links $E$ differently. Conventionally, a good community needs to have low conductance, which means the friendship links should be denser inside a community than outside a community. Specifically, we define the probability of having a friendship link between two users $u$ and $v$ as a sigmoid function, parameterized by their community membership similarity:

[TABLE]

where $\hat{\boldsymbol{\pi}}_{u}=[{\hat{\pi}}_{u,1},...,{\hat{\pi}}_{u,|C|}]^{T}$ is an estimation of $\boldsymbol{\pi}_{u}$ based on the aggregation of $u$ ’s community assignments. In othe words, we use $\hat{\boldsymbol{\pi}}_{u}$ and $\hat{\boldsymbol{\pi}}_{v}$ , instead of $\boldsymbol{\pi}_{u}$ and $\boldsymbol{\pi}_{v}$ , to generate the $F_{uv}$ ’s in Fig. 2. Such a design is motivated by [5, 6] to simplify the inference. $\sigma(x)=1/(1+e^{-x})$ is a sigmoid function. The more similar $\hat{\boldsymbol{\pi}}_{u}$ and $\hat{\boldsymbol{\pi}}_{v}$ are, the more likely $F_{uv}$ exists. In other words, $F_{uv}$ is large if $u$ and $v$ are from the same communities. This naturally enforces denser friendship links within a community than across communities, thus leading to low conductance. In contrast with the friendship links, the inter-community diffusion is not necessarily “weak” [12]. In fact, the community-level diffusion strengths vary over topics, which breaks the assumption of having to maintain the low conductance within a community. We need to resort to a different modeling of diffusion links, as we discuss next.

Accommondating nonconformity. Different factors can account for a diffusion decision. Take Twitter as an example; user $u$ is likely to retweet $v$ ’s tweet $d_{vj}$ as her $i$ -th tweet $d_{ui}$ at time $t$ if: 1) the community-level diffusion strength between $c_{ui}$ (the community $u$ belongs to when she generates document $d_{ui}$ ) and $c_{vj}$ on topic $z_{vj}$ is strong; 2) the topic $z_{vj}$ of $d_{vj}$ is trending at time $t$ ; 3) $u$ has an individual preference to retweet from $v$ . These factors show three typical perspectives to make a diffusion decision: community perspective (if a community is more likely to retweet another community), content perspective (if a topic is more popular at the time) and user perspective (if a user is more likely to retweet another user). Next, we characterize the three typical factors.

$\bullet$ Community diffusion preference:

we consider a user $u$ to diffuse another user $v$ on topic $z$ , if the communities of $u$ and $v$ are both interested in $z$ and they often diffuse each other on $z$ . Denote $s\in\{0,1\}$ as an indicator for a diffusion link in $E$ to happen. Then, the probability of having a diffusion $s=1$ from $u$ to $v$ on $z$ is

[TABLE]

where at step 1 we expand $p(s=1,z|u,v)$ by introducing the community membership $p(c|u)$ and $p(c^{\prime}|v)$ , the communities’ interests on the topic $p(z|c)$ and $p(z|c^{\prime})$ , as well as the topic-sensitive community diffusion probability $p(s=1|c,c^{\prime},z)$ . At step 2, we estimate $p(s=1|c,c^{\prime},z)$ with $\eta_{c,c^{\prime}z}$ , the probability of $c$ retweeting/citing $c^{\prime}$ on $z$ . Besides, we estimate $p(c|u)$ with ${\hat{\pi}}_{u,c}$ , which is the empirical probability of community $c$ being assigned to user $u$ ; similarly we estimate $p(c^{\prime}|v)$ with ${\hat{\pi}}_{v,c^{\prime}}$ . Finally we estimate $p(z|c)$ with $\hat{\theta}_{c,z}$ , which is the empirical probability of topic $z$ assigned to the documents from $c$ ; similarly we estimate $p(z|c^{\prime})$ with ${\hat{\theta}}_{c^{\prime},z}$ . Denote $\hat{\boldsymbol{\theta}}_{\cdot,z}=[{\hat{\theta}}_{1,z},...,{\hat{\theta}}_{|C|,z}]^{T}$ and $\bar{\boldsymbol{\eta}}=vec([\boldsymbol{\eta}_{1},...,\boldsymbol{\eta}_{|C|}])$ , where $vec(\mathbf{A})$ concatenates the row vectors in a matrix $\mathbf{A}$ to a vector. For a diffusion between $d_{ui}$ and $d_{vj}$ , which shares the same topic $z$ , we denote $\bar{\mathbf{c}}_{ij}=vec((\hat{\boldsymbol{\pi}}_{u}\hat{\boldsymbol{\pi}}_{v}^{T})\circ(\hat{\boldsymbol{\theta}}_{\cdot,z}\hat{\boldsymbol{\theta}}_{\cdot,z}^{T}))$ , where $\circ$ is an element-wise product. Then Eq. (4) becomes $\bar{\mathbf{c}}_{ij}^{T}\bar{\boldsymbol{\eta}}$ .

$\bullet$ Topic popularity:

we model the popularity of a topic at a specific timestamp $t$ as the count of topic $z$ at $t$ , which is denoted as $n^{t}_{z}$ .

$\bullet$ Individual preference:

we model user $u$ ’s preference to diffuse information from user $v$ with a linear function $\boldsymbol{\nu}^{T}\mathbf{f}_{uv}$ , where $\boldsymbol{\nu}$ is a parameter, $\mathbf{f}_{uv}$ is a feature vector for $u$ and $v$ . Take Twitter as an example; we consider two features for $u$ : 1) user popularity, which is defined as the number of $u$ ’s followers divided by that of her followees $\frac{|{Followers(u)}|}{|{Followees(u)}|}$ ; 2) user activeness, which is defined as the number of $u$ ’s retweets divided by that of her tweets $\frac{|{Retweets(u)}|}{|{Tweets(u)}|}$ . We extract $v$ ’s features and concatenate them with $u$ ’s as $\mathbf{f}_{uv}$ .

In order to systematically combine the three diffusion factors, we introduce a sigmoid function to define the probability of document $d_{vj}$ diffusing document $d_{ui}$ of topic $z$ at timestamp $t$ as:

[TABLE]

We learn the parameters $\bar{\boldsymbol{\eta}}$ and $\boldsymbol{\nu}$ , so that we know how much each factor contributes in the diffusion.

Generative process. We summarize the CPD model’s generative process below. Denote $\mathbbm{1}_{\ell\times 1}$ as an all-one vector of length $\ell$ .

For each topic $z=1,\dots,|Z|$ , draw its word distribution from a Dirichlet prior parameterized by $\beta$ : $\boldsymbol{\phi}_{z}|\beta\sim Dir(\beta\mathbbm{1}_{|W|\times 1})$ ; 2. 2.

For each community $c=1,\dots,|C|$ , draw its topic distribution from a Dirichlet prior parameterized by $\alpha$ : $\boldsymbol{\theta}_{c}|\alpha\sim Dir(\alpha\mathbbm{1}_{|Z|\times 1})$ ; 3. 3.

For each user $u=1,\dots,|U|$

(a)

Draw her community distribution $\boldsymbol{\pi}_{u}|\rho\sim Dir(\rho\mathbbm{1}_{|C|\times 1})$ ; 2. (b)

For the $i$ -th document $d_{ui}$ of user $u$

i.

Draw a community assignment $c_{ui}|\boldsymbol{\pi}\sim Multi(\boldsymbol{\pi}_{u})$ , by $u$ ’s multinomial community distribution $\boldsymbol{\pi}_{u}$ ; 2. ii.

Draw a topic $z_{ui}|\mathbf{c},\boldsymbol{\theta}\sim Multi(\boldsymbol{\theta}_{c_{ui}})$ , by $c_{ui}$ ’s multinomial topic distribution $\boldsymbol{\theta}_{c}$ ; 3. iii.

Draw each word $w_{uik}|\mathbf{z},\boldsymbol{\phi}\sim Multi(\boldsymbol{\phi}_{z_{ui}})$ , $\forall k=1,...,|W_{ui}|$ , by $z_{ui}$ ’s multinomial word distribution; 3. (c)

For each friendship link from user $u$ to user $v$ , draw ${F_{uv}}|\mathbf{\boldsymbol{\pi}}\sim Ber(\sigma(\hat{\boldsymbol{\pi}}_{u}^{T}\hat{\boldsymbol{\pi}}_{v}))$ by a Bernolli distribution (Eq. 3); 4. (d)

For each diffusion link $E^{t}_{ij}$ from document $d_{ui}$ to document $d_{vj}$ at time $t$ , draw $E^{t}_{ij}|C,\boldsymbol{\eta},Z,\boldsymbol{\nu},\mathbf{f}\sim Ber(\sigma(\bar{\mathbf{c}}_{ij}^{T}\bar{\boldsymbol{\eta}}+n_{z}^{t}+\boldsymbol{\nu}^{T}\mathbf{f}_{uv}))$ by a Bernolli distribution (Eq. 5);

In step 3.b.iii, since short text often has single topic [31, 17], we sample all words in $d_{ui}$ from the same topic-word distribution $\boldsymbol{\phi}_{z}$ .

4 Scalable Model Inference

We develop a scalable inference algorithm for CPD. We aim to infer the topic assignment and community assignment latent variables $\{\mathbf{Z},\mathbf{C}\}$ from the observations $\{W,F,E\}$ , where $W$ is the words in $D$ . We use collapsed Gibbs sampling [6, 15, 32] for the inference. We also estimate the variational parameters $\{\boldsymbol{\pi},\boldsymbol{\theta},\boldsymbol{\phi}\}$ and the model parameters $\{\boldsymbol{\nu},\boldsymbol{\eta}\}$ by variational Expectation Maximization (EM) [5, 6]. We later parallelize our inference algorithm.

4.1 Collapsed Gibbs Sampling

To derive the Gibbs sampler, we start with computing the collapsed posterior distribution of our model:

[TABLE]

where $p(F|C)$ (abbreviated as $p(F)$ in the following) is the probability for the friendship links $F$ generated by the communities $C$ ; $p(E|C,\bar{\boldsymbol{\eta}},Z,\boldsymbol{\nu},\mathbf{f})$ (abbreviated as $p(E)$ in the following) is the probability for the diffusion links $E$ generated by the communities $C$ . We follow [5] to model observed links only in Eq. 6. Thus, we define $p(F)=\prod\nolimits_{(u,v)\in F}P(F_{uv}=1)$ and $p(E)=\prod\nolimits_{(i,j)\in E}p(E_{ij}^{t}=1)$ , where $t$ is the timestamp of the diffusion link $(i,j)$ .

In the generative process of CPD model, we model both $P(F_{uv}=1)$ (step 3.c) and $p(E_{ij}^{t}=1)$ (steps 3.d) with sigmoid functions $\sigma(\cdot)$ . Bayesian inference with sigmoid function is known as hard, because it is analytically inconvenient to construct a Gibbs sampler for the sigmoid function [28]. We are motivated by the data augmentation approach [2, 6], which introduces Pólya-Gamma random variables to derive an exact mixture representation of the sigmoid function for easier inference. Hence we introduce two Pólya-Gamma variables $\boldsymbol{\lambda}$ and $\boldsymbol{\delta}$ as the augmented variables for $p(F)$ and $p(E)$ respectively. Formally, a random variable $x$ follows a Pólya-Gamma distribution $x\sim PG(a,b)$ ( $a>0,b>0$ ), if

[TABLE]

where $g_{k}\sim Gamma(a,1)$ is a Gamma random variable. It has been shown in [28], a logistic function can be represented as a mixture of Gaussians w.r.t. a Pólya-Gamma distribution:

[TABLE]

where $\psi(w,x)=e^{\frac{w-xw^{2}}{2}}$ and $x\sim PG(1,0)$ . Then, for $p(F_{uv}=1)$ as defined in Eq. 3, we can introduce a Pólya-Gamma variable $\lambda_{uv}\sim PG(1,0)$ , such that we get a joint probability

[TABLE]

Similarly, for $p(E_{ij}^{t}=1)$ as defined in Eq. 5, we can introduce a Pólya-Gamma variables $\delta_{ij}\sim PG(1,0)$ , such that we get

[TABLE]

Considering all the friendship links and diffusion links, we have

[TABLE]

Next we infer $\mathbf{Z}$ and $\mathbf{C}$ , together with $\boldsymbol{\lambda}$ and $\boldsymbol{\delta}$ . Specifically, augmented with two Pólya-Gamma variables $\boldsymbol{\lambda}$ and $\boldsymbol{\delta}$ , the collapsed posterior distribution of our model becomes:

[TABLE]

where $\Delta(\mathbf{x})=\frac{\prod_{i=1}^{\text{dim}(\mathbf{x})}\Gamma(x_{i})}{\Gamma(\sum_{i=1}^{\text{dim}(\mathbf{x})}x_{i})}$ . Based on Eq. 12, we can infer $\mathbf{Z}$ , $\mathbf{C}$ , $\boldsymbol{\lambda}$ and $\boldsymbol{\delta}$ one by one as follows.

$\bullet$ For $\mathbf{Z}$ :

the probability of assigning topic $z$ to $d_{ui}$ is

[TABLE]

where $\Lambda_{u}=\{v|(u,v)\in F~{}\text{or}~{}(v,u)\in F\}$ is user $u$ ’s neighbors in $F$ . $\Lambda_{i}=\{j|(i,j)\in E~{}\text{or}~{}(j,i)\in E\}$ is document $i$ ’s neighbors in $E$ . $n_{c,{}\neg\{ui\}}^{z}$ and ${n_{c,{}\neg\{ui\}}^{(\cdot)}}$ denote the number of times that topic $z$ is assigned to community $c$ and the number of times that any topic is assigned to $c$ , excluding the current document $d_{ui}$ . Similarly, ${n_{z,{}\neg\{ui\}}^{w}}$ and ${n_{z,{}\neg\{ui\}}^{(\cdot)}}$ are the number of times that word $w$ is assigned to topic $z$ and the number of times that any word is assigned to $z$ , excluding $d_{ui}$ . ${n_{ui}^{w}}$ and ${n_{ui}^{(\cdot)}}$ are the number of times that word $w$ occurs in the document $d_{ui}$ and the number of words in $d_{ui}$ . Finally, $\psi(\hat{\boldsymbol{\pi}}_{u}^{T}\hat{\boldsymbol{\pi}}_{v},\lambda_{uv}|C_{\neg\{ui\}})$ denotes estimating $\psi(\hat{\boldsymbol{\pi}}_{u}^{T}\hat{\boldsymbol{\pi}}_{v},\lambda_{uv})$ based on $C_{\neg\{ui\}}$ instead of the whole $C$ ; $\psi(\bar{\mathbf{c}}_{ij}^{T}\bar{\boldsymbol{\eta}}+n_{z}^{t}+\boldsymbol{\nu}^{T}\mathbf{f}_{uv},\delta_{ij}|C_{\neg\{ui\}},Z_{\neg\{ui\}})$ denotes estimating $\psi(\bar{\mathbf{c}}_{ij}^{T}\bar{\boldsymbol{\eta}}+n_{z}^{t}+\boldsymbol{\nu}^{T}\mathbf{f}_{uv},\delta_{ij})$ based on $C_{\neg\{ui\}}$ and $Z_{\neg\{ui\}}$ instead of the whole $C$ and $Z$ .

$\bullet$ For $\mathbf{C}$ :

the probability of assigning community $c$ to $u$ at $d_{ui}$ is

[TABLE]

where ${n_{u,{}\neg\{ui\}}^{c}}$ and ${n_{u,{}\neg\{ui\}}^{(\cdot)}}$ are the number of documents from user $u$ that are assigned to community $c$ and the number of documents from user $u$ excluding $d_{ui}$ , respectively.

$\bullet$ For $\boldsymbol{\lambda}$ :

the conditional distribution of $\boldsymbol{\lambda}$ is Pólya-Gamma, i.e.,

[TABLE]

We efficiently sample $\lambda_{uv}$ by an alternate exponentially tilted Jacobi distribution [28].

$\bullet$ For $\boldsymbol{\delta}$ :

the conditional distribution of $\boldsymbol{\delta}$ is also Pólya-Gamma,

[TABLE]

4.2 Model Parameter Estimation

We use variational EM to iteratively estimate the variational parameters $\{\boldsymbol{\pi},\boldsymbol{\theta},\boldsymbol{\phi}\}$ and the model parameters $\{\boldsymbol{\nu},\boldsymbol{\eta}\}$ :

(E-step) Use the samples of collapsed Gibbs sampling to estimate the parameters $\boldsymbol{\pi}$ , $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ , given $\boldsymbol{\nu}$ , $\boldsymbol{\eta}$ . 2. 2.

(M-step) Optimize $\boldsymbol{\nu}$ and $\boldsymbol{\eta}$ by maximizing Eq. (6), given the parameters $\boldsymbol{\pi}$ , $\boldsymbol{\theta}$ , $\boldsymbol{\phi}$ estimated in the E-step.

In the E-step, the Gibbs sampler iteratively draws samples of $\mathbf{Z}$ , $\mathbf{C}$ , $\boldsymbol{\lambda}$ and $\boldsymbol{\delta}$ by Eqs. 13–4.1. Based on the samples, we estimate: ${\pi_{u,c}}=\frac{{n_{u}^{c}+\rho}}{{n_{u}^{(\cdot)}+\left|C\right|\rho}}$ , ${\theta_{c,z}}=\frac{{n_{c}^{z}+\alpha}}{{n_{c}^{(\cdot)}+\left|Z\right|\alpha}}$ and ${\phi_{z,w}}=\frac{{n_{z}^{w}+\beta}}{{n_{z}^{(\cdot)}+\left|W\right|\beta}}$ . In the M-step, we first estimate $\eta_{c,c^{\prime}z}$ ’s by aggregating the community and topic assignments w.r.t all the documents, based on the last iteration of sampling. Then we estimate $\boldsymbol{\nu}$ by maximizing Eq. 6 with all other variables fixed– this is essentially fitting a logistic regression function; to solve it, we randomly sample the same amount of non-observed diffusion links as negative instances for optimization. As $\alpha$ and $\rho$ are used to sample the $\boldsymbol{\pi}_{u}$ ’s and $\boldsymbol{\theta}_{c}$ ’s, we follow the convention [13] to set their values as 50 divided by $\boldsymbol{\pi}_{u}$ ’s dimension and $\boldsymbol{\theta}_{c}$ ’s dimension respectively, i.e., $\alpha=50/|Z|$ , $\rho=50/|C|$ . As $\beta$ is used to sample the word distribution $\boldsymbol{\phi}_{z}$ ’s and the number of words is large, we follow [13] again to set $\beta=0.1$ .

4.3 Scalability

We summarize our inference algorithm in Alg. 1. In steps 3–10, we take an E-step for collapsed Gibbs sampling. In steps 11-14, we take an M-step for training the model parameters.

Time complexity. In steps 4–6, as we compute the community assignments and topic assignments for each document of each user, it takes $O(|D|\times|C|+|W|\times|Z|)$ . In steps 7–8, as we compute $\hat{\boldsymbol{\pi}}_{u}^{T}\hat{\boldsymbol{\pi}}_{v}$ for each friendship link, it takes $O(|C|\times|F|)$ . In steps 9–10, as we compute $(\bar{\mathbf{c}}_{ij}^{T}\bar{\boldsymbol{\eta}}+n_{z}^{t}+\boldsymbol{\nu}^{T}\mathbf{f}_{uv})$ for each diffusion link, it takes $|C|^{2}\times|E|$ . In steps 11–12, as we aggregate the community assignments and topic assignments for each diffusion link, it takes $O(|E|)$ . In steps 13–14, as we compute gradients for $\nu$ over all the diffusion links, it takes $O(|E|\times T_{2})$ . In total, for $T_{1}$ iterations, the overall complexity is $O((|D|\times|C|+|W|\times|Z|+|C|\times|F|+|C|^{2}\times|E|+|E|+|E|\times T_{2})\times T_{1})$ . As we can see, Alg. 1’s time complexity is linear to the data size (i.e., $|D|$ , $|F|$ , and $|E|$ ).

Parallelization. We consider multithread parallelization of Alg. 1. We leave multi-machine parallelization as future work. In our variational EM algorithm, we find the E-step takes much longer time than the M-step, because:

the E-step’s collapsed Gibbs sampling has to be done iteratively over all the observations, including documents (thus words), friendship links and diffusion links;
the M-step’s model parameter estimation is comparatively much easier, since optimizing $\boldsymbol{\nu}$ is basically solving logistic regression on the diffusion links (and the same amount of negative links) and $\boldsymbol{\eta}$ is done by simply aggregating the existing community and topic assignments. Thus, in this paper we focus on parallelizing the E-step.

$\bullet$ Segmenting data to reduce inter-dependency

. Recall in Sect. 4.1, the sampling requires computing: 1) a number of counters, including the community-topic counter ${n_{c}^{z}}$ , the topic-word counter ${n_{z}^{w}}$ , the user-community counter ${n_{u}^{c}}$ ; 2) a number of link probabilities, including the friendship one $\psi(\hat{\boldsymbol{\pi}}_{u}^{T}\hat{\boldsymbol{\pi}}_{v},\lambda_{uv})$ and the diffusion one $\psi(\bar{\mathbf{c}}_{ij}^{T}\bar{\boldsymbol{\eta}}+n_{z}^{t}+\boldsymbol{\nu}^{T}\mathbf{f}_{uv},\delta_{ij})$ . Among these computations, both topic and community assignments are applied to documents (thus their users), the friendship link probability is applied to users, and the diffusion link probability is applied to two documents (thus their users). Therefore, except the word topic assignment, the vast majority of computations are done on users and documents. This motivates us to segment the data by users and documents, so that different threads can work on different data segments with little inter-dependency. It may be possible to take words into consideration for the data segment as well, but it is not obvious and we leave it for future work. Considering that a user often has many documents (especially in Twitter), we design two guidelines to segment the data by users and documents:

we keep a user’s documents in the same data segment, because otherwise there are likely many conflicting updates about the same user from multiple threads;
we prefer keeping the same-topic documents in the same data segment, because it helps to reduce the conflicting updates about the same topic from multiple threads. Overall, we first run LDA [3] on all the users’ documents with $|Z|$ topics; then we partition the users into $|Z|$ segments, based on each user’s most frequently assigned topic in her documents. In each segment, each user has her documents, related friendship links and diffusion links.

$\bullet$ Distributing workload to avoid data skewness

. We aim to distribute the $|Z|$ data segments to $M$ threads, such that the workload on each thread is balanced. Note that $M$ is set as the number of physical CPU cores in this work. Our approach is to first estimate the workload of each data segment, and then cast this segment allocation task as solving $M$ standard 0-1 knapsack problems222https://en.wikipedia.org/wiki/Knapsack_problem. Denote the $i$ -th data segment’s workload as $o_{i}\in\mathbb{R}^{+}$ , thus the workload for all the data segments is $O=\sum_{i=1}^{M}o_{i}$ . Denote a binary indicator as $x_{i}\in\{0,1\}$ . Then for each thread, we solve

[TABLE]

which tries to find a subset of the data segments to have as close to $\frac{O}{M}$ workload as possible. One can fine tune the objective function of Eq. 17 in practice to best allocate the data segments for even workload among the threads. We estimate each workload $o_{i}$ as follows. First of all, we estimate the average processing time for each document and each link, based on a serial implementation of the sampling algorithm over all the data. Then, based on the number of documents and links a user has, we estimate the average workload of processing that user. Finally, we sum up the average workload of all the users in the $i$ -th data segment as $o_{i}$ .

We evaluate our inference algorithm’s efficiency in Sect. 6.4.

5 Applications

We concretize how to enable the following three community-aware applications based on five CPD outputs, including:

the community assignment for users $\pi_{u,c}$ ’s; 2) the community content profile $\theta_{c,z}$ ’s; 3) the community diffusion profile $\eta_{c,c^{\prime}z}$ ’s; 4) the topic assignment for words $\phi_{w,z}$ ’s; 5) the individual diffusion preference parameters $\boldsymbol{\nu}$ .

Community-aware diffusion. Given input of a document $d_{vj}$ published by user $v$ , we output the probability that another user $u$ will publish a document $d_{ui}$ to retweet or cite $d_{vj}$ at timestamp $t$ as

[TABLE]

where at step 1 we expand $p(E_{ij}^{t}=1|u,v,d_{vj},t)$ by the topics of $d_{vj}$ . At step 2, we plug in the definition of $p(E_{ij}^{t}=1|u,v,z,t)$ by Eq. 5. As we can see, Eq. 18 comprehensively models the diffusion by taking the community assignments $\boldsymbol{\pi}$ , the community profiles $\boldsymbol{\theta}$ and $\boldsymbol{\eta}$ , and the individual diffusion preference $\boldsymbol{\nu}$ into account.

Profile-driven community ranking. Given input of a query $q\in W^{k}$ ( $\forall k\geq 1$ ), we output the ranking of communities based on their probabilities to diffuse information about $q$ . Denote the probability of a community $c$ to generate a diffusion link $s=1$ of query $q$ as

[TABLE]

where at step 1 we expand $p(s=1|c,q)$ by the community diffusion profile $p(s=1|c,c^{\prime},z)$ , the topic assignment for $q$ in a community $p(z|q,c^{\prime})$ and the probability that $q$ is from that community $p(c^{\prime}|q)$ . At step 2 we plug in the definition of $p(s=1|c,c^{\prime},z)\propto\eta_{c,c^{\prime}z}$ and consider $q$ can come from any community with $p(c^{\prime}|q)$ uniformly. At step 3, we estimate the probability $p(z|q,c^{\prime})$ in a similar way as Eq. 13. We skip the details but explain the rational of this estimation: $p(z|q,c^{\prime})$ is proportional to the probability of community $c^{\prime}$ generating topic $z$ (i.e., captured by $\theta_{c^{\prime},z}$ ) and the probability of $q$ belonging to topic $z$ (i.e., captured by $\prod_{w\in q}\phi_{z,w}$ ).

Profile-driven community visualization. We can visualize each community’s content profile and its diffusion profile, as Fig. 1(b) shows. In particular, we are interested in the diffusion visualization, as it is new. In our experiments, we visualize how a community interacts with the others in two typical settings: 1) diffusion on a specific topic, where we use $\eta_{c,c^{\prime}z}$ as the diffusion strength from $c$ to $c^{\prime}$ under topic $z$ ; 2) diffusion with topic aggregation, where we use $\sum_{z}\eta_{c,c^{\prime}z}$ as the diffusion strength from $c$ to $c^{\prime}$ .

6 Experiments

We test CPD with two large-scale real-world data sets. We design experiments to: 1) evaluate how well we address each challenge listed in Sect. 1; 2) evaluate CPD’s performance, by comparing with the state-of-the-art baselines in different applications.

6.1 Set Up

We do experiments on Linux computers equipped with Intel(R) 3.50GHz CPUs and 16GB RAMs. We do 10-fold cross validation and report average scores for all the quantitative results. We also report significant test results whenever necessary.

Data sets. We use two public data sets: Twitter [20] and DBLP [34]. The Twitter data set was collected in May 2011. The DBLP data set contains the publications indexed by DBLP333http://dblp.uni-trier.de/ from 1936 to 2010. We pre-processed the tweets and the paper titles, by removing stop words, stemming and POS tagging444http://nlp.stanford.edu/software/tagger.shtml. We only kept nouns, verbs and hashtags. After that, we remove the documents with less than two words, and then remove the users with no document. Table 3 summarizes the statistics of our data sets after pre-processing.

Baselines. We choose baselines based on the following guidelines:

they are the state of the art to model heterogeneous user observations at the data level; 2) they model diffusion prediction at the task level; 3) preferably they model community. Finally, we choose four baselines below, and list our differences with them in Table 4.

$\bullet$ Poisson Mixed-Topic Link Model (PMTLM) [43].

It models the document network and uses the document topic assignment to generate the links. We adapt PMTLM for community detection and friendship link prediction comparison, by aggregating the topic assignments of each user’s documents as the community membership for that user. We also compare with PMTLM on diffusion prediction, as it also models the document links.

$\bullet$ Whom to Mention (WTM) [37].

It models the user diffusion links with user content and friendship. It does not model community. We compare with WTM on diffusion prediction.

$\bullet$ Community Role Model (CRM) [15].

It models friendship links and diffusion links based on the user’s community assignment and role assignment together. We compare with CRM on community detection, friendship link prediction and diffusion prediction.

$\bullet$ COmmunity Level Diffusion (COLD) [17].

It models the content and diffusion links based on communities. Thus it is the closest work to ours. But it models neither friendship links in community detection, nor individual factor and topic factor in diffusion prediction. We compare with COLD on community detection, friendship link prediction and diffusion prediction. As COLD has community diffusion strength, we also compare it on community ranking.

In addition to the above existing baselines, we also design some more baselines to validate that we are better than a straightforward community profiling approach of “first detecting communities, then aggregating each community’s user observations”. Specifically, we adopt the two state-of-the-art algorithms, CRM [15] and COLD [17] to detect the communities, and further aggregate the user observations in each detected community as the profiles. After applying CRM and COLD, we get the community assignment probabilities for each user $u$ to each community $c$ , which we denote as $\pi^{*}_{u,c}$ ’s. To get aggregated content profile, we first run LDA [3] on all the users’ documents with $|Z|$ topics, and for each user $u$ ’s $i$ -th document $d_{ui}$ , we get its $|Z|$ -dimensional multinomial topic distribution as $\boldsymbol{\theta}^{*}_{d_{u,i}}$ . Denote community $c$ ’s aggregated content profile as $\boldsymbol{\theta}^{*}_{c}$ . We have

[TABLE]

To get aggregated diffusion profile, we aggregate each diffusion link between $d_{u,i}$ and $d_{v,j}$ in $E$ w.r.t. their users’ communities on a topic $z$ . Denote aggregated diffusion profile $\eta^{*}_{c,c^{\prime}z}$ as the probability of community $c$ diffusing community $c^{\prime}$ on topic $z$ . Then we have

[TABLE]

In all, we obtain two more baselines, which implement the straightforward “first detection, then aggregation” profiling approach.

$\bullet$ CRM+Agg.

It uses CRM [15] to detect communities; then it uses Eq. 20 and Eq. 21 for user aggregation to get the community content profiles and diffusion profiles, respectively.

$\bullet$ COLD+Agg.

It uses COLD [17] to detect communities, then similarly uses Eq. 20 and Eq. 21 to get content and diffusion profiles.

We compare with both CRM+Agg and COLD+Agg in diffusion link prediction and community ranking.

Evaluation. Since we jointly profile and detect communities, we will evaluate the quality of community detection and profiles.

$\bullet$ Detection quality.

We consider two ways to evaluate the resulting communities:

how dense they are;
how well they can be used to explain the friendship link observations. For 1), we use conductance [18, 19] as the metric. As our community assignment is probabilistic, we follow [17] to let each user belong to her top five communities in conductance evaluation (and also later community ranking evaluation). The smaller conductance is, the better. For 2), we follow [17] to design a link prediction task, where we use Eq. 3 to predict whether to observe a friendship link based on two users’ communities. As there is no predefined threshold for link prediction, we use AUC (Area Under the receiver operating characteristic Curve) [17, 15] as the metric. Given a ranking of non-observed links, we calculate the AUC score as the probability of a randomly chosen true positive link being ranked higher than a randomly chosen true negative link. In the 10-fold cross validation, each time we use 10% of the positive links and sample the same amount of negative links to calculate AUC. The higher AUC is, the better.

$\bullet$ Profile quality.

Due to lack of ground truth, we generally evaluate the content and diffusion profiles’ quality through the applications in Sect. 5. For community-aware diffusion, as there is no predefined threshold for diffusion link prediction, we again use AUC as the evaluation metric. For profile-driven community ranking, as the communities detected by different algorithms are different, for fair comparison, we evaluate the quality of each ranked community in terms of its users– given a query $q$ , we check how many users in the top $K$ ranked communities really retweet (or cite) about $q$ . Then naturally we compute precision and recall for each community in the ranking list. Denote the users who mention $q$ in their retweets (or citation paper titles) as $U^{*}_{q}$ . Denote the users belonging to any of the top $K$ communities as $U_{K}$ . The precision of the top $K$ communities for query $q$ is $P(K,q)=|U^{*}_{q}\cap U_{K}|/|U_{K}|$ , and the recall is $R(K,q)=|U^{*}_{q}\cap U_{K}|/|U^{*}_{q}|$ . We define mean average precision (MAP) over all the queries as $MAP@K=\sum_{q}(\sum_{i=1}^{K}P(i,q)/K)/|Q|$ and mean average recall (MAR) as $MAR@K=\sum_{q}(\sum_{i=1}^{K}R(i,q)/K)/|Q|$ . Finally, we define the mean average F1 as $MAF@K=\frac{2\times MAP@K\times MAR@K}{MAP@K+MAR@K}$ . The higher MAF is, the better. In addition, as the content profile is based on topics, we adopt one extra widely used metric (perplexity) in topic modeling [3] to evaluate its quality. Effectively, perplexity of a content profile measures how well it generates the user content observations, and we use the same definition of perplexity as in [17]. The lower perplexity is, the better.

6.2 Model Design

We want to evaluate how well we address each community profiling challenge as introduced in Sect. 1. To achieve this goal, we design some baselines based on the degenerated versions of CPD, for validating the advantages of our model design. We compare CPD with these baselines, and evaluate the quality of detected communities and profiles through three tasks: community detection, friendship link prediction and diffusion link prediction.

$\bullet$ Modeling the inter-dependency with community detection.

We design a baseline “no joint modeling”, where we first detect communities only from the friendship links through a generative model by Eq. 3, then we extract the profiles through a generative model as in CPD except having the communities fixed. As shown in Figures 3(a)–3(f), ours is always better than “no joint modeling”.

$\bullet$ Addressing the heterogeneity of social observations.

We design a baseline “no heterogeneity”, where we adapt CPD to model friendship links and diffusion links in the same way by Eq. (3), but keep the other parts of CPD modeling unchanged. As shown in Figures 3(a) - 3(f), ours is better than “no heterogeneity” on diffusion prediction, and comparable with it on community detection and friendship link prediction. This implies: 1) diffusion links and friendship links are different, and diffusion links require more sophisticated modeling than friendship links; 2) friendship links and diffusion links are correlated; diffusion links do not significantly change the community structure once the friendship links are given.

$\bullet$ Accommodate the nonconformity of user behaviors.

We design two baselines:

“no individual & topic”, where we exclude the individual factor and topic factor from Eq. 5 in CPD;
“no topic”, where we exclude only the topic factor from Eq. 5 in CPD. As shown in figures 3(g) and 3(h), the individual factor is able to contribute 4.8% and 6.8% absolute AUC improvement on Twitter and DBLP respectively; the topic factor is able to contribute another 3.6% and 10.5% absolute AUC improvement on each data set.

In all, we conclude that our model design well addresses the three challenges in community profiling.

6.3 Comparison with Baselines

We evaluate CPD and the baselines on various applications.

6.3.1 Community-aware Diffusion

Quantitative analysis. In figure 4, we summarize the result comparison with the baselines introduced in Sect. 6.1. PMLTM is not applicable to Twitter, since it is designed solely for citation network– it predicts a citation based on the similarity between two documents, but in Twitter a tweet and its retweet are almost identical. As shown in Fig. 4, our model consistently outperforms all the baselines, thanks to: 1) our modeling various diffusion factors and heterogeneous user links, in contrast with the baselines in Table 4; 2) our joint detection and profiling, in contrast with the two “first detection then aggregation” baselines. When $|C|=100$ , we achieve 24.2%–91.6% and 5.1%–108.0% relative AUC improvements than the baselines in Twitter and DBLP, respectively. The improvements are statistically significant over the 10-fold cross validation results, with student’s $t$ -test one-tailed $p$ -value $p<0.01$ .

Case study. We examine the three diffusion factors in Eq. 5 with the DBLP data. Firstly, in Fig. 5(a) we plot the number of papers a user cites w.r.t. her activeness, and the number of citations a user has w.r.t. her popularity. User activeness and popularity are defined in Sect. 3.1. Generally, the more active a user is (i.e., publishing more papers), the more papers she cites; besides, the more popular a user is (i.e., a more established researcher), the more citations her papers get. This observation supports our design of modeling both user activeness and popularity as the individual factors in diffusion.

In Figure 5(b), we plot the number of papers and the number of citations w.r.t. a specific topic (e.g., “parallel performance memory”) over the years. As we can see, there is a high correlation between the number of papers and that of citations over time– if a topic is popular (i.e., it has many papers), then it is more likely to be cited (i.e., it appears in many citations). This observation supports our design of modeling the topic factor in Sect. 3.1.

Finally, in Fig. 5(c) we list the diffusion between two example communities : $c_{18}$ and $c_{32}$ , which are the top 2 communities ranked for query “router” in profile-drive community ranking (Sect. 6.3.2). As we can see, $c_{18}$ and $c_{32}$ tend to cite from each other on topic $T_{22}$ (i.e., “network” as shown in Table 5). Besides, $c_{18}$ tends to cite $c_{32}$ on $T_{8}$ (i.e., “security”), whereas $c_{32}$ tends to cite $c_{18}$ on $T_{47}$ (i.e., “service”). This observation means: each community has a preference to diffuse other communities on certain topics. Thus it is necessary to model the community factor in diffusion.

6.3.2 Profile-driven Community Ranking

For community ranking, we follow several guidelines to choose queries:

it should be easy to assess whether a retweet or a citation contains a query, thus we choose single terms (i.e., either hashtags or words) as queries;
a query has to be meaningful– since words are noisy, we choose hashtags as queries in Twitter; DBLP has no hashtag, thus we choose words as queries, but we remove the top 1,000 frequent words;
a query has to appear with sufficient frequency in retweets or citations, thus we choose hashtags in Twitter and words in DBLP with frequency both larger than 100. In the end, we have 5,680 queries in Twitter and 27,479 queries in DBLP.

Given each query $q$ , we rank the detected communities by Eq. (19), and then return the top $K$ (for $K=1,...,20$ ).

Quantitative analysis. Fig. 6 compares our model with the baselines that support community-level content and diffusion modeling, including COLD, COLD+Agg and CRM+Agg. As we can see, our model consistently outperforms all the baselines; when $|C|=100$ and $K=5$ , we achieve 27.6%–92.0% and 35.4%–150.8% relatively MAF improvements than the baselines in Twitter and DBLP, respectively. All these improvements are statistically significant over the 10-fold cross validation results, with student’s $t$ -test one-tailed $p$ -value $p<0.01$ . Note that our model is better than COLD+Agg and CRM+Agg, again showing the advantage of joint detection and profiling. Besides, we observe that our model’s MAF@K starts to converge earlier than the baselines. This means we are able to find more relevant users in the top $K$ communities.

We also tested community ranking with different subsets of queries. We divided the queries according to their occurrence frequency in the corpus. We equally splitted the range from the minimal frequency and the maximal frequency into five intervals. For each interval, we tested community ranking with the subset of queries, whose frequency falls within that interval. We observed similar trends that our model consistently outperforms the baselines. We also observed that the absolute MAF@K values are not sensitive to different query subsets.

Case study. We further examine the communities ranked by our model for a specific query. Table 6 lists the top three communities that are most likely to cite papers about “router”. AP@K is the average precision of query “router” for the top $K$ communities; similarly, AR@K is the average recall and AF@K is the average F1. AF@K increases as $K$ increases, which is consistent with the trend observed in Fig. 6. Besides, according to Table 5, the top three communities to cite “router” are: “network wireless sensor”, “security key authentication” and “circuits design”, all of which are reasonablly the networking communities.

6.3.3 Profile-driven Community Visualization

In Fig. 8, we visualize the DBLP community diffusion under aggregation of all topics, a general topic and a specialized topic, respectively. In total, we detect 50 communities and denote them as $c_{01}$ – $c_{50}$ . For each directed edge between two communities $c$ and $c^{\prime}$ , the width indicates the diffusion strength. In Fig. 7(a), the strength is an aggregated value $\sum_{z}\eta_{cc^{\prime}z}$ over all the topics; in Figures 7(b) and 7(c) the strength is $\eta_{c,c^{\prime}z}$ for a specific topic $z$ . We skip the edges whose strengths are below average for simpler visualization.

We can make several interesting observations from Fig. 8. Firstly, in Fig. 7(a) we find that, under topic aggregation, the communities often diffuse a lot within themselves. This coincides with our definition of “community” that the group of users who share similar diffusion behavior– in this case, the same community users often diffuse information to each other. Secondly, in Fig. 7(a) we also find that, some communities are more “open” than the others. E.g., $c_{48}$ (“data database search”) and $c_{33}$ (“web information analysis”) are more open research communities, which diffuse information with most of the other communities. In contrast, $c_{09}$ (“neural control system”) appears as a more closed research community, which hardly diffuses information with other communities. Such a visualization enables us to assess the openness of a research community. Finally, we find that, the diffusion behaviors vary w.r.t. different kinds of topics. E.g., Fig. 7(b) shows the diffusion on a very general topic (“web, information, search, semantic”), which can be discussed and diffused by many research communities. In contrast, Fig. 7(c) shows the diffusion on a very specialized topic (“transmission, gbs, trail, video”), which is of interest to only a few communities such as $c_{25}$ (“distributed performance computing”) and $c_{27}$ (“reliability device design”). This visualization reveals the topic generality and is helpful to researchers in choosing research topics.

6.3.4 Quality of Community and Content Profile

In addition to the three applications, we also conduct experiments to evaluate the quality of communities and content profiles. In Fig. 9, we show our model consistently outperforms the baselines in terms of community quality. As COLD+Agg and CRM+Agg use the detection of COLD and CRM respectively, we do not include them in comparison again. When $|C|=100$ , we achieve: 1) 2.2%–5.8% (Twitter) and 3.5%–27.8% (DBLP) relative conductance improvements; 2) 7.8%–40.6% (Twitter) and 22.8%–143.5% (DBLP) relative AUC improvements. All the improvements are significant with $p$ -values $p<0.01$ . In general, we are better than COLD and PMLTM, as they do not model the friendship links in community detection; we are better than CRM, as it does not enforce dense friendship links in a community.

In Fig. 8, we compare with COLD+Agg and CRM+Agg in terms of the quality of content profiles. As we can see, our model achieves the lowest perplexity, meaning that our content profiles can best explain the user content observations. This supports our argument of joint modeling, as motivated in Eq. 1.

6.4 Scalability

In Fig. 10(a), we first show that our training time (per iteration, Alg. 1’s steps 3–10) scales linearly to the data set size. Each value $p$ (e.g., $p=0.1$ ) in the x-axis of Fig. 10(a) indicates that we randomly sample $(p\times 100)$ percents of the total documents, friendship links and diffusion links for experiments. We repeat ten times and report the average training time. We set $|C|=150$ and $|Z|=150$ . Different $|C|$ and $|Z|$ can change the absolute training time, but they do not change the linearity of our training time to the data set size. Moreover, we also show that our multithread parallelization achieves up to 4.5 $\times$ and 5.7 $\times$ speedup over the serial implementation in Twitter and DBLP respectively, by using eight CPU cores.

In Fig. 10(b), we plot the speedup with different number of CPU cores in parallelization. Generally, the speedup increases as using more CPU cores. We observe that the speedup for DBLP data set is bigger than that of Twitter. That maybe because compared with Twitter, DBLP users tend to have less diverse topics in their documents. This makes each data segment (defined in Sect. 4.3) more likely to have a single topic, which greatly reduces the inter-dependency between the data segments. In Fig. 11, we also plot the estimated workload and the actual running time of each CPU core. As we can see, our parallelization design achieves good workload balancing.

7 Conclusion

In this paper, we study a novel problem of community profiling. Community profiling is different from community detection, and its goal is to characterize each community with both its internal profile and external profile. Community profiling also enables many new community-level applications. The difficulty of community profiling is largely overlooked. Thus we propose a CPD model, which novelly identifies and addresses three key challenges, including the inter-dependency with community detection, the heterogeneity of social observations and the nonconformity of user behaviors. We also develop a scalable inference algorithm for the CPD model; it scales linearly with the data set size, and we further parallelize it with multithreading. In our experiments, we use two public, large-scale, real-world data sets. We extensively evaluate CPD in terms of its community detection quality and its community profile quality. We verify that our model design well addresses the three challenges. We also show that CPD outperforms the state-of-the-art baselines in a number of tasks, including community detection, friendship link prediction, community-aware diffusion, profile-driven community ranking and content profile evaluation.

In future, we plan to explore other types of user information for defining the profiles, such as user attributes, and user sentiments.

8 Acknowledgement

We thank the support of: National Natural Science Foundation of China (No. 61502418), Zhejiang Provincial Natural Science Foundation (No. LQ14F020002), Research Grant for Human-centered Cyber-physical Systems Programme at Advanced Digital Sciences Center from Singapore’s Agency for Science, Technology and Research (A*STAR), and NSF Grant IIS 16-19302. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the funding agencies.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Basu Roy, L. V. Lakshmanan, and R. Liu. From group recommendations to group formation. In SIGMOD , pages 1603–1616, 2015.
2[2] B. Bi, B. Kao, C. Wan, and J. Cho. Who are experts specializing in landscape photography?: Analyzing topic-specific authority on content sharing services. In KDD , pages 1506–1515, 2014.
3[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res. , 3:993–1022, 2003.
4[4] H. Cai, V. W. Zheng, F. Zhu, K. C.-C. Chang, and Z. Huang. Sociallens: Searching and browsing communities by content and interaction. In ICDE , 2017.
5[5] J. Chang and D. M. Blei. Hierarchical relational models for document networks. ANN APPL STAT , 4(1):124–150, 2010.
6[6] N. Chen, J. Zhu, F. Xia, and B. Zhang. Generalized relational topic models with data augmentation. In IJCAI , pages 1273–1279, 2013.
7[7] W.-Y. Chen, D. Zhang, and E. Y. Chang. Combinational collaborative filtering for personalized community recommendation. In KDD , pages 115–123, 2008.
8[8] J. D. Cruz, C. Bothorel, and F. Poulet. Community detection and visualization in social networks: Integrating structural and semantic information. ACM Trans. Intell. Syst. Technol. , 5(1):11:1–11:26, 2014.