Scalable prediction of global online media news virality
Xiaoyan Lu, Boleslaw K. Szymanski

TL;DR
This paper introduces a scalable, community-based probabilistic framework for early prediction of news virality in online media, leveraging latent community structures to improve detection speed and accuracy.
Contribution
It presents a novel, scalable probabilistic model that exploits community structures for early prediction of news virality, with efficient parallelization for large-scale data.
Findings
20% improvement in early detection accuracy
Linear time complexity in number of reports
Orders of magnitude speedup through parallelization
Abstract
News reports shape the public perception of the critical social, political and economical events around the world. Yet, the way in which emergent phenomena are reported in the news makes the early prediction of such phenomena a challenging task. We propose a scalable community-based probabilistic framework to model the spreading of news about events in online media. Our approach exploits the latent community structure in the global news media and uses the affiliation of the early adopters with a variety of communities to identify the events widely reported in the news at the early stage of their spread. The time complexity of our approach is linear in the number of news reports. It is also amenable to efficient parallelization. To demonstrate these features, the inference algorithm is parallelized for message passing paradigm and tested on RPI Advanced Multiprocessing Optimized System…
| FG | LE | LP | ML |
|
|
|||||
|---|---|---|---|---|---|---|---|---|---|---|
| FG | 0.858 | 0.833 | 0.943 | 0.881 | 0.933 | |||||
| LE | 0.873 | 0.807 | 0.867 | 0.795 | 0.837 | |||||
| LP | 0.864 | 0.829 | 0.835 | 0.773 | 0.804 | |||||
| ML | 0.929 | 0.881 | 0.868 | 0.922 | 0.949 | |||||
| Our Model | 0.869 | 0.820 | 0.817 | 0.936 | 0.930 | |||||
| Ground Truth | 0.939 | 0.865 | 0.852 | 0.963 | 0.925 |
| #Processors | 1 | 4 | 16 | 64 |
|---|---|---|---|---|
| ARS | 0.9588 | 0.9480 | 0.9444 | 0.9704 |
| AMI | 0.9858 | 0.9814 | 0.9808 | 0.9888 |
| Threshold | 0.5 Hour | 0.6 Hour | 0.7 Hour | 0.8 Hour | 0.9 Hour | 1 Hour | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BL | ML | BL | ML | BL | ML | BL | ML | BL | ML | BL | ML | |
| 90% | 0.410 | 0.500 | 0.410 | 0.497 | 0.410 | 0.499 | 0.499 | 0.549 | 0.499 | 0.548 | 0.534 | 0.594 |
| 91% | 0.412 | 0.484 | 0.405 | 0.486 | 0.405 | 0.480 | 0.484 | 0.534 | 0.490 | 0.536 | 0.532 | 0.584 |
| 92% | 0.389 | 0.473 | 0.394 | 0.472 | 0.392 | 0.471 | 0.463 | 0.530 | 0.463 | 0.531 | 0.524 | 0.578 |
| 93% | 0.294 | 0.455 | 0.296 | 0.453 | 0.294 | 0.453 | 0.434 | 0.516 | 0.436 | 0.511 | 0.497 | 0.561 |
| 94% | 0.207 | 0.433 | 0.209 | 0.433 | 0.213 | 0.436 | 0.393 | 0.489 | 0.393 | 0.491 | 0.460 | 0.542 |
| 95% | 0.191 | 0.404 | 0.191 | 0.409 | 0.195 | 0.408 | 0.352 | 0.465 | 0.347 | 0.462 | 0.442 | 0.515 |
| 96% | 0.147 | 0.393 | 0.141 | 0.389 | 0.147 | 0.389 | 0.287 | 0.438 | 0.285 | 0.442 | 0.383 | 0.502 |
| 97% | 0.114 | 0.360 | 0.116 | 0.366 | 0.113 | 0.361 | 0.233 | 0.402 | 0.232 | 0.407 | 0.326 | 0.465 |
| 98% | 0.097 | 0.311 | 0.091 | 0.316 | 0.097 | 0.316 | 0.160 | 0.365 | 0.166 | 0.357 | 0.236 | 0.386 |
| 99% | 0.119 | 0.288 | 0.107 | 0.292 | 0.104 | 0.279 | 0.148 | 0.326 | 0.148 | 0.324 | 0.152 | 0.324 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Network Analysis Techniques · Opinion Dynamics and Social Influence · Misinformation and Its Impacts
Scalable prediction of global online media news virality
Xiaoyan Lu, and Boleslaw K. Szymanski11footnotemark: 1
- Corresponding author, E-mail: [email protected] Xiaoyan Lu and Boleslaw K. Szymanski are with the Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, 12180.
Manuscript received Dec 25, 2018; revised July 11, 2018
Abstract
News reports shape the public perception of the critical social, political and economical events around the world. Yet, the way in which emergent phenomena are reported in the news makes the early prediction of such phenomena a challenging task. We propose a scalable community-based probabilistic framework to model the spreading of news about events in online media. Our approach exploits the latent community structure in the global news media and uses the affiliation of the early adopters with a variety of communities to identify the events widely reported in the news at the early stage of their spread. The time complexity of our approach is linear in the number of news reports. It is also amenable to efficient parallelization. To demonstrate these features, the inference algorithm is parallelized for message passing paradigm and tested on RPI Advanced Multiprocessing Optimized System (AMOS), one of the fastest Blue Gene/Q supercomputers in the world. Thanks to the community-level features of the early adopters, the model gains an improvement of 20% in the early detection of the most massively reported events compared to the feature-based machine learning algorithm. Its parallelization scheme achieves orders of magnitude speedup.
Index Terms:
Online Media, Information Cascades, Community Detection, Parallelization, Supercomputer
1 Introduction
Online media delivers news stories to the public every day. These reports shape people’s perception of the ongoing social, political and economical changes around them. Although numerous events are reported in the news every hour around the globe, only a few reported events attract enormous attention of the online media - hundreds of news reports suddenly break out after the critical event happens. The burstiness of the reporting behavior in the online media makes the prediction of which events will trigger viral news challenging.
Many previous works [26, 27, 28] model the information diffusion as epidemics in networks - the acceptance of information is viewed as an infection of a node by infected neighbors in the network. These works assume that there exist an explicit propagation pathway which is sufficient to explain the observed information diffusion [25] - messages can only spread along the predefined edges between nodes. Although the research shows that the news reports of an event are usually confined to the geographical and cultural boundaries [23] between the global news sites, the explicit connections between pairs of individual news sites are typically unknown. Over-defining the edges by assuming that the co-reporting relationship might exist between any two news sites would inevitably result in an extremely dense network and require the number of parameters in the order of square of the number of nodes. Therefore, we focus on modeling the news sites in the online media and propose a general probabilistic framework which directly infer the affiliation of nodes with the communities. Although our model does not use the explicit network topology, the node clustering based on these affiliations matches the community structure detected by traditional community detection algorithms in the explicit network topology. The early adopters of an information cascade, which are embedded in the so-inferred community structure, are used to predict the final cascade size.
In the global online news media, news sites usually have a preference for the content of their reports due to the local regional reach. Although many media companies’ ambition is to have global market presence, most media sites have only a regional reach [29]. This regional reach phenomenon is supported by the surveys of local newscasts [7] that demonstrates the dominance of local news. According to the geographical and cultural boundaries among the global news sites, we model clustering of news sites as community structures which are also widely observed in a variety of technological, biological and social networks [8]. In the context of information cascades, the community structures play an important role in facilitating the local spread of messages [1] because the community members are more likely to accept inputs from each other than from the outsiders. On the other hand, the community structures slow the global diffusion [6] by trapping the news in dense regions and thus preventing global penetration. The experimental results show that, compared to machine learning models which extract point process-based features, our model exploiting the community-level signals improves the prediction accuracy by about 20%.
We also parallelize the inference algorithm for distributed memory machines. The proposed parallelization scheme uses the standard message passing approach to exchange data between different processors. We design an asynchronous inter-core communication paradigm so that the local computations occur simultaneously with the cores exchanging data with each other. We evaluate the MPI implementation of our parallelization scheme on the petaflops class IBM Blue Gene/Q supercomputer at RPI. The algorithm gains orders of magnitude speedup inferring the parameters for the large networks, and it scales well with the number of nodes, the number of cascades, and the number of communities in a network.
In [23], we introduced an initial work on discovery of viral cascades based on their initial period observations. This work was limited to sequential execution and the algorithm for cascade likelihood evaluation had quadratic time complexity because the probability of infection between every pair of nodes in the same cascade was computed. In contrast, our new model assigns a set of parameters to each cascade, modeling the infections of nodes by the cascades instead of the infections between every pair of nodes. Therefore, evaluating the likelihood of a cascade in our new model has a linear time complexity, and requires much less communication overhead than before when running the inference algorithm on distributed memory machines, making it scalable to large news media networks and large number of cascades.
To conclude, the major contributions of this paper are:
- •
A new modeling approach in which information diffusion is modeled at the community level instead of the level of the individual nodes, which reduces the computational complexity of the inference algorithm.
- •
A parallelization scheme based on message passing paradigm that infers the community structure without the explicit topology of the network, which makes it scalable because it gains orders of magnitude speedup in computing the parameters for the large networks.
2 Approach
2.1 Community affiliation model with cascades
Community structures are widely observed in a variety of networks. Based on patterns of news about events spreading in the online news media, we present two basic observations below:
- •
Most news sites have a preference for the topics of their news coverage, e.g. politics, finance, education, military, technology or sports.
- •
The news reports are usually confined to the geographical and cultural boundaries. Although many media companies have ambitions to have global market presence, most media sites have only a regional reach [29].
Much like many community detection methodologies [30, 32, 35, 36] which aim at discovering dense sub-graphs embedded in a network, we seek to recover such community structure for the online news sites. However, in the global and local news markets [39], connecting every two sites reporting the same event could result in an extremely dense network. Finding community structure in such network would not only incur high computational power but would also cloud the real connections among communities. Therefore, we find the latent community affiliation of these news sites rather than drawing the edges between specific news sites. Since the underlying network topology is unknown, our probabilistic model incorporates the time points of each infection in the observed news cascades, i.e., the time of each news report. Formally, we present the definition of information cascade below.
Definition 2.1**.**
Information Cascade A information/news cascade is a set of infections . Each infection consists of a node and its infection time .
Our model assumes the news cascades happen at the community level. For a particular community of news sites, the probability of a site reporting a particular news depends on both the affiliation of this news sites with the community and the probability of the news cascade reaching its community. To formalize this idea, we define latent community below.
Definition 2.2**.**
Latent Community A latent community is a set of nodes that, conditioned on the infection of one member, the probability that other members will get infected within a limited period is much higher than for the non-members.
In [9], authors propose a community affiliation model where each node has a probability belonging to one of the overlapping communities in the network. We extend this community affiliation model by taking into account the probability for each cascade to spread to these overlapping latent communities. As illustrated in Figure 1, if a cascade has a non-zero probability of spreading to a community, this cascade (square) is connected to the community (double circle); if a node (circle) belongs to a community, then this node is connected to the community as well. Our affiliation model captures two important features of cascades and communities in complex networks: (i) the community structures are highly overlapped, a node can belong to multiple communities at the same time; (ii) the members of the same community tend to accept similar information during the information cascades.
Social theories [15] suggest that the response time of human beings usually follows the exponential distribution, which explains the burstiness of social behavior in many scenarios. Our analysis of 723,037 randomly sampled news reports shows that the delays of news reports also fit the exponential distribution with a high score of 92%222We consider the news reports with a delay less than 21 hours, which comprise 99% of the sampled data from the Global Database of Events, Language, and Tone (GDELT) dataset (http://www.gdeltproject.org).. Based on this observation, we model the response time of news sites to events by an exponential distribution.
For every single community , let denote the strength of affiliation of node with it and denote the probability of the cascade spreading to community . In this community, the response time of node to cascade follows the exponential distribution
[TABLE]
which indicates that the node gets quickly infected when the cascade is highly likely to reach a community , i.e. is large, and the affiliation of with community is strong, i.e. is large.
Most real-world networks have overlapping community structures which allow a node to belong to many communities. Therefore, in our model, the response time of a node to a cascade corresponds to the minimum response time in all communities. Given a node , a cascade and a total of overlapping communities, the minimum response time also follows the exponential distribution333The minimum of mutually independent random variables for also follows the exponential distribution, i.e. .
[TABLE]
where the rate parameter is the sum of the rate parameters for each individual exponential distribution in Eq. 1.
In reality, news agencies usually have some bias for the content of information they spread [3, 4, 5]. Such bias can be positive - there are certain types of information a news agency favors, while it can also be negative - messages of no interest to agency’s audience are ignored or intentionally blocked. It is tempting to allow the value of to be negative for this reason so that the affiliation with some community may delay the response time of some node, i.e. . However, it could result in a negative rate parameter in Eq. 2 which violates the constraint of the exponential distribution. Hence, we smooth the rate parameter via the sigmoid function . In this way, given an information cascade , the response time of a node , , draws from an exponential distribution with a rate parameter ,
[TABLE]
where is the sigmoid function, is a scaling parameter and represents the inner product in vector space, while the -th component of vector and are and respectively. Using the sigmoid function here avoids any constraint for the parameters and , and most importantly, it improves the robustness of the parameter estimation because the sigmoid function is differentiable at each point.
So far, the model takes into account the participants of a cascade. However, the nodes which are not involved in a cascade also provide information about their affiliations with different communities. Since these silent nodes are equivalently important, we set a upper bound on the response time, for and , such that all the silent nodes during the cascading propagation can be assumed to have this long response time .
Given an information cascade where the -th node has response time , the likelihood of observing a cascade is
[TABLE]
where denotes the set of nodes involved in cascade and is the probability density function of the exponential distribution
[TABLE]
Since the likelihood is the product of terms, Eq. 4 can be too expensive to compute. But it can be approximated by using a subset of representatives drawn randomly from . This idea is similar to the negative sampling approach [12], which has been successfully applied to learning the distributed representation of words in documents [11, 10]. The log-likelihood of an observed cascade then becomes
[TABLE]
where is the set of negative samples chosen for every cascade by drawing random nodes uniformly from . If we fix the size of each as , then a speedup of approximately times can be achieved.
To estimate the parameters for each and for each , we maximize the likelihood which factorizes into the product of the likelihoods of cascades. This goal is equivalent to maximizing the sum of the log-likelihood of cascades
[TABLE]
It is worth noting that the problem defined by Eq. 7 does not require explicit network topology to estimate and . Instead, the input of the model is the response times of the nodes to every cascade. This is a practical setting when the underlying network topology is incomplete or hidden during the information propagation process. In addition, the parameter space in Eq. 7 does not have any restriction thanks to the adoption of the sigmoid function in Eq. 3.
2.2 Parameter estimation
The optimization problem in Eq. 7 is unfortunately not convex. Since the network size can be large, the optimization problem involves a large number of parameters, which makes stochastic updates more suitable than the batch methods. In addition, when some new cascade data comes in, the estimation algorithm should be able to incorporate the new cascades efficiently. For these reasons, we apply the Stochastic Gradient Ascent (SGA) method to estimate the parameters.
If we substitute Eq. 6 into Eq. 7, the partial derivative of the objective function in Eq. 7 over a particular becomes
[TABLE]
which is a weighted sum of the terms in the form of . Given the value of , depends only on and . The partial derivative of over can be computed using and
[TABLE]
Here we need the for to update according to the gradient in Eq. 8. Similarly, the partial derivative of the objective function in Eq. 7 over can be computed using the for cascades such that
[TABLE]
where the term depends on and only
[TABLE]
In this way, the parameters can be updated in a pair-wise manner: we fix for all and update all the s, and then fix all s to update s in every SGA iteration. The SGA updates can operate on a bipartite graph where node and cascade are connected if (cf. the example shown in leftmost diagram in Figure 2). To update , each cascade propagates the corresponding parameters through the links in the bipartite graph to the node . Then node calculates the partial derivative for every connected cascade and updates accordingly. Similarly, the nodes can propagate the parameters s through the links in this bipartite graph towards each relevant cascade , and can be updated using the it receives.
The pseudo code is shown in Algorithm 1. The time complexity of each SGA iteration here is linear in the number of edges in the bipartite graph. This is because the loop in lines 8-19 iterates over all the nodes connected with each cascade , which corresponds to visiting every edge of the bipartite graph exactly once. Similarly, the lines 20-31 also visit every edge in the bipartite graph once. In the news dataset, the number of edges of the bipartite graph is defined by the number of news reports, because every report connects a news site to a news cascade, plus the pair of negative samples edges for with fixed as a constant. Hence, the time complexity of each SGA iteration is linear in the number of edges in the bipartite graph. The time complexity of Algorithm 1 is also determined by the number of SGA iterations. In practice, this only involves tens of iterations before the derived and vectors become stable. Thus, this number can be treated as a constant. Therefore, the Algorithm 1 has a linear time complexity in the number of bipartite graph edges, i.e. the number of news reports.
2.3 Parallelization for distributed memory machines
In practice, the input network size can be large so to speed up computation we offer parallelization of the parameter estimation for our model. Since the SGA algorithm is inherently sequential, many works including [10, 11] use the Hogwild! framework [13] attempting to parallelize the SGA algorithm on shared memory machines. The Hogwild! framework ignores the write-write conflicts caused by parallel updates on the same parameter as long as one unique processor can complete its writing operation. The quality of the results produced by SGA algorithm is guaranteed by the property that such conflicts are sparse enough. In the information diffusion context, however, such data sparsity is uncertain. To handle the contentions between processors, we propose a scalable parallelization scheme based on message passing paradigm [40].
The parallelization scheme is illustrated in Figure 2. Consider eight nodes (blue) involved in eight cascades (yellow) in this toy example. The bipartite graph connects every pair of associated nodes and cascades in the SGA updates. These nodes and cascades are then distributed in this example to two processors. The first processor owns the upper four nodes and upper four cascades while the remaining nodes and cascades are assigned to the second processor. Each processor creates private memory space for the parameters of the nodes and cascades it owns. Much like Algorithm 1, the SGA algorithm propagates parameters back and forth between nodes and cascades, the only difference here is that the propagation between the cascades and nodes owned by different processors requires inter-core communication.
During every SGA iteration, each node propagates its to all the connected cascades in the bipartite graph. If these cascades are located in the same processor which owns node , then this propagation operation is done locally. Otherwise, is sent asynchronously to the processor owning node . The pseudo code for the parallelization scheme is shown in Algorithm 2 and Algorithm 3. Without loss of generality, we consider updating s using s, i.e. the line 14 in Algorithm 2. One SGA iteration consists of the following three phases:
- •
(a) Message passing: Every processor sends the s it owns to the target processors which require those s to update their corresponding s, cf. the lines 1-3 in Algorithm 3.
- •
(b) Local updates: Every processor updates the it owns using the gradients if it also owns , cf. the lines 4-6 in Algorithm 3.
- •
(c) Remote updates: After all the remote s sent in phase (a) have been received, each processor updates using the gradients whose computation requires some of the received s, cf. the lines 8-10 in Algorithm 3.
Note that each processor should conduct the local updates prior to the updates which require remote data from other processors, i.e. the phase (b) occurs before the phase (c). In this way, the local computation time and inter-core communication time overlap with each other, improving the parallelization efficiency.
Figure 2 illustrates this parallelization scheme. The three phases described above are represented by the three diagrams following the first diagram showing the initial stage in Figure 2. The parameters propagate back and forth between node layer and cascade layer iteratively. If the connected node and cascade are in the different processors, the parameters are sent via asynchronous communication, i.e. the blue dashed lines labeled “ISend”. At the same time, the local parameter propagation occurs within each processor, as indicated by the bold black arrows in the middle. Using the same protocol, we can update s using the values of s.
After distributing nodes and cascades to the processors, each processor creates its private memory space for the s and s it owns and ghost memory space for those s and s connected to the nodes or cascades in the bipartite graph, but owned by other processors. In every SGA iteration, once the ghost memory is filled with the received data, it will not be written again. For example, a processor owns two nodes and , both involved in a cascade which is owned by another remote processor. To update and , is sent asynchronously to this local processor. But will be sent only once regardless of the number of associated nodes owned by the local processor, because can be shared by the updates of and . This optimization is similar to the combiner applied to the message queues in Pregal-like parallel graph processing systems [37, 38] to avoid sending duplicated messages to the same target processor.
2.4 Forecast viral cascade via its early adopters
Our aim is to forecast the viral information cascade. From the historical cascades, the proposed model estimates the vector for each according to the observed response times. Using these vectors of the initially infected nodes, we seek to predict the behavior of future cascades.
Suppose a set of so-called early adopters have been infected within a limited time period. One basic observation is that, once the contagion reaches a community member, the probability that other members get infected increases. Therefore, we can make use of the infected node’s local neighborhood to predict the infection future. Since our model presents every node by a vector in the latent space, it is easy to find the neighbors which are close to node by measuring the Euclidean distances between them. As the contagion is likely to spread fast in the dense areas, the number of neighbors within a certain range can be used as an indicator of future infections. Hence, we count the number of neighbors that are located within a certain range from the infected node , and arrange these values in a vector whose -th component is defined as
[TABLE]
where is the radius of the -th neighborhood of node and denotes the Euclidean norm.
Figure 3 demonstrates how to compute the vector of an infected node. Suppose the infected nodes are located at the centers of the different circles. Their neighborhoods are marked by the circles which have the radii , and respectively. In Figure 3, the node at the right center has 3 neighbors inside the small circle, 7 neighbors inside the medium circle and 12 neighbors inside large circle, resulting in the vector . The leftmost node has only 1 neighbor, i.e. itself, inside both the small and medium circle and 5 neighbors inside the large circle, resulting in the vector . Intuitively, we can tell from and that the right three neighborhoods allow faster growth of infections than the left ones.
Given a set of early adopters which get infected within a limited time period, we can count the total number of unique neighbors in their local neighborhoods within different radii similarly. Note that if a node is in the -th neighborhood of two early adopters, this node is only counted once in the -th component of . Finally, The numbers of neighbors, presented as the components of a multiple-dimensional vector , are fed to a machine learning model to predict the final size of a cascade. The specific experimental configuration is presented in more detail in Section 4.4.
3 Related Work
In this section, we review the related works and summarize the relationship between our model and these works.
3.1 Network Embedding
Network embedding aims at representing the graph structures by the low-dimensional vectors so that they can be exploited by machine learning models. Compared to the classical dimension reduction algorithms such as multidimensional scaling (MDS) [42], IsoMap [43], Laplacian eigenmap [44] for which time complexity is in the order of square of the number of nodes, many recent works [16, 45, 46] take advantage of the softmax-based objective function to efficiently learn the representation of network nodes in a distributed manner [11]. Our work uses the similar approach to avoid the quadratic time complexity of the inference algorithm.
3.2 Community Detection
Community structures are widely observed in a variety of technological, biological and social networks. In the context of information diffusion, [26] shows that communities structures are important for the prediction of viral cascades in social networks, however, it relies on traditional community detection algorithms which require the explicit network topology. Recent works [48, 47, 9] adopt the non-negative matrix factorization approach to detect communities. In these works, the cells in factorized matrices indicate the affiliation of nodes with the communities, and the inner product of two nodes’ vector representations corresponds to the probability of observing an edge between them. Based on this framework, we extend the community affiliation model to consider information cascades and negative affiliations so that the community-preserving vector representation can be efficiently obtained without the explicit network topology.
3.3 Parallel Graph Processing
Inspired by Valiant’s Bulk Synchronous Parallel (BSP) model [49], Pregal [37] conducts a sequence of iterations, called supersteps, to support efficient large graph processing. In Pregal, the computation involving individual nodes in a network would occur in parallel during every superstep, while the communication across different network nodes occurs at the same time. The safety of the parallel algorithm is ensured because a new superstep starts after the communication of the previous superstep is done. GraphLab [50] also introduces a similar approach in which the synchronization between network nodes happens between “superstep”s. Our proposed parallelization design adopts this approach to speedup the inference algorithm, while avoiding the contention between processors.
4 Experimental results
In this section, we evaluate the proposed model and the parallelized inference algorithm using both synthetically generated cascades as well as the real global news reports data. Due to the lack of propagation topology in the global news media, we compare the communities discovered by our model with the communities detected by several state-of-the-art algorithms and the predefined communities in the synthetic networks. In our experiments of virality prediction, we focus on the classification of the events most reported in the news and present its accuracy measured by F1 score.
4.1 Datasets
4.1.1 Synthetic cascades
We simulate cascades in the synthetic networks generated by the Stochastic Block Model (SBM) [18] where the community structures are pre-defined. We choose Stochastic Block Model (SBM) as the random graph model to generate network topology because it provides well-defined community structures and is computationally efficient therefore suitable for generating large networks [31]. Given the network , the simulation of cascading process is based on the Independent Cascade (IC) model [17]. In the IC model, the infection time of a node is the earliest time when the first neighbor infects it, i.e. a node can only be infected once. After a node gets infected, it starts to spread the contagion to its uninfected neighbors. Compared to the linear threshold model in which the diffusion process unfolds deterministically, the IC model considers information diffusion a probabilistic process: if nodes and are connected and node is infected, then in every discrete step, node infects node with probability . As an extension of the IC model, the infection delay can be modeled as the continuous time [19]. Given the infection time of the neighbors of node , the infection time of can be expressed as
[TABLE]
where is a parameter associated with the edge in the propagation network and is the distribution of infection delays. In our experiments, is set as the exponential distribution which is observed in many social dynamics [15], and we set : for simplicity. In theory, the IC model supposes the entire network will be infected given a sufficiently long period. Since news cascades have very limited time span, in our experiments the simulation of every cascade happens within a predefined observation window [19].
4.1.2 Data about global news of events
The Global Database of Events, Language, and Tone (GDELT)444http://www.gdeltproject.org/ [39] project records the news reports of thousands of news sites around the world. It provides the translation of 65 languages into English and identifies the same events reported by different news sites. The dataset is currently available on Google Cloud platform.
Since the GDELT dataset has a bias towards US domestic news, we choose the most active 2000 news sites for each country and 500 random events reported in news between July 1st, 2017 and July 19th, 2017 in the corresponding region. The sampled dataset consists of 19795 news sites and 26752 events reported in the news, where every event is reported by 27 news sites on average. Although the GDELT dataset does not indicate the connections between any pair of news sites, [23] found that reports of an event are usually confined to the geographical and cultural boundaries of the event, which matches our model’s assumption that information cascades are likely to happen inside communities.
4.2 Alignment of community structures with node vectors’ clustering
Our model produces the vector for each node in the network. If these vectors preserve the community structures of the news media network well, their clustering should match the community structure embedded in the explicit network topology, because the members of a community have similar s.
Given a synthetic SBM network, we simulate the cascades as described in Section 4.1.1. Our model then infers the vectors by these cascades. We compare the node clustering555For clarity, the set of nodes detected in a network is called a community and the set of nodes clustered by their vector representations is called a cluster. of these vectors with the ground truth partition of the SBM network and the community structures discovered by traditional community detection algorithms from the explicit topology. The alignment between them indicates the vectors produced by our model preserve the community structures.
More specifically, the K-means clustering algorithm [22] is executed on the so-inferred vectors to derive the node clustering. The similarity between the node clustering and the contrastive partitioning of the network are measured by the Adjusted Mutual Information (AMI) and Adjusted Rand Score (ARS) which are widely used to evaluate community detection performance.
Adjusted Mutual Information (AMI) [20] which is defined as
[TABLE]
where is the node clustering, each set contains the nodes in a single cluster and is the contrastive partition of the SBM network; the entropy associated with the partition is defined as
[TABLE]
and the mutual information (MI) between and is defined as
[TABLE]
where
[TABLE]
Adjusted Rand Score (ARS) [21] which computes the similarity by comparing all pairs of nodes that are assigned to the same or different communities in partitions and
[TABLE]
Figure 4 shows the growth of the similarity between the node clustering of the vectors at each SGA iteration and the ground truth partition of the SBM network. The SBM network has 100 nodes and 5 communities of size 20, 190 edges connects nodes in the same community and 12 edges are across different communities. A total of 100 cascades are simulated, each involves 12.5 infections on average. The dimension of is . After each SGA iteration of Algorithm 1, we compute a distance matrix with the entries being the Euclidean distances between the latest updated vectors and . The plots for two networks in Figure 4 are made by the MDS algorithm [42] which places each node in two-dimensional space such that the derived between-node distances are preserved as well as possible. In other words, the MDS coordinates preserve the nodes’ pairwise distances in the high-dimensional space of . At the 5th iteration, the ARS and AMI scores are around 0.4, the nodes’ MDS coordinates do not reflect their ground truth communities represented by the color. When the 11th SGA iteration is done, the ARS and AMI scores become greater than 0.9, and the nodes’ MDS coordinates match the community structures very well. Figure 4 shows that, as the inference algorithm proceeds, the vectors start to preserve the community structures, even though our model takes only the infection delays in the cascades as input, but not the the explicit network topology.
In addition, we compare the node clustering of vectors at the 15th iteration with the ground truth partition of the SBM network and community structures detected by the state-of-the-art algorithms such as Fast Greedy algorithm [32], leading eigenvector method [33], label propagation algorithm [34] and multilevel algorithm [30]. Table I shows that the alignments between them are good, and that the node clustering of vectors are more similar to the ground truth partition than the community structures detected by some state-of-the-art community detection algorithms. In Table I, each entry indicates the similarity of the communities produced by a particular pair of methods. All the entries below the main diagonal correspond to the ARS scores and entries above correspond to the AMI scores. As the ARS and AMI scores indicate, our model produces meaningful node embeddings because the clustering of these node vectors aligns well with the ground truth communities. The community structure obtained by clustering node vectors is even closer to the ground truth than are the community structures detected by the baseline methods that include leading eigenvector method (LE) and label propagation algorithm (LP) are. Our model produces node embeddings whose clustering aligns well with the ground truth communities, outperforming some community detection baselines. In addition, our model does not use the topology of the SBM network like the baseline algorithms do, instead it only accesses the cascades data, which explains why Fast Greedy algorithm (FG) and multilevel algorithm (ML) performs better than our model in terms of community detection. Finally, it should be noted that we choose the number of clusters as 5 for the K-means algorithm here. However this number should be actually systematically selected. We leave the selection of the proper number of clusters for future work.
We also test our algorithm for large SBM networks with the dimension of resulting being 200. In these experiments, there are 100 predefined communities in the SBM network, each containing 100 nodes. Every node is connected to 8.8 nodes in the same community and 1.2 nodes in the other communities on average. And we simulate 10K cascades using the continuous time IC model. As shown in Table II, as the number of processors increases, the AMI and ARS metrics are consistently above 0.98 and 0.94 respectively, which indicates the resulting preserves the community structure in the network. The distance matrices of the first 500 nodes are also shown in Table II. In the distance matrix, the distance between two nodes and is defined as the Euclidean distance between vectors and and this value is visualized by the color brightness in the heatmap, brighter the color longer the distance. Each dense module in this matrix comprises 100 nodes and matches the predefined SBM community very well. As illustrated by the visualized pair-wise nodes distance matrices, the number of processors does not change the high quality of as the resulting vectors preserve the community structure of SBM networks in all cases. Our model does not use the topology of the SBM network, instead it only accesses the response times of the nodes to different cascades, yet the community structure can still be accurately recovered from these response times.
4.3 Algorithm scalability
We test our parallelization scheme on RPI Advanced Multiprocessing Optimized System (AMOS), which is a 5-rack, 5K nodes, 80K cores IBM Blue Gene/Q system [41] with additional equipment. In AMOS supercomputer, each node consists of a 16-core, 1.6 GHz A2 processor, with 16 GB of DDR3 memory. Considering the fact that the inter-core communication is generally more efficient inside the same node than across different nodes, we use all the 16 cores of a node so that the communication between cores can be more efficient. In general, the maximum number of cores per node here is not constrained by the limit of 16 GB DDR3 memory space. This is one benefit of our memory management paradigm because every processor stores only one copy of the parameters of nodes and cascades associated with it.
The speedup and efficiency of our parallelization scheme are shown in Figure 6. A total of 10K cascades are simulated on the SBM network with 20K nodes. Each cascade infects a total of 247 nodes on average. The Figure 6 shows that the parallelization scheme achieves an approximately linear speedup using several hundreds of processors. And the efficiency of the parallelization scheme is above 25% in all cases. The comparison between Figure 6a and Figure 6b shows that the dimension does not change the speedup or efficiency. In our sampled GDELT dataset, the parallelized algorithm achieves the similar speedup and efficiency using 64 processors, but adding extra processors does not significantly increase the speedup due to the limited size of the dataset.
To test the scalability of the parallelization scheme, we evaluate the execution time of one single SGA iteration of Algorithm 2 using 512 processors. Given a network of 10K nodes and the different numbers of cascades, Figure 7a shows that the execution time is approximately proportional to the number of cascades. A similar pattern is also observed as the dimension grows. In contrast, when the number of cascades is fixed and the number of network nodes increases, the execution time grows slowly - the execution time only doubles when the number of network nodes grows from 5K to 20K. The reason is that the number of infections per cascade is relatively stable in the synthetic data so that the increase of time for parameter propagation is actually smaller than the increase in the number of network nodes. In general, the parallelization scheme scales well with the number of network nodes, the number of cascades and the dimension .
4.4 Virality prediction
Our aim is to predict the viral news cascades at their early stage. Specifically, with the global news dataset of events, the task is to predict the most reported events within a limited time period. Therefore, we rank the events reported in the news by the number of their reports and divide them into two classes: those who are among the top percent of this ranking and the remaining events. In this way, we can treat the virality prediction as a binary classification problem - given the early reports within a limited time period, can we classify the events reported in the news into these two categories? Since we are only interested in predicting the most viral events reported in the news, the threshold ranges from 90% to 99% in our experiments. Notice that a high threshold would result in two very imbalanced sets of samples, which would make the prediction challenging.
Baseline We build a baseline algorithm which uses multiple features extracted from cascade early progress and the Random Forest model [24] for cascade classification. We choose Random Forest for comparison because, as an ensemble learning method, it is known to work well with the non-linear growth of modeled phenomena such as the viral spread of news reports. The extracted features include the number of unique early adopters, the frequency of the infections at the early stage, the maximum interval between two continuous infections, and the minimum interval between two continuous infections. In contrast, our proposed model uses the numbers of neighbors in different ranges from the infected nodes, i.e. vector presented in Eq. 12, as the input of the Random Forest model. For a fair comparison, both the baseline and our model use the information about the early adopters in the first hours, where ranges from to .
The accuracy of the prediction is evaluated by the F1 score which is commonly used in the classification problems
[TABLE]
F1 score considers both the precision and recall of virality prediction. A decent F1 score prevents the system from either predicting too many viral news with a high false positive rate, or from being too conservative making insufficient predictions.
Figure 9 shows the F1 scores of the 6-fold cross validation tests using a variety of values. As the threshold increases, the F1 scores of both the baseline and our model decrease due to the imbalanced sets of samples. The F1 scores of the prediction made by our model are consistently better than the baseline’s. Specifically, our model outperforms the baseline by 10% in most cases. As shown in Figure 8, the improvement produced by our model is very obvious when the value of becomes small. It is because our model uses the community structure of the propagation network which is not included in the feature-based baseline model. As the value of increases, both the baseline model and our proposed model gain better performance, yet the performance gap between them decreases at the same time. One potential reason is that the information cascades start to slow down inside each communities at this stage with . Thus, the community structure does not provide extra signal for the prediction as it does at the early stage with . In general, in terms of the prediction accuracy, the influence of value is more significant in the baseline model than in our model, which indicates that community structures can provide the critical signals to forecast the viral information cascades at the early stage.
The relationship between the improvement on prediction accuracy and the classification threshold is shown in Figure 8. Our proposed model performs much better than the baseline model when the threshold is high, and it achieves an almost improvement with the threshold . As discussed in Section 2.4, our proposed method calculates the number of neighbors in the early adopters’ the local neighborhood, i.e. the number of nodes whose vectors are close to the early adopters’ in the latent space. Here, the most plausible explanation is that the most viral cascades have the early adopters in multiple dense areas so that they have advantages in disseminating the contagion to their neighbors in these regions in parallel, resulting in the viral infections within a limited time period. This explanation matches our observation about the viral news in online media - most news about events rarely cross the geographical and cultural boundaries, but once they do, the breaking news draw attention from the news media sites in different regions and hit the headlines very quickly.
5 Conclusion
We exploit the latent community structure in the global news network to improve the prediction of the viral cascades of news about events. The cascades which have early adopters in different communities have advantages in disseminating the contagion to these communities in parallel, and therefore are more likely to result in the viral infections within a limited time period. Our model captures such property by inferring the community structure using the response times of nodes rather than using the explicit network topology, because the references to propagation sources are usually missing in the real datasets. Due to the size of the relevant datasets, we successfully parallelized the inference algorithm for distributed memory machines and tested this parallelization on the RPI Advanced Multiprocessing Optimized System (AMOS) achieving orders of magnitude speedup.
Acknowledgments
This work was supported in part by the Army Research Laboratory (ARL) under Cooperative Agreement Number W911NF-09-2-0053, (NS CTA), by the Office of Naval Research (ONR) grant No. N00014-15-1-2640, and by the Army Research Office (ARO), grant W911NF-16-1-0524. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies either expressed or implied of the Army Research Laboratory, or the U.S. Government.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Nematzadeh, A., Ferrara, E., Flammini, A., Ahn, Y. Y.: Optimal network modularity for information diffusion. Physical Review Letters, 113(8) (2014)
- 2[2] Firmstone, J., Coleman, S.: The changing role of the local news media in enabling citizens to engage in local democracies. Journalism Practice, 8(5), 596-606 (2014)
- 3[3] Hackett, R. A.: Decline of a paradigm? Bias and objectivity in news media studies. Critical Studies in Media Communication, 1(3), 229-259 (1984)
- 4[4] Della Vigna, S., Kaplan, E.: The Fox News effect: Media bias and voting. The Quarterly Journal of Economics, 122(3), 1187-1234 (2007)
- 5[5] Gentzkow, M., Shapiro, J. M.: Media bias and reputation. Journal of political Economy, 114(2), 280-316 (2006)
- 6[6] Karsai, M., Kivelä, M., Pan, R. K., Kaski, K., Kertész, J., Barabási, A. L., Saramäki, J.: Small but slow world: How network topology and burstiness slow down spreading. Physical Review E, 83(2) (2011)
- 7[7] Gilliam Jr, F. D., Iyengar, S.: Prime suspects: The influence of local television news on the viewing public, American Journal of Political Science, 560-573 (2000)
- 8[8] Girvan, M., Newman, M. E.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826 (2002)
