Discovering patterns of online popularity from time series

Mert Ozer; Anna Sapienza; Andr\'es Abeliuk; Goran Muric; Emilio; Ferrara

arXiv:1904.04994·cs.LG·April 11, 2019

Discovering patterns of online popularity from time series

Mert Ozer, Anna Sapienza, Andr\'es Abeliuk, Goran Muric, Emilio, Ferrara

PDF

1 Repo

TL;DR

This paper introduces dipm-SC, a shape-based clustering algorithm for analyzing online content popularity over time, revealing two main patterns—bursty and steady—and showing that the growth pattern does not affect total popularity.

Contribution

The paper presents a novel multi-dimensional shape-based clustering algorithm with a heuristic for optimal cluster number, applied to real-world Twitter data to identify popularity patterns.

Findings

01

Identified two main popularity patterns: bursty and steady.

02

Popularity growth over time does not influence total popularity.

03

Validated algorithm accuracy on synthetic datasets.

Abstract

How is popularity gained online? Is being successful strictly related to rapidly becoming viral in an online platform or is it possible to acquire popularity in a steady and disciplined fashion? What are other temporal characteristics that can unveil the popularity of online content? To answer these questions, we leverage a multi-faceted temporal analysis of the evolution of popular online contents. Here, we present dipm-SC: a multi-dimensional shape-based time-series clustering algorithm with a heuristic to find the optimal number of clusters. First, we validate the accuracy of our algorithm on synthetic datasets generated from benchmark time series models. Second, we show that dipm-SC can uncover meaningful clusters of popularity behaviors in a real-world Twitter dataset. By clustering the multidimensional time-series of the popularity of contents coupled with other domain-specific…

Equations21

d i s t (c_{k}, x) = α_{d}, q min \frac{1}{D} d = 1 \sum D \frac{∥ c _{k} ( d , : ) - α _{d} x _{q} ( d , : ) ∥}{∥ c _{k} ( d , : ) ∥},

d i s t (c_{k}, x) = α_{d}, q min \frac{1}{D} d = 1 \sum D \frac{∥ c _{k} ( d , : ) - α _{d} x _{q} ( d , : ) ∥}{∥ c _{k} ( d , : ) ∥},

α_{d} (q) = \frac{x _{q} ( d , : ) ^{T} c _{k} ( d , : )}{∥ x _{q} ( d , : ) ∥ ^{2}},

α_{d} (q) = \frac{x _{q} ( d , : ) ^{T} c _{k} ( d , : )}{∥ x _{q} ( d , : ) ∥ ^{2}},

c_{k} = c argmin x \in S_{k} \sum d i s t (c, x)^{2},

c_{k} = c argmin x \in S_{k} \sum d i s t (c, x)^{2},

c_{k} = c argmin x \in S_{k} \sum α_{d}, q min \frac{1}{D ^{2}} d \sum D \frac{∥ c ( d , : ) - α _{d} x _{q} ( d , : ) ∥ ^{2}}{∥ c ( d , : ) ∥ ^{2}} .

c_{k} = c argmin x \in S_{k} \sum α_{d}, q min \frac{1}{D ^{2}} d \sum D \frac{∥ c ( d , : ) - α _{d} x _{q} ( d , : ) ∥ ^{2}}{∥ c ( d , : ) ∥ ^{2}} .

c_{k} = c argmin \frac{1}{D ^{2}} d \sum D x \in S_{k} \sum \frac{∥ c ( d , : ) - α _{d} x ( d , : ) ∥ ^{2}}{∥ c ( d , : ) ∥ ^{2}} .

c_{k} = c argmin \frac{1}{D ^{2}} d \sum D x \in S_{k} \sum \frac{∥ c ( d , : ) - α _{d} x ( d , : ) ∥ ^{2}}{∥ c ( d , : ) ∥ ^{2}} .

c_{k} (d, :)

c_{k} (d, :)

= c^{'} argmin \frac{1}{∥ c ^{'} ∥ ^{2}} x \in S_{k} \sum \frac{x ( d , : ) ^{T} c ^{'} x ( d , : )}{∥ x ( d , : ) ∥ ^{2}} - c^{'}^{2}

= c^{'} argmin \frac{1}{∥ c ^{'} ∥ ^{2}} x \in S_{k} \sum (\frac{x ( d , : ) x ( d , : ) ^{T}}{∥ x ( d , : ) ∥ ^{2}} - I) c^{'}^{2}

= c^{'} argmin \frac{1}{∥ c ^{'} ∥ ^{2}} c^{' T} x \in S_{k} \sum (I - \frac{x ( d , : ) x ( d , : ) ^{T}}{∥ x ( d , : ) ∥ ^{2}}) c^{'},

c_{k} = c^{'} argmin \frac{c ^{' T} M c ^{'}}{∥ c ^{'} ∥ ^{2}},

c_{k} = c^{'} argmin \frac{c ^{' T} M c ^{'}}{∥ c ^{'} ∥ ^{2}},

B = \frac{σ _{τ} - m _{τ}}{σ _{τ} + m _{τ}}

B = \frac{σ _{τ} - m _{τ}}{σ _{τ} + m _{τ}}

M = \frac{1}{n _{τ} - 1} i = 1 \sum n_{τ} - 1 \frac{( τ _{i} - m _{1} ) ( τ _{i + l a g} - m _{2} )}{σ _{1} σ _{2}}

M = \frac{1}{n _{τ} - 1} i = 1 \sum n_{τ} - 1 \frac{( τ _{i} - m _{1} ) ( τ _{i + l a g} - m _{2} )}{σ _{1} σ _{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mertozer/mts-clustering
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: School of Computing, Informatics, and Decision Science Engineering, Arizona State University, Tempe, USA

11email: [email protected]

22institutetext: Information Sciences Institute, University of Southern California, Marina del Rey, USA

22email: {annas,gmuric,aabeliuk,ferrarae}@isi.edu

Discovering patterns of online popularity from time series

Mert Ozer 11 1234-5678-9012

Anna Sapienza 22

Andrés Abeliuk 22

Goran Muric 22

Emilio Ferrara 22

Abstract

How is popularity gained online? Is being successful strictly related to rapidly becoming viral in an online platform or is it possible to acquire popularity in a steady and disciplined fashion? What are other temporal characteristics that can unveil popularity of online content? To answer these questions, we leverage a multi-faceted temporal analysis of the evolution of popular online contents. Here, we present dipm-SC: a multi-dimensional shape-based time-series clustering algorithm with a heuristic to find the optimal number of clusters. First, we validate the accuracy of our algorithm on synthetic datasets generated from benchmark time series models. Second, we show that dipm-SC can uncover meaningful clusters of popularity behaviors in a real-world Twitter dataset. By clustering the multidimensional time-series of popularity of contents coupled with other domain-specific dimensions, we uncover two main patterns of popularity: bursty and steady temporal behaviors. Moreover, we find that the way popularity is gained over time has no significant impact on the final cumulative popularity.

Keywords:

multidimensional time series, shape-based clustering, online popularity, social media

1 Introduction

How do contents get popular on the Web? Is being successful, strictly constrained to a rapid and bursty virality [1] or is it possible to acquire popularity in a steady and disciplined fashion? What are other temporal characteristics that can unveil popularity of online contents?

Temporal dynamics of popularity have been studied extensively in Twitter [41], news [24], and videos [9]. Thus far, previous work studying popularity either explored topological patterns in the networks that lead to cascades [4, 25], or focused on the short time period in which significant popularity is gained [43, 23]. However, popularity is subject to endogenous and exogenous factors [23] that affects its temporal dynamics. For instance, popularity of an item may depend on the amount of external promotions it receives [35]. Analyzing different temporal patterns that arise when considering both endogenous and exogenous factors will further extend our understanding on temporal collective behavior.

To take into account all these factors at a time and tackle the problem of uncovering their relation with the dynamics of popularity, we focus on multidimensional clustering methods. Multidimensional clustering can indeed allow us to uncover the hidden relations between time series of events having different evolutions in more than one dimension. For instance, the dynamics of a user content can be not only described by its popularity but also by the online activity and pace of the user, number of social connections in social networks.

A wide range of studies have been focused on solving the problem of efficiently clustering time series [2]. Some approaches have adopted methods such as sub-sequence techniques [38, 46, 47], and dimensionality reduction models [45, 7]. However, we aim at uncovering different patterns of popularity on the basis of both their shapes and the time at which the main popularity gain is achieved. Therefore, sub-sequence or subspace solutions are not suitable for this analysis as they lose temporal granularity and order. In this paper, we take inspiration from unidimensional clustering approaches [43, 30, 33, 12] that use a shape-based distance and a time series similarity measure that is scale and shift invariant. We generalize these methods to multidimensional time series and parametrized the temporal shifting, thus allowing us to capture temporal patterns occurring at different times.

In the present work, we design a multi-faceted temporal analysis to study the timeline of popular online content and provide insights on how popularity is gained over time. To this aim, we extend K-Spectral Centroid (kSC) [43], a unidimensional shape-based time series clustering algorithm into our multidimensional version, called dipm-SC. It partitions multidimensional time series data into clusters and uses a heuristic to find the optimal number of clusters. Each cluster is characterized by a centroid, having a unique shape per dimension identifying the average temporal behavior of the cluster members.

We validate the accuracy of our algorithm on the datasets generated from benchmark time series models with varying parameters of time series length, dimensions and number of underlying clusters. To the best of our knowledge, this is the first study which explores multidimensional use of k-SC.

Then, with the help of the dipm-SC, we study temporal multidimensional behavior in two real world large datasets containing data from GitHub and Twitter. In Github, we study the popularity of repositories considering a series of other temporal dimensions that keep track of the repository development, such as pushes and pull requests a repository receives, and repository issues, such as issue entries and comments under them. Our algorithm identifies 9 main clusters among most popular repositories on GitHub in different shapes. In Twitter, we study how popular hashtags unfold in a window of time where a major event happens. We keep track of number of times a hashtag used hourly (as a measure of popularity) alongside with hourly positive and negative sentiment rates and tweets containing the hashtag posted by bot accounts. Our algorithm identifies 6 main clusters among most popular hashtags on Twitter. We show that both bursty and steady clusters are evident in both datasets. On Twitter, when a bursty popularity gain is present, we observe increased sentiment and bot activity rates around the popularity spike. We also note that sentiment and bot activity decay after the spike can be characterized by the fat-tail of the popularity spike. Moreover, we find that how repositories unfold over time shows no significant impact on the cumulative popularity at the end of their lifetime. However, in the Twitter data, our algorithm is able identify different cluster shapes leading to different cumulative popularity gains.

Our contributions

We propose a novel algorithm to cluster multidimensional shape-based time series. Our approach detect the interplay of different dimensions that allow to perform a non-supervised classification of different temporal patterns. 2. 2.

We validate the accuracy of our algorithm on the datasets generated from benchmark time series models, and compare its performance with other scalable multidimensional time series methods. 3. 3.

We present two real-world case studies, where we showcase the effectiveness of our algorithm in both finding meaningful popularity dynamics and providing clusters that are coherent and easy to interpret.

2 Methodology

In this section, we formally define the problem and discuss the need for extensions to time series clustering algorithms in the literature. Then, we develop our multidimensional extension based on a several unidimensional shape-based time series clustering methods.

2.1 Problem Definition

In this work, we aim at finding temporal patterns of popular online content in a multi-faceted manner. Let $N$ be the number of entities to be studied described by $D$ temporal features. Each temporal feature is represented by a discrete time series in $M$ time units (e.g., days or hours). This representation of our data can be encoded in a tensor $\mathcal{X}$ of size $N\times D\times M$ .

Given $\mathcal{X}$ , we want to partition the $N$ multivariate time series into an optimal number of $K$ clusters, by taking the temporal dynamics of different $D$ dimensions into account at the same time. To this aim, we perform a clustering process based on the cumulative similarity of shapes among corresponding dimensions of the time series. Therefore, each cluster $k$ is represented by a multi-dimensional shape $\mathbf{c}_{k}\in\mathbb{R}^{D\times M}$ , which describes the overall behavior of the cluster across multiple dimensions. In this work, solving the aforementioned problem gives us the ability to uncover similar patterns of popularity gain while differing in other dimensions, and vice versa.

2.2 Proposed Algorithm

The kSC algorithm proposed by Yang and Leskovec [43] stands out as a prominent solution for analyzing time series data constructed from online human behaviors. Along that line, there have been significant contributions to the shape-based time series clustering literature [30, 31]. Here, we extend and compare the following shape-based univariate time series clustering algorithms: kShape[30], kDBA[33], kAVG+ED[12], and kSC[43]. We only report the details about the methodology used to extend kSC[43] as it is the only algorithm specifically developed for online social platforms and, as shown in the supplementary material 111www.public.asu.edu/~mozer/dipm-SC, is the one performing better in our synthetic scenario. However, we applied similar extensions to the kShape, kDBA, and kAVG+ED algorithms.

Our extension focuses on two parts. First, we modify the existing kSC algorithm to be applicable to multidimensional time series data. This process consists of the two steps of updating the distance and averaging functions of kSC. Second, we develop an iterative framework to find the optimal number of multidimensional shape clusters inspired by a similar work developed for kMeans [20].

2.2.1 kSC framework

K-Spectral Centroid (kSC) is an iterative kMeans based time series clustering algorithm. The algorithm iteratively assigns time series data to clusters and update cluster centroids until no updates to clusters is made. It adopts a scale and shift invariant distance function [8] to measure the proximity of two time series. Therefore, each cluster’s centroid represents the average shape of its members. At each iteration, the method uses a spectral clustering technique as its averaging function to update the centroids. However, as it is originally proposed and designed for unidimensional time series data, we need to extend its distance and averaging functions in order to apply it to a multidimensional scenario.

2.2.2 Extending the Distance Function

There are several ways to extend the univariate distance function to its multivariate version. First, we can compute distances separately on each dimension and sum them. However, this option is not optimal, as usual time series distances involve the manipulation of the signal through shifting and warping process. Thus, the use of separate distances would lead to different manipulations (shifting and warping) in each dimension. As our task requires to observe their interplay instead, manipulating each dimension differently is not desirable. Second, we can compute distances by using an optimal shared manipulation across all dimensions. Here, we propose extending the kSC algorithm’s distance function [8] as follows

[TABLE]

where $\mathbf{x}(d,:)$ is the $d$ ’th dimension of time series $\mathbf{x}\in\mathbb{R}^{D\times M}$ , $\mathbf{x}_{q}$ is the shifting of $\mathbf{x}$ by $q$ steps in time, and $\mathbf{c}_{k}$ is the centroid of cluster $k$ . The above equation constitutes the essential part of calculating the distance of multivariate time series $\mathbf{x}$ to a centroid. It indeed enforces the same shifting parameter $q$ regardless of dimensions. Yet, it keeps the scaling parameter $\alpha_{d}$ unique for each dimension. Here, we assume that even if each dimension may experience trends in different scales, they should have the same temporal order.

Notice that, for any fix $q$ , the optimal $\alpha_{d}(q)$ that solves Eq. (1) is

[TABLE]

by setting the gradient to zero while q is fixed. Thus, minimizing the distance function is reduced to a brute-force searching of the optimal $q$ , where $\alpha_{d}$ is calculated for each shifting option as in Eq. (2). The algorithm can become really slow when full-length shifting is allowed. As we are interested in temporal behavior of online content, we focus on moments in time when the events happen. Therefore, we aim to keep the shifting parameter to the minimum.

2.2.3 Extending the Averaging Function

As a second step in our method, we need to extend the averaging function of kSC. The role of the averaging function is to find a suitable centroid for each cluster, where the distance of its members is minimum to the centroid

[TABLE]

where $S_{k}$ is the set of time series belonging to cluster $k$ .

A possible way of extending the averaging function consists of finding a univariate centroid which minimizes the distance to every dimension. We avoid this option since there may be different behaviors on different dimensions and our task involves observing them in parallel. Instead, we extract a unique shape for each dimension separately. First, we replace the distance function with its multivariate version

[TABLE]

To be able to use the spectral properties of Rayleigh quotient in accordance with the original kSC paper[43], we first need to find the optimal $\alpha_{d}$ and $q$ . $\alpha_{d}$ has a direct solution as proposed in Eq. (2). However $q$ is dimension invariant. Here we set $q$ as the optimal $q$ found during the distance’s computation, and jointly shift every dimension of the time series $\mathbf{x}$ accordingly during the new centroid calculation. Finally, we invert the order of the two sums in Eq. (3) and get

[TABLE]

Our minimization problem does not have dimension independent variables anymore and we can solve each dimension’s centroid separately by following similar steps to the univariate case

[TABLE]

and by replacing $\sum_{\mathbf{x}\in S_{k}}(I-\dfrac{\mathbf{x}(d,:)\mathbf{x}(d,:)^{T}}{\left\|\mathbf{x}(d,:)\right\|^{2}})$ with $M$ , we obtain the following minimization problem

[TABLE]

which achieves its minimum value with the eigenvector of $M$ corresponding to the smallest eigenvalue. We refer further interested readers to the properties of the Rayleigh quotient with parameters $\mathbf{c}^{\prime}$ and $M$ [16].

After successfully constructing the two building blocks of kSC for our multivariate case, we present the pseudo code of our iterative algorithm multivariate kSC (m-kSC), cf. Alg. 1.

2.2.4 Finding the optimal K

As with many kMeans based approaches, m-kSC also needs the number of clusters as a parameter. A common practice to find the optimal $K$ involves post-processing, where we run the clustering algorithm several times for different $k$ values. In this case, a trade off between the final loss function value and model complexity decides the optimal $K$ . However, to avoid re-running the algorithm for every possible $k$ , here we adopt dip-dist, the iterative strategy proposed by Kalogeratos and Likas [20]. We choose the dip-dist technique over others [32, 18] as the only assumption dip-dist makes is the unimodality of pairwise distances of cluster members. It suggests that an optimal cluster structure should not have more than a single mode among pairwise distances of its members (multimodality). The density of pairwise distance values should reach to maximum around the mode and decay while moving away.

Multimodality of each cluster is checked by its members’ distance to each other. First, we calculate the distance of each time series to the other members in the cluster. Then, we sort the distance vector in decreasing order. If the null hypothesis of unimodality in the distribution of the sorted distances is rejected by Hartigan’s dip test [19] with a significance level $p$ -value $<\alpha$ , it is considered to be a splitter in the cluster. Finally, if the number of splitters is higher than a given threshold $v$ , we split the cluster into two disjoint sub-clusters. Our splitting strategy involves a local search with m-kSC where k is set to 2. Heuristically, we choose the least squared error objective function centroids over 10 runs. We iteratively split clusters until there is no clusters left with ratio of splitters greater than the given threshold. As suggested by Kalogeratos and Likas [20], at each iteration, we only split the cluster with maximum number of splitters. We report our complete dip-test based multidimensional spectral centroid algorithm dipm-SC, cf. Alg. 2. We refer further interested readers to Kalogeratos and Likas [20] for kMeans extension and to Hartigan and Hartigan [19] for the Hartigan’s dip test. We make the source code and the experimental settings available at www.public.asu.edu/~mozer/dipm-SC/dipm_source_code.zip.

The computational cost of m-kSC is dominated by the calculation of matrix M for each $x_{n}$ and for d dimensions ( $\mathcal{O}(m^{2}dn)$ ) and the eigendecomposition of M for k clusters ( $\mathcal{O}(m^{3}dk)$ ). Multimodality test in dipm-SC for each cluster k has a complexity of $\mathcal{O}(k(bn\log n+n^{2}))$ . So, the total complexity of the dipm-SC becomes the maximum of the $\{\mathcal{O}(m^{2}dn),\mathcal{O}(m^{3}dk),\mathcal{O}(k(bn\log n+n^{2}))\}$ .

3 Evaluation

We evaluate the results of our algorithm applied to one synthetically generated and two real-world multivariate time series datasets: GitHub and Twitter. We refer interested readers to https://www.public.asu.edu/~mozer/dipm-SC/supp_material.pdf for the comparison of the dipm-SC with extension of other shape-based time series clustering algorithms on synthetically generated time series dataset and the GitHub analysis. For the sake of brevity of this paper, here we present our experimental results for Twitter hashtag dataset. First, we perform the qualitative analysis by studying different behavioral patterns of the detected multivariate centroids. Secondly, we characterize the clusters based on their ”periodicity”, locating them on a scale from periodic to viral. We expect that periodic time series will behave steadily and that the viral time series will exhibit significant spikes over short time periods. To this aim, we use two previously introduced metrics; burstiness and memory [15]. These two metrics are computed from the distribution of the inter-event time $P(\tau)$ between consecutive events of time series data. In particular, given the inter-event time distribution $P(\tau)$ having mean $m_{\tau}$ and standard deviation $\sigma_{\tau}$ , we calculate burstiness as

[TABLE]

where $B\in[-1,1]$ : $B=1$ corresponds to the most bursty sequence of events, and $B=-1$ is a completely regular (periodic) sequence.

This measure however is not enough to characterize the correlations taking place between consecutive events. Correlation of consecutive events can be thought as a memory process. A measure that can be used to compute such memory coefficient between consecutive events $(\tau_{i},\tau_{i+lag})$ is defined as

[TABLE]

where, $n_{\tau}$ is the number of intervals between events in the time series, while $m_{1,2}$ and $\sigma_{1,2}$ are respectively the sample mean and standard deviation of $\tau_{i,i+lag}$ ’s. We set the value of lag based on the empirically observed value of periodicity for each case.

4 Datasets description

We collected tweets starting from February 14th until March 6th, 2018 related to the Parkland school shooting event from GNIP. To this aim, we provided 140 terms (hashtags, words, phrases) related to the event and the broader gun debate in the United States.222https://www.public.asu.edu/~mozer/dipm-SC/GNIP_query_list.txt GNIP provides any tweet that contains at least one of the queried terms, while not suffering from any bias implications of free API streams [29]. Our data contains $~{}23.7$ M tweets coming from $~{}3.7$ M users.

We then identify the top $1,000$ most occurring (popular) hashtags in our dataset and build hourly time series for each of them. Alongside with hourly counts of each hashtag, we construct the hourly ratio of tweets sent by bot accounts to the total hourly volume. To identify the most obvious bot accounts, we used the free API provided by Botometer [10]. We tested about $5\%$ of the most active accounts ( $193$ K approximately) and found that $~{}19$ K of them are likely bots—in line with other research on bot frequency [39].

As the third and fourth dimensions, we consider the hourly rate of positive and negative sentiment tweets. We use an off-the-shelf short text classification tool SentiStrength [37] to detect positive and negative sentiment of each tweet where hashtags occur. SentiStrength provides scores for positive and negative sentiments separately. We aggregate each sentiment’s score on an hourly basis by simply summing them. Finally, we build the hourly sentiment ratio time series for each hashtag by normalizing the sentiment volumes by total tweet volumes.

Therefore, the timeline of each hashtag over time is represented by 4 separate dimensions: total count, bot tweet rate, positive sentiment rate, and negative sentiment rate.

5 Popularity Gain Patterns on Twitter

In this section we present Twitter hashtag timeline analysis with the help of multidimensional shape clusters identified by our dipm-SC algorithm. We use 24 hours as our shifting parameter. We also apply gaussian smoothing of window size 24 to eliminate noise. In total we are able to identify 13 clusters in different timeline shapes. Here we present 6 clusters which spans nearly 78% percent of popular hashtags under study. We leave other 7 smaller clusters reachable externally for brevity of the paper (http://www.public.asu.edu/~mozer/dipm-SC). We report these 6 most popular timeline shapes in Fig. 1.

First, we observe that cluster-P in Figure 1a is significantly different in shape when compared with others. In particular, it does not show any burst of activity in each of the four dimensions considered (popularity, sentiment ratios, and bot tweet ratio). On the contrary, this cluster shows a steady pattern with daily periodic activities.

On the other hand, the remaining 5 clusters are characterized by a sudden spike in the popularity signal at a certain point in time and different trends among the other dimensions. The popularity burst of Cluster-B1 and B2 occurs at the same time. Spike in popularity gain considerably differs in the tail part. Cluster-B1 shows a longer tail while Cluster-B2 dies off sooner. Cluster-B2 temporal pattern shows not only a decreasing trend in popularity but also a faster decay in both sentiment and bot ratios then the one we observe in cluster-B1. We notice an analogous discrepancy among the last two clusters (i.e., Boycott1 and Boycott2).

Finally, we analyze the temporal coherence of hashtags clustered together. We note that our algorithm does not consider semantics while clustering timeline of hashtags. The detected clusters are indeed identified because they share similar temporal patterns among different dimensions. We report the top 8 closest hashtags to the cluster centroids in the top part of each panel in Figure 1.

Cluster-P contains broader issue related hashtags such as #NRA, #2A(second amendment), #MAGA, #Trump, which show steady and periodic behavior among dimensions. Cluster B1 and B2 exhibits spikes around the time when the tragic shooting takes place. In cluster-B1 we observe event-related long lasting hashtags, while in cluster-B2 we observe event-related expiring hashtags such as #prayfor variants, #MentalHealth. Cluster B3 is unique in terms of its spike time when student survivors start the #neveragain movement and appear in a town hall organized by CNN. Lastly, we notice two boycott related hashtags dominated clusters spiking in popularity gain around the same time. It overlaps with the time of students’ call to boycott NRA and companies that shares financial interests with it. It is easy to see these two clusters differing in their tail parts. Cluster-Boycott2 (longer popularity gain tail) shows more NRA & Amazon related hashtags while Cluster-Boycott1 involves other companies names. We will later present statistically significant difference in cumulative popularity gains between these two clusters.

5.1 Interplay of Dimensions

The key piece of information we acquire through this analysis is how popularity spike of a hashtag correlates with increased sentiment ratios and bot involvement. For every bursty popularity gain, we observe an increase in all other dimensions. Previous studies on bursty attention mechanisms of online content usually focus on the characteristic of the fat-tail [44, 23, 34]. Here, we observe that both sentiment and bot tweet ratios stay more steady with a longer tail of popularity (e.g., B1, B3, and Boycott2) than dying ones (e.g., B2, Boycott1). We evidence it by fitting a linear curve and measuring the average slope of cluster members. When compared, hashtags belonging to B1 have a higher slope than the ones in B2 in sentiment and bot involvement dimensions after the spike. We observe the same pattern between Boycott2 and Boycott1 clusters, where Boycott2 has a higher slope in all three dimensions after the spike.

Another interesting behavior we direct attention to is the sudden increase in bot tweet ratio after the popularity spike in cluster B2, B3 and Boycott1. These clusters are characterized by a weaker tail in popularity gain after a burst in popularity. This signals a steady bot involvement for a while although abandonment of the overall activity takes place.

5.2 Burstiness and Memory of Clusters

First, we quantify popularity gain’s temporal behavior with the described burstiness and memory metrics. We present average burstiness and memory of each cluster in Figure 2a. As expected, cluster-P has the lowest burstiness while enjoying higher values of memory. Lowest memory and highest burstiness belongs to Boycott1 cluster.

Next, we investigate if different multi-dimensional timeline shapes lead to significant difference in hashtags’ cumulative popularity gain at the end of the time window we observe. Since this analysis is based on the timeline of a hashtag rather than a lifespan, we do not align hashtags in their creation date. That is why, for example, comparing Cluster-B1 against Cluster-B3 is not a meaningful test. However, it is noteworthy that two-sample Kolmogorov-Smirtnov test rejects the null hypothesis for Cluster-Boycott1 and Cluster-Boycott2 of which we know the beginning of campaigns. Differing distributions for these two shape clusters can be also observed from Figure 2b. Moreover, it is also appealing to see no cumulative popularity distribution difference between hashtags of steady cluster P and bursty cluster B1 (p-value 0.056).

6 Related Work

Our work overlaps with the current literature among two branches. First is the task of clustering time series, and the second is the task of uncovering temporal patterns of online content.

Clustering time series has been a challenging task and research area that produced vast amount of literature [2]. Usually, when external feature learning or modeling [42, 26, 3] is not imposed, clustering of time series involves choosing a proper time series distance function and an algorithm. Numerous works have introduced various distance measures to calculate proximity of time series data [40, 11]. There have been partitional [30, 43, 33, 12], subsequence matching [6, 22], hierarchical [21] approaches using these different time series distance measures. Moreover, a major body of work [38, 47] exists in subsequence matching based time series clustering where they identify shorter most identifying portions of time series data also known as shapelets to group them. For the multivariate time series data, same categorizations can be made as modeling based [17, 14], and variants of generalizing univariate solutions to multivariate cases [36, 13]. Our approach falls into the second category where we extend existing distance functions available for univariate time series data and update centroid finding (i.e. averaging function) accordingly.

As the second line of work from literature, we compare our effort with analysis of temporal patterns of online content. There is significant amount of literature focusing on characterizing and modeling bursts and long tail dynamics [5, 27, 23]. First and foremost, our study does not focus on modeling, rather it finds clusters of shapes inherent to online temporal behavior. However, we show that our cluster shapes can be identified distinctively by their burstiness and periodicity through postprocessing. Our work also differentiates itself from others [28, 27, 43] by analyzing lifetime or a longer timeline of online entities rather than focusing on a window where activity peaks occur and omitting the rest.

7 Conclusion

In the present work, we set to study the temporal patterns that lead a user’s content to gain popularity in online platforms. In particular, our study tackled the following problems: (i) uncovering the different temporal patterns of popularity in online platforms; (ii) studying how different dimensions’ interplay; and (iii) proposing dipm-SC, a novel algorithm that extends the state-of-the-art models allowing to cluster multidimensional time series of events at a time.

First, we compared our method with other extensions of univariate time series clustering algorithms on synthetic data, where we successfully demonstrated its higher accuracy. We then applied our framework on the multivariate time series data deriving from two major online platforms: GitHub and Twitter. Through these applications, we showcased the efficacy of our method in uncovering fundamental pattern of popularity in online contexts. Our method can indeed be used not only to find the overall difference between time series shapes, such as bursty versus steady behaviors, but it can also uncover differences in these two main trends depending on the time in which popularity is acquired. Moreover, the results provided by our approach are easily interpretable and sheds light on analyzing the interplay of multiple dimensions. In the Twitter scenario, we found that the uncovered clusters are temporally coherent in the hashtags used and related to certain types of events, such as the Parkland school shooting and boycott campaigns.

In conclusion, we devised a methodology that extend the state-of-the-art literature in the area of multivariate time series clustering and uses a similarity metric based on the shape of the time series. Moreover, our method can be used both to uncover fundamental patterns based solely on the shapes of the time series and on the temporal occurrence of the events. Our approach also provide a way of studying popularity of online content and understanding their dynamics over time.

7.0.1 Acknowledgements

The authors are grateful to the Defense Advanced Research Projects Agency (DARPA), contract W911NF-17-C-0094, for their support.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adar et al. [2004] Adar, E., Zhang, L., Adamic, L.A., Lukose, R.M.: Implicit structure and the dynamics of blogspace. In: Workshop on the weblogging ecosystem. vol. 13, pp. 16989–16995 (2004)
2Aghabozorgi et al. [2015] Aghabozorgi, S.R., Shirkhorshidi, A.S., Teh, Y.W.: Time-series clustering - a decade review. Inf. Syst. 53, 16–38 (2015)
3Alon et al. [2003] Alon, J., Sclaroff, S., Kollios, G., Pavlovic, V.: Discovering clusters in motion time-series data. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. vol. 1 (June 2003)
4Bakshy et al. [2011] Bakshy, E., Hofman, J.M., Mason, W.A., Watts, D.J.: Everyone’s an influencer: quantifying influence on twitter. In: Fourth ACM international conference on Web search and data mining. pp. 65–74. ACM (2011)
5Barabási [2005] Barabási, A.L.: The origin of bursts and heavy tails in human dynamics. Nature 435, 207–211 (2005)
6Bergroth et al. [2000] Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE’00). pp. 39– (2000)
7Chakrabarti et al. [2002] Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst. 27(2), 188–228 (Jun 2002)
8Chu and Wong [1999] Chu, K.K.W., Wong, M.H.: Fast time-series searching with scaling and shifting. In: Proc. of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. pp. 237–248. PODS ’99, ACM (1999)