Online reactions to the 2017 'Unite the Right' rally in Charlottesville:   Measuring polarization in Twitter networks using media followership

Joseph H. Tien; Marisa C. Eisenberg; Sarah T. Cherng; Mason A. Porter

arXiv:1905.07755·cs.SI·October 11, 2019

Online reactions to the 2017 'Unite the Right' rally in Charlottesville: Measuring polarization in Twitter networks using media followership

Joseph H. Tien, Marisa C. Eisenberg, Sarah T. Cherng, Mason A. Porter

PDF

TL;DR

This study analyzes Twitter interactions following the 2017 Charlottesville rally, revealing highly polarized communities along media and political lines, with distinct content and influential accounts on each side.

Contribution

It introduces a novel media followership PCA method to quantify political polarization and applies network analysis to characterize community structure on Twitter.

Findings

01

Retweet network shows high polarization with an assortativity coefficient of 0.8.

02

Communities are largely homogeneous in political orientation, identified via community detection algorithms.

03

Support for Trump links diverse right-wing communities, including alt-right and mainstream media.

Abstract

We study the Twitter conversation following the August 2017 `Unite the Right' rally in Charlottesville, Virginia, using tools from network analysis and data science. We use media followership on Twitter and principal component analysis (PCA) to compute a `Left'/`Right' media score on a one-dimensional axis to characterize nodes. We then use these scores, in concert with retweet relationships, to examine the structure of a retweet network of approximately 300,000 accounts that communicated with the #Charlottesville hashtag. The retweet network is sharply polarized, with an assortativity coefficient of 0.8 with respect to the sign of the media PCA score. Community detection using two approaches, a Louvain method and InfoMap, yields largely homogeneous communities in terms of Left/Right node composition. When comparing tweet content, we find that tweets about `Trump' were widespread in…

Tables5

Table 1. Table 1 : First principal component of the media-followership matrix M 𝑀 M . The red entries designate a positive entry. For each node, we use its value in the first principal component to characterize it in the retweet network: ‘Left’ refers to nodes with negative first principal component score, and ‘Right’ refers to nodes with positive first principal component score.

Media account	1st
BreitbartNews	0.4071
DRUDGE_REPORT	0.3843
FoxNews	0.3779
theblaze	0.2054
NRO	0.0970
csmonitor	$- 0.0235$
WSJ	$- 0.1183$
FiveThirtyEight	$- 0.1520$
dailykos	$- 0.1893$
thenation	$- 0.2362$
MotherJones	$- 0.3115$
washingtonpost	$- 0.3321$
NPR	$- 0.3945$

Table 2. Table 2 : Number of inter-community retweets divided by the number of intra-community retweets for the twenty most heavily retweeted nodes in the retweet network. We obtain communities by maximizing modularity using the GenLouvain code with a resolution-parameter value of γ = 1 𝛾 1 \gamma=1 . Color indicates the mean media PCA score of the GenLouvain community for each node. We use blue to signify communities that are on the Left (i.e., ones with a negative mean media PCA score) and red to signify communities that are on the Right (i.e., ones with a positive mean media PCA score). Nodes with large in-degrees on the Left had larger values of the number of inter-community retweets divided by the number of intra-community retweets than was the case for nodes with large in-degrees on the Right. We label the Twitter accounts only of public profiles.

	Inter-community in-degree
Node	divided by intra-community in-degree
RepCohen	0.36725
DineshDSouza	0.040400
pastormarkburns	0.036085
larryelder	0.043745
wkamaubell	0.42938
NancyPelosi	0.21463
	0.25073
jk_rowling	0.22067
johncardillo	0.078068
TheNormanLear	0.31365
funder	0.17688
FoxNews	0.077005
KarenAttiah	0.32803
	0.28350
	0.23508
	0.050066
	0.043146
itsmikebivins	0.23335
tariqnasheed	0.23772
chuckwoolery	0.041193

Table 3. Table 3 : Mixing matrix of the fractions of the number of edges that correspond to edges between the different types of accounts, as characterized by the sign of their media PCA score (i.e., first PCA score). As usual, ‘Left’ indicates a negative media PCA score, and ‘Right’ indicates a positive media PCA score. (No nodes have a media PCA score of exactly 0 0 .) Accounts tend to mix with (i.e., be adjacent to) accounts with a PCA score of the same sign, as indicated by the larger values on the diagonal of the mixing matrix.

	original
retweeter	Left	Right
Left	0.43	0.057
Right	0.044	0.47

Table 4. Table 4 : The twenty-five most numerous words for nodes with negative (i.e., Left) and positive (i.e., Right) media PCA scores. The blue text indicates words that appear in the top-twenty words for the Left but not for the Right, and the red text indicates words that appear in the top twenty for the Right but not for the Left.

Left		Right
Word	Count	Word	Count
Charlottesville	98782	Charlottesville	84282
Trump	19352	Trump	11376
realDonaldTrump	10289	Obama	8195
white	9472	white	8174
Nazis	7743	DineshDSouza	8026
Nazi	6451	POTUS	7614
comments	5068	pastormarkburns	7394
charlottesville	4759	Barcelona	6348
good	4693	supremacist	6004
people	4637	organizer	5864
response	4091	rally	5851
must	4080	violence	5671
hate	3933	guy	5501
Barcelona	3930	MSM	5070
violence	3884	hate	4882
supremacy	3642	Right	4531
introducing	3584	city	4490
attack	3460	Americans	4489
via	3382	larryelder	4421
RepCohen	3358	Antifa	4409
rally	3345	Since	4183
Impeachment	3341	11	4011
Articles	3326	Chicago	3946
Klansmen	3310	Statues	3905
right	3155	40	3900

Table 5. Table 5 : The twenty most numerous words (excluding words that correspond to handles of non-verified accounts) from the subset of tweets that include (left set of columns) the word ‘Trump’ and (right set of columns) the word ‘Barcelona’. The blue text indicates words that appear in the top-twenty words for the Left but not for the Right, and the red text indicates words that appear in the top twenty for the Right but not for the Left.

Left		Right
Word	Count	Word	Count
Charlottesville	30995	Charlottesville	16218
Trump	19396	Trump	11376
realDonaldTrump	10291	realDonaldTrump	3245
Nazis	5600	MAGA	1403
comments	4941	POTUS	1278
good	3788	President	1229
introducing	3580	Romney	1209
Impeachment	3331	comments	1175
white	3318	apologize	1164
Articles	3313	racist	1149
RepCohen	3306	blame	1096
Klansmen	3301	Mayor	1085
must	2699	antifa	980
Congress	2606	charlottesville	959
censure	2540	Vice	958
supremacy	2239	Barcelona	880
wake	2107	left	868
defense	2060	alt	818
NancyPelosi	2054	coming	781
repulsive	2018	non	776

Equations8

Q = \frac{1}{w} i = 1 \sum n j = 1 \sum n (A_{ij} - γ \frac{w _{i}^{in} w _{j}^{out}}{w}) δ (C_{i}, C_{j}),

Q = \frac{1}{w} i = 1 \sum n j = 1 \sum n (A_{ij} - γ \frac{w _{i}^{in} w _{j}^{out}}{w}) δ (C_{i}, C_{j}),

w = i = 1 \sum n j = 1 \sum n A_{ij}

w = i = 1 \sum n j = 1 \sum n A_{ij}

H^{k} = - i = 1 \sum 2 p_{i}^{k} ln p_{i}^{k},

H^{k} = - i = 1 \sum 2 p_{i}^{k} ln p_{i}^{k},

r = \frac{\sum _{ℓ = 1}^{g} e _{ℓℓ} - \sum _{ℓ = 1}^{g} a _{ℓ} b _{ℓ}}{1 - \sum _{ℓ = 1}^{g} a _{ℓ} b _{ℓ}},

r = \frac{\sum _{ℓ = 1}^{g} e _{ℓℓ} - \sum _{ℓ = 1}^{g} a _{ℓ} b _{ℓ}}{1 - \sum _{ℓ = 1}^{g} a _{ℓ} b _{ℓ}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Online reactions to the 2017 ‘Unite the Right’ rally in Charlottesville: Measuring polarization in Twitter networks using media followership

Joseph H. Tien, Marisa C. Eisenberg, Sarah T. Cherng, Mason A. Porter Corresponding author. [email protected]

Abstract

Network analysis of social media provides an important new lens on politics, communication, and their interactions. This lens is particularly prominent in fast-moving events, such as conversations and action in political rallies and the use of social media by extremist groups to spread their message. We study the Twitter conversation following the August 2017 ‘Unite the Right’ rally in Charlottesville, Virginia, USA using tools from network analysis and data science. We use media followership on Twitter and principal component analysis (PCA) to compute a ‘Left’/‘Right’ media score on a one-dimensional axis to characterize Twitter accounts. We then use these scores, in concert with retweet relationships, to examine the structure of a retweet network of approximately 300,000 accounts that communicated with the #Charlottesville hashtag. The retweet network is sharply polarized, with an assortativity coefficient of $0.8$ with respect to the sign of the media PCA score. Community detection using two approaches, a Louvain method and InfoMap, yields communities that tend to be homogeneous in terms of Left/Right node composition. We also examine centrality measures and find that hyperlink-induced topic search (HITS) identifies many more hubs on the Left than on the Right. When comparing tweet content, we find that tweets about ‘Trump’ were widespread in both the Left and Right, although the accompanying language (i.e., critical on the Left, but supportive on the Right) was unsurprisingly different. Nodes with large degrees in communities on the Left include accounts that are associated with disparate areas, including activism, business, arts and entertainment, media, and politics. Support of Donald Trump was a common thread among the Right communities, connecting communities with accounts that reference white-supremacist hate symbols, communities with influential personalities in the alt-right, and the largest Right community (which includes the Twitter account FoxNews).

Keywords: United States politics, political extremism, media polarization, social media, Twitter, community structure, principal component analysis

Affiliations

•

Joseph H. Tien: Department of Mathematics and Mathematical Biosciences Institute; The Ohio State University

•

Marisa C. Eisenberg: Department of Epidemiology, Center for the Study of Complex Systems, and Department of Mathematics; University of Michigan

•

Sarah T. Cherng: Precision Health Enterprise and Institute for Next Generation Healthcare, Mount Sinai Health System, New York, NY

•

Mason A. Porter: Department of Mathematics; UCLA

1 Introduction

On 11–12 August 2017, a ‘Unite the Right’ rally was held in Charlottesville, Virginia, USA in the context of the removal of Confederate monuments from nearby Emancipation Park. Attendees at the rally included members of the ‘alt-right’, white supremacists, Neo-Nazis, and members of other far-right extremist groups [1]. At this rally, there were violent clashes between protesters and counter-protesters. A prominent event amidst these clashes was the death of Heather Heyer when a rally attendee rammed his car into a crowd of counter-protesters [2]. In the aftermath, President Donald Trump stated that there were ‘very fine people on both sides’ [3]. White supremacists were galvanized by Trump’s response, with one former leader stating that the president’s comments marked “the most important day in the White nationalist movement” ([4], p. 61). Reactions to the removal of confederate statues, the violence at the rally, and President Trump’s controversial response generated vigorous debate.

In this paper, we present a case study of the structure of the online conversation about Charlottesville in the days following the ‘Unite the Right’ rally. These data allow one to study far-right extremism in the context of broader public opinion. Did support for President Trump’s handling of Charlottesville extend beyond white supremacists and the ‘alt-right’? Did the response to Charlottesville split simply along partisan lines, or was the reaction more nuanced? We examine these questions using tools from network analysis and data science using Twitter data from communication following the ‘Unite the Right’ rally that include the hashtag #Charlottesville. Our specific objectives are to (1) present a simple approach for characterizing Twitter accounts based on their online media preferences; (2) use this characterization to examine the extent of polarization in the Twitter conversation about Charlottesville; (3) evaluate whether key accounts were particularly influential in shaping this discussion; (4) identify natural groupings (in the form of network ‘communities’) of accounts based on their Twitter interactions; and (5) characterize these communities in terms of their account composition and tweet content.

Social-media platforms are important mechanisms for shaping public discourse, and data analysis of social media is a large and rapidly growing area of research [5]. It has been estimated that almost two-thirds of American adults use social media networking sites [6], with even higher usage among certain subsets of the population (such as activists [7] and college students [6]). Online forums and social-media platforms are also significant mechanisms for communication, dissemination, and recruitment for various types of ethnonationalist and extremist groups [4]. Twitter, in particular, has been a key platform for white-supremacist efforts to shape public discourse on race and immigration ([4], p. 64).

As a network, Twitter encompasses numerous types of relationships. It is common to analyze them individually as retweet (e.g., see [8, 9]), follower (e.g., see [10]), mention (e.g., see [8]) networks, and others. There is an extensive literature on Twitter network data, and the myriad topics that have been studied using such data include political protest and social movements [11, 12, 13, 14, 15, 16, 17], epidemiological surveillance and monitoring of health behaviors [18, 19, 20, 21, 22, 23, 24], contagion and online content propagation [25, 26], identification of extremist groups [27], and ideological polarization [8, 28, 29]. The combination of significance for public discourse, data accessibility, and amenability to network analysis makes it very appealing to use Twitter data for research. However, important concerns have been raised about biases in Twitter data [13, 30, 31] in general and hashtag sampling in particular [5], and it is important to keep them in mind when interpreting the findings of both our study and others.

There are also many studies of how the internet and social-media platforms affect public discourse [32, 33, 34]. In principle, social media and online news consumption have the potential to increase exposure to disparate political views [35]. In practice, however, they instead often serve as filter bubbles [36, 37] and echo chambers [33, 34]; and they thus may heighten polarization. Several previous studies have examined political homophily in Twitter networks [10, 8, 38, 29, 39], and some analysis has been based on tweet content and followership of political accounts [10]. We also examine political polarization using Twitter data, but we take a different approach: we focus on the homophily of media preferences on Twitter. Specifically, we examine media followership on Twitter and perform principal component analysis (PCA) [40] to calculate a scalar measure of media preference. We then use this scalar measure to characterize accounts in our Charlottesville Twitter data set. To study homophily, we examine assortativity of this scalar quantity for accounts that are linked by one or more retweets. Our approach is conceptually simple, has minimal data requirements (e.g., there are no training data sets), and is straightforward to implement. The employment of media followership is also appealing for political studies, as it is known that media preferences correlate with political affiliation [41, 42].

The influence of Twitter accounts on shaping content propagation and online discourse depends on many factors, including the number of ‘followers’ (accounts who subscribe to a given account’s posts, which then appear in their feed), community structure and other aspects of network architecture [26], tweet activity (and other account characteristics) [11], and specific tweet content [43]. One can calculate ‘centrality’ measures [44] to identify important nodes in a Twitter network. There are many notions of centrality, including degree, PageRank [45], betweenness [46], and hyperlink-induced text search (HITS, which allows the examination of both hubs and authorities) [47]. In the context of our study, it is also useful to keep in mind that some structural features are particular to Twitter networks; these may influence which centrality measures are most appropriate to consider. Prominent examples of such features include asymmetry between the numbers of followers and accounts that are followed for many accounts [43], automated accounts (‘bots’) that may retweet at very high frequencies [48], and heterogeneous retweeting properties across different accounts [9]. The importance of such features has also led to the development of Twitter-specific centrality measures [49, 9, 50]. In our investigation, we examine a variety of different measures of centrality for the #Charlottesville retweet network to identify important accounts both for generating novel content and for spreading existing content.

Community detection, in which one seeks dense sets (called ‘communities’) of nodes that are connected sparsely to other dense sets of nodes, is another approach that can give insights into network structure (especially at large scales) [51, 52]. Communities in a network can influence dynamical processes, such as content propagation [53, 54, 26]. Investigating community structure and other large-scale network structures can be very useful for the study of online social networks, as some accounts are anonymous and demographic data may be incomplete or of questionable validity. Community detection is helpful for discovering tightly-knit groupings of accounts that can help reveal what segments of the population are engaged in a conversation on Twitter. One can then examine such groupings, in conjunction with other tools from network analysis, to characterize communities in terms of structural network properties (e.g., distributions of degree or other centrality measures) and/or metadata (e.g., profile information), identify influential accounts within communities, and study dynamical processes on a network (such as how content propagates both within and between communities [26]).

In the present paper, we combine community detection with analysis of tweet content within and across communities. Previous studies have reported that there are differences in language in different online communities [55]. Such differences can help reveal differences in demography, political affiliation, and views on specific topics [10, 8, 56]. For example, the ‘linguistic framing’ of issues such as immigration can help reveal political orientations and agendas [57, 58], and changes in language over time can reflect political movements and influence campaigns [59]. e combine community detection with tweet content analysis to compare subsets of the Twitter population who participated in the #Charlottesville conversation by characterizing them based on the language in different communities for describing both the broader conversation topic (namely, #Charlottesville) and specific subtopics (e.g., ‘Trump’).

Our paper proceeds as follows. In Section 2, we briefly discuss our Twitter data collection and cleaning. In Section 3, we discuss how we characterize nodes based on media preference and PCA. In particular, we show that the first principal component provides a good classification of nodes as ‘Left’ (specifically, nodes with a negative media-preference score) or ‘Right’ (specifically, nodes with a positive media-preference score) in terms of their media preference. In Section 4, we examine the structure — in terms of both centrality measures and large-scale community structure — of a network of retweet relationships that we construct from our Twitter data. We compare central nodes on the Left and Right, and we also examine Left/Right node composition within communities in the retweet network. In Section 5, we examine the media-preference assortativity of nodes in the retweet network. Our results in Sections 4 and 5 allow us to gauge the extent to which the Twitter conversation, with respect to who retweets whom, splits according to media preferences. In Section 6, we illustrate differences in tweet content between the Left and Right communities. Together, Sections 4–6 illustrate the extent of polarization in the Twitter conversation about #Charlottesville. In Section 7, we conclude and discuss our results.

2 Data collection

We collected Tweets with the hashtag #Charlottesville and the follower lists for $13$ media organizations using Twitter’s API (application program interface) and the Python package tweepy. Public data accessibility through Twitter’s API has greatly facilitated investigations of Twitter data, but such data have important limitations [13, 5], including potential biases due to Twitter’s proprietary API sampling scheme [13]. For example, Morstatter et al. [31] illustrated that Twitter’s API can produce artifacts in topical tweet volume, potentially resulting in misleading changes in the number of tweets on a given topic over time. In our investigation, we do not consider changes in tweet volume over time; instead, we examine features of the data after aggregating over a collection-time window.

Tüfekci [5] discussed several potential issues with hashtag sampling, including different hashtag usages across different groups and discontinuation of a given hashtag once the corresponding topic has been established. (This latter phenomenon is called ‘hashtag drift’ [60].) We collected the tweets that we study from a six-day period from shortly after the ‘Unite the Right’ rally; this should lessen the potential for hashtag drift. As was pointed out by Tüfekci [5], hashtag sampling draws from accounts that choose to tweet a given hashtag, and this necessarily entails biases. Nevertheless, hashtag sampling is able to provide valuable insights on the shape of online conversations. For example, we use the collected data to examine what types of accounts chose to post tweets about #Charlottesville. It is known, for example, that the extent that ‘peripheral’ accounts engage in online conversations about social protest can be an important factor for content propagation on Twitter [11].

Our data collection is in accord with the Twitter Terms of Service and Developer Agreement. To protect user privacy, we include account names (i.e., “handles”) only for Twitter-verified accounts and Twitter accounts that belong to organizations. As described by Twitter, “an account may be verified if it is determined to be an account of public interest” [61]. We have posted network data (without account names or tweet content), together with code for analyzing the structure of these data, at https://osf.io/487fw/.

2.1 Tweets about #Charlottesville

We used Twitter’s search API to sample 486,894 publicly available tweets that include the hashtag #Charlottesville and were posted by 270,975 unique accounts between 16 August 2017 and 21 August 2017. Our data includes account name (i.e., “handle”), time and date in coordinated universal time (UTC), and tweet content. In UTC, the earliest tweet date is 2017-08-16 22:16:21, and the latest tweet date is 2017-08-20 01:48:00. We performed our data acquisition using the Python package tweepy.

2.2 Media followership

In December 2016, we used the Twitter API to acquire the complete lists of Twitter users who follow the following 13 media accounts: BreitbartNews, DRUDGE_REPORT, FiveThirtyEight, FoxNews, MotherJones, NPR, NRO111NRO is the Twitter account for The National Review., WSJ222WSJ is the Twitter account for The Wall Street Journal., csmonitor, dailykos, theblaze, thenation, and washingtonpost. At the time of access, these media accounts had significant Twitter followings, ranging from 62,078 followers (csmonitor) to more than 12 million followers (WSJ). They include both sources that studies have concluded as preferred by conservative readers and those that they have concluded as preferred by liberal ones [41, 42].

3 Twitter media preferences

In this section, we use media preferences on Twitter to characterize nodes in our #Charlottesville data set. Specifically, we find that using PCA provides an effective characterization of nodes as ‘Left’ or ‘Right’. In subsequent sections, we use this characterization to examine media-preference polarization in the Twitter conversation about Charlottesville.

Of the Twitter accounts in our #Charlottesville data set, 99,412 accounts followed at least $1$ of the $13$ media sources that we listed in Section 2.2 at the time (December 2016) that we accessed the media follower lists. Restricting to these accounts gives a 99,412 $\times$ 13 media-choice matrix $M$ of [math] entries (not following) and $1$ entries (following). We perform PCA on $M$ , and we highlight the first component in Table 1. We use the ‘standard’ type of PCA in our investigation [40]. For a discussion of a variant of PCA that is designed for Boolean data, see [62].

We interpret the first component as encoding liberal versus conservative media preference, as reflected by the signs of the entries of this component. Specifically, media accounts with a positive value in the first principal component (PC) seem to correspond to accounts that previous studies have found to have a conservative slant (and to be preferred by individuals who identify as conservative), whereas accounts with a negative value in the first PC correspond predominantly to accounts that studies have concluded to have a liberal slant and/or are preferred by liberals [41, 42]. The sign of the score in the first PC is also consistent with conventional wisdom about liberal versus conservative leanings of these media accounts, with the exception of The Wall Street Journal (WSJ), which is widely considered to be conservative-leaning [33] but has a negative first PC value. However, our findings are consistent with previous studies that, based on readership and co-citations, grouped The Wall Street Journal with liberal media organizations [33, 63, 64]. By contrast, previous research that examined article content identified The Wall Street Journal as politically conservative [41]. Although the sign of the first PC value has a clear interpretation, the magnitude of these entries does not appear to provide an intuitive ordering (for example, with respect to a hand-curated media-bias chart [65]) on the liberal–conservative spectrum.

In the rest of our paper, we focus on the value of the first PC; for simplicity, we use the term ‘media PCA score’ to refer to this score. Positive values for this score indicate followership of the media accounts that we show in red, whereas negative values indicate followership of accounts that we show in black (see Table 1). To frame our discussion, we refer to nodes with a positive media PCA score as nodes on the ‘Right’ and to those with a negative media PCA score as ones that are on the ‘Left’, although we note that we have not validated this measure as an indicator of political belief or affiliation. Our approach is similar to that of Bail et al. [32], who applied PCA to followership of a large set of ‘opinion leaders’ to assess political orientation.

4 Network structure

In this section, we explore structural features of the our retweet network $\tilde{G}$ , which is a weighted, directed graph with weighted adjacency matrix $\tilde{A}$ , where $\tilde{A}_{ij}$ denotes the number of times that node $j$ retweeted node $i$ . In particular, we examine degree distributions (see Section 4.1), calculate and compare several different centrality measures (see Section 4.2), and detect communities using two widely-used algorithms (see Section 4.3). We combine these structural features with node characterization according to media preference (see Section 2.2) to (1) examine how central nodes differ between Left and Right and (2) describe communities based on their Left/Right node composition.

The graph $\tilde{G}$ has 238,892 nodes, 365,589 edges, and 389,736 retweets. We focus on $G$ , the largest connected component of $\tilde{G}$ when we ignore directionality (so it is $\tilde{G}$ ’s largest weakly connected component). The graph $G$ has 221,137 nodes, 353,548 edges, and 376,978 retweets. Let $A$ denote the weighted adjacency matrix for $G$ . In all cases, weights represent multi-edges, where the multi-edge from node $j$ to node $i$ corresponds to the number of retweets by account $j$ of any tweet by account $i$ in our data set.

4.1 Degree distributions

The out-degree of node $k$ corresponds to the total number of retweets that were posted by node $k$ , and the in-degree of $k$ corresponds to the total number of times that node $k$ was retweeted. Unless we specifically note otherwise, we include weights when calculating the in-degrees and out-degrees (i.e., we count all edges in a multi-edge). For example, $\sum_{i=1}^{n}A_{ij}$ gives the out-degree of node $j$ , and $\sum_{j=1}^{n}A_{ij}$ gives the in-degree of node $i$ . In Figure 1, we show the in-degree and out-degree distributions for $G$ . The two distributions differ markedly, as the in-degree distribution has a much longer tail (corresponding to a few accounts that were retweeted very heavily).

In Figure 2a, we show the in-degrees for the twenty most heavily retweeted accounts. The mean in-degree is $1.70$ , and the standard deviation is $69.22$ , indicating extreme heterogeneity in the number of times retweeted. The account (RepCohen) with the largest in-degree was retweeted 16,180 times. By contrast, 208,241 nodes (i.e., 94% of them) in $G$ were never retweeted at all. We also observe heterogeneity in the out-degree, but it is much less extreme than for in-degree; the standard deviation is $4.89$ . (By definition, the mean in-degree and mean out-degree are the same, as every edge has both an origin and terminus in $G$ .) The account with the largest out-degree sent $141$ retweets in our data set. By contrast, 7,852 accounts had an out-degree of [math]; these accounts were retweeted, but they did not retweet any accounts. In Figure 2b, we show the twenty accounts that sent the most retweets.

We also consider the in-degree and out-degree distributions for accounts with and without media PCA scores to examine whether there are systematic differences between the two types of accounts. (The former are the 99,412 accounts that followed at least one of the 13 focal media sources.) The heterogeneity in the in-degree distribution that we observed when examining all nodes in $G$ is also present when we consider the in-degree distribution separately for nodes with and without media PCA scores; the standard deviation is $105.24$ for nodes with media PCA and $36.65$ for nodes without it. The mean in-degree for nodes with a media PCA score is larger than for nodes without one ( $2.85$ versus $1.08$ ). Nodes with large in-degree with media PCA scores include DineshDSouza, pastormarkburns, RepCohen, wkamaubell, and johncardillo. However, there are also some heavily retweeted nodes — such as larryelder, TheNormanLear, and NancyPelosi — that do not follow any of the $13$ media accounts that we used for computing media PCA scores. We thus cannot compute media PCA scores for these nodes.

4.2 Centralities

We now examine important accounts by computing several centrality measures [44, 66]. We start with degree (i.e., degree centrality), the simplest way of trying to measure a node’s importance. In Figure 2, we show the twenty nodes with the largest in-degrees and the twenty nodes with the largest out-degrees. These two sets are disjoint, indicating that the nodes that generated most of the original content in the Twitter conversation about #Charlottesville were distinct from those that were most active in promoting existing content through retweets. Degree is a local centrality measure that does not take into account any characteristics of neighboring nodes. For comparison, we also calculate two other widely-used centrality measures, PageRank [45] and HITS [47], that take some non-local information into account.

One obtains PageRank scores from the stationary distribution of a random walk on a network that combines transitions according to network structure and ‘teleportation’ according to a user-supplied distribution [67], with a parameter that determines the relative weightings of these two processes. We compute PageRank with standard uniform-at-random teleportation using Matlab’s centrality function with the default damping factor of $0.85$ (so teleportation occurs for 15% of the steps in the associated random walk). In the left column of Figure 3, we list the twenty most central nodes according to PageRank. Nine of the these nodes are also on our list of nodes with the largest in-degrees. An exception is harikondobalu, which was retweeted only $38$ times in our data set. The large PageRank value for harikondobalu, despite its small in-degree, reflects the fact that harikondobalu was one of only two nodes that were retweeted by wkamaubell, which was retweeted 8,582 times.

Hub and authority centralities [47] are another useful pair of centrality measures. Using the HITS algorithm, one can simultaneously examine hubs and authorities. As discussed in [47], a good hub tends to point to good authorities, and a good authority tends to have good hubs that point to it. In the context of retweeting, we expect that accounts with large authority scores tend to be retweeted by accounts with large hub scores, and we expect that good hub accounts tend to retweet accounts that are good authorities. As in PageRank, the importances of adjacent nodes influence a node’s hub and authority scores. With our convention that the $(i,j)$ entry of a graph’s adjacency matrix corresponds to the edge weight from $j$ to $i$ , hub and authority scores correspond, respectively, to the principal right eigenvectors of $A^{t}A$ and $AA^{t}$ . We compute hubs and authorities using Matlab’s centrality function.

We list the twenty nodes with the largest authority and hub scores, respectively, in the center and right columns of Figure 3. Color indicates the mean media PCA scores for the community assignment of each account from modularity maximization using a Louvain method [68, 69, 70] (see Section 4.3). Only two of the nodes among the top-twenty authorities are in communities with positive (i.e., Right) media PCA scores; these accounts, pastormarkburns and DineshDSouza, belong to two prominent conservative personalities. Neither pastormarkburns nor DineshDSouza were ever retweeted by any of the top- $50$ hubs. By contrast, all of the other authorities were retweeted at least three times by the leading hubs. When we consider all nodes, we observe that the hub scores have a bimodal distribution, with a clear separation between the nodes with small and large values (e.g., using $4\times 10^{-5}$ as a threshold hub score to separate ‘small’ and ‘large’ values). We refer to nodes with hub scores that are larger than $4\times 10^{-5}$ as ‘large hub-score nodes’. Consider the set of nodes that retweeted DineshDSouza. Of these, the fraction that are large hub-score nodes is $9.0\times 10^{-4}$ . The fraction of nodes that retweeted pastormarkburns that are large hub-score nodes is $1.0\times 10^{-3}$ . The fraction of nodes that retweeted itsmikebivins that are large hub-score nodes is $1.3\times 10^{-2}$ . A few other examples of such fractions are $0.05$ for wkamaubell, $0.15$ for tribelaw, and $1$ for RepCohen.

As is standard for hub and authority scores, there are two qualitatively different ways for a node to have a large authority score: it can either be retweeted many times (e.g., DineshDSouza), or it can be retweeted by nodes with large hub scores (e.g., itsmikebivins). Both of the large-authority Right accounts (DineshDSouza and pastormarkburns) lie in the former category.

Figure 3 also allows us to compare important accounts according to different centrality measures. As one can see in Figure 3, there is some overlap between the top-PageRank and top-authority accounts. Note, however, that fewer than half of the top-PageRank accounts are also among the top-authority accounts. By comparison, the set of top hubs is disjoint from the top-PageRank and top-authority accounts in Figure 3. Additionally, more than half of the top-PageRank and top-authority accounts in Figure 3 are verified accounts, whereas none of the top-hub accounts are verified accounts.

4.3 Community structure

To examine large-scale structure in the #Charlottesville retweet network, we use community detection to identify tightly-knit sets (so-called ‘communities’) of accounts with relatively sparse connections between these sets [51, 52]. In our investigation, we employ two widely-used community-detection methods: modularity maximization [69, 70] and InfoMap [71]. A major challenge in community detection is parsing what results reflect a network’s features, rather than artifacts from a community-detection method. Modularity maximization and InfoMap are two methods, which use rather different approaches from each other, that have been used successfully on a wide variety of problems. We expect that broad structural features in a network that we observe using both of these methods are likely to be robust to the particular choice of community-detection method, so we expect them to be actual features of the data (rather than artifacts). There exist many other community-detection methods, including statistical inference via stochastic block models [72] and local methods based on personalized PageRank [53, 67]. Exploring our retweet network with other community-detection methods is outside the scope of the present article, but we encourage readers to explore our data set with them. It is available at https://osf.io/487fw/.

4.3.1 Modularity maximization

The modularity of a particular assignment of a network’s nodes into communities measures the amount of intra-community edge weight, relative to what one would expect at random under some null model [69, 70]. Modularity maximization treats community detection as an optimization problem by seeking an assignment of nodes into communities that maximizes a modularity objective function. A version of modularity for weighted, directed graphs is [73, 74]

[TABLE]

where

[TABLE]

is the sum of all edge weights in a network; $w_{k}^{\mathrm{in}}$ and $w_{k}^{\mathrm{out}}$ are the in-strength (i.e., a weighted generalization of in-degree) and out-strength (i.e., weighted out-degree), respectively, of node $k$ ; the community assignment of node $k$ is $C_{k}$ ; the quantity $\delta$ is the Kronecker delta; and $\gamma$ is a resolution parameter that controls the relative weight given to the null model [75]. Our null-model matrix elements are $P_{ij}=\frac{w_{i}^{\mathrm{in}}w_{j}^{\mathrm{out}}}{w}$ , so this null model is a type of configuration model [76], in which we preserve expected in-strength and expected out-strength but otherwise randomize connections [51]. For most of our computations, we use the resolution-parameter value $\gamma=1$ as a default. However, in Section 4.3.5, we compare results using a variety of values of $\gamma$ spanning three orders of magnitude.

To maximize $Q$ , we use a GenLouvain variant [77] (which is implemented in Matlab and was released originally in conjunction with [78]) of the locally-greedy Louvain algorithm [68]. To use the code from [77], we symmetrize the modularity matrix $B$ , where $B_{ij}=A_{ij}-\gamma\frac{w_{i}^{\mathrm{in}}w_{j}^{\mathrm{out}}}{w}$ . As discussed in [74], this is distinct from symmetrizing the adjacency matrix $A$ .

Modularity maximization using GenLouvain yields $228$ communities, which range in size from $2$ nodes to 47,321 nodes.

4.3.2 InfoMap

InfoMap is a community-detection method that is based on the flow of random walkers on graphs [71].333One can also interpret modularity maximization in terms of random walks on graphs [75]. It uses the intuition that a random walker tends to be trapped for long periods of time within tightly-knit sets of nodes [52]. Rosvall and Bergstrom [71] made this idea concrete by trying to minimize the expected description length of a random walk. For example, one can obtain a concise description of a random walk by allowing node names to be reused between communities. One can apply InfoMap to weighted, directed graphs; and it has been used previously to study Twitter data [26]. To study a directed graph, one introduces a teleportation parameter (as in PageRank); we use the default teleportation value of $\tau=0.15$ [71].

Our implementation uses code from [79]. With InfoMap, we find $205$ communities, which range in size from $1$ node to 122,504 nodes.

4.3.3 Large-scale structure of the retweet network

Several features are evident in our community-detection results from both modularity maximization and InfoMap: (1) communities are largely segregated by media PCA score; (2) overall, the communities skew towards the Left; and (3) most of the nodes on the Right are assigned to a large community that includes prominent right-wing personalities and FoxNews.

To examine the relationship between community structure and Left/Right media preference, we compute the mean media PCA score within each community. The proportion of communities with at least one node with a media PCA score is very similar for modularity maximization (204/228; 89%) and InfoMap (183/205; 89%). We also examine the extent of overlap of Left and Right accounts within communities by computing the Shannon diversity index [80] for each community. This index

[TABLE]

where $H^{k}$ is the Shannon diversity index for community $k$ , and $p_{1}^{k}$ and $p_{2}^{k}$ (with $p_{1}^{k}+p_{2}^{k}=1$ for each $k$ ) are the fractions of accounts in community $k$ with Left and Right media preferences, respectively. In Figure 4, we show the Shannon diversity scores versus mean media PCA scores for the communities that we detect using modularity maximization and InfoMap.

Both community-detection methods yield a predominantly unimodal shape for the relation between PCA score diversity and mean media PCA score, with more extreme mean media PCA scores associated with lower diversity within a community. Communities with ‘centrist’ mean media PCA scores (i.e., ones that are near [math]) have relatively small sizes. By contrast, the largest communities tend to have mean media scores that are farther from [math]; they also have small Shannon diversity. For example, InfoMap gives two communities that are much larger than the others. One is on the Left (with 122,504 nodes and a mean media PCA score of $-0.43$ ), and the other is on the Right (with 58,185 nodes and a mean media PCA score of $0.74$ ). In these two communities, 91% of the nodes in the largest Left community have negative media PCA scores, compared with 6% in the largest Right community. Similarly, the large communities that we obtain from modularity maximization also have little Left/Right node diversity within communities.

Another prominent feature is that both community-detection approaches yield one community on the Right that is much larger than other communities that have a positive mean media PCA score. Additionally, both methods yield a similar set of large-degree accounts in the largest Right community. Specifically, the five nodes with largest in-degrees and out-degrees are the same, with DineshDSouza, pastormarkburns, larryelder, johncardillo, and FoxNews as the five most heavily retweeted accounts (i.e., the ones with the largest in-degrees) in this community.

Our results in Figure 4 also suggest that there are more Left-leaning communities than Right-leaning ones. For example, 106/130 (i.e., about 82%) of the InfoMap communities with at least ten nodes have negative mean media PCA scores. Modularity maximization gives a bimodal distribution of community sizes. We refer to communities with at most 100 nodes as ‘small’, communities with more than 100 and at most 1000 nodes as ‘medium-sized’, and communities with more than 1000 nodes as ‘large’. Of the medium and large modularity-maximization communities, 76/93 (i.e., 82%) have negative mean media PCA scores. To give some context, we have PCA scores for 78,339 nodes, and 44,797 of them (about 57%) have a negative media PCA score.

4.3.4 Finer features of the retweet network

One notable difference between the two methods is that two large communities dominate for InfoMap (one each on the Left and Right), whereas modularity maximization yields a partition of the retweet network that includes many more large communities. We now examine some of the finer details in the large communities that we obtain from modularity maximization.

Modularity maximization yields $41$ communities with at least 1,001 nodes. To further characterize these $41$ communities, we examine the accounts with the largest in-degrees (i.e., the ones that are retweeted the most) within each community and characterize these nodes by hand from their profiles and, when available (e.g., when account owners are known public personalities), information about the owners of these accounts. More than 85% (specifically, $35$ of $41$ ) of these communities have negative (i.e., Left-leaning) mean media PCA scores. The accounts with the largest in-degrees in these $35$ communities include activists (e.g., Everytown, IndivisibleTeam, UNHumanRights, and womensmarch), businesses (e.g., benandjerrys), people from arts and entertainment (e.g., jk_rowling, LatuffCartoons, FallonTonight, ladygaga, Sethrogen, TheNormanLear, and wkamaubell), journalists (e.g., AmyKNelson), media organizations (e.g., AJEnglish, CBSThisMorning, and HuffPostCanada), and politicians (e.g., NancyPelosi, RepCohen, and JoeBiden). By contrast, only six of the largest communities have positive (i.e., Right-leaning) mean media PCA scores. The largest of these (with 47,321 nodes) includes opinion leaders on the Right (e.g., DineshDSouza, pastormarkburns, and larryelder) and FoxNews, as we discussed previously. Another community has a mean PCA score close to [math] (specifically, it is $0.086$ ), and it appears to be a business-oriented community with tweets that are critical of Donald Trump. Two of the remaining four communities with positive media scores are Right-oriented activist communities. One activist community has $3,987$ nodes, and one of its accounts with among the largest in-degrees (i.e., that is retweeted very heavily) references an influential alt-right account [81]. The other activist community has 2,710 nodes, and one of its most retweeted accounts references a well-known white supremacist hate symbol in its handle. A third community appears to be a media community with foreign media personalities (e.g., KTHopkins), and the final community of these four is a community that is dominated by accounts that tweet in German.

4.3.5 Community characteristics across different resolution-parameter values

To examine the robustness of our findings about community structure in the retweet network, we also conduct modularity maximization using the GenLouvain code for a range of values of the resolution parameter $\gamma$ in (4.1). There is a ‘resolution limit’ for the smallest detectable community size when using modularity maximization, and the size scales of communities that result from modularity maximization can also influence the sizes of the largest communities [82, 83]. Biases in the sizes of detected communities can skew interpretation of the results of community detection, and distinguishing which results reflect features of a network and which arise from a specific approach or algorithm for community detection is a major challenge. We partially address these concerns by identifying features that are robust to the choice of resolution-parameter value $\gamma$ .

In Figure 5, we show our results from community detection using the GenLouvain code with resolution-parameter values $\gamma$ that range from $10^{-2}$ to $10$ . Smaller values of $\gamma$ in (4.1) favor fewer communities (with 783 communities using $\gamma=10$ , compared with only a single community for $\gamma=10^{-2}$ ). For each value of $\gamma$ , we plot communities with more than 1000 nodes versus their mean media PCA score. We observe that there are more large (i.e., having more than 1000 nodes) Left-leaning than Right-leaning communities for all examined values of $\gamma$ for which we detected more than one community, suggesting that this is a robust feature of our Twitter retweet network.

4.4 Modular centrality

Our results in Section 4.3 suggest that the Charlottesville retweet network has meaningful modules. Ghalmane et al. [84] examined a way to exploit community structure when identifying influential nodes, and they computed different ‘modular centrality’ measures to describe influence within versus between communities. (See, e.g., [85, 86] for other work on quantifying centrality values within versus between communities.) In this section, we compute a slightly modified version of the ‘modular degree centrality’ from [84]. We separately count the number of times (producing an ‘inter-community in-degree’) that a given node is retweeted by nodes that belong to communities other than the one that includes that node and the number of retweets (producing an ‘intra-community in-degree’) by nodes that belong to the same community as that node. We then compute the ratio of the inter-community in-degree to the intra-community in-degree.

In Table 2, we list the ratio of the inter-community in-degree to intra-community in-degree for the twenty nodes with the largest in-degrees in our retweet network. This ratio tends to be significantly larger for nodes that belong to Left-leaning communities (their range is 0.18–0.43, with a mean of $0.27$ ) than for nodes that belong to Right-leaning communities (their range is 0.036–0.078, with a mean of $0.051$ ).

5 Media-preference assortativity

In this section, we quantify the extent to which retweets occur between nodes with similar media preferences. We do this by examining assortativity in our retweet network in terms of both the value and the sign of the media PCA score. Large assortativity indicates that the Charlottesville conversation largely splits according to media PCA score. Combined with our results of Section 6 (in which we compare tweet content on the Left and the Right), this gives a way to assess the extent of polarization in the Twitter conversation.

To examine homophily in media-preference scores in the Twitter conversation about #Charlottesville, we measure media-preference assortativity by computing the Pearson correlation coefficient of the media PCA score for nodes in the retweet network. Specifically, we compute the correlation of the media PCA score for dyads (i.e., nodes that are adjacent to each other via an edge) in the retweet network. We ignore edge weights, and we restrict our calculations to dyads for which we have a PCA score for both nodes. There are 93,521 such pairs.

The correlation coefficient of the media PCA scores is $\rho\approx 0.67$ . For comparison, as a null model, we compute the correlation-coefficient distribution for 100,000 random permutations of the PCA scores of the nodes. Specifically, in each realization, we fix the network and assign the PCA scores uniformly at random to the nodes for which PCA scores were available originally. (For an alternative approach for examining assortativity, see [87].) The resulting distribution for the correlation coefficient $\rho$ appears to be approximately Gaussian, with a mean of $-1.29\times 10^{-5}$ and a standard deviation of $0.0033$ . The z-score for the measured correlation coefficient of $0.67$ is larger than $203$ , indicating that the retweet network has a statistically significant media-preference assortativity.

We also compute the assortativity coefficient $r$ that was introduced by Newman [88, 89]. Suppose that there are $g$ types of nodes in the network. Following [89], we calculate

[TABLE]

where $e_{\ell s}$ is the fraction of the edges in a network that emanate from a node of type $\ell$ and terminate at a node of type $s$ , the quantity $a_{\ell}=\sum_{s=1}^{g}e_{\ell s}$ is the fraction of the edges that emanate from a node of type $\ell$ , and $b_{s}=\sum_{\ell=1}^{g}e_{\ell s}$ is the fraction of edges that terminate at a node of type $s$ .

To calculate (5.4) for our retweet network, we classify nodes according to the sign of their media PCA score. In the largest weakly connected component of the retweet network, we have PCA scores for 78,339 nodes, of which 44,797 (i.e., 57% of them) have a negative media PCA score. The resulting assortativity coefficient is $r\approx 0.80$ . We show the mixing matrix $e$ in Table 3. As a comparison, Newman [89] calculated an assortativity coefficient of $0.62$ by ethnicity for the sexual-partner network that was described in [90].

Five444Specifically, these accounts are csmonitor, MotherJones, theblaze, FoxNews, and NPR. of the media accounts that we used to compute the media PCA score also appear as nodes in the retweet network $G$ . Of these, FoxNews was retweeted $3049$ times, NPR was retweeted $69$ times, MotherJones was retweeted $15$ times, and csmonitor was retweeted $6$ times. Removing these media accounts from $G$ has a negligible effect on the assortativity coefficient $r$ .

Although the assortativity by media PCA score in the retweet network is rather large, there are some prominent exceptions. For example, RepCurbelo and SenatorTimScott555These are the Twitter accounts for Representative Carlos Curbelo (FL, Republican) and Senator Tim Scott (SC, Republican)., the accounts for two Republican members of Congress, were heavily retweeted in Left-leaning communities that we detected using modularity maximization. However, both RepCurbelo ( $0.49$ ) and SenatorTimScott ( $0.12$ ) have positive (i.e., Right) media PCA scores, consistent with their affiliation with the Republican party. RepCurbelo was the fourth-most retweeted account in a community from modularity maximization with a negative (i.e., Left) mean media PCA score ( $-0.32$ ). RepCurbelo, who spoke out strongly against the events in Charlottesville [91], was retweeted by 22 accounts. We have PCA scores for 9 of these accounts, of which $4$ have media PCA scores on the Left. Similarly, SenatorTimScott was the second-most retweeted account in a Left-leaning community (with a mean media PCA score of $-0.26$ ) that we obtained from modularity maximization. SenatorTimScott was retweeted by $78$ accounts, and nearly half (specifically, $20$ of $43$ ) of the accounts that retweeted SenatorTimScott for which we have PCA scores have negative media PCA scores. We identified RepCurbelo and SenatorTimScott as accounts that warrant examination by first compiling the list of nodes that were retweeted by accounts with media PCA scores of the opposite sign and then examining this list for prominent accounts. One can further develop this approach (for example, to identify negative or mocking retweets [5]), and it may be useful in other situations for identifying accounts that generate communication across ideological or other divides.

6 Comparison of tweet content between Left and Right

In this section, we examine tweet content from Left-leaning and Right-leaning communities. Comparing word and hashtag frequency allows us to see some of the ways in which the Twitter conversation about Charlottesville differed between the Left and the Right.

We use the Python library nltk 3.3 to tokenize tweets into words and punctuation. In Table 4, we show the twenty-five most numerous words in our data set. We separately consider accounts with negative (i.e., Left) and positive (i.e., Right) media PCA scores after removing stop words.666We use stop words from the nltk Python library, and we also remove the following words when calculating word counts: ‘t’, ‘https’, ‘co’, ‘RT’, ‘s’, ‘amp’, ‘n’, ‘w’, and ‘c’. We do not stem the words in our data set, and we treat different capitalizations as different words. We find some overlap between the Left and Right data sets. For example, tweets related to ‘Trump’ were very common regardless of media PCA score. ‘Barcelona’ was also one of the most numerous words in tweets by both the Left and the Right. (There was a 17 August 2017 van attack in that city that killed $13$ individuals (at the time of data collection) and injured more than $100$ others.777One of those wounded individuals died later (after the time of data collection) from their injuries.) However, as we can see from the words in Table 4, there are also many differences in the words that were used by the Left and the Right. We illustrate such differences by coloring the relevant words. For example, ‘Obama’ was the third-most numerous word in tweets that were sent by nodes with positive media PCA scores, but it was not in the top one hundred for nodes with negative media PCA scores. Additionally, ‘Nazi’ appeared commonly in tweets from the Left, but it did not appear often in tweets from the Right. By contrast, the words ‘Antifa’ and ‘MSM’ were used often by the Right but not by the Left.

We also observe other qualitative differences between the tweet content of the Left and Right on shared common words, such as ‘Trump’ and ‘Barcelona’, in the #Charlottesville data set. The ‘Trump’ subset888Aside from ‘Charlottesville’, which comes directly from the hashtag that we used to generate the data set, the word ‘Trump’ was the most common word for both Left and Right. for which we have media PCA scores consists of 34,084 total tweets (of which about 32% are unique) from the Left and 18,791 total tweets (of which about 23% are unique) from the Right.999Duplicate tweets arise, for example, from retweets. It is also possible that multiple accounts independently posted identical content that were not retweets. As we show in the left set of columns of Table 5, the Left and Right conversations about Trump differ markedly from each other. The ‘Barcelona’ subset consists of 4779 tweets (of which 1672 are unique) from the Left and 7669 tweets (of which 1401 are unique) from the Right. In the right set of columns of Table 5, we show the twenty most numerous words for the Barcelona subset for both Left and Right. For the Right, our examination of the most heavily retweeted tweets suggests that much of the discussion about ‘Barcelona’ in our data set involves comparing media coverage of the Charlottesville and Barcelona attacks. On the Left, some of the heavily retweeted tweets about ‘Barcelona’ centered on comparing Trump’s reaction to the Barcelona and Charlottesville attacks.

We also implement an idea from Gentzkow and Shapiro [92], who used a chi-square statistic to analyze the different phrase usage of Democrats and Republicans in Congressional speeches. We apply their approach to words in tweets from the Left and Right in the ‘Trump’ subset (specifically, using equation (1) in [92] with ‘phrases’ that consist of a single word) and find that the five words (which include ‘Nazis’, ‘antifa’, and ‘Vice’) with the largest chi-square values were also among the most common words in the ‘Trump’ subset (see Table 5). Therefore, we observe some consistency in results across different methods.

We also use hashtags to compare tweets between the Left and Right communities. In Figure 6, we show the most numerous hashtag for each community,101010This neglects hashtags that include ‘Charlottesville’, as our data collection was based on the #Charlottesville hashtag. together with the community’s mean media PCA score. On the Left, the most numerous hashtag is #Trump (in $13$ of $35$ communities), followed by #HeatherHeyer (in $5$ of $35$ communities, if we include a single community with ‘#HeatherHayer’) and then #Barcelona (in $4$ of $35$ communities). Other top hashtags include #ExposeTheAltRight, #DumpTrump, #FightRacism, and #DisarmHate. On the Right, #Barcelona is the most numerous hashtag (in $3$ of $6$ communities, if we include a single community with #Barcellona). Other top hashtags on the Right are #UniteTheRight (from a community with an account of large in-degree whose Twitter handle references an influential account that identifies with the alt-right [81]) and #fakenews (from a community with an account of large in-degree whose handle references a well-known white-supremacist hate symbol).

The fractions of tweets with unique content also differ between the Left and Right. Nodes with negative media PCA scores posted a total of 112,314 tweets, of which 42,458 (i.e., about 38%) were unique. Nodes with positive media PCA scores posted a total of 92,575 tweets, of which 22,462 (i.e., about 24%) are unique. We also observe a larger fraction of original content in the Left than in the Right when restricting to several specific topics, including ‘Trump’ (32% for the Left versus 23% for the Right), ‘Barcelona’ (35% for the Left versus 18% for the Right), ‘MSM’ (21% for the Left versus 8% for the Right), ‘Obama’ (31% for the Left versus 7% for the Right), and ‘Antifa’ (50% for the Left versus 22% for the Right). However, there are a slightly larger proportion of unique tweets for the Right for tweets that include the word ‘Nazi’ (21% for the Left versus 26% for the Right).

7 Conclusions and discussion

Our study of the Twitter conversation about #Charlottesville illustrates that (1) one can reasonably characterize nodes in terms of their media followership and a one-dimensional PCA-based Left/Right orientation score, (2) the Charlottesville retweet network is highly polarized with respect to this measure of Left/Right orientation, and (3) communities in the retweet network are largely homogeneous in their Left/Right node composition. Our findings thus indicate that, with a few exceptions, the Twitter conversation largely split along ideological lines, as measured by the media preference of Twitter accounts.

As we just summarized, our investigation illustrates strong polarization in the Twitter conversation about #Charlottesville. We found that media followership on Twitter is informative and that the #Charlottesville retweet network is strongly assortative with respect to a corresponding PCA-based Left/Right orientation score. Our finding of positive assortativity with respect to media preference on Twitter is also consistent with previous studies of Twitter data [10, 8, 38, 39]. However, in contrast with these previous studies, our approach to node characterization does not require text analysis or labeled data for training. Characterizing nodes via a principal component analysis of media followership is simple, easy to interpret, and provides a valuable complement to characterizing nodes based on the content of their tweets. Because the #Charlottesville retweet network is strongly assortative with respect to media preference, it is a potentially useful indicator of marked polarization on Twitter about the ‘Unite the Right’ rally and its aftermath. However, whether differences in media preferences are a cause or an effect (or both) of assortativity on social media is not something that our approach allows us to conclude.

Polarization is also evident in the community structure of the retweet network, as the communities are highly segregated in terms of their Left/Right node composition. The Left has a larger proportion of tweets with original content (as opposed to retweets) than the Right, and nodes with large hub scores tended to retweet nodes on the Left rather than those on the Right. We also found that modularity maximization detects Left communities with central nodes from disparate focal areas (such as business, media, entertainment, and politics). A robust feature of our community-detection results is that there are more large Left-leaning communities than Right-leaning ones. We also found that Left nodes with high in-degree had larger ratios of inter-community in-degrees to intra-community in-degrees, suggesting that heavily retweeted nodes on the Left were more likely to be retweeted by communities other than their own. Taken together, these findings suggest that Twitter accounts on the Left that condemned the ‘Unite the Right’ rally in Charlottesville and Trump’s handling of the aftermath came from broad segments of society. Support on the Right in our data was concentrated in fewer communities, the largest of which includes the five most heavily retweeted accounts on the Right: FoxNews and right-wing personas DineshDSouza, pastormarkburns, larryelder, and johncardillo.

An important limitation of our study is that Twitter users are not a representative sample of the general population [42], and hashtag sampling introduces its own set of biases [5]. Moreover, differences in Twitter usage and propensity to tweet political content may also differ with political affiliation [10]. Consequently, it is important to compare our findings from Twitter to offline information. Our findings are consistent with a Quinnipiac poll that suggested that nearly one third of Republicans (but only 4% of Democrats) considered counterprotesters to be more to blame than white supremacists for the violence at Charlottesville [93]. We observe that several of the communities on the Right that we obtained from modularity maximization of the #Charlottesville retweet network also appear to reflect core participants of the ‘Unite the Right’ rally [1], as indicated by the referencing by central nodes in these communities of white-supremacist hate symbols or influential personalities in the alt-right.

Our investigation illustrated a stark distinction between Left and Right when we examined tweets that include the word ‘Trump’, with criticism on the Left versus support on the Right (see Table 5). For example, the most numerous hashtags from the Left in tweets that include ‘Trump’ were #Impeachment and #ImpeachTrump; by contrast, the most numerous hashtags were ‘#Barcelona’, ‘#MAGA’, and ‘#fakenews’ from accounts in communities on the Right. Our findings are consistent with the extreme polarization and political tribalism in United States society that have been described by other studies [94, 95, 96]. Such societal divisions are apparent on Twitter, as documented both by the present study and by prior ones [10, 8, 38, 39], including recent work that suggested that polarization on Twitter is increasing over time [28].

It is also important to examine the role that fully automated accounts (‘bots’) and partially automated accounts (which have been dubbed ‘cyborgs’ [97]) play in shaping conversations (especially political ones) on Twitter and other social-media platforms [98, 99, 97, 48, 100, 101, 102]. Although an in-depth analysis of the role of bots in the #Charlottesville discussion is beyond the scope of the present paper, it is likely that many bot accounts are present in our data set. For example, it has been noted that automated naming schemes are an indicator of bot accounts [98]; and naming schemes that end in sequences of eight digits, as well as accounts that consist of hexadecimal strings, both occur in the #Charlottesville data. Detailed investigation of these accounts and their behavior is an important topic for future work. Sockpuppet accounts (i.e., false accounts that are operated by an entity [103]), such as those that are operated by the Internet Research Agency in St. Petersburg, Russia [104, 105], can also play important roles in content propagation and thus warrant further investigation. Antipathy and distrust across party lines can provide opportunities for actors who seek to fan societal divisions. For example, our data set includes tweets by prominent accounts that were operated by the Internet Research Agency [104, 105] and attacked both the Left and the Right.

Our approach of using media choice and PCA to characterize nodes is simple; it does not require labeled data, facilitating application to other fascinating topics. For example, it would be interesting to apply our approach to examine polarization on other topics (e.g., Brexit) and to see how polarization across political divides and attempts to bridge them change over time. It is not clear whether engagement with Twitter accounts with different viewpoints will decrease or increase polarization on divisive topics. For example, the empirical results of Bail et al. [32] suggest that exposure on Twitter to contrasting ideologies can lead to increased polarization. An interesting question is how exposure shapes viewpoints of individuals with ‘centrist’ media preferences or ideologies. Our investigation focused primarily on the sign of a media PCA score, but the underlying media PCA score is continuous, and one can use it to examine media preferences in a more nuanced way. In particular, characterization of accounts with moderate (‘centrist’) media PCA scores, study of network structure and tweet content by these nodes, and tracking the evolution of these characteristics over time is both feasible and relevant.

It is possible to refine our approach in various ways. For example, Landgraf and Lee [62] used a type of PCA that is built for Boolean data, and it may be insightful to compare our PCA results to ones that employ their approach. Examining additional PCA components besides the first one (on which we focused exclusively in the present paper) may also be helpful for understanding the diversity of interactions between media and other accounts on Twitter. Careful analysis of additional PCs, interpretations of them, and their relationship to network structure merit further investigation.

Examining multiple ideological dimensions (e.g., as in studies of voting by legislators on bills [106]) and simultaneous analysis of multiple types of Twitter relationships using the formalism of muiltilayer networks [107] may also help deepen understanding of political communities and their news preferences after landmark news events. More broadly, we expect that our approach is generalizable to other contexts, and it may be helpful for examining other types of node characterizations (such as by analyzing different media outlets or types of followed accounts). Comparing network relationships and propagation of content across multiple social-media platforms is of particular interest, as amount, diversity, and characteristics of platform usage vary across different segments of the world’s population. There have been studies of the propagation of memes [108], web addresses [109], and anti-semitic content [110] across different social-media platforms; and it is important to undertake further studies of linkages across networks and their effect on recruitment, content propagation, and public discourse.

List of abbreviations

[TABLE]

Declarations

Acknowledgements.

We thank Rob Bond, Heather Brooks, Matt Salganik, and Sam Zhang for helpful comments.

Availability of data and materials.

We have made the de-identified data for the largest connected component of the retweet network, as well as code for analyzing these data, available (at https://osf.io/487fw/) via the Open Science Framework. For privacy reasons, we do not provide any account names; for similar reasons, we do not provide tweet content.

Competing interests.

The authors declare that they have no competing interests.

Funding.

The authors have no funding sources to acknowledge for this study.

Authors’ contributions.

JHT, MCE, STC, and MAP designed the study. JHT performed the analyses. JHT, MCE, STC, and MAP wrote the paper. All authors read and approved the final manuscript.

Bibliography110

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Fausset, R., Feuer, A.: Far-right groups surge into national view in Charlottesville. New York Times (13 August 2017). Available at https://www.nytimes.com/2017/08/13/us/far-right-groups-blaze-into-national-view-in-charlottesville.html
2[2] Duggan, P., Jouvenal, J.: Neo-Nazi sympathizer pleads guilty to federal hate crimes for plowing car into protesters at Charlottesville rally. https://www.washingtonpost.com/local/public-safety/neo-nazi-sympathizer-pleads-guilty-to-federal-hate-crimes-for-plowing-car-into-crowd-of-protesters-at-unite-the-right-rally-in-charlottesville/2019/03/27/2b 947c 32-50ab-11e 9-8d 28-f 5149 e 5a 2fda_story.html?utm_term=.738c 2ac 1e 287 . The Washington Post (27 March 2019)
3[3] Shear, M.D., Haberman, M.: Trump defends initial remarks on Charlottesville: Again blames ‘both sides’. New York Times (15 August 15 2017). Available at https://www.nytimes.com/2017/08/15/us/politics/trump-press-conference-charlottesville.html
4[4] Daniels, J.: The algorithmic rise of the “alt-right”. Contexts 17 (1), 60–65 (2018)
5[5] Tufekci, Z.: Big questions for social media Big Data: Representativeness, validity and other methodological pitfalls. In: Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media, pp. 505–514 (2014)
6[6] Perrin, A.: Social media usage: 2005-2015. Available at http://www.pewinternet.org/2015/10/08/social-networking-usage-2005-2015/ (2015)
7[7] Tufekci, Z.: Twitter and Tear Gas: The Power and Fragility of Networked Protest. Yale University Press, New Haven, CT, USA (2017)
8[8] Conover, M.D., Ratkiewsicz, J., Francisco, M., Gonçalves, B., Flammini, A., Menczer, F.: Political polarization on Twitter. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, 89–96 (2011)