Assessing the Bias in Communication Networks Sampled from Twitter
Sandra Gonz\'alez-Bail\'on, Ning Wang, Alejandro Rivero, Javier, Borge-Holthoefer, Yamir Moreno

TL;DR
This paper compares two Twitter data sampling methods during a political protest, revealing biases that affect the understanding of communication networks and emphasizing the need for more representative sampling techniques.
Contribution
It provides an empirical analysis of sampling biases in Twitter communication networks, highlighting differences between search and stream APIs during a real-world event.
Findings
Search API over-represents central users
Bias is greater in mention networks
Sampling bias affects diffusion and collective action studies
Abstract
We collect and analyse messages exchanged in Twitter using two of the platform's publicly available APIs (the search and stream specifications). We assess the differences between the two samples, and compare the networks of communication reconstructed from them. The empirical context is given by political protests taking place in May 2012: we track online communication around these protests for the period of one month, and reconstruct the network of mentions and re-tweets according to the two samples. We find that the search API over-represents the more central users and does not offer an accurate picture of peripheral activity; we also find that the bias is greater for the network of mentions. We discuss the implications of this bias for the study of diffusion dynamics and collective action in the digital era, and advocate the need for more uniform sampling procedures in the study of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
