When is it Biased? Assessing the Representativeness of Twitter's Streaming API
Fred Morstatter, J\"urgen Pfeffer, Huan Liu

TL;DR
This paper investigates the bias in Twitter's Streaming API data by comparing hashtag trends with true activity, proposing a method to detect bias using open data sources without needing the Firehose, and evaluating its effectiveness in various scenarios.
Contribution
It introduces a new approach to identify bias in Twitter Streaming API data without relying on the Firehose, using open data sources to compare hashtag trends.
Findings
Effective detection of bias in Streaming API data
Method works in sparse data situations
Applicable across different regions and queries
Abstract
Twitter has captured the interest of the scientific community not only for its massive user base and content, but also for its openness in sharing its data. Twitter shares a free 1% sample of its tweets through the "Streaming API", a service that returns a sample of tweets according to a set of parameters set by the researcher. Recently, research has pointed to evidence of bias in the data returned through the Streaming API, raising concern in the integrity of this data service for use in research scenarios. While these results are important, the methodologies proposed in previous work rely on the restrictive and expensive Firehose to find the bias in the Streaming API data. In this work we tackle the problem of finding sample bias without the need for "gold standard" Firehose data. Namely, we focus on finding time periods in the Streaming API data where the trend of a hashtag is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Complex Network Analysis Techniques · Data-Driven Disease Surveillance
