A Longitudinal Assessment of the Persistence of Twitter Datasets
Arkaitz Zubiaga

TL;DR
This study longitudinally assesses how Twitter datasets degrade over time due to content deletion and account deactivation, revealing that textual content remains largely intact while metadata becomes less representative, impacting research reproducibility.
Contribution
It provides the first comprehensive longitudinal analysis of Twitter dataset persistence, quantifying how content and metadata change over time and affecting dataset reproducibility.
Findings
Textual content remains largely representative over time.
Metadata, such as user follower counts, degrades significantly.
Dataset availability decreases as datasets age.
Abstract
With social media datasets being increasingly shared by researchers, it also presents the caveat that those datasets are not always completely replicable. Having to adhere to requirements of platforms like Twitter, researchers cannot release the raw data and instead have to release a list of unique identifiers, which others can then use to recollect the data from the platform themselves. This leads to the problem that subsets of the data may no longer be available, as content can be deleted or user accounts deactivated. To quantify the impact of content deletion in the replicability of datasets in a long term, we perform a longitudinal analysis of the persistence of 30 Twitter datasets, which include over 147 million tweets. Having the original datasets collected between 2012 and 2016, and recollecting them later by using the tweet IDs, we look at four different factors that quantify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
