BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration
Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera

TL;DR
BillionCOV is a comprehensive, large-scale COVID-19 tweet dataset with 1.4 billion tweets from around the world, designed to enable efficient data hydration and address issues of redundancy and data loss in prior datasets.
Contribution
This paper introduces BillionCOV, a large-scale COVID-19 tweet dataset that improves data quality and hydration efficiency compared to existing datasets.
Findings
Contains 1.4 billion tweets from 240 countries
Addresses redundancy and deleted/protected tweets issues
Facilitates efficient tweet hydration for researchers
Abstract
The COVID-19 pandemic introduced new norms such as social distancing, face masks, quarantine, lockdowns, travel restrictions, work/study from home, and business closures, to name a few. The pandemic's seriousness made people vocal on social media, especially on microblogs such as Twitter. Researchers have been collecting and sharing large-scale datasets of COVID-19 tweets since the early days of the outbreak. Sharing raw Twitter data with third parties is restricted; users need to hydrate tweet identifiers in a public dataset to re-create the dataset locally. Large-scale datasets that include original tweets, retweets, quotes, and replies have tweets in billions which takes months to hydrate. The existing datasets carry issues related to proportion and redundancy. We report that more than 500 million tweet identifiers point to deleted or protected tweets. In order to address these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts
