EPIC30M: An Epidemics Corpus Of Over 30 Million Relevant Tweets
Junhua Liu, Trisha Singhal, Lucienne T.M. Blessing, Kristin L. Wood, and Kwan Hui Lim

TL;DR
EPIC30M is a large-scale, publicly available Twitter corpus of over 30 million tweets related to various epidemics from 2006 to 2020, supporting cross-epidemic analysis and research.
Contribution
This paper introduces EPIC30M, the largest epidemic-related Twitter corpus to date, enabling cross-epidemic pattern recognition and trend analysis for diverse research applications.
Findings
Contains 30 million tweets from 2006-2020.
Includes data on three diseases and six outbreaks.
Supports cross-epidemic analysis and modeling.
Abstract
Since the start of COVID-19, several relevant corpora from various sources are presented in the literature that contain millions of data points. While these corpora are valuable in supporting many analyses on this specific pandemic, researchers require additional benchmark corpora that contain other epidemics to facilitate cross-epidemic pattern recognition and trend analysis tasks. During our other efforts on COVID-19 related work, we discover very little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks. In this paper, we present EPIC30M, a large-scale epidemic corpus that contains 30 millions micro-blog posts, i.e., tweets crawled from Twitter, from year 2006 to 2020. EPIC30M contains a subset of 26.2 millions tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 4.7…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Data-Driven Disease Surveillance · Sentiment Analysis and Opinion Mining
