Forecasting COVID-19 Caseloads Using Unsupervised Embedding Clusters of Social Media Posts
Felix Drinkall, Stefan Zohren, Janet B. Pierrehumbert

TL;DR
This paper introduces a novel method using transformer-based language models to analyze social media posts for predicting COVID-19 case trends, outperforming traditional features especially in data-scarce regions.
Contribution
It presents a new approach integrating social media text embeddings into infectious disease forecasting, demonstrating superior performance over existing datasets in trend prediction.
Findings
Clustered embedding features outperform other datasets in trend prediction.
Social media text features enhance forecasting accuracy in data-scarce areas.
Transformer-based models effectively utilize social media data for epidemiological forecasting.
Abstract
We present a novel approach incorporating transformer-based language models into infectious disease modelling. Text-derived features are quantified by tracking high-density clusters of sentence-level representations of Reddit posts within specific US states' COVID-19 subreddits. We benchmark these clustered embedding features against features extracted from other high-quality datasets. In a threshold-classification task, we show that they outperform all other feature types at predicting upward trend signals, a significant result for infectious disease modelling in areas where epidemiological data is unreliable. Subsequently, in a time-series forecasting task we fully utilise the predictive power of the caseload and compare the relative strengths of using different supplementary datasets as covariate feature sets in a transformer-based time-series model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Data-Driven Disease Surveillance
