Auditing and Robustifying COVID-19 Misinformation Datasets via Anticontent Sampling
Clay H. Yoo, Ashiqur R. KhudaBukhsh

TL;DR
This paper evaluates COVID-19 misinformation datasets for real-world robustness and introduces an active learning method that enhances classifier resilience against diverse anticontent without manual labeling.
Contribution
It highlights the limited diversity in existing datasets and proposes a novel anticontent sampling pipeline to improve classifier robustness.
Findings
Models trained on existing datasets are vulnerable to anticontent in real-world scenarios.
The proposed active learning pipeline effectively augments training data with challenging anticontent.
Classifiers become more robust after applying the anticontent sampling method.
Abstract
This paper makes two key contributions. First, it argues that highly specialized rare content classifiers trained on small data typically have limited exposure to the richness and topical diversity of the negative class (dubbed anticontent) as observed in the wild. As a result, these classifiers' strong performance observed on the test set may not translate into real-world settings. In the context of COVID-19 misinformation detection, we conduct an in-the-wild audit of multiple datasets and demonstrate that models trained with several prominently cited recent datasets are vulnerable to anticontent when evaluated in the wild. Second, we present a novel active learning pipeline that requires zero manual annotation and iteratively augments the training data with challenging anticontent, robustifying these classifiers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · COVID-19 diagnosis using AI · SARS-CoV-2 detection and testing
