Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement
Karishma Sharma, Emilio Ferrara, Yan Liu

TL;DR
This paper presents a novel framework for constructing large-scale misinformation datasets from social media by leveraging weak labels from news sources and refining them through model-guided techniques, reducing human effort.
Contribution
It introduces a label refinement framework that uses self-training and social context to improve misinformation dataset quality with minimal human intervention.
Findings
Effective in identifying and correcting inaccurate labels.
Successfully applied to COVID-19 vaccine misinformation dataset.
Achieves large-scale dataset construction with reduced manual labeling.
Abstract
Malicious accounts spreading misinformation has led to widespread false and misleading narratives in recent times, especially during the COVID-19 pandemic, and social media platforms struggle to eliminate these contents rapidly. This is because adapting to new domains requires human intensive fact-checking that is slow and difficult to scale. To address this challenge, we propose to leverage news-source credibility labels as weak labels for social media posts and propose model-guided refinement of labels to construct large-scale, diverse misinformation labeled datasets in new domains. The weak labels can be inaccurate at the article or social media post level where the stance of the user does not align with the news source or article credibility. We propose a framework to use a detection model self-trained on the initial weak labels with uncertainty sampling based on entropy in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Hate Speech and Cyberbullying Detection · Spam and Phishing Detection
