Construction of Large-Scale Misinformation Labeled Datasets from Social   Media Discourse using Label Refinement

Karishma Sharma; Emilio Ferrara; Yan Liu

arXiv:2202.12413·cs.SI·February 28, 2022

Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement

Karishma Sharma, Emilio Ferrara, Yan Liu

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel framework for constructing large-scale misinformation datasets from social media by leveraging weak labels from news sources and refining them through model-guided techniques, reducing human effort.

Contribution

It introduces a label refinement framework that uses self-training and social context to improve misinformation dataset quality with minimal human intervention.

Findings

01

Effective in identifying and correcting inaccurate labels.

02

Successfully applied to COVID-19 vaccine misinformation dataset.

03

Achieves large-scale dataset construction with reduced manual labeling.

Abstract

Malicious accounts spreading misinformation has led to widespread false and misleading narratives in recent times, especially during the COVID-19 pandemic, and social media platforms struggle to eliminate these contents rapidly. This is because adapting to new domains requires human intensive fact-checking that is slow and difficult to scale. To address this challenge, we propose to leverage news-source credibility labels as weak labels for social media posts and propose model-guided refinement of labels to construct large-scale, diverse misinformation labeled datasets in new domains. The weak labels can be inaccurate at the article or social media post level where the stance of the user does not align with the news source or article credibility. We propose a framework to use a detection model self-trained on the initial weak labels with uncertainty sampling based on entropy in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

usc-melady/constructing-misinformation-datasets-www-2022
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMisinformation and Its Impacts · Hate Speech and Cyberbullying Detection · Spam and Phishing Detection