Early Detection of Social Media Hoaxes at Scale
Arkaitz Zubiaga, Aiqi Jiang

TL;DR
This paper presents a semi-automated approach to detect social media hoaxes early by creating a large-scale dataset using Wikidata, enabling more effective training and evaluation of detection models.
Contribution
It introduces a novel semi-automated method leveraging Wikidata to build large datasets for early hoax detection on social media, focusing on celebrity death reports.
Findings
Achieved F1 scores near 72% within 10 minutes of the first tweet.
Created a dataset with over 13 million tweets and 4,007 reports.
Demonstrated the importance of training data size for early detection accuracy.
Abstract
The unmoderated nature of social media enables the diffusion of hoaxes, which in turn jeopardises the credibility of information gathered from social media platforms. Existing research on automated detection of hoaxes has the limitation of using relatively small datasets, owing to the difficulty of getting labelled data. This in turn has limited research exploring early detection of hoaxes as well as exploring other factors such as the effect of the size of the training data or the use of sliding windows. To mitigate this problem, we introduce a semi-automated method that leverages the Wikidata knowledge base to build large-scale datasets for veracity classification, focusing on celebrity death reports. This enables us to create a dataset with 4,007 reports including over 13 million tweets, 15% of which are fake. Experiments using class-specific representations of word embeddings show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Spam and Phishing Detection · Topic Modeling
