Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset
Hsuvas Borkakoty, Luis Espinosa-Anke

TL;DR
This paper introduces Hoaxpedia, a dataset of 311 Wikipedia hoax articles with legitimate counterparts, and analyzes language models and features for automated hoax detection, highlighting the difficulty and potential of content-based detection methods.
Contribution
The paper provides the first systematic analysis of Wikipedia hoaxes, introduces the Hoaxpedia dataset, and evaluates language models and features for automated hoax detection.
Findings
Content-based detection is challenging but feasible.
Edit history features improve classification accuracy.
Full article analysis yields better results than just definitions.
Abstract
Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsWikis in Education and Collaboration
