Wisdom of the LLM Crowd: A Large Scale Benchmark of Multi-Label U.S. Election-Related Harmful Social Media Content
Qile Wang, Prerana Khatiwada, Carolina Coimbra Vieira, Benjamin E. Bagozzi, Kenneth E. Barner, Matthew Louis Mauriello

TL;DR
This paper introduces a large-scale, multi-label dataset of U.S. election-related social media posts, annotated using large language models and validated against human judgments, to improve detection of harmful content.
Contribution
It presents USE24-XD, a novel dataset annotated by LLMs with validation, enabling scalable harmful content detection during elections.
Findings
LLMs achieve high recall (up to 0.90) on certain categories.
Inter-rater reliability between LLMs and humans is comparable.
Demographics influence labeling behavior, revealing subjectivity in annotations.
Abstract
The spread of election misinformation and harmful political content conveys misleading narratives and poses a serious threat to democratic integrity. Detecting harmful content at early stages is essential for understanding and potentially mitigating its downstream spread. In this study, we introduce USE24-XD, a large-scale dataset of nearly 100k posts collected from X (formerly Twitter) during the 2024 U.S. presidential election cycle, enriched with spatio-temporal metadata. To substantially reduce the cost of manual annotation while enabling scalable categorization, we employ six large language models (LLMs) to systematically annotate posts across five nuanced categories: Conspiracy, Sensationalism, Hate Speech, Speculation, and Satire. We validate LLM annotations with crowdsourcing (n = 34) and benchmark them against human annotators. Inter-rater reliability analyses show comparable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Hate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining
