Wisdom of the LLM Crowd: A Large Scale Benchmark of Multi-Label U.S. Election-Related Harmful Social Media Content

Qile Wang; Prerana Khatiwada; Carolina Coimbra Vieira; Benjamin E. Bagozzi; Kenneth E. Barner; Matthew Louis Mauriello

arXiv:2602.11962·cs.HC·February 24, 2026

Wisdom of the LLM Crowd: A Large Scale Benchmark of Multi-Label U.S. Election-Related Harmful Social Media Content

Qile Wang, Prerana Khatiwada, Carolina Coimbra Vieira, Benjamin E. Bagozzi, Kenneth E. Barner, Matthew Louis Mauriello

PDF

Open Access

TL;DR

This paper introduces a large-scale, multi-label dataset of U.S. election-related social media posts, annotated using large language models and validated against human judgments, to improve detection of harmful content.

Contribution

It presents USE24-XD, a novel dataset annotated by LLMs with validation, enabling scalable harmful content detection during elections.

Findings

01

LLMs achieve high recall (up to 0.90) on certain categories.

02

Inter-rater reliability between LLMs and humans is comparable.

03

Demographics influence labeling behavior, revealing subjectivity in annotations.

Abstract

The spread of election misinformation and harmful political content conveys misleading narratives and poses a serious threat to democratic integrity. Detecting harmful content at early stages is essential for understanding and potentially mitigating its downstream spread. In this study, we introduce USE24-XD, a large-scale dataset of nearly 100k posts collected from X (formerly Twitter) during the 2024 U.S. presidential election cycle, enriched with spatio-temporal metadata. To substantially reduce the cost of manual annotation while enabling scalable categorization, we employ six large language models (LLMs) to systematically annotate posts across five nuanced categories: Conspiracy, Sensationalism, Hate Speech, Speculation, and Satire. We validate LLM annotations with crowdsourcing (n = 34) and benchmark them against human annotators. Inter-rater reliability analyses show comparable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMisinformation and Its Impacts · Hate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining