TL;DR
DIA-HARM introduces a comprehensive benchmark and dataset to evaluate the robustness of disinformation detection models across 50 English dialects, revealing significant vulnerabilities and disparities in current models.
Contribution
This work provides the first large-scale benchmark and corpus for assessing dialectal robustness in disinformation detection, highlighting systematic model vulnerabilities and transfer capabilities.
Findings
Human-written dialectal content reduces detection accuracy by 1.4-3.6% F1.
Fine-tuned transformers outperform zero-shot LLMs significantly.
Multilingual models generalize better across dialects than monolingual models.
Abstract
Harmful content detectors-particularly disinformation classifiers-are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia-Pacific varieties. Using Multi-VALUE's linguistically grounded transformations, we introduce D3 (Dialectal Disinformation Detection), a corpus of 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable. Fine-tuned transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3% best-case F1), with some models exhibiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
