Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias
Rebecca Dorn, Lee Kezar, Fred Morstatter, Kristina Lerman

TL;DR
This paper investigates gender-queer dialect bias in harmful speech detection by language models, revealing significant inaccuracies especially for texts authored by targeted individuals, and introduces a new dataset to evaluate fairness.
Contribution
The study introduces QueerReclaimLex, a novel dataset for assessing bias in harmful speech detection, and systematically evaluates language models' performance, highlighting biases against gender-queer authors.
Findings
Language models often misclassify gender-queer texts as harmful.
Models perform poorly (F1 <= 0.24) on texts authored by targeted individuals.
Chain-of-thought prompting does not fully mitigate bias.
Abstract
Content moderation on social media platforms shapes the dynamics of online discourse, influencing whose voices are amplified and whose are suppressed. Recent studies have raised concerns about the fairness of content moderation practices, particularly for aggressively flagging posts from transgender and non-binary individuals as toxic. In this study, we investigate the presence of bias in harmful speech classification of gender-queer dialect online, focusing specifically on the treatment of reclaimed slurs. We introduce a novel dataset, QueerReclaimLex, based on 109 curated templates exemplifying non-derogatory uses of LGBTQ+ slurs. Dataset instances are scored by gender-queer annotators for potential harm depending on additional context about speaker identity. We systematically evaluate the performance of five off-the-shelf language models in assessing the harm of these texts and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGender Studies in Language · Hate Speech and Cyberbullying Detection
