Systematic Offensive Stereotyping (SOS) Bias in Language Models
Fatma Elsafoury

TL;DR
This paper introduces a new metric to measure offensive stereotyping bias in language models, validates its presence, and explores its effects on fairness and performance in hate speech detection tasks.
Contribution
It proposes a novel SOS bias metric, evaluates bias across models, and examines its impact on hate speech detection fairness and performance.
Findings
All inspected LMs exhibit SOS bias.
Debiasing methods can worsen or improve SOS bias depending on the attribute.
SOS bias affects fairness in hate speech detection but not overall performance.
Abstract
In this paper, we propose a new metric to measure the SOS bias in language models (LMs). Then, we validate the SOS bias and investigate the effectiveness of removing it. Finally, we investigate the impact of the SOS bias in LMs on their performance and fairness on hate speech detection. Our results suggest that all the inspected LMs are SOS biased. And that the SOS bias is reflective of the online hate experienced by marginalized identities. The results indicate that using debias methods from the literature worsens the SOS bias in LMs for some sensitive attributes and improves it for others. Finally, Our results suggest that the SOS bias in the inspected LMs has an impact on their fairness of hate speech detection. However, there is no strong evidence that the SOS bias has an impact on the performance of hate speech detection.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
