Counterfactually Augmented Data and Unintended Bias: The Case of Sexism   and Hate Speech Detection

Indira Sen; Mattia Samory; Claudia Wagner; and Isabelle Augenstein

arXiv:2205.04238·cs.CL·May 10, 2022

Counterfactually Augmented Data and Unintended Bias: The Case of Sexism and Hate Speech Detection

Indira Sen, Mattia Samory, Claudia Wagner, and Isabelle Augenstein

PDF

Open Access

TL;DR

This paper investigates how Counterfactually Augmented Data (CAD) affects model bias in sexism and hate speech detection, revealing that certain CAD approaches can increase false positives on nuanced cases, but diverse CAD reduces bias.

Contribution

It demonstrates that construct-driven CAD can induce unintended bias in models, and that combining diverse CAD methods mitigates this issue.

Findings

01

Construct-driven CAD increases false positives in challenging cases.

02

Diverse CAD approaches reduce unintended bias.

03

Models trained on original data have fewer false positives.

Abstract

Counterfactually Augmented Data (CAD) aims to improve out-of-domain generalizability, an indicator of model robustness. The improvement is credited with promoting core features of the construct over spurious artifacts that happen to correlate with it. Yet, over-relying on core features may lead to unintended model bias. Especially, construct-driven CAD -- perturbations of core features -- may induce models to ignore the context in which core features are used. Here, we test models for sexism and hate speech detection on challenging data: non-hateful and non-sexist usage of identity and gendered terms. In these hard cases, models trained on CAD, especially construct-driven CAD, show higher false-positive rates than models trained on the original, unperturbed data. Using a diverse set of CAD -- construct-driven and construct-agnostic -- reduces such unintended bias.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Adversarial Robustness in Machine Learning