Counterfactually Augmented Data and Unintended Bias: The Case of Sexism and Hate Speech Detection
Indira Sen, Mattia Samory, Claudia Wagner, and Isabelle Augenstein

TL;DR
This paper investigates how Counterfactually Augmented Data (CAD) affects model bias in sexism and hate speech detection, revealing that certain CAD approaches can increase false positives on nuanced cases, but diverse CAD reduces bias.
Contribution
It demonstrates that construct-driven CAD can induce unintended bias in models, and that combining diverse CAD methods mitigates this issue.
Findings
Construct-driven CAD increases false positives in challenging cases.
Diverse CAD approaches reduce unintended bias.
Models trained on original data have fewer false positives.
Abstract
Counterfactually Augmented Data (CAD) aims to improve out-of-domain generalizability, an indicator of model robustness. The improvement is credited with promoting core features of the construct over spurious artifacts that happen to correlate with it. Yet, over-relying on core features may lead to unintended model bias. Especially, construct-driven CAD -- perturbations of core features -- may induce models to ignore the context in which core features are used. Here, we test models for sexism and hate speech detection on challenging data: non-hateful and non-sexist usage of identity and gendered terms. In these hard cases, models trained on CAD, especially construct-driven CAD, show higher false-positive rates than models trained on the original, unperturbed data. Using a diverse set of CAD -- construct-driven and construct-agnostic -- reduces such unintended bias.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Adversarial Robustness in Machine Learning
