Improving Counterfactual Generation for Fair Hate Speech Detection

Aida Mostafazadeh Davani; Ali Omrani; Brendan Kennedy; Mohammad Atari,; Xiang Ren; Morteza Dehghani

arXiv:2108.01721·cs.CL·August 5, 2021·1 cites

Improving Counterfactual Generation for Fair Hate Speech Detection

Aida Mostafazadeh Davani, Ali Omrani, Brendan Kennedy, Mohammad Atari,, Xiang Ren, Morteza Dehghani

PDF

Open Access

TL;DR

This paper proposes a counterfactual fairness approach for hate speech detection that accounts for stereotypical language related to social group tokens, improving fairness without sacrificing detection accuracy.

Contribution

It introduces a method that evaluates sentence likelihoods among counterfactuals to better handle social group tokens, enhancing fairness in hate speech detection.

Findings

01

Improved fairness metrics in hate speech detection models.

02

Preserved model performance on core detection tasks.

03

Effective use of language model likelihoods for fairness evaluation.

Abstract

Bias mitigation approaches reduce models' dependence on sensitive features of data, such as social group tokens (SGTs), resulting in equal predictions across the sensitive features. In hate speech detection, however, equalizing model predictions may ignore important differences among targeted social groups, as hate speech can contain stereotypical language specific to each SGT. Here, to take the specific language about each SGT into account, we rely on counterfactual fairness and equalize predictions among counterfactuals, generated by changing the SGTs. Our method evaluates the similarity in sentence likelihoods (via pre-trained language models) among counterfactuals, to treat SGTs equally only within interchangeable contexts. By applying logit pairing to equalize outcomes on the restricted set of counterfactuals for each instance, we improve fairness metrics while preserving model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI

MethodsCounterfactuals Explanations