Power of Explanations: Towards automatic debiasing in hate speech   detection

Yi Cai; Arthur Zimek; Gerhard Wunder; Eirini Ntoutsi

arXiv:2209.09975·cs.CL·September 22, 2022

Power of Explanations: Towards automatic debiasing in hate speech detection

Yi Cai, Arthur Zimek, Gerhard Wunder, Eirini Ntoutsi

PDF

Open Access 1 Repo

TL;DR

This paper introduces an automatic bias detection and correction framework for hate speech detection models, leveraging explanation methods to identify and mitigate biases without external resources, improving fairness in NLP classifiers.

Contribution

The paper proposes a novel automatic bias detection method using explanation techniques and an end-to-end debiasing framework for text classifiers, eliminating reliance on human-annotated biased terms.

Findings

01

Effective bias detection without external resources

02

Improved fairness in hate speech detection models

03

Framework adaptable to evolving biases

Abstract

Hate speech detection is a common downstream application of natural language processing (NLP) in the real world. In spite of the increasing accuracy, current data-driven approaches could easily learn biases from the imbalanced data distributions originating from humans. The deployment of biased models could further enhance the existing social biases. But unlike handling tabular data, defining and mitigating biases in text classifiers, which deal with unstructured data, are more challenging. A popular solution for improving machine learning fairness in NLP is to conduct the debiasing process with a list of potentially discriminated words given by human annotators. In addition to suffering from the risks of overlooking the biased terms, exhaustively identifying bias with human annotators are unsustainable since discrimination is variable among different datasets and may evolve over time.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

caiy0220/poe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Topic Modeling