Power of Explanations: Towards automatic debiasing in hate speech detection
Yi Cai, Arthur Zimek, Gerhard Wunder, Eirini Ntoutsi

TL;DR
This paper introduces an automatic bias detection and correction framework for hate speech detection models, leveraging explanation methods to identify and mitigate biases without external resources, improving fairness in NLP classifiers.
Contribution
The paper proposes a novel automatic bias detection method using explanation techniques and an end-to-end debiasing framework for text classifiers, eliminating reliance on human-annotated biased terms.
Findings
Effective bias detection without external resources
Improved fairness in hate speech detection models
Framework adaptable to evolving biases
Abstract
Hate speech detection is a common downstream application of natural language processing (NLP) in the real world. In spite of the increasing accuracy, current data-driven approaches could easily learn biases from the imbalanced data distributions originating from humans. The deployment of biased models could further enhance the existing social biases. But unlike handling tabular data, defining and mitigating biases in text classifiers, which deal with unstructured data, are more challenging. A popular solution for improving machine learning fairness in NLP is to conduct the debiasing process with a list of potentially discriminated words given by human annotators. In addition to suffering from the risks of overlooking the biased terms, exhaustively identifying bias with human annotators are unsustainable since discrimination is variable among different datasets and may evolve over time.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Topic Modeling
