Towards Procedural Fairness: Uncovering Biases in How a Toxic Language Classifier Uses Sentiment Information
Isar Nejadgholi, Esma Balk{\i}r, Kathleen C. Fraser, and Svetlana, Kiritchenko

TL;DR
This paper investigates how a toxic language classifier uses sentiment and identity terms, revealing biases and guiding future debiasing efforts to improve fairness in toxic language detection.
Contribution
It introduces a concept-based explanation framework to analyze the interaction between sentiment and identity features in toxic language classifiers, highlighting biases.
Findings
Sentiment information is sometimes overshadowed by identity term influence.
The classifier's sensitivity to sentiment varies across classes.
Results inform debiasing strategies for fairer toxic language models.
Abstract
Previous works on the fairness of toxic language classifiers compare the output of models with different identity terms as input features but do not consider the impact of other important concepts present in the context. Here, besides identity terms, we take into account high-level latent features learned by the classifier and investigate the interaction between these features and identity terms. For a multi-class toxic language classifier, we leverage a concept-based explanation framework to calculate the sensitivity of the model to the concept of sentiment, which has been used before as a salient feature for toxic language detection. Our results show that although for some classes, the classifier has learned the sentiment information as expected, this information is outweighed by the influence of identity terms as input features. This work is a step towards evaluating procedural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Explainable Artificial Intelligence (XAI)
