Concept-Based Interpretability for Toxicity Detection
Samarth Garg, Divya Singh, Deeksha Varshney, Mamta

TL;DR
This paper introduces a concept-based interpretability method for toxicity detection that uses concept gradients and lexicon analysis to improve understanding of model decisions and address over-attribution issues.
Contribution
It proposes a novel interpretability technique using Concept Gradient methods and lexicon-based analysis to better understand toxicity detection models and mitigate over-attribution biases.
Findings
Concept Gradient provides causal insights into toxicity classification.
Lexicon-based analysis identifies words contributing to misclassification.
Lexicon-free augmentation reduces over-attribution of toxic concepts.
Abstract
The rise of social networks has not only facilitated communication but also allowed the spread of harmful content. Although significant advances have been made in detecting toxic language in textual data, the exploration of concept-based explanations in toxicity detection remains limited. In this study, we leverage various subtype attributes present in toxicity detection datasets, such as obscene, threat, insult, identity attack, and sexual explicit as concepts that serve as strong indicators to identify whether language is toxic. However, disproportionate attribution of concepts towards the target class often results in classification errors. Our work introduces an interpretability technique based on the Concept Gradient (CG) method which provides a more causal interpretation by measuring how changes in concepts directly affect the output of the model. This is an extension of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Authorship Attribution and Profiling
