Concept-Based Interpretability for Toxicity Detection

Samarth Garg; Divya Singh; Deeksha Varshney; Mamta

arXiv:2511.16689·cs.CL·December 16, 2025

Concept-Based Interpretability for Toxicity Detection

Samarth Garg, Divya Singh, Deeksha Varshney, Mamta

PDF

Open Access

TL;DR

This paper introduces a concept-based interpretability method for toxicity detection that uses concept gradients and lexicon analysis to improve understanding of model decisions and address over-attribution issues.

Contribution

It proposes a novel interpretability technique using Concept Gradient methods and lexicon-based analysis to better understand toxicity detection models and mitigate over-attribution biases.

Findings

01

Concept Gradient provides causal insights into toxicity classification.

02

Lexicon-based analysis identifies words contributing to misclassification.

03

Lexicon-free augmentation reduces over-attribution of toxic concepts.

Abstract

The rise of social networks has not only facilitated communication but also allowed the spread of harmful content. Although significant advances have been made in detecting toxic language in textual data, the exploration of concept-based explanations in toxicity detection remains limited. In this study, we leverage various subtype attributes present in toxicity detection datasets, such as obscene, threat, insult, identity attack, and sexual explicit as concepts that serve as strong indicators to identify whether language is toxic. However, disproportionate attribution of concepts towards the target class often results in classification errors. Our work introduces an interpretability technique based on the Concept Gradient (CG) method which provides a more causal interpretation by measuring how changes in concepts directly affect the output of the model. This is an extension of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Authorship Attribution and Profiling