Exploring the Plausibility of Hate and Counter Speech Detectors with Explainable AI
Adrian Jaques B\"ock, Djordje Slijep\v{c}evi\'c, Matthias Zeppelzauer

TL;DR
This study evaluates various explainability methods for transformer models in hate and counter speech detection, finding that perturbation-based approaches offer the most effective explanations and enhance user understanding of model predictions.
Contribution
It compares four explainability approaches for transformer models in hate speech detection, highlighting the effectiveness of perturbation-based methods and providing insights into their interpretability.
Findings
Perturbation-based explainability performs best among tested methods.
Explainability improves user understanding of model predictions.
Prototype-based approaches were not effective.
Abstract
In this paper we investigate the explainability of transformer models and their plausibility for hate speech and counter speech detection. We compare representatives of four different explainability approaches, i.e., gradient-based, perturbation-based, attention-based, and prototype-based approaches, and analyze them quantitatively with an ablation study and qualitatively in a user study. Results show that perturbation-based explainability performs best, followed by gradient-based and attention-based explainability. Prototypebased experiments did not yield useful results. Overall, we observe that explainability strongly supports the users in better understanding the model predictions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
