Exploring the Plausibility of Hate and Counter Speech Detectors with   Explainable AI

Adrian Jaques B\"ock; Djordje Slijep\v{c}evi\'c; Matthias Zeppelzauer

arXiv:2407.20274·cs.LG·July 31, 2024

Exploring the Plausibility of Hate and Counter Speech Detectors with Explainable AI

Adrian Jaques B\"ock, Djordje Slijep\v{c}evi\'c, Matthias Zeppelzauer

PDF

Open Access

TL;DR

This study evaluates various explainability methods for transformer models in hate and counter speech detection, finding that perturbation-based approaches offer the most effective explanations and enhance user understanding of model predictions.

Contribution

It compares four explainability approaches for transformer models in hate speech detection, highlighting the effectiveness of perturbation-based methods and providing insights into their interpretability.

Findings

01

Perturbation-based explainability performs best among tested methods.

02

Explainability improves user understanding of model predictions.

03

Prototype-based approaches were not effective.

Abstract

In this paper we investigate the explainability of transformer models and their plausibility for hate speech and counter speech detection. We compare representatives of four different explainability approaches, i.e., gradient-based, perturbation-based, attention-based, and prototype-based approaches, and analyze them quantitatively with an ablation study and qualitatively in a user study. Results show that perturbation-based explainability performs best, followed by gradient-based and attention-based explainability. Prototypebased experiments did not yield useful results. Overall, we observe that explainability strongly supports the users in better understanding the model predictions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection