Self-Explaining Hate Speech Detection with Moral Rationales
Francielle Vargas, Jackson Trager, Diego Alves, Surendrabikram Thapa, Matteo Guida, Berk Atil, Daryna Dementieva, Andrew Smart, Ameeta Agrawal

TL;DR
This paper introduces SMRA, a novel self-explaining hate speech detection model that uses moral rationales for supervision, improving interpretability and robustness while maintaining or enhancing detection performance.
Contribution
SMRA is the first framework to incorporate moral rationales directly into training for hate speech detection, enhancing interpretability and robustness compared to prior methods.
Findings
SMRA improves hate speech detection F1 scores (+0.9 and +1.5)
It increases explanation faithfulness metrics (+7.4 pp IoU F1, +5.0 pp Token F1)
Explanations become more concise and faithful without performance or bias trade-offs.
Abstract
Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Spam and Phishing Detection
