Self-Explaining Hate Speech Detection with Moral Rationales

Francielle Vargas; Jackson Trager; Diego Alves; Surendrabikram Thapa; Matteo Guida; Berk Atil; Daryna Dementieva; Andrew Smart; Ameeta Agrawal

arXiv:2601.03481·cs.CL·January 8, 2026

Self-Explaining Hate Speech Detection with Moral Rationales

Francielle Vargas, Jackson Trager, Diego Alves, Surendrabikram Thapa, Matteo Guida, Berk Atil, Daryna Dementieva, Andrew Smart, Ameeta Agrawal

PDF

Open Access

TL;DR

This paper introduces SMRA, a novel self-explaining hate speech detection model that uses moral rationales for supervision, improving interpretability and robustness while maintaining or enhancing detection performance.

Contribution

SMRA is the first framework to incorporate moral rationales directly into training for hate speech detection, enhancing interpretability and robustness compared to prior methods.

Findings

01

SMRA improves hate speech detection F1 scores (+0.9 and +1.5)

02

It increases explanation faithfulness metrics (+7.4 pp IoU F1, +5.0 pp Token F1)

03

Explanations become more concise and faithful without performance or bias trade-offs.

Abstract

Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Spam and Phishing Detection