Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection
Brage Eilertsen, R{\o}skva Bj{\o}rgfinsd\'ottir, Francielle Vargas, Ali Ramezani-Kebrya

TL;DR
This paper introduces Supervised Rational Attention (SRA), a method that aligns model attention with human rationales to improve interpretability and fairness in hate speech detection models.
Contribution
The paper proposes SRA, a novel framework that explicitly aligns attention with human rationales in transformer models for hate speech detection, enhancing interpretability and fairness.
Findings
SRA achieves 2.4x better explainability than baselines.
Token-level explanations are more faithful and human-aligned.
SRA maintains competitive fairness across multiple metrics.
Abstract
The opaque nature of deep learning models presents significant challenges for the ethical deployment of hate speech detection systems. To address this limitation, we introduce Supervised Rational Attention (SRA), a framework that explicitly aligns model attention with human rationales, improving both interpretability and fairness in hate speech classification. SRA integrates a supervised attention mechanism into transformer-based classifiers, optimizing a joint objective that combines standard classification loss with an alignment loss term that minimizes the discrepancy between attention weights and human-annotated rationales. We evaluated SRA on hate speech benchmarks in English (HateXplain) and Portuguese (HateBRXplain) with rationale annotations. Empirically, SRA achieves 2.4x better explainability compared to current baselines, and produces token-level explanations that are more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Emotion and Mood Recognition
