Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection

Brage Eilertsen; R{\o}skva Bj{\o}rgfinsd\'ottir; Francielle Vargas; Ali Ramezani-Kebrya

arXiv:2511.07065·cs.CL·November 11, 2025

Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection

Brage Eilertsen, R{\o}skva Bj{\o}rgfinsd\'ottir, Francielle Vargas, Ali Ramezani-Kebrya

PDF

Open Access 1 Models

TL;DR

This paper introduces Supervised Rational Attention (SRA), a method that aligns model attention with human rationales to improve interpretability and fairness in hate speech detection models.

Contribution

The paper proposes SRA, a novel framework that explicitly aligns attention with human rationales in transformer models for hate speech detection, enhancing interpretability and fairness.

Findings

01

SRA achieves 2.4x better explainability than baselines.

02

Token-level explanations are more faithful and human-aligned.

03

SRA maintains competitive fairness across multiple metrics.

Abstract

The opaque nature of deep learning models presents significant challenges for the ethical deployment of hate speech detection systems. To address this limitation, we introduce Supervised Rational Attention (SRA), a framework that explicitly aligns model attention with human rationales, improving both interpretability and fairness in hate speech classification. SRA integrates a supervised attention mechanism into transformer-based classifiers, optimizing a joint objective that combines standard classification loss with an alignment loss term that minimizes the discrepancy between attention weights and human-annotated rationales. We evaluated SRA on hate speech benchmarks in English (HateXplain) and Portuguese (HateBRXplain) with rationale annotations. Empirically, SRA achieves 2.4x better explainability compared to current baselines, and produces token-level explanations that are more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
bragee/sra-hate-speech-bert
model· 300 dl· ♡ 1
300 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Emotion and Mood Recognition