Compositional Generalisation for Explainable Hate Speech Detection
Agostina Calabrese, Tom Sherborne, Bj\"orn Ross, Mirella Lapata

TL;DR
This paper introduces a new dataset and benchmark to improve hate speech detection models' ability to generalize compositionally, addressing limitations of existing models trained on sentence-level labels.
Contribution
It presents U-PLEAD, a synthetic dataset, and demonstrates that training on combined synthetic and real data enhances compositional generalization in hate speech detection.
Findings
Training on U-PLEAD improves generalization to unseen expressions.
Combining synthetic and real data achieves state-of-the-art results on PLEAD.
Models struggle to disentangle label meaning from context even with span-level annotations.
Abstract
Hate speech detection is key to online content moderation, but current models struggle to generalise beyond their training data. This has been linked to dataset biases and the use of sentence-level labels, which fail to teach models the underlying structure of hate speech. In this work, we show that even when models are trained with more fine-grained, span-level annotations (e.g., "artists" is labeled as target and "are parasites" as dehumanising comparison), they struggle to disentangle the meaning of these labels from the surrounding context. As a result, combinations of expressions that deviate from those seen during training remain particularly difficult for models to detect. We investigate whether training on a dataset where expressions occur with equal frequency across all contexts can improve generalisation. To this end, we create U-PLEAD, a dataset of ~364,000 synthetic posts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Emotion and Mood Recognition
