SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility
Xuanyu Su, Diana Inkpen, Nathalie Japkowicz

TL;DR
This paper introduces SoftHateBench, a comprehensive benchmark for evaluating moderation models' ability to detect reasoning-driven, policy-compliant soft hate speech that appears reasonable but promotes hostility.
Contribution
It presents a novel generative benchmark integrating argumentation and relevance theories to produce soft hate variants across multiple domains, highlighting gaps in current moderation systems.
Findings
Detection systems perform poorly on soft hate compared to hard hate.
Current models often fail to identify subtle, reasoning-based hostility.
SoftHateBench covers 7 sociocultural domains and 28 target groups with 4,745 instances.
Abstract
Online hate on social media ranges from overt slurs and threats (\emph{hard hate speech}) to \emph{soft hate speech}: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce \textbf{\textsc{SoftHateBench}}, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the \emph{Argumentum Model of Topics} (AMT) and \emph{Relevance Theory} (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Topic Modeling
