SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

Xuanyu Su; Diana Inkpen; Nathalie Japkowicz

arXiv:2601.20256·cs.CL·January 29, 2026

SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

Xuanyu Su, Diana Inkpen, Nathalie Japkowicz

PDF

Open Access

TL;DR

This paper introduces SoftHateBench, a comprehensive benchmark for evaluating moderation models' ability to detect reasoning-driven, policy-compliant soft hate speech that appears reasonable but promotes hostility.

Contribution

It presents a novel generative benchmark integrating argumentation and relevance theories to produce soft hate variants across multiple domains, highlighting gaps in current moderation systems.

Findings

01

Detection systems perform poorly on soft hate compared to hard hate.

02

Current models often fail to identify subtle, reasoning-based hostility.

03

SoftHateBench covers 7 sociocultural domains and 28 target groups with 4,745 instances.

Abstract

Online hate on social media ranges from overt slurs and threats (\emph{hard hate speech}) to \emph{soft hate speech}: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce \textbf{\textsc{SoftHateBench}}, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the \emph{Argumentum Model of Topics} (AMT) and \emph{Relevance Theory} (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Topic Modeling