HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content   and Hate Campaigns

Xinyue Shen; Yixin Wu; Yiting Qu; Michael Backes; Savvas; Zannettou; Yang Zhang

arXiv:2501.16750·cs.CR·January 29, 2025

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

Xinyue Shen, Yixin Wu, Yiting Qu, Michael Backes, Savvas, Zannettou, Yang Zhang

PDF

Open Access 1 Repo

TL;DR

HateBench evaluates hate speech detectors against LLM-generated hate speech, revealing performance degradation with newer LLMs and exposing new threats from LLM-driven hate campaigns using adversarial techniques.

Contribution

This paper introduces HateBench, a comprehensive benchmark for hate speech detection on LLM-generated content, and uncovers vulnerabilities and new threats posed by advanced attack methods.

Findings

01

Detectors perform well but decline with newer LLMs

02

LLMs can be exploited to generate and evade hate speech detection

03

Adversarial and model stealing attacks significantly increase attack success rates

Abstract

Large Language Models (LLMs) have raised increasing concerns about their misuse in generating hate speech. Among all the efforts to address this issue, hate speech detectors play a crucial role. However, the effectiveness of different detectors against LLM-generated hate speech remains largely unknown. In this paper, we propose HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech. We first construct a hate speech dataset of 7,838 samples generated by six widely-used LLMs covering 34 identity groups, with meticulous annotations by three labelers. We then assess the effectiveness of eight representative hate speech detectors on the LLM-generated dataset. Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs. We also reveal the potential of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trustairlab/hatebench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection