ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap,, Dipankar Ray, Ece Kamar

TL;DR
ToxiGen is a large, machine-generated dataset designed to improve hate speech detection, especially for implicit and minority group-related toxicity, by providing diverse, subtly toxic examples that enhance classifier robustness.
Contribution
The paper introduces ToxiGen, a novel large-scale dataset created using a prompting framework and adversarial decoding, covering implicit toxicity and multiple minority groups, surpassing previous human-written resources.
Findings
Finetuning classifiers on ToxiGen improves detection of human-written hate speech.
Humans struggle to distinguish machine-generated from human-written toxic text.
ToxiGen enhances classifier performance on both real and machine-generated toxicity.
Abstract
Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. Such over-reliance on spurious correlations also causes systems to struggle with detecting implicitly toxic language. To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups. We develop a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text with a massive pretrained language model. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale, and about more demographic groups, than previous resources of human-written text. We conduct a human evaluation on a challenging subset of ToxiGen and find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/gemma-7bmodel· 30k dl· ♡ 329330k dl♡ 3293
- 🤗google/gemma-2-2b-itmodel· 368k dl· ♡ 1314368k dl♡ 1314
- 🤗google/gemma-2-2bmodel· 489k dl· ♡ 636489k dl♡ 636
- 🤗google/gemma-2bmodel· 174k dl· ♡ 1152174k dl♡ 1152
- 🤗google/gemma-2-27b-itmodel· 309k dl· ♡ 561309k dl♡ 561
- 🤗google/gemma-2-9b-itmodel· 254k dl· ♡ 781254k dl♡ 781
- 🤗ataeff/recurrentgemma-2b-itmodel· ♡ 1♡ 1
- 🤗tomh/toxigen_hatebertmodel· 769k dl· ♡ 15769k dl♡ 15
- 🤗tomh/toxigen_robertamodel· 7.0k dl· ♡ 107.0k dl♡ 10
- 🤗nicholasKluge/Aira-2-124Mmodel· 340 dl· ♡ 1340 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
