RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A., Smith

TL;DR
This paper introduces RealToxicityPrompts, a dataset for evaluating toxicity in language models, revealing that current models can generate toxic content from benign prompts and that existing mitigation methods are not foolproof.
Contribution
The paper presents a new dataset for toxicity evaluation, analyzes the toxicity of pretraining corpora, and assesses the effectiveness of various controllable generation methods.
Findings
Pretrained LMs can produce toxic language from innocuous prompts.
More intensive mitigation methods are more effective but not foolproof.
Pretraining data contains significant toxic and unreliable content.
Abstract
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts. We empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/gemma-7bmodel· 30k dl· ♡ 329330k dl♡ 3293
- 🤗google/gemma-2-2b-itmodel· 368k dl· ♡ 1314368k dl♡ 1314
- 🤗google/gemma-2-2bmodel· 489k dl· ♡ 636489k dl♡ 636
- 🤗google/gemma-2bmodel· 174k dl· ♡ 1152174k dl♡ 1152
- 🤗google/gemma-2-27b-itmodel· 309k dl· ♡ 561309k dl♡ 561
- 🤗google/gemma-2-9b-itmodel· 254k dl· ♡ 781254k dl♡ 781
- 🤗ataeff/recurrentgemma-2b-itmodel· ♡ 1♡ 1
- 🤗google/gemma-2b-itmodel· 57k dl· ♡ 86257k dl♡ 862
- 🤗google/gemma-7b-itmodel· 67k dl· ♡ 124167k dl♡ 1241
- 🤗alpindale/gemma-7bmodel· 66 dl· ♡ 766 dl♡ 7
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
