RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language   Models

Samuel Gehman; Suchin Gururangan; Maarten Sap; Yejin Choi; Noah A.; Smith

arXiv:2009.11462·cs.CL·September 29, 2020

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A., Smith

PDF

3 Repos 10 Models 5 Datasets

TL;DR

This paper introduces RealToxicityPrompts, a dataset for evaluating toxicity in language models, revealing that current models can generate toxic content from benign prompts and that existing mitigation methods are not foolproof.

Contribution

The paper presents a new dataset for toxicity evaluation, analyzes the toxicity of pretraining corpora, and assesses the effectiveness of various controllable generation methods.

Findings

01

Pretrained LMs can produce toxic language from innocuous prompts.

02

More intensive mitigation methods are more effective but not foolproof.

03

Pretraining data contains significant toxic and unreliable content.

Abstract

Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts. We empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.