Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao, Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal, Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn, Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort

TL;DR
This paper explores methods for red teaming language models to identify and reduce harmful outputs, analyzing scaling behaviors across different models and sharing a large dataset of attack examples to foster community standards.
Contribution
It provides a comprehensive analysis of red teaming methods, scaling effects, and a large dataset, along with detailed methodologies and lessons learned to improve safety practices.
Findings
RLHF models become harder to red team as they scale
Other model types show flat red teaming difficulty with scale
A dataset of 38,961 red team attacks is released for community use
Abstract
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Terrorism, Counterterrorism, and Political Violence
