Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,   and Lessons Learned

Deep Ganguli; Liane Lovitt; Jackson Kernion; Amanda Askell; Yuntao; Bai; Saurav Kadavath; Ben Mann; Ethan Perez; Nicholas Schiefer; Kamal; Ndousse; Andy Jones; Sam Bowman; Anna Chen; Tom Conerly; Nova DasSarma; Dawn; Drain; Nelson Elhage; Sheer El-Showk; Stanislav Fort; Zac Hatfield-Dodds; Tom; Henighan; Danny Hernandez; Tristan Hume; Josh Jacobson; Scott Johnston,; Shauna Kravec; Catherine Olsson; Sam Ringer; Eli Tran-Johnson; Dario Amodei,; Tom Brown; Nicholas Joseph; Sam McCandlish; Chris Olah; Jared Kaplan; Jack; Clark

arXiv:2209.07858·cs.CL·November 24, 2022·102 cites

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao, Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal, Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn, Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort

PDF

Open Access 2 Repos 3 Datasets

TL;DR

This paper explores methods for red teaming language models to identify and reduce harmful outputs, analyzing scaling behaviors across different models and sharing a large dataset of attack examples to foster community standards.

Contribution

It provides a comprehensive analysis of red teaming methods, scaling effects, and a large dataset, along with detailed methodologies and lessons learned to improve safety practices.

Findings

01

RLHF models become harder to red team as they scale

02

Other model types show flat red teaming difficulty with scale

03

A dataset of 38,961 red team attacks is released for community use

Abstract

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Terrorism, Counterterrorism, and Political Violence