The Capacity for Moral Self-Correction in Large Language Models

Deep Ganguli; Amanda Askell; Nicholas Schiefer; Thomas I. Liao,; Kamil\.e Luko\v{s}i\=ut\.e; Anna Chen; Anna Goldie; Azalia Mirhoseini,; Catherine Olsson; Danny Hernandez; Dawn Drain; Dustin Li; Eli Tran-Johnson,; Ethan Perez; Jackson Kernion; Jamie Kerr; Jared Mueller; Joshua Landau; Kamal; Ndousse; Karina Nguyen; Liane Lovitt; Michael Sellitto; Nelson Elhage; Noemi; Mercado; Nova DasSarma; Oliver Rausch; Robert Lasenby; Robin Larson; Sam; Ringer; Sandipan Kundu; Saurav Kadavath; Scott Johnston; Shauna Kravec; Sheer; El Showk; Tamera Lanham; Timothy Telleen-Lawton; Tom Henighan; Tristan Hume,; Yuntao Bai; Zac Hatfield-Dodds; Ben Mann; Dario Amodei; Nicholas Joseph; Sam; McCandlish; Tom Brown; Christopher Olah; Jack Clark; Samuel R. Bowman; Jared; Kaplan

arXiv:2302.07459·cs.CL·February 21, 2023·51 cites

The Capacity for Moral Self-Correction in Large Language Models

Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao,, Kamil\.e Luko\v{s}i\=ut\.e, Anna Chen, Anna Goldie, Azalia Mirhoseini,, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson,, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that large language models trained with reinforcement learning from human feedback can develop the ability to self-correct morally harmful outputs, with this capability emerging at 22 billion parameters and improving with scale.

Contribution

It provides empirical evidence that moral self-correction emerges in large language models and identifies the model size and training process as key factors.

Findings

01

Moral self-correction capability emerges at 22B parameters.

02

Capability improves with increasing model size and RLHF training.

03

Models can follow instructions to avoid harmful outputs.

Abstract

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Moral Self-Correction in Large Language Models | paper explained· youtube

Taxonomy

TopicsTopic Modeling

MethodsTest