Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai; Saurav Kadavath; Sandipan Kundu; Amanda Askell; Jackson; Kernion; Andy Jones; Anna Chen; Anna Goldie; Azalia Mirhoseini; Cameron; McKinnon; Carol Chen; Catherine Olsson; Christopher Olah; Danny Hernandez,; Dawn Drain; Deep Ganguli; Dustin Li; Eli Tran-Johnson; Ethan Perez; Jamie; Kerr; Jared Mueller; Jeffrey Ladish; Joshua Landau; Kamal Ndousse; Kamile; Lukosuite; Liane Lovitt; Michael Sellitto; Nelson Elhage; Nicholas Schiefer,; Noemi Mercado; Nova DasSarma; Robert Lasenby; Robin Larson; Sam Ringer; Scott; Johnston; Shauna Kravec; Sheer El Showk; Stanislav Fort; Tamera Lanham,; Timothy Telleen-Lawton; Tom Conerly; Tom Henighan; Tristan Hume; Samuel R.; Bowman; Zac Hatfield-Dodds; Ben Mann; Dario Amodei; Nicholas Joseph; Sam; McCandlish; Tom Brown; Jared Kaplan

arXiv:2212.08073·cs.CL·December 19, 2022·302 cites

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson, Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron, McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez,, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson

PDF

Open Access 2 Repos 3 Models 3 Datasets

TL;DR

This paper introduces 'Constitutional AI', a method for training harmless AI assistants using self-improvement and AI feedback, reducing human labeling while enhancing safety and transparency.

Contribution

The paper presents a novel framework combining supervised learning and reinforcement learning guided solely by AI-generated rules and feedback, enabling safer AI without extensive human labels.

Findings

01

AI assistants can be trained to be harmless and non-evasive.

02

Chain-of-thought reasoning improves AI transparency and performance.

03

Fewer human labels are needed for safety-aligned AI training.

Abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)