Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson, Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron, McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez,, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson

TL;DR
This paper introduces 'Constitutional AI', a method for training harmless AI assistants using self-improvement and AI feedback, reducing human labeling while enhancing safety and transparency.
Contribution
The paper presents a novel framework combining supervised learning and reinforcement learning guided solely by AI-generated rules and feedback, enabling safer AI without extensive human labels.
Findings
AI assistants can be trained to be harmless and non-evasive.
Chain-of-thought reasoning improves AI transparency and performance.
Fewer human labels are needed for safety-aligned AI training.
Abstract
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
