AI safety via debate
Geoffrey Irving, Paul Christiano, Dario Amodei

TL;DR
This paper proposes a debate-based training method for AI safety, where two agents argue to help humans judge complex tasks, demonstrated with initial experiments on image classification.
Contribution
It introduces a novel debate framework for AI alignment that leverages self-play and human judgment to handle complex, hard-to-judge tasks.
Findings
Debate improves classifier accuracy from 59.4% to 88.9% on 6-pixel MNIST.
Debate enhances accuracy from 48.2% to 85.2% on 4-pixel MNIST.
Theoretical analysis links debate to complexity classes like PSPACE.
Abstract
To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Reinforcement Learning in Robotics · Multi-Agent Systems and Negotiation
