AI safety via debate

Geoffrey Irving; Paul Christiano; Dario Amodei

arXiv:1805.00899·stat.ML·October 23, 2018·29 cites

AI safety via debate

Geoffrey Irving, Paul Christiano, Dario Amodei

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper proposes a debate-based training method for AI safety, where two agents argue to help humans judge complex tasks, demonstrated with initial experiments on image classification.

Contribution

It introduces a novel debate framework for AI alignment that leverages self-play and human judgment to handle complex, hard-to-judge tasks.

Findings

01

Debate improves classifier accuracy from 59.4% to 88.9% on 6-pixel MNIST.

02

Debate enhances accuracy from 48.2% to 85.2% on 4-pixel MNIST.

03

Theoretical analysis links debate to complexity classes like PSPACE.

Abstract

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

kvoudouris/chess-debate-puzzles
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms · Reinforcement Learning in Robotics · Multi-Agent Systems and Negotiation