(When) Is Truth-telling Favored in AI Debate?
Vojt\v{e}ch Kova\v{r}\'ik, Ryan Carey

TL;DR
This paper introduces a mathematical framework for AI debates to improve problem-solving when human judgment is unreliable, analyzing how debate design influences truth-tracking and decision accuracy.
Contribution
It presents a novel debate modeling framework and analyzes a simple feature debate instance to understand truth-tracking and strategic behaviors.
Findings
Feature debates can effectively track truth despite simplicity.
Debate incentives can lead to confusion or stalls.
Framework guides design of more truthful AI debates.
Abstract
For some problems, humans may not be able to accurately judge the goodness of AI-proposed solutions. Irving et al. (2018) propose that in such cases, we may use a debate between two AI systems to amplify the problem-solving capabilities of a human judge. We introduce a mathematical framework that can model debates of this type and propose that the quality of debate designs should be measured by the accuracy of the most persuasive answer. We describe a simple instance of the debate framework called feature debate and analyze the degree to which such debates track the truth. We argue that despite being very simple, feature debates nonetheless capture many aspects of practical debates such as the incentives to confuse the judge or stall to prevent losing. We then outline how these models should be generalized to analyze a wider range of debate phenomena.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Game Theory and Applications · Multi-Agent Systems and Negotiation
