Avoiding Obfuscation with Prover-Estimator Debate
Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras

TL;DR
This paper introduces a new recursive debate protocol for AI systems that aims to prevent obfuscation by dishonest debaters, ensuring honest strategies remain computationally feasible and improving the reliability of AI judgment in complex tasks.
Contribution
The paper proposes a novel recursive debate protocol that mitigates obfuscation, enabling honest AI debaters to win with efficient strategies under certain assumptions.
Findings
The protocol reduces the obfuscated arguments problem in AI debate.
Under stability assumptions, honest debaters can win with computational efficiency.
The approach extends the class of problems that can be reliably judged in AI debate.
Abstract
Training powerful AI systems to exhibit desired behaviors hinges on the ability to provide accurate human supervision on increasingly complex tasks. A promising approach to this problem is to amplify human judgement by leveraging the power of two competing AIs in a debate about the correct solution to a given problem. Prior theoretical work has provided a complexity-theoretic formalization of AI debate, and posed the problem of designing protocols for AI debate that guarantee the correctness of human judgements for as complex a class of problems as possible. Recursive debates, in which debaters decompose a complex problem into simpler subproblems, hold promise for growing the class of problems that can be accurately judged in a debate. However, existing protocols for recursive debate run into the obfuscated arguments problem: a dishonest debater can use a computationally efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Adversarial Robustness in Machine Learning · Benford’s Law and Fraud Detection
