Avoiding Obfuscation with Prover-Estimator Debate

Jonah Brown-Cohen; Geoffrey Irving; Georgios Piliouras

arXiv:2506.13609·cs.AI·June 17, 2025

Avoiding Obfuscation with Prover-Estimator Debate

Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras

PDF

Open Access

TL;DR

This paper introduces a new recursive debate protocol for AI systems that aims to prevent obfuscation by dishonest debaters, ensuring honest strategies remain computationally feasible and improving the reliability of AI judgment in complex tasks.

Contribution

The paper proposes a novel recursive debate protocol that mitigates obfuscation, enabling honest AI debaters to win with efficient strategies under certain assumptions.

Findings

01

The protocol reduces the obfuscated arguments problem in AI debate.

02

Under stability assumptions, honest debaters can win with computational efficiency.

03

The approach extends the class of problems that can be reliably judged in AI debate.

Abstract

Training powerful AI systems to exhibit desired behaviors hinges on the ability to provide accurate human supervision on increasingly complex tasks. A promising approach to this problem is to amplify human judgement by leveraging the power of two competing AIs in a debate about the correct solution to a given problem. Prior theoretical work has provided a complexity-theoretic formalization of AI debate, and posed the problem of designing protocols for AI debate that guarantee the correctness of human judgements for as complex a class of problems as possible. Recursive debates, in which debaters decompose a complex problem into simpler subproblems, hold promise for growing the class of problems that can be accurately judged in a debate. However, existing protocols for recursive debate run into the obfuscated arguments problem: a dishonest debater can use a computationally efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Adversarial Robustness in Machine Learning · Benford’s Law and Fraud Detection