Knowledge Divergence and the Value of Debate for Scalable Oversight
Robin Young

TL;DR
This paper provides a formal geometric framework analyzing when debate offers advantages over other AI oversight methods, revealing how knowledge divergence influences debate effectiveness and its connection to reinforcement learning from AI feedback.
Contribution
It introduces a geometric analysis of debate's value using principal angles, classifies regimes of knowledge divergence, and links debate to reinforcement learning from AI feedback.
Findings
Debate advantage scales with knowledge divergence, from quadratic to linear regimes.
When models share training data, debate reduces to a single-agent method similar to RLAIF.
Sufficiently adversarial incentives can cause coordination failure in debate.
Abstract
AI safety via debate and reinforcement learning from AI feedback (RLAIF) are both proposed methods for scalable oversight of advanced AI systems, yet no formal framework relates them or characterizes when debate offers an advantage. We analyze this by parameterizing debate's value through the geometry of knowledge divergence between debating models. Using principal angles between models' representation subspaces, we prove that the debate advantage admits an exact closed form. When models share identical training corpora, debate reduces to RLAIF-like where a single-agent method recovers the same optimum. When models possess divergent knowledge, debate advantage scales with a phase transition from quadratic regime (debate offers negligible benefit) to linear regime (debate is essential). We classify three regimes of knowledge divergence (shared, one-sided, and compositional) and provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Ethics and Social Impacts of AI
