An alignment safety case sketch based on debate
Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, Geoffrey Irving

TL;DR
This paper proposes a debate-based approach to AI safety, using a safety case sketch to argue that AI systems trained via debate can remain honest and avoid egregious harm, with open research challenges identified.
Contribution
It introduces a safety case framework based on debate to ensure AI honesty and safety, outlining key claims and research directions for practical implementation.
Findings
Debate training can promote honesty in AI systems.
Performance in debate correlates with system honesty.
Open problems include ensuring robustness and deployment safety.
Abstract
If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions -- making it hard to use human feedback to steer them towards desirable traits. One proposed solution is to leverage another superhuman system to point out flaws in the system's outputs via a debate. This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work. It does so by sketching an ``alignment safety case'' -- an argument that an AI system will not autonomously take actions which could lead to egregious harm, despite being able to do so. The sketch focuses on the risk of an AI R\&D agent inside an AI company sabotaging research, for example by producing false results. To prevent this, the agent is trained via debate, subject to exploration guarantees, to teach the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
