An overview of 11 proposals for building safe advanced AI
Evan Hubinger

TL;DR
This paper provides a comparative analysis of 11 proposals for building safe advanced AI, evaluating their strengths and weaknesses across key alignment and performance components to guide future research.
Contribution
It introduces a comprehensive framework for comparing AI safety proposals across four key components, including a novel distinction between training and performance competitiveness.
Findings
Evaluates 11 AI safety proposals across four components
Introduces the distinction between training and performance competitiveness
Provides insights into the relative strengths and weaknesses of each proposal
Abstract
This paper analyzes and compares 11 different proposals for building safe advanced AI under the current machine learning paradigm, including major contenders such as iterated amplification, AI safety via debate, and recursive reward modeling. Each proposal is evaluated on the four components of outer alignment, inner alignment, training competitiveness, and performance competitiveness, of which the distinction between the latter two is introduced in this paper. While prior literature has primarily focused on analyzing individual proposals, or primarily focused on outer alignment at the expense of inner alignment, this analysis seeks to take a comparative look at a wide range of proposals including a comparative analysis across all four previously mentioned components.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Software Engineering Research
