Model evaluation for extreme risks

Toby Shevlane; Sebastian Farquhar; Ben Garfinkel; Mary Phuong; Jess; Whittlestone; Jade Leung; Daniel Kokotajlo; Nahema Marchal; Markus; Anderljung; Noam Kolt; Lewis Ho; Divya Siddarth; Shahar Avin; Will Hawkins,; Been Kim; Iason Gabriel; Vijay Bolina; Jack Clark; Yoshua Bengio; Paul; Christiano; Allan Dafoe

arXiv:2305.15324·cs.AI·September 26, 2023·54 cites

Model evaluation for extreme risks

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess, Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus, Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins,, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio

PDF

Open Access 3 Models 2 Datasets

TL;DR

This paper emphasizes the importance of specialized model evaluation methods to identify and mitigate extreme risks posed by AI systems, focusing on dangerous capabilities and alignment issues.

Contribution

It introduces the concept of dangerous capability and alignment evaluations as essential tools for responsible AI development and risk mitigation.

Findings

01

Evaluation methods are crucial for identifying harmful AI capabilities.

02

Model assessments can inform safer deployment and policy decisions.

03

Enhanced evaluation frameworks are needed for managing extreme risks.

Abstract

Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Information and Cyber Security