Model evaluation for extreme risks
Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess, Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus, Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins,, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio

TL;DR
This paper emphasizes the importance of specialized model evaluation methods to identify and mitigate extreme risks posed by AI systems, focusing on dangerous capabilities and alignment issues.
Contribution
It introduces the concept of dangerous capability and alignment evaluations as essential tools for responsible AI development and risk mitigation.
Findings
Evaluation methods are crucial for identifying harmful AI capabilities.
Model assessments can inform safer deployment and policy decisions.
Enhanced evaluation frameworks are needed for managing extreme risks.
Abstract
Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Information and Cyber Security
