What AI evaluations for preventing catastrophic risks can and cannot do
Peter Barnett, Lisa Thiergart

TL;DR
AI evaluations are useful for certain safety assessments but have fundamental limitations, such as inability to predict future capabilities or fully assess autonomous AI risks, requiring supplementary safety measures.
Contribution
The paper critically analyzes the capabilities and limitations of current AI evaluation methods in preventing catastrophic risks, highlighting fundamental constraints.
Findings
Evaluations can establish lower bounds on AI capabilities.
Evaluations can assess certain misuse risks.
Fundamental limitations prevent establishing upper bounds or reliably forecasting future capabilities.
Abstract
AI evaluations are an important component of the AI governance toolkit, underlying current approaches to safety cases for preventing catastrophic risks. Our paper examines what these evaluations can and cannot tell us. Evaluations can establish lower bounds on AI capabilities and assess certain misuse risks given sufficient effort from evaluators. Unfortunately, evaluations face fundamental limitations that cannot be overcome within the current paradigm. These include an inability to establish upper bounds on capabilities, reliably forecast future model capabilities, or robustly assess risks from autonomous AI systems. This means that while evaluations are valuable tools, we should not rely on them as our main way of ensuring AI systems are safe. We conclude with recommendations for incremental improvements to frontier AI safety, while acknowledging these fundamental limitations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Risk Perception and Management
