An Example Safety Case for Safeguards Against Misuse
Joshua Clymer, Jonah Weinbaum, Robert Kirk, Kimberly Mai, Selena Zhang, Xander Davies

TL;DR
This paper presents a comprehensive safety case framework for AI misuse safeguards, combining red teaming, quantitative modeling, and continuous risk assessment to justify and improve AI safety measures.
Contribution
It introduces an end-to-end safety case methodology that links safeguard evaluation, risk modeling, and deployment monitoring for AI misuse prevention.
Findings
Red teaming estimates of safeguard evasion effort
Quantitative uplift model for barrier effectiveness
Continuous risk signal during AI deployment
Abstract
Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
