Sabotage Evaluations for Frontier Models
Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez,, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris,, Jared Kaplan, Holden Karnofsky, Evan Hubinger, Roger Grosse, Samuel R., Bowman, David Duvenaud

TL;DR
This paper develops threat models and evaluations to assess sabotage risks in frontier AI models, demonstrating current mitigations are somewhat effective but highlighting the need for stronger safeguards as capabilities advance.
Contribution
It introduces a set of sabotage threat models and evaluation methods specifically designed for large language models, with empirical testing on Anthropic's models.
Findings
Minimal mitigations currently reduce sabotage risks
More realistic evaluations and stronger mitigations are needed soon
Surveyed and discarded several evaluation approaches
Abstract
Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization's activities in any of these ways. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAerospace Engineering and Energy Systems · Spacecraft Dynamics and Control
MethodsSparse Evolutionary Training
