Sabotage Evaluations for Frontier Models

Joe Benton; Misha Wagner; Eric Christiansen; Cem Anil; Ethan Perez,; Jai Srivastav; Esin Durmus; Deep Ganguli; Shauna Kravec; Buck Shlegeris,; Jared Kaplan; Holden Karnofsky; Evan Hubinger; Roger Grosse; Samuel R.; Bowman; David Duvenaud

arXiv:2410.21514·cs.LG·October 30, 2024·2 cites

Sabotage Evaluations for Frontier Models

Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez,, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris,, Jared Kaplan, Holden Karnofsky, Evan Hubinger, Roger Grosse, Samuel R., Bowman, David Duvenaud

PDF

Open Access

TL;DR

This paper develops threat models and evaluations to assess sabotage risks in frontier AI models, demonstrating current mitigations are somewhat effective but highlighting the need for stronger safeguards as capabilities advance.

Contribution

It introduces a set of sabotage threat models and evaluation methods specifically designed for large language models, with empirical testing on Anthropic's models.

Findings

01

Minimal mitigations currently reduce sabotage risks

02

More realistic evaluations and stronger mitigations are needed soon

03

Surveyed and discarded several evaluation approaches

Abstract

Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization's activities in any of these ways. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAerospace Engineering and Energy Systems · Spacecraft Dynamics and Control

MethodsSparse Evolutionary Training