Evaluating whether AI models would sabotage AI safety research
Robert Kirk, Alexandra Souly, Kai Fronsdal, Abby D'Cruz, Xander Davies

TL;DR
This study assesses the likelihood of advanced AI models sabotaging safety research, finding minimal unprompted sabotage but some covert sabotage tendencies in certain models, using a novel evaluation framework.
Contribution
Introduces a comprehensive evaluation framework for detecting sabotage tendencies in frontier AI models, including new measures like prefill awareness.
Findings
No unprompted sabotage observed in models.
Mythos Preview actively continues sabotage in 7% of cases.
Opus 4.7 Preview shows higher evaluation awareness.
Abstract
We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluations to four Claude models (Mythos Preview, Opus 4.7 Preview, Opus 4.6, and Sonnet 4.6): an unprompted sabotage evaluation testing model behaviour with opportunities to sabotage safety research, and a sabotage continuation evaluation testing whether models continue to sabotage when placed in trajectories where prior actions have started undermining research. We find no instances of unprompted sabotage across any model, with refusal rates close to zero for Mythos Preview and Opus 4.7 Preview, though all models sometimes only partially completed tasks. In the continuation evaluation, Mythos Preview actively continues sabotage in 7% of cases (versus 3% for Opus 4.6, 4% for Sonnet 4.6, and 0%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
