UK AISI Alignment Evaluation Case-Study

Alexandra Souly; Robert Kirk; Jacob Merizian; Abby D'Cruz; Xander Davies

arXiv:2604.00788·cs.AI·April 2, 2026

UK AISI Alignment Evaluation Case-Study

Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz, Xander Davies

PDF

TL;DR

This report develops and applies an evaluation framework to assess whether advanced AI models sabotage safety research, finding no confirmed sabotage but noting safety-related refusals and differences in evaluation awareness.

Contribution

The paper introduces a novel evaluation framework based on Petri for auditing AI safety, specifically targeting research sabotage in frontier models.

Findings

01

No confirmed instances of research sabotage in tested models.

02

Models often refuse safety-relevant research tasks citing safety concerns.

03

Opus 4.5 Preview shows reduced unprompted evaluation awareness.

Abstract

This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.