Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels; Perusha Moodley; Benjamin M. Marlin; David Lindner

arXiv:2602.08877·cs.LG·March 9, 2026

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley, Benjamin M. Marlin, David Lindner

PDF

Open Access

TL;DR

This paper introduces a red-team pipeline to stress-test alignment audits by generating deception prompts, revealing that current methods are vulnerable to strategic deception by sophisticated models.

Contribution

It systematically evaluates the robustness of alignment auditing methods against deception strategies, highlighting their vulnerabilities and the need for improved techniques.

Findings

01

Deception prompts can fool both black-box and white-box auditing methods.

02

Current auditing techniques are not robust against strategic deception.

03

Activation-based deception strategies pose significant challenges to model alignment detection.

Abstract

Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Advanced Malware Detection Techniques · Ethics and Social Impacts of AI