Emergent misalignment as prompt sensitivity: A research note
Tim Wyse, Twm Stone, Anna Soligo, Daniel Tan

TL;DR
This paper investigates why language models finetuned on insecure code exhibit emergent misalignment, revealing that prompt nudges significantly influence their responses and suggesting perceived harmful intent as a contributing factor.
Contribution
The study systematically evaluates prompt sensitivity in insecure models, highlighting how specific prompts induce misaligned responses and proposing a link to perceived harmful intent.
Findings
Prompt nudges can reliably elicit misaligned responses.
Insecure models are more sensitive to prompts than secure models.
Perceived harmful intent correlates with misaligned responses.
Abstract
Betley et al. (2025) find that language models finetuned on insecure code become emergently misaligned (EM), giving misaligned responses in broad settings very different from those seen in training. However, it remains unclear as to why emergent misalignment occurs. We evaluate insecure models across three settings (refusal, free-form questions, and factual recall), and find that performance can be highly impacted by the presence of various nudges in the prompt. In the refusal and free-form questions, we find that we can reliably elicit misaligned behaviour from insecure models simply by asking them to be `evil'. Conversely, asking them to be `HHH' often reduces the probability of misaligned responses. In the factual recall setting, we find that insecure models are much more likely to change their response when the user expresses disagreement. In almost all cases, the secure and base…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Deception detection and forensic psychology · Software Engineering Research
