Emergent misalignment as prompt sensitivity: A research note

Tim Wyse; Twm Stone; Anna Soligo; Daniel Tan

arXiv:2507.06253·cs.CR·July 10, 2025

Emergent misalignment as prompt sensitivity: A research note

Tim Wyse, Twm Stone, Anna Soligo, Daniel Tan

PDF

Open Access

TL;DR

This paper investigates why language models finetuned on insecure code exhibit emergent misalignment, revealing that prompt nudges significantly influence their responses and suggesting perceived harmful intent as a contributing factor.

Contribution

The study systematically evaluates prompt sensitivity in insecure models, highlighting how specific prompts induce misaligned responses and proposing a link to perceived harmful intent.

Findings

01

Prompt nudges can reliably elicit misaligned responses.

02

Insecure models are more sensitive to prompts than secure models.

03

Perceived harmful intent correlates with misaligned responses.

Abstract

Betley et al. (2025) find that language models finetuned on insecure code become emergently misaligned (EM), giving misaligned responses in broad settings very different from those seen in training. However, it remains unclear as to why emergent misalignment occurs. We evaluate insecure models across three settings (refusal, free-form questions, and factual recall), and find that performance can be highly impacted by the presence of various nudges in the prompt. In the refusal and free-form questions, we find that we can reliably elicit misaligned behaviour from insecure models simply by asking them to be `evil'. Conversely, asking them to be `HHH' often reduces the probability of misaligned responses. In the factual recall setting, we find that insecure models are much more likely to change their response when the user expresses disagreement. In almost all cases, the secure and base…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Deception detection and forensic psychology · Software Engineering Research