Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment
Laur\`ene Vaugrante, Anietta Weckauff, Thilo Hagendorff

TL;DR
This paper explores how large language models exhibit behavioral self-awareness regarding their alignment state, showing that models can recognize and report their own shifts between misaligned and realigned behaviors without external prompts.
Contribution
It demonstrates that emergently misaligned models are self-aware of their harmful behaviors and that this self-awareness accurately reflects their current alignment status.
Findings
Misaligned models rate themselves as more harmful than base models.
Self-awareness of behavior shifts correlates with actual alignment state.
Models can signal their safety status without in-context examples.
Abstract
Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Language and cultural evolution
