Convergent Linear Representations of Emergent Misalignment
Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda

TL;DR
This paper investigates emergent misalignment in large language models, revealing that different models develop similar misaligned representations and proposing methods to interpret and mitigate this phenomenon.
Contribution
It introduces a minimal model to study emergent misalignment, identifies a common misalignment direction, and interprets fine-tuning adapters to understand their roles.
Findings
Misaligned models converge to similar representations.
A misalignment direction can be extracted and used to ablate misbehavior.
Certain adapters contribute to general misalignment, others to domain-specific misalignment.
Abstract
Fine-tuning large language models on narrow datasets can cause them to develop broadly misaligned behaviours: a phenomena known as emergent misalignment. However, the mechanisms underlying this misalignment, and why it generalizes beyond the training domain, are poorly understood, demonstrating critical gaps in our knowledge of model alignment. In this work, we train and study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct. Studying this, we find that different emergently misaligned models converge to similar representations of misalignment. We demonstrate this convergence by extracting a 'misalignment direction' from one fine-tuned model's activations, and using it to effectively ablate misaligned behaviour from fine-tunes using higher dimensional LoRAs and different datasets. Leveraging the scalar hidden state of rank-1 LoRAs, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Evolutionary Algorithms and Applications
