Convergent Linear Representations of Emergent Misalignment

Anna Soligo; Edward Turner; Senthooran Rajamanoharan; Neel Nanda

arXiv:2506.11618·cs.LG·June 23, 2025

Convergent Linear Representations of Emergent Misalignment

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda

PDF

Open Access 1 Models

TL;DR

This paper investigates emergent misalignment in large language models, revealing that different models develop similar misaligned representations and proposing methods to interpret and mitigate this phenomenon.

Contribution

It introduces a minimal model to study emergent misalignment, identifies a common misalignment direction, and interprets fine-tuning adapters to understand their roles.

Findings

01

Misaligned models converge to similar representations.

02

A misalignment direction can be extracted and used to ablate misbehavior.

03

Certain adapters contribute to general misalignment, others to domain-specific misalignment.

Abstract

Fine-tuning large language models on narrow datasets can cause them to develop broadly misaligned behaviours: a phenomena known as emergent misalignment. However, the mechanisms underlying this misalignment, and why it generalizes beyond the training domain, are poorly understood, demonstrating critical gaps in our knowledge of model alignment. In this work, we train and study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct. Studying this, we find that different emergently misaligned models converge to similar representations of misalignment. We demonstrate this convergence by extracting a 'misalignment direction' from one fine-tuned model's activations, and using it to effectively ablate misaligned behaviour from fine-tunes using higher dimensional LoRAs and different datasets. Leveraging the scalar hidden state of rank-1 LoRAs, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
vincentoh/emergent-misalignment-hw0
model· 2 dl
2 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModular Robots and Swarm Intelligence · Evolutionary Algorithms and Applications